code, coder, programming

How to Hash Data Without Errors (part 1)

Posts in this series:

  1. Introduction to Hash Functions
  2. The Principles of Hashing (in Python)
  3. Hash Functions for Ethereum Developers

Hash functions are one of the fundamental building tools in cryptography, and pretty much in any blockchain technology. You can find them at every step, when computing a digital signature or creating a Merkle Tree. But there are quite a few traps when using them, and it’s very easy to make mistakes and compute the wrong hash instead. If you’ve wondered why your signatures don’t verify, or your hashed digests don’t agree with someone else’s, read on, you may find some answers in here.

What are Hash Functions?

A ‘hash’ is a short way to denote the output of a hashing function, usually of ‘cryptographic grade’. These functions have a number of guarantees, the most common of which are:

  • they are one way, meaning that from a hash we can not easiliy find the message that originated it;
  • they are collision-resistant, that is, we can’t easily find a pair of inputs that create the same hash;
  • they are pseudo-random in practice, meaning that we expect the images of a hash function to be uniformly distributed when the pre-image is chosen at random.

In structure, they’re all similar: they can take an input of an arbitrary length and return an output of a fixed and very small size. But the ways they are computed varies greatly, with some functions belonging to common families or following some standard design model (for example, the Merkle Damgård or Sponges).

Hash Function Standards

There have been several famous hash functions since the early days of modern cryptography. The NIST has been standardizing functions for cryptographic use since the 1970s, and such a certification is usually taken as an indication of a function’s security and general quality.

Today, the most widespread NIST standards for Hash functions are SHA2, which comes in several variants, and SHA3, also frequently known as Keccak. Strictly, Keccak is the name of the algorithm chosen as the winner of the SHA3 competition. The standardized version fixes some parameters and therefore SHA3 is a restricted version of Keccak. In both standards, the most common outputs are 256- and 512-bit lengths. In practical implementations, naming conventions are inconsistent. The Solidity programming language for Ethereum smart contracts specifies three hash functions: sha256, keccak256 and ripemd160. The first is the SHA2 standard, whereas the second is a the 256-bit version of sha3.

Two other hash functions were very successful as well, until they were deemed too old and obsolete: SHA1 and MD5. They may still be found used as fingerprinting functions, for example to show that a download is genuine, but under no circumstances should they be used nowadays for serious cryptographic protocols. The output of SHA1 is 160-bit long, and the output of MD5 is 128-bit long.

What Can We Hash?

The beauty of hash functions is that we can hash anything. A hash function provides a summary of any object, as long as you can represent it in some digital format. The input to a function is called in mathematical terms the ‘pre-image’. Conversely, the output of a function is called the ‘image’. Hash functions are not injective. This means that many different pre-images can generate the same hash, but if we know only the final result only, we can never know which one used to produce it. Hash functions can take inputs of any size. They are usually built in an iterative fashion: the input is broken into blocks of equal size, and then passed into a core function together with the current state of the computation. This core function takes inputs of fixed size, and the result becomes the new state, to be used with the next block.

In practice, hashes are usually based on bit-wise operations, since these are faster to compute. This is the case of all SHA family hash functions and MD5, for example. The natural input to these is a sequence of bytes which may then be operated upon as words of fixed sizes or even individual bits. For some esoteric cases, for example the ZK-Snarks that are currently the rage in Distributed Ledger Technology, these standard functions are inefficient, and should be replaced by alternatives that operate on group and field elements instead. The structure of these is totally different from the ones above, and are not in the scope of this post.

What Can Hash Functions be Used For?

Hashes are frequently used for integrity, since they can easily detect an accidental corruption of bits. When this happens, the resulting hash will be totally different and signal that a change has occurred. In the same way, this can protect against a deliberate change. This is the main tool used in Bitcoin’s Proof of Work, where each block’s integrity is protected by a hash that is also later used as part of the pre-image for the next block’s hash. Any change, even of a single bit, in a block, will consequently change that block’s hash and that of every other block, making it unfeasible to convincingly alter a block. Such a change would require recomputing the proper hash for each block afterwards, which is a long task that in normal circumstances can not be completed before the blockchain is changed by other participants.

Hashes can also be used as summaries for quick comparison of objects. For example, if we want to compare two versions of a collection of objects, and find which ones have been changed, we could keep an index of hashes instead and just check these for differences. Each hash is just a small bunch of bytes (currently, the most used value is 32 bytes) and so these comparisons are way more efficient than checking each byte of each object. For the same reason, hashes are usually employed as a first step in a digital signature. A digital signature is a public-key construction, and currently that means its keys are based on some mathematical-hardness assumption and are relatively slow. Also, we would like them to be of a fixed size, despite the input being as long as necessary. Instead of applying the signature computation over the original message, we typically apply it only over the hash of the message, which ensures a fixed input size, and so a fixed output size, to the signing algorithm.

This is a good place to stop. In the next post, I will go over actual details of creating hashes in Python and Ethereum smart contracts, the unexpectedly easy ways that we can make mistakes, and how to avoid them. Until then, have fun coding and exploring cryptography and crypto-currencies.

Leave a Reply