matrix, code, computer

How to Hash Data Without Errors (part 2)

Posts in this series:

  1. Introduction to Hash Functions
  2. The Principles of Hashing (in Python)
  3. Hash Functions for Ethereum Developers

In my previous post, I introduced hash functions, and gave a general overview of what they are and can be used for. That was the setup for a series where I explore the pitfalls in their use. As one of the most fundamental tools in the construction of blockchains, and indeed in all of cryptography, it is fundamental to have a good understanding of how they behave. This post is more practical than the previous one, and will give abundant examples in Python.

Hash function inputs

I said in the last installment that a hash function could hash anything, and that was one of its great advantages. This is only half-truth, perhaps, as much as you can say that everything in a computer is a binary string. It is more accurate to say that you can hash anything that you can fully represent in some canonical fashion. Ideally, you can hash strings, numbers and binary data, but keep in mind this may require some transformation of your data and that I have much to say about this yet. That is, in fact, the whole point of this post.

For now, let us just take a leaf from convention and accept the easiest way to represent the data to hash function is by using a simple string.

What is definitely true is that we are not limited in the length of this object representation, and in theory you could hash the whole layout of your computer’s memory, if you had a mind to, and probably the patience to go with it. But we don’t typically need anything that extreme. I’ll go over such details later, but for now let us start with basics: how to choose a particular hash algorithm and use it in Python.

Hash Functions Available in Python

There are many well-known and standardized hash functions out there. It is common for libraries to provide a choice among these, and in Python that is very conveniently given to us. To access hashing function in your program, include this at the top

import hashlib

This gives you a list of options. In Python 3.7, you are guaranteed to have access to the following, but your particular implementation could offer more:

  • Obsolete hashes:
    • ‘md5’,
    • ‘sha1’
  • The SHA-2 series:
    • ‘sha224’,
    • ‘sha256’,
    • ‘sha384’,
    • ‘sha512’,
  • Blake hashes:
    • ‘blake2b’,
    • ‘blake2s’,
  • The SHA-3 series:
    • ‘sha3_224’,
    • ‘sha3_256’,
    • ‘sha3_384’,
    • ‘sha3_512’,
  • Shake hashes:
    • ‘shake_128’,
    • ‘shake_256’

You can obtain this list by typing or the list of algorithms actually available in your system by

hashlib.algorithms_guaranteed
hashlib.algorithms_available

The name within quotes is the actual function name you use to invoke the hash. For example, you can obtain the SHA256 hash of the empty string with:

hashlib.sha256()

Test Vectors

The first thing I do when I use a hash function for a first time in a certain language/platform is to make sure I’m doing it correctly. That means two things:

  • ensuring I get the syntax right
  • verifying I can get the right result for a specific input

The last point is more important than it looks on the surface. It’s easy to mislead yourself into thinking you’re hashing an object properly, and then finding that your mates receive a different result for the same supposed input. There are a few details to get right, and remember that if a single bit is out of place, the hash result will be different in such an unpredictable way that you will likely not be able to guess what went wrong. Getting the last point right ensures both that you are using the correct algorithm and also that you can encode your input correctly.

Test vectors, lists of well-defined inputs and outputs, are fundamental tools for this. Currently, I most often use SHA256, so that will be my main example throughout. The test vectors published by NIST are available here. Another useful resource is this page by DI Management Services. Notice the bias towards representing data as strings. A further example, that shows up in a ZoKrates tutorial I have previously learned from, also provides an interesting known hash for SHA256: the integer 5.

It may look simple and trivial to you, but I assure you, the pain and headache I had with that single example was the ancestral source of all the knowledge I’ll share with you in this series.

Invoking Hash Functions

Let’s make a few experiments, and explore how these functions work. I start with the empty string, the most obvious and expected test case:

hashlib.sha256()
<sha256 HASH object @ 0x009AAD28>

This simple call is not quite enough. It does compute the hash, but does not report it in any meaningful way. We could call the method digest to get the result, but I find the answer is neither compact nor immediately enlightening:

hashlib.sha256().digest()
b"\xe3\xb0\xc4B\x98\xfc\x1c\x14\x9a\xfb\xf4\xc8\x99o\xb9

There is too much noise in here. The result looks like a string, but is in fact a byte array (notice the hidden ‘b’ at the start). All the \x are not real data, they just indicate that what follows are two hex digits, which is a way to describe non-printable character codes. But it is easy to oversee the many interspersed bytes that are actually represented by their character value, which means this string’s length cannot even be easily related to the number of bytes in the array. No, what we really want to use instead is the hex digest, which can be immediately read and compared to other implementations, and on top has a length that is always double the number of bytes.

hashlib.sha256().hexdigest()
'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'

This is the right value, according to the DI Management test case. Good. Let’s now move to the next test case, and hash some real data. Let’s try text first

hashlib.sha256("abc").hexdigest()
TypeError: Unicode-objects must be encoded before hashing     

Hmmm. What about this?

hashlib.sha256(5).hexdigest()
TypeError: object supporting the buffer API required

What Python is telling us in here is that it can’t really hash objects with an arbitrary structure. Instead, it expects to receive an array of bytes, so we have to transform our input into this format beforehand. This however is not entirely straightforward. Let’s look at numbers first.

Hashing numbers

In Python and a few other languages, we can’t directly hash numbers. There are many online hash generators, which are often more convenient to use, but also a veritable mixed bag of behaviour. Some will give you the option to choose from a lot of different algorithms, others won’t; some will expect to receive general textual strings, others require just hexadecimal strings; others will happily accept numbers as input, but will interpret them as strings instead of a value. For illustration, I will use this generator and pick specifically SHA256. Let us try to reproduce the ZoKrates example, and hash 5. The output of that tutorial is:

c6481e22c5ff4164af680b8cfaa5e8ed3120eeff89c4f307c4a6faaae059ce10

But when I enter 5 in the generator, my result is:

ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d  

That is not quite the same. We can investigate in Python.

As I said above, we have to turn numbers into bytes before we can hash them. That should be more or less trivial: after all, a number can be naturally represented in bits, and bytes are just these bits grouped 8 by 8. For 5, that is easy: it can be represented in a single byte as 00000101. Let’s try that:

hashlib.sha256(bytes([5])).hexdigest()
'e77b9a9ae9e30b0dbdb6f510a264ef9de781501d7b6b92ae89eb059c5ab743db'

Not what we expected. Maybe we can try another way of encoding 5. In Python, we can apply the to_bytes method to a number. It requires two arguments, a length and an endianness option. The first describes how many bytes we want the encoding to have, and the other where the low-end of the number is. Let’s see a few different examples:

(5).to_bytes(4,'big')
b'\x00\x00\x00\x05'
(5).to_bytes(8,'big')
b'\x00\x00\x00\x00\x00\x00\x00\x05'

Bit-Endianness

The above two examples show how to encode 5 in 32- and 64-bit sizes, which corresponds to standard integers and longs in languages like C, Java and C#. But notice these have the lowest-significant bit in the right-most position, which means that the lowest byte occupies the highest index in the byte array. We can change the second argument to turn this around:

(5).to_bytes(4,'little')
b'\x05\x00\x00\x00'
(5).to_bytes(8,'little')
b'\x05\x00\x00\x00\x00\x00\x00\x00'

but although this reverses the order of bytes, keep in mind it does not reverse the order of bits, as the following example shows:

(1102000).to_bytes(3,'big')
b'\x10\xd0\xb0'
bin(0x10d0b0)
'0b100001101000010110000' 

(1102000).to_bytes(3,'little')
b'\xb0\xd0\x10'
bin(0xb0d010)
'0b101100001101000000010000'

The Correct Hash

Now, obviously all of this has an impact in the value of the hash, as the following examples demonstrate:

hashlib.sha256((5).to_bytes(1, 'big')).hexdigest()
'e77b9a9ae9e30b0dbdb6f510a264ef9de781501d7b6b92ae89eb059c5ab743db'
hashlib.sha256((5).to_bytes(4, 'big')).hexdigest()
'221f8af2372a95064f2ef7d7712216a9ab46e7ef98482fd237e106f83eaa7569'
hashlib.sha256((5).to_bytes(8, 'big')).hexdigest()
'5dee4dd60ff8d0ba9900fe91e90e0dcf65f0570d42c431f727d0300dd70dc431'

Now that we understand this, let’s have a stab at solving the problem and producing the hash of that blog post. The authors of the example use a SHA256 function that expects to receive 512 bits of input. That is 64 bytes. We know we can encode 5 in a single byte, so we have to include 63 0-bytes. The only question is where to place the non-zero byte, as we have two choices. If we look at their post, we can see they pass 4 numbers each of 128 bits, representing [0,0,0,5]. That is, the lowest-significant bit is in the highest order position. This is a big-endian representation, and so we can try:

hashlib.sha256((5).to_bytes(64, 'big')).hexdigest()
'c6481e22c5ff4164af680b8cfaa5e8ed3120eeff89c4f307c4a6faaae059ce10'

That’s it, we’ve got it. This is the hash we wanted.

But if you were paying attention, there is still one unanswered question. What is that mysterious hash we received from the online generator? What can that value be, since it is none of the hashes we generated with Python so far? We passed a single value 5 as input, and though we tried several variable-byte representations of 5, none of them worked. Could it be that 5 is being interpreted as a string? Let’s make a simple experiment:

hashlib.sha256(b"5").hexdigest()
'ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d'

Bingo!

Hashing Strings

The previous example is the perfect segue for the next section. On the surface, the syntax is very simple, and it almost looks like we are hashing a simple string. But what about what I said earlier on, that we need to encode strings before hashing them?

That is what that small b is doing there. When a string is prefixed by b, it is instead treated as a byte array that specifies the encoding of the string. If you’re not very familiar with string encodings, you may think this should be the same as the transformations we did above, getting bytes from numbers, but that is not the case. We generate byte arrays in both cases, but the values are very different. An encoded string returns an array of bytes, with the charset encoding of each character in the string. For standard ASCII characters, as numbers and Latin letters without diacritics, each symbol takes up exactly one byte:

[int(x) for x in b"5555"]
[53, 53, 53, 53]

On the contrary, when we convert a number to bytes, we get instead an arithmetic representation of that number, in effectively base 256:

[int(x) for x in (5555).to_bytes(2, 'big')]
[21, 179]

With this out of the way, let’s talk a bit about string encodings.

String Encodings

A string is a sequence of characters, but in memory everything is just a sequence of bits. Characters are no exception, and have to be encoded in one or more bytes. One of the oldest encodings is ASCII, that assigns a 7-bit code to each character, which means you can’t have more than 128 characters in your alphabet. This includes 32 non-printable characters, numbers and punctutation symbols and the upper- and lower-case letters of the English alphabet. This is a tight range, and other languages have larger alphabets, include dyacritics or have totally different writing systems, that are simply incompatible with ASCII. We therefore need other encodings.

In Python, we can convert a string to bytes according to a specific encoding using the function encode():

"action".encode()

which is equivalent to the notation I showed above

b"action"

This uses the default encoding UTF-8. In anglo-centric communication, you may not worry much about the encoding type. Most of the common codesets will coincide on the common symbols, letters and numbers. For example, the following all return the same array of bytes:

[c for c in "action".encode()] 
[c for c in "action".encode("ascii")] 
[c for c in "action".encode("cp860")] 
[c for c in "action".encode("latin")] 
[c for c in "action".encode("utf8")] 

[97, 99, 116, 105, 111, 110]

But try the Portuguese version of this word instead, and you will immediately note the differences:

[c for c in "acção".encode()] 
[97, 99, 195, 167, 195, 163, 111] 
[c for c in "acção".encode("ascii")] 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) 
[c for c in "acção".encode("cp860")] 
[97, 99, 135, 132, 111] 
[c for c in "acção".encode("latin")] 
[97, 99, 231, 227, 111] 
[c for c in "acção".encode("utf8")] 
[97, 99, 195, 167, 195, 163, 111]

Above, you can see an encoding that fails to handle the string, other encodings generating a different number of bytes, and other encodings with the same length but different values for some characters. As a result, the hashes will be different. Because of this, I always prefer not to think of a hash’s input as a string, because if the encoding is not well-defined, we cannot predict how it will be turned into bytes.

This is more than just a nuisance. There are many moving parts when dealing with hashes, and when a hash verification fails, all you know is that at some point you computed it wrong. This often happens when two different applications, or parts thereof, have to compute the same hash from their own view of the data. There is nothing in the miscomputed hash to tell you what went wrong: it could have been an extra whitespace, the wrong character encoding or even a difference in format that no one noticed before hashing. But all of this goes away when you, instead of a string, pass an array of bytes and verify they are correct.

Avoiding Pitfalls When Using Hash Functions

Hash functions are an extremely useful tool. We often use them to provide an almost-guaranteed-to-be-unique identifier to another longer and complex object. This could be a file, an object instance, the current state of a github repo, the summary of a message to be digitally signed. They are so pervasive that they are conveniently offered by programming languages and are easy to access and use, but this leads to our often overlooking the subtleties in their use.

In this post, I argued how you should, at all times, remember that a hash function receives and returns a series of bits only. There is no high-level meaning for a hash function. For the hash function, the string "5" and the integer 5 are two very different things. These mean nothing until you describe exactly how you want them encoded, in how many bytes and in what order.

The Main Rule

For your safety, think always in terms of bytes. This will also equip you better to think about hashing large objects or collections of objects. Of course, you need a convenient way to transform your original input to bytes. And here is another trap: you need to develop a canonical encoding first. Suppose you want to hash a cryptographic key. These are often represented in a printable format, like hexadecimal or base64. But don’t fall into the trap of hashing those literal strings!

Some time later, you may want to change the format in which your key is stored (for example, in a binary-mode file) and then you’ll always have to convert back to your textual form before hashing and comparing again. It is better to always take the representation you have and convert to bytes first and hash that. This way, you’ll never have to guess which particular format you chose to represent your raw data.

In Python, it is easy to create bytes from a string. For example, if you receive a hexadecimal description like

'3c266e73574b51665776eedf5f5bdd'

then you can retrieve the corresponding bytes with

bytes.fromhex('3c266e73574b51665776eedf5f5bdd')

Encoding Complex Objects

The same applies for objects you want to serialize. You may not want to depend on the system serialization, which in turn is dependent on the languange and operating system you use, and may be tied to a specific word size and bit-endianness.

Instead you might prefer to convert the object to a JSON format and hash this string instead. But once again, remember that spaces matter, and newline separators do as well. Plus, these are encoded differently in Windows, Mac and Linux systems (as I have explained before), so make sure your project has a clear definition of what the canonical representation is, at the byte level, before starting development and banging your head against the wall due to those pesky hash functions.

Hash Functions Naming

Another issue to be aware of is that there are many hash functions, and their names can be confusing. For an example, there have been several iterations of the SHA (Secure Hash Algorithm) family: SHA-0, SHA-1, SHA-2 and SHA-3. SHA-0 and SHA-1 are old and by now considered insecure, but SHA-1 is still available in Python, for example.

The current recommendation is to use SHA-2 or SHA-3, but pay attention that both names refer to a family of four algorithms each, characterized by different output lengths: 224, 256, 384 and 512 bits. When referring to SHA-2, the version number is currently normally ommited, since it is the recommended default. Therefore, we get names like 'sha256' to identify the SHA-2 variant and 'sha3_256' for the SHA-3 version of the same size. Also note that the SHA-2 standard has two variants for each of the 224 / 384 lengths, but only one of them may be available in your language.

To make matters worse, the SHA-3 function was chosen in an open competition promoted by NIST to design a new Hash function. The winner was Keccak, which is a broad family of algorithms of which SHA-3 is a subset. But the standardized SHA3 is a specific version of Keccak, with some changes regarding the original specification (see here and here). This throws a lot of confusion in the air, with some functions called 'sha3' that do not behave like the standardized version. Case in point:

In Ethereum: 
------------
keccak256()
0xc5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470
In Python (and the NIST standard): 
----------------------------------
sha3()
0xa7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a 

Parting Thoughts

As a general rule, I take the following approach, and recommend you do the same:

  • Start always by first finding an authoritative list of test cases for your chosen hash function, including the empty-argument case.
  • Never assume you can hash until you can reproduce the empty-argument test case. This ensures you have the right function.
  • Follow-up by reproducing one of the test cases with non-null input. If you can reproduce, this ensures you can actually encode the data correctly.
  • Specify unambiguously the encoding of each type of data you want to hash: this includes numbers (how many bytes, what endianness, how many bits for mantissa and exponent, if applicable), strings (Charset encoding) and general objects (field separators, treatment of whitespace, encoding of newline characters, etc.).

Preferably, and despite official test vectors often representing inputs by strings, think of your input as a sequence of bytes. This is particularly important when dealing with binary data, like cryptographic keys, cyphertexts or just data files. These are often encoded in printable strings, by reducing the data to hexadecimal or Base64. But hashing these strings is not equivalent to hashing the actual objects. In these cases, we typically want to hash the actual bytes and not the enconding of a textual representation of those bytes. So, I add a last recommendation:

  • Never hash hex, binary or byte64 strings: convert them to bytes first and hash the bytes instead.

I hope this has helped clear some common problems with hashing. If you have any doubts or feel that I should have mentioned something else, please write so in the comments and I will reply.

Meanwhile, in the next post I will talk a bit about how to hash in Solidity and web3.js. Until then, have fun exploring and coding.

Leave a Reply