A few days ago, I was coding a small parser in C++ for a ZoKrates output file and hit on an unexpected time-killer: my parser failed to read the following section of my file (contents changed for brevity):
B: [[BigNumber1, BigNumber2], [BigNumber3, BigNumber4]],
In reality, these BigNumbers were 254-bit long integers represented in hexadecimal, but treating them as small words will not change the example.
This line break is produced by ZoKrates, but is unneeded, and is there probably by mistake. All the other lines follow the same format, but without the line break, so I decided to read those two lines and concatenate them together and print the result to console. I expected to see this:
B: [[BigNumber1, BigNumber2], [BigNumber3, BigNumber4]],
Instead, I got something like this
[[BigNumber1, BigNumber2], [BigNumber3, BigNumber4]],
I was confused, why was I losing the first line? I did several prints, to ensure I had read both lines, that I kept each in its own variable and that string concatenation functions did work properly (I used several different methods, from cout
to plain old strcat
). I always got strange results, somehow it always seemed that the second line had negative spaces or something that wrote the second line over the start of the first one.
This turned out to be a bigger time-killer than I expected, and the source was one of those facts of the C language that may well be obscure by now.
I have to tell you about my setup, because that is the source of all these woes. Astute readers may have picked up on the cause already, but I’ll fill you in.
I was compiling my code on Ubuntu with g++
, inside a Docker image running on my Windows host. On top of that, I mapped a folder in Windows onto a directory in the Ubuntu container so that I could have a much easier time editing.
The key in here is this Windows-Linux interface: they represent the newline character differently and Docker does its own thing when mapping files from one to the other.
First of all, before Unicode existed, characters in computers were represented in ASCII, which being an American standard didn’t care about diacritics and so could represent all its character codes in 7 bits. This encoding assigns a value to every character in the alphabet, and indeed in C there is not even a difference between the types byte
and char
. To this day, even in C#, my favourite way of coding a Caesar’s Cipher, is something like this:
void Main()
{
string s = "Caesar";
string r = "";
int n = 3;
for (int i = 0; i < s.Length; i++) {
byte c = (byte) s[i];
r = r + (char) (c + n);
}
Console.Write(r);
}
On top of the ASCII code, larger code pages were built that could accommodate the different requirements of each language, either by including diacritics or changing the script completely. These were 8-bit pages that included the ASCII 7-bit definitions, and left the new characters in the extended bank of values starting at 128 and ending at 255. This legacy remains even today in Unicode, where the first 128 codes are still the old ASCII ones.
The ASCII codeset was designed with teleprinters in mind and has a number of characters that are not printable, called control characters. Therein we find some famous ones, like the horizontal tab (\t
) and the null character (\0
) and some oddities like an alarm (which in C actually made the computer beep), device control and transmission synchronization commands.
Among these, two stand out for this post:
Code 10: Line Feed (LF) Code 13: Carriage Return (CR)
These are deeply tied to the roots of the code. Computer terminals were a bit like typewriters where a paper roll would move the sheet up and a carriage would move the character typebars horizontally as the user typed. At the end of the line, the carriage had to be returned to its beginning horizontal position, and the paper had to be rolled up one line, which required two control characters to perform both actions.
These two control characters were adopted as the encoding for a single ‘newline’ action in Windows machines, where they appear in the order CR LF
. In other words, if you look at the bytes representation of a plain text file, you will see at the end of each line the bytes 0x0d0a
, and in C-derived languages these will usually be represented by the escape sequence '\n'
. Strictly, though, there is an escape sequence for CR
— '\r'
, and '\n'
should be reserved for LF
(or better even, use the appropriate OS-independent constant from your language).
Only, I wouldn’t be writing this post if things were so simple. See, in Linux and Unix systems, the same newline action is represented by only one of these characters, namely LF
. And in the Mac, for some obscure reason, by the single character CR
. That means we have a problem at the interface when we want to use files from one system in another system.
Text editors nowadays are smart enough to offer conversion between these encodings, and so in my Windows Notepad++ editor I was using Unix-encoded files. But Docker, on mapping the Windows files onto Linux, converted all the newlines back to the CRLF
encoding. (Admittedly, it does tell you that when the image is started, but the consequences didn’t occur to me just then).
So, to finalize, what was happening with my C parser?
Well, the instruction to read a line from the file would read the string up to the newline character and remove it. Said newline character is the Linux newline character, ie LF
. What comes before that is a string with a CR
at the end, which when printed will be dutifully interpreted to mean
'print whatever is in the string and then bring the next printing position to the beginning of the line but (crucially) don't move to another line'.
When I only printed that line, I could not tell this was happening: the next position is never visible in a dumb output. But when I printed another string to follow this one, the latter would start at the beginning of the line, simply overwriting what was there before.
This ended being a trip back into time, when problems like this were rife, but I feel like they should belong in a different era. Sadly, for some low-level functions, you still have to think in low-level terms, and know about minutiae that have mostly been covered by the abstractions of modern programming languages.
The solution to the problem, you may ask? In the end, a simple trim()
to remove all trailing white spaces.
Keep coding and have fun.
Until the next time.