binary, one, cyborg

Working with Strings in Solidity

It feels like a long time since I started this blog, and until now it has been dedicated to writing about statistics. I don’t particularly enjoy this monody, it can quickly become boring and tiresome. Moreover, this blog was from the start aimed at probing two different areas, and after so long dedicated to one of them, I have been feeling the urge to write about the other.

And so, today I give you a post about programming for the Ethereum blockchain using the Solidity language. I won’t follow any plan in doing this: my objective is only to write about my obstacles in learning this language and the practical difficulties I encounter in my daily work.

I want the freedom to write about any topic without having first to introduce preliminary material, as I’d have to do if I were writing a textbook. If you notice me talking about things I have not explained before, that is by design. Leave me a comment below and I’ll come back to them in a later post.

 

Basic access

Today, I want to talk about strings in Solidity. Solidity is, at first, similar in syntax to Javascript and other C-like languages. Because of that, it is easy for a newcomer with a grounding in one of several common and widespread languages to get a quick grasp of what a Solidity program does. Nevertheless, Solidity is mighty in the proverbial details that hide unforeseen difficulties. That is the case of the type string and the related type bytes.

Both of these are dynamic array types, which means that they can store data of an arbitrary size. Each element of a variable of type bytes is, unsurprisingly, a single byte. Each element of a variable of type string is a character of the string. So far so good, but the initial looks are deceiving. One who comes from other languages might expect the string type to provide several useful functions, like:

  • determining the string’s length
  • reading or changing the character at a given location in the string
  • joining two strings
  • extracting part of a string

Bad news: Solidity’s string does none of this! If we need any of the above, we have to do it manually.

So, let’s explore some of these difficulties and see what we can do about them. I open Remix and type the following code in a new file called string.sol.

pragma solidity ^0.4.24;
contract String {
    string store = "abcdef";
    
    function getStore() public view returns (string) {
        return store;
    }
    
    function setStore(string _value) public {
        store = _value;
    }
}

The right side of the screen, in Remix, is taken by the developer’s area. In the Compile tab, I check the Auto-Compile option, so that Remix will notify me of errors and code-analysis warnings as I write my code. The static code-analysis is controlled by the options in the tab Analysis, and I usually have all options selected.

(Click to Enlarge)

In the current case, Remix will report 2 warnings of the same kind: the methods I have written can potentially have a high to infinite gas cost. I will ignore that in this post.
The above contract is very minimal. It defines a state variable store of type string, a method to set it and a method to get it. Let’s test it.

In the Run tab, I hit Deploy and if there are no problems with the contract, a new area will appear below that button with the address where the contract is located and the functions that are available.

Below the working area, Remix shows a detailed record of the transaction’s result. Initially, it shows only a line indicating the account that deployed the contract, the contract and method that was called, ie String.(constructor), and how much Ether was passed to the execution (Initially this is shown in Wei, which is the smallest unit of Ether, corresponding to 10^-18 Ether), but we can expand it by clicking over the header, revealing logs, execution and transaction costs, available gas, final result, etc.

At this point, I just want to press the button getStore on the right, and notice how that shows beneath it the result

0: string: abcdef

Likewise, there is a new transaction log on the left and by clicking it we can see

{
    "0": "string: abcdef"
}

in the decoded output. All is well.

Now, I type “0123456789” in the textbox to the right of setStore and hit that button. Then I call getStore again and I receive that string. Thumbs up, we can do basic storage/retrieval with strings!

Let’s now go for more interesting things.

Creating new strings: data location

So far, I have accessed a string literal and we have seen how we can change it by assigning to it. But that is only a very coarse way of dealing with strings. Let us create a string character by character. This will introduce us to one peculiarity of Solidity programming: data location.

I create a new method that only returns a new string with 3 specific characters: “Abc”.

    function createString() public returns (string) {
        string newString = new string(3);
        newString[0] = "A";
        newString[1] = "b";
        newString[2] = "c";
        return newString;
    }

This is a well-intentioned effort, but does not work. Remix is kind enough to immediately point 4 errors and 1 warning:

  • 2 of these are on the same line: string newString = new string(3);
    • Warning: Variable is declared as a storage pointer. Use an explicit “storage” keyword to silence this warning.
    • TypeError: Type string memory is not implicitly convertible to expected type string storage pointer
  • The other three occur in the following lines, eg newString[0] = "A"; and are all of the same type:
    • TypeError: Index access for string is not possible.

To understand the first issue, I have to tell you about data location. Writing to the blockchain is very expensive. Every node that runs the transaction has to do the same writing, it makes the transaction more expensive and the blockchain bigger. When a node downloads a block containing this transaction, it will incur larger storage costs because of this writing. In Ethereum, every transaction has an associated cost, called gas, to incentivise programmers to be as economic as possible.

When writing a contract, authors have a choice of what kind of data to use: memory is cheap, ie it costs relatively low gas, but the data are volatile and lost after a function finishes executing; storage is the most expensive, and is absolutely needed for contract state, which must persist from function call to function call; there is also a calldata location, that corresponds to the values in the stack frame of a function that is executing. This is the cheapest location to use, but it has a limited size. In particular, that means that functions may be limited in their number of arguments.

Every data type has a default location. This is from the documentation:

Forced data location:

  • parameters (not return) of external functions: calldata
  • state variables: storage

Default data location:

  • parameters (also return) of functions: memory
  • all other local variables: storage

Notice the subtlety: function parameters are by default stored in memory, except if the function is external, in which case they will be stored in the stack (ie calldata). This means that  a function that is perfectly alright when public can suddenly have too many arguments  when made external.
Now, let’s come back to our code and examine the line

string newString = new string(3);

This is a local variable inside the function, and so by default it is in storage. The new keyword is used to specify the initial size of a memory dynamic array. Memory arrays cannot be resized. On the other hand, we can change the size of a storage dynamic array by changing its length property, but can’t use new with them.

This is the source of our error. In this case, all we want to do with this string is create it and return it to the outside. Let the outside world decide what to do with it, whether it is wholly temporary or important enough to persist on the blockchain. For this example, the storage is not important, and the string will be created in memory. To do that, we add the memory keyword in the declaration, like this:
string memory newString = new string(3);

Direct access to strings: equivalence with bytes

Let’s see the second sort of errors now. This is simple and unavoidable: Solidity does not currently allow index access to strings. From the FAQ:

string is basically identical to bytes only that it is assumed to hold the UTF-8 encoding of a real string. Since string stores the data in UTF-8 encoding it is quite expensive to compute the number of characters in the string (the encoding of some characters takes more than a single byte). Because of that,

string s; s.length;

is not yet supported and not even index access s[2]

The alternative is to first transform the string into bytes, and then access it directly. This works because string is an array type, albeit with some restrictions.

But there is a trap to watch out for. bytes stores raw data; string stores UTF-8 characters. The following code does not always return the number of characters in _s:

// THIS CODE HAS AN ERROR
function getStringLength(string _s) returns (uint) {
    bytes memory bs = bytes(_s);
    return bs.length;
}

The problem here occurs if _s contains any character that takes more than 1 byte to represent in UTF. In that case, the function returns the length of the byte representation of the input string, and will be more than the number of characters.

This has also an impact when trying to address a particular character of the string, as we cannot predict at which location the character’s bytes will be. We have to parse the string linearly identifying any multi-byte character, or else make sure we restrict our input to characters of fixed length. If we work exclusively with ASCII strings, for example, we’ll be safe.

Returning to our previous function, this works:

function createString() public pure returns (string) {
    string memory newString = new string(3);
    bytes memory byteString = bytes(newString);
    byteString[0] = "A";
    byteString[1] = "b";
    byteString[2] = "c";        

    return string(byteString);
}

But for example, the following code which tries to set the third character of a string to X, will fail when it receives multi-byte characters.

// THIS CODE HAS AN ERROR
function makeThirdCharacterX(string _s) returns (string) {
    bytes memory byteString = bytes(_s);
    byteString[2] = "X";
    return string(byteString);
}

This returns “AbXdef” for an input of “Abcdef”, but returns “XbÁnç!” for an input of “€bÁnç!”

Conclusion

There are still many more things that can be said about this topic, but this is a long enough post already, so I’ll wrap up. The key concept regarding the type string is that this is an array of UTF-8 characters, and can be seamlessly converted to bytes. This is the only way of manipulating the string at all.But it is important to note that UTF-8 characters do not exactly match bytes. The conversion in either direction will be accurate, but there is not an immediate relation between each byte index and the corresponding string index.
For most things, there may be an advantage in representing the string directly as the type bytes (avoiding conversions) and be very careful when using characters that are encoded in UTF by more than one byte.

That’s enough for now. See you another day, with more steps in this coding adventure.

 

Leave a Reply