Notes

2015-10-20

I would like to share some thoughts I had about tagged data. Sum types and such are getting really common in mainstream languages, they are usually implemented using what I'll call a tag field. A tag is a small number that indicates what is the kind of the data considered. For example, in C, we often use the following idiom to mimic sum types:

struct Fruit {
	enum {
		FruitNone, /* invalid fruit */
		FruitApple,
		FruitBanana,
		FruitCherry,
	} kind;
	union {
		struct Apple apple;
		struct Banana banana;
		struct Cherry cherry;
	} u;
};

Like in the example, most of the programming languages represent the tag using a machine integer. However, if you are programming with high space constraints it might be worth thinking more about the appropriate representation of the tag field.

Let's say that you want a tagged structure that can be either one integer of type A in the range [0-255] or integers of type B, C, and D that lie in the range [0-63], then, because there are 4 different cases, the tag field needs to be at least 2 bits long, and because the type A needs at least 8 bits (8 bits represent at most 2⁸ = 256 values), the whole tagged data-structure would need at least 8+2 = 10 bits for storage.

The point is now that the whole reasoning above assumes that the tag bit-width is constant, but in the above case it is actually smarter to have a variable bit-width for the tag. Here is why: if we use the tag 0 to represent the kind of data A, and 3, 5, 7 to represent respectively the kinds B, C, and D, then we use only 1 bit for the tag in the A case and 3 bits in all the other cases. So instead of using 10 bits of storage as above we use max(8+1,6+3) = 9 bits! (Note that the above works because 3, 5, and 7 have 1 as least significant bit, so they are never confused with the A kind.)