# Notes

These are my personal opinions on various subjects, they are sorted in chronological order. I tried to keep them short. If you have any comments, I would be glad that you drop me a mail.

### January 15, 2019

It has been forever since my last note! I hereby take it as new year's resolution to share more about the interesting thought nuggets I come across.

In computer hardware, arithmetic primitives are in fact modular primitives; this means that the result of `a+b` is defined as the mathematical sum of `a` and `b` modulo a power of 2 (usually, 2³² or 2⁶⁴). The mathematical structure obtained is a modular ring in which all operations "wrap around". It is only a ring and not a field because the multiplication operation does not have an inverse; to see why an inverse function for multiplication cannot be defined, observe that 42 is the result of both 21×2 and (2³¹+21)×2!

An elementary result equivalent to Bézout's identity is that any number co-prime with the modulus has a multiplicative inverse. In a computer, this means that for any odd number `a`, a number `b` exists shuch that `a✕b = 1`. Now there is a particularly cute algorithm to find multiplicative inverses, and it is what motivated me to write this note.

The core idea of the algorithm lies in an observation I found quite surprising: a multiplicative inverse `b` of `a` modulo 2ⁿ can be used to find an inverse `c` of `a` modulo 2²ⁿ. Let's see how.

```ab = 1 (mod 2ⁿ)
⇒ ab - 1 = k2ⁿ
⇒ (ab - 1)² = 0      (mod 2²ⁿ)
⇒ a²b² - 2ab + 1 = 0 (mod 2²ⁿ)
⇒ 2ab - a²b² = 1     (mod 2²ⁿ)
⇒ a(2b - ab²) = 1    (mod 2²ⁿ)
⇒ c = 2b - ab²       (mod 2²ⁿ)
```

This gives us the inductive step of the algorithm; to bootstrap it we simply need to find an inverse `b` of `a` modulo some small power of 2. To that end, one can check that all odd numbers are their own inverse modulo 4 (and 8). Putting it together in C.

```uint32_t mulinv(uint32_t a) {
uint32_t b = a;   /* 1/a mod 2² */
b *= 2 - a*b;     /* 1/a mod 2⁴ */
b *= 2 - a*b;     /* 1/a mod 2⁸ */
b *= 2 - a*b;     /* 1/a mod 2¹⁶ */
b *= 2 - a*b;     /* 1/a mod 2³² */
return b;
}
```

For those really eager for performance, it turns out that `b = 3*a ^ 2` is a cute initialization by Peter Montgomery which yields an inverse of `a` modulo 2⁵ and allows to cut one refinement step!

### February 15, 2017

Undefined behavior in C is a common source of bugs, and sometimes, of funny ones. Here is my story about one. A few months ago I was working on a function that looked like this.

```for (i=0; arr[i]!=0 && i<2; ++i) {
/* do some work */
}
```

When my program was compiled without optimizations it would behave correctly, but turning optimizations on made it incorrect. I was able to quickly pinpoint the above loop as the root of my issues, but the symptom of the bug was quite unusual: Stepping in the debugger revealed that the loop body was only executed once, when it should have been running twice! Indeed, the array `arr` contained two non-zero values.

After a little bit of head scratching, I eventually realized what the compiler was doing. The variable `arr` was declared as `int arr[2]`, so accessing its third element is undefined behavior. Because of this, a valid program cannot access `arr[2]`; but if the loop body is run twice, the test condition will check `arr[2]!=0` at the end of the second iteration. The consequence of this reasoning is that, assuming the input program is valid, the second loop iteration will not be run and can be gotten rid of!

I thought this was quite a remarkable use of undefined behavior: Invalid array accesses do not happen in valid programs, so if the compiler can prove such access happens in a branch, it means that the branch is dead code.

### February 9, 2017

The φ nodes of SSA form are often defined as taking one argument per predecessor block. In both my compiler and LLVM, the data structures are based on this definition. However, with some more experience, I think it is more sensible to require φ arguments to be linked with incoming edges, not predecessors. I will give two practical examples to prove this point, but first, let me clarify the concept of edge.

In a control flow graph (CFG), I call edge a link between two basic blocks. An essential point is that edges to a block are not in one-to-one correspondence with the block's predecessors. This is because of jumps like ```if(%cond) goto @b else @b```, where a conditional jump jumps to the same block in two (or more) cases. While such a jump may look contrived, it can result from optimizations, or even from lowering C-style `switch` statements where multiple cases have the same body.

The program below will serve to illustrate my claim.

```@src
%var = ....
%cond = ...
if(%cond) goto @dst else @dst
@dst
%foo = φ(@src %var, ...)
```

Following the usual requirement, the φ node `%foo` has one argument for the `@src` predecessor block. Assume now that the predecessor is dead code and gets deleted by this natural deletion code.

```void blockdel(Block block) {
/* do stuff... */
foreach succ in block.successors {
foreach phi in succ.phis {
phi.remove_arg(block)
}
}
}
```

Stepping through this function reveals that the `remove_arg` method will erroneously be called twice. In my compiler, the second invocation caused an assertion failure. On the other hand, if φ nodes have one argument per incoming edge, the loop above is naturally changed to some correct code.

Another case where things could go wrong is in a CFG simplification pass. Such a pass will likely remove spurious empty blocks and adjust links elsewhere in consequence. But consider the following program.

```@blk0
...
if(%cond) goto @blk1 else @blk2
@blk1
jmp @blk3
@blk2
jmp @blk3
@blk3
%foo = φ(@blk1 %v1, @blk2 %v2)
```

The simplification pass will remove both `@blk1` and `@blk2`. After deleting `@blk1` the block of the first φ argument will be updated to `@blk0`. Now if the same thing happens to the second empty block, two arguments of the φ node will bogusly end up with the same incoming block.

Solving these issues in my compiler required careful logic to handle the "multiplicity" of predecessors. However I think the edge-based design is better, and I might adopt it in the future.

### January 5, 2016

Because of a misuse of an archiving command I had to delete more than 40,000 files polluting my home directory. The UNIX way should make it easy because the decompression program can output the list of files it extracted. Unfortunately, it turned out to be surprisingly hard for at least two reasons: first 40,000 files is a lot and second, files used funny characters (white-spaces, dashes at the beginning, patheseses, etc.). After some Googling, here is the final command I used, phew.

```tr \\n \\0 <list.txt | xargs -0 -n 1 rm -f --
```

### November 13, 2015

While browsing information about compilers I found it difficult to find a formal definition of SSA form (Static Single Assignment). For whatever reason, it's often under-specified or simplifed. So anyway, if you are like me and like formal definitions, here it is.

A program is in SSA form when:

• Variables are assigned only once.
• All the uses of a variable are dominated by the unique definition.
• The φ nodes are at the beginning basic blocks.
• The φ nodes have one argument for each predecessor of their basic block.
• Uses of φ arguments are counted at the end of the corresponding predecessor.

The concept of domination is from graph theory: an instruction A dominates another B when all the paths from the entry point of the program to B go through A.

Please let me know if the definition above looks wrong to you. Finally, if you want informal examples, the Wikipedia page has plenty!

### October 20, 2015

I would like to share some thoughts I had about tagged data. Sum types and such are getting really common in mainstream languages, they are usually implemented using what I'll call a tag field. A tag is a small number that indicates what is the kind of the data considered. For example, in C, we often use the following idiom to mimic sum types:

```struct Fruit {
enum {
FruitNone, /* invalid fruit */
FruitApple,
FruitBanana,
FruitCherry,
} kind;
union {
struct Apple apple;
struct Banana banana;
struct Cherry cherry;
} u;
};
```

Like in the example, most of the programming languages represent the tag using a machine integer. However, if you are programming with high space constraints it might be worth thinking more about the appropriate representation of the tag field.

Let's say that you want a tagged structure that can be either one integer of type A in the range [0-255] or integers of type B, C, and D that lie in the range [0-63], then, because there are 4 different cases, the tag field needs to be at least 2 bits long, and because the type A needs at least 8 bits (8 bits represent at most 28 = 256 values), the whole tagged data-structure would need at least 8+2 = 10 bits for storage.

The point is now that the whole reasoning above assumes that the tag bit-width is constant, but in the above case it is actually smarter to have a variable bit-width for the tag. Here is why: if we use the tag 0 to represent the kind of data A, and 3, 5, 7 to represent respectively the kinds B, C, and D, then we use only 1 bit for the tag in the A case and 3 bits in all the other cases. So instead of using 10 bits of storage as above we use max(8+1,6+3) = 9 bits! (Note that the above works because 3, 5, and 7 have 1 as least significant bit, so they are never confused with the A kind.)

### September 19, 2015

The x64 architecture is better than i386 in many regards but one surprise it reserved to programmers is that immediates remain 32 bits constants instead of being promoted to 64 bits like everything else. In case you do not know it already, an immediate is a constant in an instruction, for example, if you want to add N to the register rax, you would write `addq \$N, %rax`. And in that case, we say that N is an immediate.

Because immediates are restricted to be only 32 bits long, it is impossible to add 233 to rax using one instruction only. One has to load the big constant in another register first (say rbx), and then perform the addition, like so: `addq %rbx, %rax`. This brings us to the following question, how does one load a constant into a 64 bits register? The answer to this is simple, if you do not care about the code size. But if you do, there are actually three different ways!

First, like most 64 bits instruction sets that allow operating 32 bits sub-registers, x64 zeroes the high bits of the larger parent 64 bits register for any 32 bits operation. (If you wonder why, lookup data-dependencies in computer architecture books.) So for instance, when the machine executes `movl \$42, %eax`, the high bits of rax will be set to 0, and the full rax register will contain the value 42. This is our first way to load a value, it is also the cheapest way and uses 5 bytes only.

Because 32 bits operations are zero-extended, the previous technique is unsuitable to load a negative value into a 64 bits register. In that case we have to mention rax explicitely in the instruction: ```movq \$-42, %rax```. On x64, mentioning a full width register costs you two extra bytes (one of them is the REX prefix, required for all 64 bits instructions). So that gives us the second most economical load instruction, it uses 7 bytes.

Finally if the constant to load does not fit in 32 bits, one has to use the only x64 instruction with a 64 bits immediate. In AT&T syntax we write `movabsq \$4294967296, %rax`. It really is a last-resort option because that instruction uses 10 bytes!

These three cases give us the following rules for compiling constant loads into rax on x64. (The same works for any other integer register.)

• When 0 ≤ N < 232, use `movl \$N, %eax` (5 bytes).
• When -231 ≤ N < 0, use `movq \$N, %rax` (7 bytes).
• Otherwise, use `movabsq \$N, %rax` (10 bytes).

Isn't that crazy?

### June 2, 2015

Here is a short note for those who have trouble remembering Russell's Paradox or the definition of the Y combinator in lambda-calculus. I will expose here one observation that allows you to remember just one of the two and recover the other. The actual link between these two notions prevents the lambda-calculus from serving the purpose Church had for it: giving a formal basis to mathematics. Fortunately the lambda-calculus found another application not any less important than grounding mathematics: it describes universal computations.

Russell's Paradox considers the set R = { X | X ∉ X } of sets that do not belong to themselves. Now if such set exists, R ∈ R if and only if R ∉ R, and we have a contradiction. Let us encode this reasoning in lambda-calculus, we start by observing that sets can be represented by their characteristic function: a function returning 1 for all members of the set and 0 for the others. Assume that we represent the negation using a function N. Then, we can define the set R in lambda-calculus by the following characteristic function:

```R = λ x. N (x x)
```

Let's observe what happens when we ask if R is in itself: We form `Y = R R` (that is exactly the Y combinator) and by one application of beta-reduction, we get `N (R R) ≡ N Y`.

We just found a fixpoint of N. In set theory we had a fixpoint of negation, but negation does not have fixpoint, hence the paradox. In the lambda-calculus case, we get out of this with a trick: if N has no fixpoint, the computation we get by applying these characteristic functions never terminates, it diverges and we face an ever-running program that never gives us the answer if R is in itself or not.

### March 5, 2015

I recently hacked together a small yacc for C. Initially, I was programming as if I lived in the future and performance was not a problem. However, with this LALR state generation you can't be too silly, as I soon discovered. Processing this C89 grammar took around 8 minutes, that was not really acceptable as bison could deal with it in a fraction of second! After eliminating a few massive performance hogs (sorted array to represent a set, loop invariant in a loop), I was processing the above C grammar in 5 seconds. Much better! Then comes my point, at that time the profiler was saying that my code was fairly fast and accounted for less than a second of runtime. The issue turned out to be in my use of malloc/free in the libc. I used the following pattern:

• Allocate initialize some memory to store a temporary,
• if condition C is true, store the pointer
• otherwise, free the memory.

The case 3 happened much more often than the case 2 so I thought that glibc's allocation code would simply give me the same block all the time. Switching to the scheme below got me a 5x speedup, so I guess I was wrong!

• Initialize a statically allocated object A,
• if condition C is true, allocate some memory and copy A to it, then store the pointer
• otherwise, do nothing.

The conclusion I can draw is that objects likely to be short-lived should really not be allocated dynamically. This makes me wonder how, for instance, C++'s RAII and Rust's automatic liveness mechanisms handle this kind of performance problems. The speed of my yacc is now similar to bison's after only 10% of the code was changed. This proves the point that, instead of designing for speed, we should design for simplicity and then, improve.

### September 22, 2014

I am grading some verified code submitted by students and observe one interesting fact. The code they have to verify is the classic inefficient bubble sort, they are given an array and need to convince Microsoft Research's Dafny verifier that the returned array is indeed sorted.

Now comes the interesting bit. It turns out that all students have in the precondition that the pointer to the array is not null, but most of them also require that `a.Length > 1` or `a.Length > 0`. That is, their sorting function can only be called on arrays that are either two elements long (in the worst case) or, non empty. This additional requirement helps the verifier going through but most of the time their code actually works for empty arrays! In other words, the requirement is useless but they did not go through the pain of trying to eliminate or understand it. I see this as disdain for small cases and would like to argue it is poor programming practice.

Now that I think about it, it reminds me of a mail I read some time ago, somebody implemented a vi clone and in its first versions, editing commands at the end of the file would crash or misbehave in most cases. To me, it is an indication of poor design. This software is undoubtedly still flawed since the corner case was patched afterwards, good code and data design would have allowed uniform programming and probably made the program safer.

I see going through the thinking to eliminate corner cases as an essential part of programming. The more your function or data structure is general the easier and more comfortable it is to use. By experience, adapting your code to the corner case will not make it unclear but more elegant and easier to read/use (assuming you don't dumbly patch it)! So, next time you write some code, ask yourself what happens when the window is 0 pixels wide, or if the file you are reading contains a nul byte...

### September 19, 2014

While restructuring my simple IO multiplexing module I had to solve a problem that was not completely trivial. It might make an instructive exercise so I will explain it here.

The problem is that I am iterating over a collection (C array) that can be modified by the functions called during the iteration. More specifically, some elements could be removed during the iteration. I thought the simplest way to do this is to have a flag that marks the validity of elements in my array. This way, instead of deleting during the iteration I change a flag and, during the iteration, I skip flagged elements.

After the iteration, we have to compact the array to remove flagged elements. My first (mental) try was quadratic but with some more thought, I found a linear algorithm. I let you meditate and find it, it's not so hard and pretty nice to hack (3 lines in my program).

### September 10, 2014

Here are my latest thoughts on compilers for low level languages (I think C). I am a pretty big fan of the C language. What I like with C is that when you write it and know your target platform you have pretty good idea what the code generated will be. It is very predictable. C was designed at a time where programs had to be very simple, and this includes compilers. So C is trying to be as expressive as possible while remaining very easy to compile.

It seems that this ideal is now slipping out of sight among language users and compiler writers. Recently, a new family of bugs has emerged, compilers exploit undefined behavior as defined in the C standard to enable "optimizations". This mail shows how GCC removes a check for overflow that has undefined behavior without any warning. In the same vein, this article describes a bug in the Linux kernel that was aggravated by another undefined behavior based GCC optimization. Another recent article tries to show how, in modern C, one should zero a piece of memory. This problem is made hard by optimizing compilers that know the semantics of memset and will optimize it out if the it does not change the observational behavior of C programs. These different behaviors of compilers greatly impair the predictability, and I think this is a real deal breaker for many people.

But why predictability is something important? Well, consider the following function. If the compiler is predictable, the memset call will be compiled down to assembly.

```void f() {
char *p = malloc(512);
if (!p) return;
sensitivestuff(p);
memset(p, 0, 512);
free(p);
}
```

If only standard C is using this function, it is perfectly fine to remove the memory clearing call to memset. But we are not in a world where only well-behaved C calls C. This function could very well be called by some malicious code that tries to access back the memory just freed. In that case, not zeroing the memory causes a security bug since sensitive information can be accessed. So, "modern" compilers, when they do their job, assume that all code is written in C and well behaved. This is obviously not true in the real world, where abstraction leaps exist and can be exploited.

The previous argument shows that predictability is essential for security. Another important area where predictability is important is performance evaluation. While basic optimizations are critical, when C is used for embedded applications, predictability of the compiler is usually preferred to crazy compiler tricks. This is true for a simple reason, programmers want to evaluate the cost of their programs at the C level but what runs is the actual compiled code. These two abstraction layers need to be related in some simple way for the analysis on the high level to be relevant. Airbus, for example, uses a custom GCC version with all optimizations disabled.

If I have to sum it up, I think simpler is better, and compiler writers are taking a slippery path that could very well lead their compilers to be unusable for many applications where C used to shine. The focus is a lot more on a literal and probably misguided interpretation of the standard than it is on usability and relevance of compilers to today's problems. And, to be honest, is any of the above "optimizations" critical to any application that is not a contrived benchmark?

### September 9, 2014

Don Knuth, in its Volume 1 of The Art of Computer Programming argues against garbage collection by saying that early freeing can lead to a more efficient usage of the memory by the allocator.

In the following diagrams which represent a memory state, the heavy gray area is not used by the program but the garbage collector did not reclaim it yet, the black areas are in use, and the white areas are free to use. If the program tries to allocate one small block in the current state, the allocator will use some of the free space, Now the garbage collector reclaims the gray zone and we are left with two disjoint blocks of memory. On the other hand, if memory is managed manually the gray block has been freed eagerly, Thus, when the application needs some extra memory it can be tightly packed and only leaves one large hole in the memory.

I guess this is all to be taken with a grain of salt since there are also easy arguments for garbage collection (for example fast allocation). All in all I feel that programmers using languages with GC will advocate for it, and the converse for the others. The reasonable choice to make is probably application specific and even in this case it is likely to be hard and controversial making the right decision at the planning stage.

### September 4, 2014

Today's test revealed that fancy text data structures might very well not be necessary.

Proved by experiment today: a linked list of hole buffers is enough to edit with no lag a file of 64 megs (1,000,000+ lines).

— Quentin Carbonneaux September 4, 2014

### September 3, 2014

While hacking on my text editor I discovered a fun idiom. Often, in a switch statement you have two cases that are very similar except in one or two points. In that situation it would be poor style to copy the big chunk of code that is common to the two cases, instead, you can use the following neat trick.

```switch(type) {
...
case X:
isx = 1;
if (0) {
case Y:
isx = 0;
}
common code;
...
break;
}
```

The line `isx = 0;` will only be executed when the type is Y because of the guard. This is very similar and sort of dual to Duff's device. For additional fun, the braces of `if (0)` can be removed!

### September 2, 2014

Here is a simple Awk program to check that your files respect the religious constraint of 80 characters per line.

```{
gsub(/\t/, "        ");
if (length(\$0) > 80)
printf("%d: %s\n", NR, \$0);
}
```