# Notes

These are my personal opinions on various subjects, they are sorted in chronological order. I tried to keep them short. If you have any comments, I would be glad that you drop me a mail.

### January 15, 2019

It has been forever since my last note! I hereby take it as new year's resolution to share more about the interesting thought nuggets I come across.

In computer hardware, arithmetic primitives are in fact *modular*
primitives; this means that the result of `a+b`

is defined as
the mathematical sum of `a`

and `b`

modulo a power of 2
(usually, 2³² or 2⁶⁴). The mathematical structure obtained
is a modular ring in which all operations "wrap around".
It is only a ring and not a field because the multiplication
operation does not have an inverse; to see why an inverse
function for multiplication cannot be defined, observe that
42 is the result of both 21×2 and (2³¹+21)×2!

An elementary result equivalent to Bézout's identity
is that any number co-prime with the modulus has a
multiplicative inverse. In a computer, this means that for
any odd number `a`

, a number `b`

exists shuch that `a✕b = 1`

.
Now there is a particularly cute algorithm to find
multiplicative inverses, and it is what motivated me to
write this note.

The core idea of the algorithm lies in an observation I
found quite surprising: a multiplicative inverse `b`

of `a`

modulo 2ⁿ can be used to find an inverse `c`

of `a`

modulo
2²ⁿ. Let's see how.

ab = 1 (mod 2ⁿ) ⇒ ab - 1 = k2ⁿ ⇒ (ab - 1)² = 0 (mod 2²ⁿ) ⇒ a²b² - 2ab + 1 = 0 (mod 2²ⁿ) ⇒ 2ab - a²b² = 1 (mod 2²ⁿ) ⇒ a(2b - ab²) = 1 (mod 2²ⁿ) ⇒ c = 2b - ab² (mod 2²ⁿ)

This gives us the inductive step of the algorithm;
to bootstrap it we simply need to find an inverse
`b`

of `a`

modulo some small power of 2. To that
end, one can check that all odd numbers are their
own inverse modulo 4 (and 8).
Putting it together in C.

uint32_t mulinv(uint32_t a) { uint32_t b = a; /* 1/a mod 2² */ b *= 2 - a*b; /* 1/a mod 2⁴ */ b *= 2 - a*b; /* 1/a mod 2⁸ */ b *= 2 - a*b; /* 1/a mod 2¹⁶ */ b *= 2 - a*b; /* 1/a mod 2³² */ return b; }

For those really eager for performance, it turns
out that `b = 3*a ^ 2`

is a cute initialization
by Peter Montgomery which yields an inverse
of `a`

modulo 2⁵ and allows to cut one refinement
step!

### February 15, 2017

Undefined behavior in C is a common source of bugs, and sometimes, of funny ones. Here is my story about one. A few months ago I was working on a function that looked like this.

for (i=0; arr[i]!=0 && i<2; ++i) { /* do some work */ }

When my program was compiled without
optimizations it would behave correctly,
but turning optimizations on made it
incorrect. I was able to quickly
pinpoint the above loop as the root of
my issues, but the symptom of the bug
was quite unusual: Stepping in the
debugger revealed that the loop body was
only executed once, when it should have
been running twice! Indeed, the array
`arr`

contained two non-zero values.

After a little bit of head scratching,
I eventually realized what the compiler
was doing. The variable `arr`

was declared
as `int arr[2]`

, so accessing its third
element is undefined
behavior. Because of this, a valid
program *cannot* access `arr[2]`

;
but if the loop body is run twice, the
test condition *will* check `arr[2]!=0`

at the end of the second iteration. The
consequence of this reasoning is that,
assuming the input program is valid, the
second loop iteration will not be run
and can be gotten rid of!

I thought this was quite a remarkable use of undefined behavior: Invalid array accesses do not happen in valid programs, so if the compiler can prove such access happens in a branch, it means that the branch is dead code.

### February 9, 2017

The φ nodes of SSA form are often defined
as taking one argument per predecessor block.
In both my compiler and LLVM, the data
structures are based on this definition.
However, with some more experience, I think
it is more sensible to require φ arguments
to be linked with incoming *edges*, not
predecessors. I will give two practical
examples to prove this point, but first, let
me clarify the concept of edge.

In a control flow graph (CFG), I call edge a link
between two basic blocks. An essential point is
that edges to a block are *not* in one-to-one
correspondence with the block's predecessors.
This is because of jumps like ```
if(%cond) goto @b
else @b
```

, where a conditional jump jumps to the
same block in two (or more) cases. While such
a jump may look contrived, it can result from
optimizations, or even from lowering C-style
`switch`

statements where multiple cases have
the same body.

The program below will serve to illustrate my claim.

@src %var = .... %cond = ... if(%cond) goto @dst else @dst @dst %foo = φ(@src %var, ...)

Following the usual requirement, the φ
node `%foo`

has one argument for the `@src`

predecessor block. Assume now that the
predecessor is dead code and gets deleted
by this natural deletion code.

void blockdel(Block block) { /* do stuff... */ foreach succ in block.successors { foreach phi in succ.phis { phi.remove_arg(block) } } }

Stepping through this function reveals
that the `remove_arg`

method will erroneously be
called twice. In my compiler, the second
invocation caused an assertion failure. On the
other hand, if φ nodes have one argument per
incoming edge, the loop above is naturally changed
to some correct code.

Another case where things could go wrong is in a CFG simplification pass. Such a pass will likely remove spurious empty blocks and adjust links elsewhere in consequence. But consider the following program.

@blk0 ... if(%cond) goto @blk1 else @blk2 @blk1 jmp @blk3 @blk2 jmp @blk3 @blk3 %foo = φ(@blk1 %v1, @blk2 %v2)

The simplification pass will remove both `@blk1`

and `@blk2`

. After deleting `@blk1`

the block
of the first φ argument will be updated to
`@blk0`

. Now if the same thing happens to the
second empty block, two arguments of the φ node
will bogusly end up with the same incoming block.

Solving these issues in my compiler required careful logic to handle the "multiplicity" of predecessors. However I think the edge-based design is better, and I might adopt it in the future.

### January 5, 2016

Because of a misuse of an archiving command I had to delete more
than 40,000 files polluting my home directory. The *UNIX way*
should make it easy because the decompression program can output
the list of files it extracted. Unfortunately, it turned out to
be surprisingly hard for at least two reasons: first 40,000 files
is a lot and second, files used funny characters (white-spaces,
dashes at the beginning, patheseses, etc.). After some Googling,
here is the final command I used, phew.

tr \\n \\0 <list.txt | xargs -0 -n 1 rm -f --

### November 13, 2015

While browsing information about compilers I found it difficult to find a formal definition of SSA form (Static Single Assignment). For whatever reason, it's often under-specified or simplifed. So anyway, if you are like me and like formal definitions, here it is.

A program is in SSA form when:

- Variables are assigned only once.
- All the uses of a variable are
*dominated*by the unique definition. - The φ nodes are at the beginning basic blocks.
- The φ nodes have one argument for each predecessor of their basic block.
- Uses of φ arguments are counted at the end of the corresponding predecessor.

The concept of domination is from graph theory: an instruction A *dominates*
another B when all the paths from the entry point of the program to B go
through A.

Please let me know if the definition above looks wrong to you. Finally, if you want informal examples, the Wikipedia page has plenty!

### October 20, 2015

I would like to share some thoughts I had about tagged data. Sum types and
such are getting really common in mainstream languages, they are usually
implemented using what I'll call a *tag* field. A tag is a small
number that indicates what is the kind of the data considered. For example,
in C, we often use the following idiom to mimic sum types:

struct Fruit { enum { FruitNone, /* invalid fruit */ FruitApple, FruitBanana, FruitCherry, } kind; union { struct Apple apple; struct Banana banana; struct Cherry cherry; } u; };

Like in the example, most of the programming languages represent the tag using a machine integer. However, if you are programming with high space constraints it might be worth thinking more about the appropriate representation of the tag field.

Let's say that you want a tagged structure that can be either
one integer of type A in the range [0-255] or integers of type
B, C, and D that lie in the range [0-63], then, because there
are 4 different cases, the tag field needs to be at least 2 bits
long, and because the type A needs at least 8 bits (8 bits
represent at most
2^{8}
= 256 values), the whole tagged
data-structure would need at least 8+2 = 10 bits for storage.

The point is now that the whole reasoning above assumes that
the tag bit-width is *constant*, but in the above case it is
actually smarter to have a variable bit-width for the tag. Here
is why: if we use the tag 0 to represent the kind of data A, and
3, 5, 7 to represent respectively the kinds B, C, and D, then
we use only 1 bit for the tag in the A case and 3 bits in all
the other cases. So instead of using 10 bits of storage as
above we use max(8+1,6+3) = 9 bits! (Note that the above works
because 3, 5, and 7 have 1 as least significant bit, so they
are never confused with the A kind.)

### September 19, 2015

The x64 architecture is better than i386 in many regards but one surprise
it reserved to programmers
is that *immediates* remain 32 bits constants instead of being
promoted to 64 bits like everything else. In case you do not know it
already, an immediate is a constant in an instruction, for example,
if you want to add N to the register rax, you would write
`addq $N, %rax`

. And in that case, we say that N is an immediate.

Because immediates are restricted to be only 32 bits long, it is impossible
to add
2^{33}
to rax using
one instruction only. One has to load the big constant in another register
first (say rbx), and then perform the addition, like so:
`addq %rbx, %rax`

.
This brings us to the following question, how does one load a constant into
a 64 bits register?
The answer to this is simple, if you do not care about the code size.
But if you do, there are actually three different ways!

First, like most 64 bits instruction sets that allow operating 32 bits
sub-registers, x64 zeroes the high bits of the larger parent 64 bits
register for any 32 bits operation. (If you
wonder why, lookup data-dependencies in computer architecture books.)
So for instance, when the machine executes `movl $42, %eax`

, the high
bits of rax will be set to 0, and the full rax register will contain the
value 42. This is our first way to load a value, it is also the
cheapest way and uses 5 bytes only.

Because 32 bits operations are zero-extended, the previous technique
is unsuitable to load a negative value into a 64 bits register. In that
case we have to mention rax explicitely in the instruction: ```
movq
$-42, %rax
```

. On x64, mentioning a full width register costs you
two extra bytes (one of them is the REX prefix, required for all 64 bits
instructions). So that gives us the second most economical load
instruction, it uses 7 bytes.

Finally if the constant to load does not fit in 32 bits, one has to
use the only x64 instruction with a 64 bits immediate. In
AT&T syntax we write `movabsq $4294967296, %rax`

. It
really is a last-resort option because that instruction uses 10 bytes!

These three cases give us the following rules for compiling constant loads into rax on x64. (The same works for any other integer register.)

- When
0 ≤ N < 2
^{32}, use`movl $N, %eax`

(5 bytes). - When
-2
^{31}≤ N < 0, use`movq $N, %rax`

(7 bytes). - Otherwise, use
`movabsq $N, %rax`

(10 bytes).

Isn't that crazy?

### June 2, 2015

Here is a short note for those who have trouble remembering Russell's Paradox or the definition of the Y combinator in lambda-calculus. I will expose here one observation that allows you to remember just one of the two and recover the other. The actual link between these two notions prevents the lambda-calculus from serving the purpose Church had for it: giving a formal basis to mathematics. Fortunately the lambda-calculus found another application not any less important than grounding mathematics: it describes universal computations.

Russell's Paradox considers the set R = { X | X ∉ X } of sets that do not belong to themselves. Now if such set exists, R ∈ R if and only if R ∉ R, and we have a contradiction. Let us encode this reasoning in lambda-calculus, we start by observing that sets can be represented by their characteristic function: a function returning 1 for all members of the set and 0 for the others. Assume that we represent the negation using a function N. Then, we can define the set R in lambda-calculus by the following characteristic function:

R = λ x. N (x x)

Let's observe what happens when we ask if R is in itself:
We form `Y = R R`

(that is exactly the Y combinator) and by one
application of beta-reduction, we get `N (R R) ≡ N Y`

.

We just found a fixpoint of N. In set theory we had a fixpoint of
negation, but negation does not have fixpoint, hence the paradox. In the
lambda-calculus case, we get out of this with a trick: if N has no fixpoint,
the computation we get by applying these characteristic functions never
terminates, it *diverges* and we face an ever-running program that
never gives us the answer if R is in itself or not.

### March 5, 2015

I recently hacked together a small yacc for C. Initially, I was programming
as if I lived in the future and performance was not a problem.
However, with this LALR state generation you can't be *too*
silly, as I soon discovered. Processing this C89 grammar
took around 8 minutes, that was not really acceptable as bison could deal with
it in a fraction of second! After eliminating a few massive
performance hogs (sorted array to represent a set, loop invariant
in a loop), I was processing the above C grammar in 5 seconds.
Much better! Then comes my point, at that time the profiler was saying
that my code was fairly fast and accounted for less than a second
of runtime. The issue turned out to be in my use of malloc/free
in the libc. I used the following pattern:

- Allocate initialize some memory to store a temporary,
- if condition C is true, store the pointer
- otherwise, free the memory.

The case 3 happened much more often than the case 2 so I thought that glibc's allocation code would simply give me the same block all the time. Switching to the scheme below got me a 5x speedup, so I guess I was wrong!

- Initialize a statically allocated object A,
- if condition C is true, allocate some memory and copy A to it, then store the pointer
- otherwise, do nothing.

The conclusion I can draw is
that objects likely to be short-lived should really *not* be
allocated dynamically. This makes me wonder how, for instance,
C++'s RAII and Rust's automatic liveness mechanisms handle this
kind of performance problems. The speed of my yacc is now similar
to bison's after only 10% of the code was changed. This proves
the point that, instead of designing for speed, we should design
for simplicity and then, improve.

### September 22, 2014

I am grading some verified code submitted by students and observe one interesting fact. The code they have to verify is the classic inefficient bubble sort, they are given an array and need to convince Microsoft Research's Dafny verifier that the returned array is indeed sorted.

Now comes the interesting bit. It turns out that
all students have in the precondition that the pointer to the array
is not null, but most of them also require that `a.Length > 1`

or `a.Length > 0`

. That is, their sorting function can
only be called on arrays that are either two elements long (in the
worst case) or, non empty. This additional requirement helps
the verifier going through but most of the time their code actually
works for empty arrays! In other words, the requirement is useless
but they did not go through the pain of trying to eliminate or
understand it. I see this as disdain for small cases and would
like to argue it is poor programming practice.

Now that I think about it, it reminds me of a mail I read some time ago, somebody implemented a vi clone and in its first versions, editing commands at the end of the file would crash or misbehave in most cases. To me, it is an indication of poor design. This software is undoubtedly still flawed since the corner case was patched afterwards, good code and data design would have allowed uniform programming and probably made the program safer.

I see going through the thinking to eliminate corner cases as an essential part of programming. The more your function or data structure is general the easier and more comfortable it is to use. By experience, adapting your code to the corner case will not make it unclear but more elegant and easier to read/use (assuming you don't dumbly patch it)! So, next time you write some code, ask yourself what happens when the window is 0 pixels wide, or if the file you are reading contains a nul byte...

### September 19, 2014

While restructuring my simple IO multiplexing module I had to solve a problem that was not completely trivial. It might make an instructive exercise so I will explain it here.

The problem is that I am iterating over a collection (C array) that can be modified by the functions called during the iteration. More specifically, some elements could be removed during the iteration. I thought the simplest way to do this is to have a flag that marks the validity of elements in my array. This way, instead of deleting during the iteration I change a flag and, during the iteration, I skip flagged elements.

After the iteration, we have to *compact* the array
to remove flagged elements. My first (mental) try was quadratic
but with some more thought, I found a linear algorithm. I let
you meditate and find it, it's not so hard and pretty nice to
hack (3 lines in my program).

### September 10, 2014

Here are my latest thoughts on compilers for low level languages (I think C). I am a pretty big fan of the C language. What I like with C is that when you write it and know your target platform you have pretty good idea what the code generated will be. It is very predictable. C was designed at a time where programs had to be very simple, and this includes compilers. So C is trying to be as expressive as possible while remaining very easy to compile.

It seems that this ideal is now slipping out of sight among
language users and compiler writers. Recently, a new family of
bugs has emerged, compilers exploit *undefined behavior*
as defined in the C standard to enable "optimizations".
This mail shows how GCC removes a check for overflow that has
undefined behavior without any warning. In the same vein,
this article describes a bug in the Linux kernel that was aggravated by
another undefined behavior based GCC optimization. Another recent article
tries to show how, in modern C, one should zero a piece
of memory. This problem is made hard by optimizing compilers
that know the semantics of memset and will
optimize it out if the it does not change the *observational
behavior* of C programs.
These different behaviors of compilers greatly impair the predictability,
and I think this is a real deal breaker for many people.

But why predictability is something important? Well, consider the following function. If the compiler is predictable, the memset call will be compiled down to assembly.

void f() { char *p = malloc(512); if (!p) return; sensitivestuff(p); memset(p, 0, 512); free(p); }

If only standard C is using this function, it is perfectly fine to remove the memory clearing call to memset. But we are not in a world where only well-behaved C calls C. This function could very well be called by some malicious code that tries to access back the memory just freed. In that case, not zeroing the memory causes a security bug since sensitive information can be accessed. So, "modern" compilers, when they do their job, assume that all code is written in C and well behaved. This is obviously not true in the real world, where abstraction leaps exist and can be exploited.

The previous argument shows that predictability is essential for security. Another important area where predictability is important is performance evaluation. While basic optimizations are critical, when C is used for embedded applications, predictability of the compiler is usually preferred to crazy compiler tricks. This is true for a simple reason, programmers want to evaluate the cost of their programs at the C level but what runs is the actual compiled code. These two abstraction layers need to be related in some simple way for the analysis on the high level to be relevant. Airbus, for example, uses a custom GCC version with all optimizations disabled.

If I have to sum it up, I think simpler is better, and compiler writers are taking a slippery path that could very well lead their compilers to be unusable for many applications where C used to shine. The focus is a lot more on a literal and probably misguided interpretation of the standard than it is on usability and relevance of compilers to today's problems. And, to be honest, is any of the above "optimizations" critical to any application that is not a contrived benchmark?

### September 9, 2014

Don Knuth, in its Volume 1 of The Art of Computer Programming argues against garbage collection by saying that early freeing can lead to a more efficient usage of the memory by the allocator.

In the following diagrams which represent a memory state, the heavy gray area is not used by the program but the garbage collector did not reclaim it yet, the black areas are in use, and the white areas are free to use. If the program tries to allocate one small block in the current state, the allocator will use some of the free space, Now the garbage collector reclaims the gray zone and we are left with two disjoint blocks of memory. On the other hand, if memory is managed manually the gray block has been freed eagerly, Thus, when the application needs some extra memory it can be tightly packed and only leaves one large hole in the memory.

I guess this is all to be taken with a grain of salt since there are also easy arguments for garbage collection (for example fast allocation). All in all I feel that programmers using languages with GC will advocate for it, and the converse for the others. The reasonable choice to make is probably application specific and even in this case it is likely to be hard and controversial making the right decision at the planning stage.

### September 4, 2014

Today's test revealed that fancy text data structures might very well not be necessary.

Proved by experiment today: a linked list of hole buffers is enough to edit with no lag a file of 64 megs (1,000,000+ lines).

— Quentin Carbonneaux September 4, 2014

### September 3, 2014

While hacking on my text editor I discovered a fun idiom. Often, in a switch statement you have two cases that are very similar except in one or two points. In that situation it would be poor style to copy the big chunk of code that is common to the two cases, instead, you can use the following neat trick.

switch(type) { ... case X: isx = 1; if (0) { case Y: isx = 0; } common code; ... break; }

The line `isx = 0;`

will only be executed when
the type is Y because of the guard. This is very similar
and sort of dual to Duff's device. For additional fun,
the braces of `if (0)`

can be removed!

### September 2, 2014

Here is a simple Awk program to check that your files respect the religious constraint of 80 characters per line.

{ gsub(/\t/, " "); if (length($0) > 80) printf("%d: %s\n", NR, $0); }