Notes
These are my personal opinions on various subjects, they are sorted in
chronological order. I tried to keep them short. If you have any
comments, I would be glad that you drop me a mail.
It is typical, and often recommended, to use
tests when writing programs. Tests allow you
to ensure that your program functions as expected
and facilitate refactoring code by expliciting
its desired behavior. In this note, I would
like to present a cheap way to multiply the number
of tests you have.
Assume you want to multiply the size of your
test suite of N tests by K, one
(expensive) way is to write (K-1)N new tests.
But if you find K places where your program can be
changed and remain correct, you can re-use the N
tests of your suite on the K tweaked programs,
essentially running KN tests! (In fact, if the
tweaks do not interfere, you only need log(K) of
them.)
Plenty of handy frameworks are available to let
you test your program easily on a bunch of
inputs. But I claim that, in view of the previous
paragraph, test frameworks should also let you
test different variations of your program.
So, test framework writers, at your editors!
This observation occured to me when working on
the register allocator of my compiler. Indeed,
my algorithm is supposed to work regardless of
the order in which basic blocks are processed.
During my refactoring I found it useful to stress
the allocation code by using many different orders.
Undefined behavior in C is a common source
of bugs, and sometimes, of funny ones.
Here is my story about one.
A few months ago I was working on a
function that looked like this.
for (i=0; arr[i]!=0 && i<2; ++i) {
/* do some work */
}
When my program was compiled without
optimizations it would behave correctly,
but turning optimizations on made it
incorrect. I was able to quickly
pinpoint the above loop as the root of
my issues, but the symptom of the bug
was quite unusual: Stepping in the
debugger revealed that the loop body was
only executed once, when it should have
been running twice! Indeed, the array
arr contained two non-zero values.
After a little bit of head scratching,
I eventually realized what the compiler
was doing. The variable arr was declared
as int arr[2], so accessing its third
element is undefined
behavior. Because of this, a valid
program cannot access arr[2];
but if the loop body is run twice, the
test condition will check arr[2]!=0
at the end of the second iteration. The
consequence of this reasoning is that,
assuming the input program is valid, the
second loop iteration will not be run
and can be gotten rid of!
I thought this was quite a remarkable
use of undefined behavior: Invalid array
accesses do not happen in valid programs,
so if the compiler can prove such access
happens in a branch, it means that the
branch is dead code.
The φ nodes of SSA form are often defined
as taking one argument per predecessor block.
In both my compiler and LLVM, the data
structures are based on this definition.
However, with some more experience, I think
it is more sensible to require φ arguments
to be linked with incoming edges, not
predecessors. I will give two practical
examples to prove this point, but first, let
me clarify the concept of edge.
In a control flow graph (CFG), I call edge a link
between two basic blocks. An essential point is
that edges to a block are not in one-to-one
correspondence with the block's predecessors.
This is because of jumps like if(%cond) goto @b
else @b, where a conditional jump jumps to the
same block in two (or more) cases. While such
a jump may look contrived, it can result from
optimizations, or even from lowering C-style
switch statements where multiple cases have
the same body.
The program below will serve to illustrate my
claim.
@src
%var = ....
%cond = ...
if(%cond) goto @dst else @dst
@dst
%foo = φ(@src %var, ...)
Following the usual requirement, the φ
node %foo has one argument for the @src
predecessor block. Assume now that the
predecessor is dead code and gets deleted
by this natural deletion code.
void blockdel(Block block) {
/* do stuff... */
foreach succ in block.successors {
foreach phi in succ.phis {
phi.remove_arg(block)
}
}
}
Stepping through this function reveals
that the remove_arg method will erroneously be
called twice. In my compiler, the second
invocation caused an assertion failure. On the
other hand, if φ nodes have one argument per
incoming edge, the loop above is naturally changed
to some correct code.
Another case where things could go wrong is
in a CFG simplification pass. Such a pass will
likely remove spurious empty blocks and adjust
links elsewhere in consequence. But consider
the following program.
@blk0
...
if(%cond) goto @blk1 else @blk2
@blk1
jmp @blk3
@blk2
jmp @blk3
@blk3
%foo = φ(@blk1 %v1, @blk2 %v2)
The simplification pass will remove both @blk1
and @blk2. After deleting @blk1 the block
of the first φ argument will be updated to
@blk0. Now if the same thing happens to the
second empty block, two arguments of the φ node
will bogusly end up with the same incoming block.
Solving these issues in my compiler required
careful logic to handle the "multiplicity" of
predecessors. However I think the edge-based
design is better, and I might adopt it in the
future.
Because of a misuse of an archiving command I had to delete more
than 40,000 files polluting my home directory. The UNIX way
should make it easy because the decompression program can output
the list of files it extracted. Unfortunately, it turned out to
be surprisingly hard for at least two reasons: first 40,000 files
is a lot and second, files used funny characters (white-spaces,
dashes at the beginning, patheseses, etc.). After some Googling,
here is the final command I used, phew.
tr \\n \\0 <list.txt | xargs -0 -n 1 rm -f --
While browsing information about compilers I found it difficult
to find a formal definition of SSA form (Static Single Assignment).
For whatever reason, it's often under-specified or simplifed.
So anyway, if you are like me and like formal definitions, here it is.
A program is in SSA form when:
- Variables are assigned only once.
- All the uses of a variable are dominated by the unique definition.
- The φ nodes are at the beginning basic blocks.
- The φ nodes have one argument for each predecessor of their basic block.
- Uses of φ arguments are counted at the end of the corresponding predecessor.
The concept of domination is from graph theory: an instruction A dominates
another B when all the paths from the entry point of the program to B go
through A.
Please let me know if the definition above looks wrong to you.
Finally, if you want informal examples, the Wikipedia page has
plenty!
I would like to share some thoughts I had about tagged data. Sum types and
such are getting really common in mainstream languages, they are usually
implemented using what I'll call a tag field. A tag is a small
number that indicates what is the kind of the data considered. For example,
in C, we often use the following idiom to mimic sum types:
struct Fruit {
enum {
FruitNone, /* invalid fruit */
FruitApple,
FruitBanana,
FruitCherry,
} kind;
union {
struct Apple apple;
struct Banana banana;
struct Cherry cherry;
} u;
};
Like in the example, most of the programming languages represent the tag
using a machine integer. However, if you are programming with high space
constraints it might be worth thinking more about the appropriate
representation of the tag field.
Let's say that you want a tagged structure that can be either
one integer of type A in the range [0-255] or integers of type
B, C, and D that lie in the range [0-63], then, because there
are 4 different cases, the tag field needs to be at least 2 bits
long, and because the type A needs at least 8 bits (8 bits
represent at most
28
= 256 values), the whole tagged
data-structure would need at least 8+2 = 10 bits for storage.
The point is now that the whole reasoning above assumes that
the tag bit-width is constant, but in the above case it is
actually smarter to have a variable bit-width for the tag. Here
is why: if we use the tag 0 to represent the kind of data A, and
3, 5, 7 to represent respectively the kinds B, C, and D, then
we use only 1 bit for the tag in the A case and 3 bits in all
the other cases. So instead of using 10 bits of storage as
above we use max(8+1,6+3) = 9 bits! (Note that the above works
because 3, 5, and 7 have 1 as least significant bit, so they
are never confused with the A kind.)
The x64 architecture is better than i386 in many regards but one surprise
it reserved to programmers
is that immediates remain 32 bits constants instead of being
promoted to 64 bits like everything else. In case you do not know it
already, an immediate is a constant in an instruction, for example,
if you want to add N to the register rax, you would write
addq $N, %rax. And in that case, we say that N is an immediate.
Because immediates are restricted to be only 32 bits long, it is impossible
to add
233
to rax using
one instruction only. One has to load the big constant in another register
first (say rbx), and then perform the addition, like so:
addq %rbx, %rax.
This brings us to the following question, how does one load a constant into
a 64 bits register?
The answer to this is simple, if you do not care about the code size.
But if you do, there are actually three different ways!
First, like most 64 bits instruction sets that allow operating 32 bits
sub-registers, x64 zeroes the high bits of the larger parent 64 bits
register for any 32 bits operation. (If you
wonder why, lookup data-dependencies in computer architecture books.)
So for instance, when the machine executes movl $42, %eax, the high
bits of rax will be set to 0, and the full rax register will contain the
value 42. This is our first way to load a value, it is also the
cheapest way and uses 5 bytes only.
Because 32 bits operations are zero-extended, the previous technique
is unsuitable to load a negative value into a 64 bits register. In that
case we have to mention rax explicitely in the instruction: movq
$-42, %rax. On x64, mentioning a full width register costs you
two extra bytes (one of them is the REX prefix, required for all 64 bits
instructions). So that gives us the second most economical load
instruction, it uses 7 bytes.
Finally if the constant to load does not fit in 32 bits, one has to
use the only x64 instruction with a 64 bits immediate. In
AT&T syntax we write movabsq $4294967296, %rax. It
really is a last-resort option because that instruction uses 10 bytes!
These three cases give us the following rules for compiling constant
loads into rax on x64. (The same works for any other integer register.)
- When
0 ≤ N < 232,
use
movl $N, %eax (5 bytes).
- When
-231 ≤ N < 0,
use
movq $N, %rax (7 bytes).
- Otherwise, use
movabsq $N, %rax (10 bytes).
Isn't that crazy?
Here is a short note for those who have trouble remembering Russell's
Paradox or the definition of the Y combinator in lambda-calculus.
I will expose here one observation that allows you to remember just one of
the two and recover the other. The actual link between these two notions
prevents the lambda-calculus from serving the purpose Church had for it:
giving a formal basis to mathematics. Fortunately the lambda-calculus
found another application not any less important than grounding mathematics:
it describes universal computations.
Russell's Paradox considers the set R = { X | X ∉ X }
of sets that do not belong to themselves. Now if such set exists, R ∈ R if and only
if R ∉ R, and we have a contradiction. Let us encode this reasoning
in lambda-calculus, we start by observing that sets can
be represented by their characteristic function: a function returning
1 for all members of the set and 0 for the others. Assume that
we represent the negation using a function N. Then, we can define
the set R in lambda-calculus by the following characteristic function:
R = λ x. N (x x)
Let's observe what happens when we ask if R is in itself:
We form Y = R R (that is exactly the Y combinator) and by one
application of beta-reduction, we get N (R R) ≡ N Y.
We just found a fixpoint of N. In set theory we had a fixpoint of
negation, but negation does not have fixpoint, hence the paradox. In the
lambda-calculus case, we get out of this with a trick: if N has no fixpoint,
the computation we get by applying these characteristic functions never
terminates, it diverges and we face an ever-running program that
never gives us the answer if R is in itself or not.
I recently hacked together a small yacc for C. Initially, I was programming
as if I lived in the future and performance was not a problem.
However, with this LALR state generation you can't be too
silly, as I soon discovered. Processing this C89 grammar
took around 8 minutes, that was not really acceptable as bison could deal with
it in a fraction of second! After eliminating a few massive
performance hogs (sorted array to represent a set, loop invariant
in a loop), I was processing the above C grammar in 5 seconds.
Much better! Then comes my point, at that time the profiler was saying
that my code was fairly fast and accounted for less than a second
of runtime. The issue turned out to be in my use of malloc/free
in the libc. I used the following pattern:
- Allocate initialize some memory to store a temporary,
- if condition C is true, store the pointer
- otherwise, free the memory.
The case 3 happened much more often than the case 2 so I thought
that glibc's allocation code would simply give me the same
block all the time. Switching to the scheme below got
me a 5x speedup, so I guess I was wrong!
- Initialize a statically allocated object A,
- if condition C is true, allocate some memory and copy A to it,
then store the pointer
- otherwise, do nothing.
The conclusion I can draw is
that objects likely to be short-lived should really not be
allocated dynamically. This makes me wonder how, for instance,
C++'s RAII and Rust's automatic liveness mechanisms handle this
kind of performance problems. The speed of my yacc is now similar
to bison's after only 10% of the code was changed. This proves
the point that, instead of designing for speed, we should design
for simplicity and then, improve.
I am grading some verified code submitted by students and observe
one interesting fact. The code they have to verify is the classic
inefficient bubble sort, they are given an array and need to convince
Microsoft Research's Dafny verifier that the returned array is
indeed sorted.
Now comes the interesting bit. It turns out that
all students have in the precondition that the pointer to the array
is not null, but most of them also require that a.Length > 1
or a.Length > 0. That is, their sorting function can
only be called on arrays that are either two elements long (in the
worst case) or, non empty. This additional requirement helps
the verifier going through but most of the time their code actually
works for empty arrays! In other words, the requirement is useless
but they did not go through the pain of trying to eliminate or
understand it. I see this as disdain for small cases and would
like to argue it is poor programming practice.
Now that I think about it, it reminds me of a mail I read some
time ago, somebody implemented a vi clone and in its first
versions, editing commands at the end of the file would crash
or misbehave in most cases. To me, it is an indication of
poor design. This software is undoubtedly still flawed since
the corner case was patched afterwards, good code and data
design would have allowed uniform programming and probably made
the program safer.
I see going through the thinking to eliminate corner cases as
an essential part of programming. The more your function or
data structure is general the easier and more comfortable it is
to use. By experience, adapting your code to the corner case will
not make it unclear but more elegant and easier to read/use
(assuming you don't dumbly patch it)! So, next time you write
some code, ask yourself what happens when the window is 0 pixels
wide, or if the file you are reading contains a nul byte...
While restructuring my simple IO multiplexing module I had to solve
a problem that was not completely trivial. It might make an
instructive exercise so I will explain it here.
The problem is that I am iterating over a collection (C array)
that can be modified by the functions called during the iteration.
More specifically, some elements could be removed during the
iteration. I thought the simplest way to do this is to have a
flag that marks the validity of elements in my array. This way,
instead of deleting during the iteration I change a flag and,
during the iteration, I skip flagged elements.
After the iteration, we have to compact the array
to remove flagged elements. My first (mental) try was quadratic
but with some more thought, I found a linear algorithm. I let
you meditate and find it, it's not so hard and pretty nice to
hack (3 lines in my program).
Here are my latest thoughts on compilers for low level languages
(I think C). I am a pretty big fan of the C language. What I like
with C is that when you write it and know your target platform you
have pretty good idea what the code generated will be. It is very
predictable. C was designed at a time where programs had to be very
simple, and this includes compilers. So C is trying to be as
expressive as possible while remaining very easy to compile.
It seems that this ideal is now slipping out of sight among
language users and compiler writers. Recently, a new family of
bugs has emerged, compilers exploit undefined behavior
as defined in the C standard to enable "optimizations".
This mail shows how GCC removes a check for overflow that has
undefined behavior without any warning. In the same vein,
this article describes a bug in the Linux kernel that was aggravated by
another undefined behavior based GCC optimization. Another recent article
tries to show how, in modern C, one should zero a piece
of memory. This problem is made hard by optimizing compilers
that know the semantics of memset and will
optimize it out if the it does not change the observational
behavior of C programs.
These different behaviors of compilers greatly impair the predictability,
and I think this is a real deal breaker for many people.
But why predictability is something important? Well, consider
the following function. If the compiler is predictable, the
memset call will be compiled down to assembly.
void f() {
char *p = malloc(512);
if (!p) return;
sensitivestuff(p);
memset(p, 0, 512);
free(p);
}
If only standard C is using this function, it is perfectly
fine to remove the memory clearing call to memset. But we are
not in a world where only well-behaved C calls C. This function
could very well be called by some malicious code that tries to
access back the memory just freed. In that case, not zeroing
the memory causes a security bug since sensitive information can
be accessed. So, "modern" compilers, when they do
their job, assume that all code is written in C and well behaved.
This is obviously not true in the real world, where abstraction
leaps exist and can be exploited.
The previous argument shows that predictability is essential
for security. Another important area where predictability is
important is performance evaluation. While basic optimizations
are critical, when C is used for embedded applications,
predictability of the compiler is usually preferred to crazy
compiler tricks. This is true for a simple reason, programmers
want to evaluate the cost of their programs at the C level
but what runs is the actual compiled code. These two abstraction
layers need to be related in some simple way for the analysis
on the high level to be relevant. Airbus, for example, uses
a custom GCC version with all optimizations disabled.
If I have to sum it up, I think simpler is better, and
compiler writers are taking a slippery path that could very
well lead their compilers to be unusable for many applications
where C used to shine. The focus is a lot more on a literal
and probably misguided interpretation of the standard than it
is on usability and relevance of compilers to today's
problems. And, to be honest, is any of the above
"optimizations" critical to any application that is not
a contrived benchmark?
Don Knuth, in its Volume 1 of The Art of Computer Programming
argues against garbage collection by saying that early freeing can
lead to a more efficient usage of the memory by the allocator.
In the following diagrams which represent a memory state, the
heavy gray area is not used by the program but the garbage collector
did not reclaim it yet, the black areas are in use, and the white
areas are free to use.
If the program tries to allocate one small block in the current
state, the allocator will use some of the free space,
Now the garbage collector reclaims the gray zone and we
are left with two disjoint blocks of memory.
On the other hand, if memory is managed manually the gray block
has been freed eagerly,
Thus, when the application needs some extra memory it can be tightly
packed and only leaves one large hole in the memory.
I guess this is all to be taken with a grain of salt since
there are also easy arguments for garbage collection (for example
fast allocation). All in all I feel that programmers using languages
with GC will advocate for it, and the converse for the others.
The reasonable choice to make is probably application specific and
even in this case it is likely to be hard and controversial making
the right decision at the planning stage.
Today's test revealed that fancy text data structures
might very well not be necessary.
While hacking on my text editor I discovered
a fun idiom. Often, in a switch statement you have two
cases that are very similar except in one or two points.
In that situation it would be poor style to copy the big
chunk of code that is common to the two cases, instead,
you can use the following neat trick.
switch(type) {
...
case X:
isx = 1;
if (0) {
case Y:
isx = 0;
}
common code;
...
break;
}
The line isx = 0; will only be executed when
the type is Y because of the guard. This is very similar
and sort of dual to Duff's device. For additional fun,
the braces of if (0) can be removed!
Here is a simple Awk program to check that your files
respect the religious constraint of 80 characters per
line.
{
gsub(/\t/, " ");
if (length($0) > 80)
printf("%d: %s\n", NR, $0);
}