Notes

2014-09-10

Here are my latest thoughts on compilers for low level languages (I think C). I am a pretty big fan of the C language. What I like with C is that when you write it and know your target platform you have pretty good idea what the code generated will be. It is very predictable. C was designed at a time where programs had to be very simple, and this includes compilers. So C is trying to be as expressive as possible while remaining very easy to compile.

It seems that this ideal is now slipping out of sight among language users and compiler writers. Recently, a new family of bugs has emerged, compilers exploit undefined behavior as defined in the C standard to enable "optimizations". This mail shows how GCC removes a check for overflow that has undefined behavior without any warning. In the same vein, this article describes a bug in the Linux kernel that was aggravated by another undefined behavior based GCC optimization. Another recent article tries to show how, in modern C, one should zero a piece of memory. This problem is made hard by optimizing compilers that know the semantics of memset and will optimize it out if the it does not change the observational behavior of C programs. These different behaviors of compilers greatly impair the predictability, and I think this is a real deal breaker for many people.

But why predictability is something important? Well, consider the following function. If the compiler is predictable, the memset call will be compiled down to assembly.

void f() {
        char *p = malloc(512);
        if (!p) return;
        sensitivestuff(p);
        memset(p, 0, 512);
        free(p);
}

If only standard C is using this function, it is perfectly fine to remove the memory clearing call to memset. But we are not in a world where only well-behaved C calls C. This function could very well be called by some malicious code that tries to access back the memory just freed. In that case, not zeroing the memory causes a security bug since sensitive information can be accessed. So, "modern" compilers, when they do their job, assume that all code is written in C and well behaved. This is obviously not true in the real world, where abstraction leaps exist and can be exploited.

The previous argument shows that predictability is essential for security. Another important area where predictability is important is performance evaluation. While basic optimizations are critical, when C is used for embedded applications, predictability of the compiler is usually preferred to crazy compiler tricks. This is true for a simple reason, programmers want to evaluate the cost of their programs at the C level but what runs is the actual compiled code. These two abstraction layers need to be related in some simple way for the analysis on the high level to be relevant. Airbus, for example, uses a custom GCC version with all optimizations disabled.

If I have to sum it up, I think simpler is better, and compiler writers are taking a slippery path that could very well lead their compilers to be unusable for many applications where C used to shine. The focus is a lot more on a literal and probably misguided interpretation of the standard than it is on usability and relevance of compilers to today's problems. And, to be honest, is any of the above "optimizations" critical to any application that is not a contrived benchmark?