Trusted Logic

How hard can you believe in your logic?

Why formal artillery?

I once had to fix a bug in the VIM editor. This is a great piece of software but it kept crashing when I resized my window while some options (listchars and cursorline) were enabled. So I downloaded the latest version, compiled it with debugging information and ran the code under gdb. The bug was in the redrawing procedure, this function is around 1k lines of code and has one parameter that basically points to the whole world. After one afternoon of investigation I finally found the culprit line and fixed it. Then I reported the bug and my patch was numbered 7.3.140 by Bram.

Interestingly this patch exactly reverts the one numbered 7.3.116. I did not know this and could have found out using mercurial to bisect, but at this time I was not aware of this and just did it the old, brutal, way.

This raises one simple question, as programmers, how hard can we trust our code? Clearly the person suggesting the patch 116 did not think he was breaking anything, but on what ground? The only thing that mattered was that Bram was convinced that the patch did not introduce any bugs, but he turned out to be wrong.

My Program

Now, I write a text editor. I do not have crazy ideas but I want it well behaved and designed so I can dig with pleasure in the code when I need to introduce a new feature. Recently I had to write the function to save files. Files are represented internally as buffer objects, these buffers simply store runes (unicode codepoints) and can be queried at a given unsigned offset with the buf_get function. This query function will return the newline character for offsets that are out of bounds.

Writing a save function is usually trivial but I wanted it to be clever enough to remove trailing whitespaces at end of lines and trailing blank lines at the end of file. It turned out to be a good programming exercise and I hacked it together using a small state machine.

static void
puteb(EBuf *eb, FILE *fp)
{
        enum { Munching, Spitting } state = Munching;
        unsigned munchb = 0, munche = 0, nl = 0;
        Rune r;

	while (munche < eb->b.limbo) switch (state) {

        case Munching:
                r = buf_get(&eb->b, munche);
                if (r == ' ' || r == '\t' || r == '\n') {
                        nl += (r == '\n');
                        munche++;
                        continue;
                }
                for (; munchb < munche; munchb++) {
                        r = buf_get(&eb->b, munchb);
                        if ((r == ' ' || r == '\t') && nl)
                                continue;
                        assert(nl == 0 || r == '\n');
                        nl -= (r == '\n');
                        putrune(r, fp);
                }
                assert(munchb == munche);
		assert(nl == 0);
                state = Spitting;
                continue;

        case Spitting:
                r = buf_get(&eb->b, munchb);
                if (r == ' ' || r == '\t' || r == '\n') {
                        assert(nl == 0);
                        munche = munchb;
                        state = Munching;
                        continue;
                }
                putrune(r, fp);
                munchb++;
                continue;

        }

	putrune('\n', fp);
}

I heavily used assertions because I know that I often write wrong code. While my code is often wrong, my assertions are often right, and it is a great satisfaction when programs crash on assertions compared to when the output the unhelpful "Segmentation fault" message and suddenly quit. Good use of assertions spared me long debugging sessions.

I ran it on several tests and it worked fine. It is pretty good, but I wanted to go further for this piece of code. And, as E.W. Dijkstra puts it: Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence.

A Specification

One thing that the little story above emphasizes is that arguing about the correctness of a program with a human (even if he/she is a trained programmer) might not be enough to get it right. One alternative I explored is to convince a computer instead. I used the Coq system to program the above algorithm, specify it, and prove its correctness.

The first difficulty encountered was translating the informal specification above into a rigorous mathematical statement. After some thinking it occured to me that relating the stripped down output and the buffer contents had to be done incrementally and backwards, because conditions are always in this form:

A text filled with invisible characters can be stripped if it is before the end of file.
A sequence of spaces can be stripped if it preceeds a newline.
Finally, any rune of the text can be conserved in the output.

These observations can be transcribed precisely in the Coq inductive definition that follows.

Inductive represents: list Rune -> list Rune -> Prop :=

  (* first bullet *)
  | re_eos:
      ∀ s, white s -> nil ⊑ s

  (* second bullet *)
  | re_eol:
      ∀ s1 s2 nl sp,
      space sp -> is_newline nl -> s1 ⊑ s2 ->
      [nl] ++ s1 ⊑ sp ++ [nl] ++ s2

  (* third bullet *)
  | re_sam:
      ∀ s1 s2 r, s1 ⊑ s2 -> [r] ++ s1 ⊑ [r] ++ s2.

Using this formal definition I can state that the output out of the function represents the buffer text text:

Theorem output_represents_text:
  out ⊑ text.

The proof of this theorem turned out to be a bit tricky. What was hard was to relate the state machine machinery of my function with this ⊑ relation. I basically had to run an induction on blocks of "solid" and "white" characters. This induction by blocks is not natively supported by Coq so there is a bit of scaffolding I had to prove (see the next_block function in my proof).

But this is only one half of the result that I really wanted. Because, by applying the third bullet repeatedly, you can see that for any string s it is true that s ⊑ s which means that a simple while loop dumping the buffer intact would satisfy the previous theorem. We want to make sure that there is no way we can strip down the output of our function more, this can also be stated formally in Coq:

Theorem output_cannot_be_shorter:
  ∀ x, x ⊑ out -> x = out.

So if a string represents the output itself, then it must be the output. We say that the output is a normal form for the relation ⊑. To prove this theorem I first wrote a simple function which I thought was detecting normal forms, then proved that it indeed detected normal forms, and then proved that applying this function to my output validates it as a normal form.

Feel free to play and execute my proof step by step using Proof General!

Conclusion & Some limitations

This whole exercise took me around a week. You might very well argue with me that this is a lot too long for such a small function. But, how would you proceed to test it? By hand made tests? By fuzzing? And if you fuzz, how are you going to know the correct output? You would have to program a similar algorithm and you would also have to make sure that this one is correct by one mean or another.

The end product of this is that I am now rock solid convinced that this save code is correct. Even though you might find it funny looking I am sure it makes its job correctly because I convinced a narrow minded computer that it was.

Although, what could make the whole thing trip is a loosy specification. Indeed, if the specification is too weak, for instance only the second theorem above, a valid save function could be the one that writes an empty file regardless of the buffer contents! So we have to keep in mind that proofs on programs are only as strong as the specifications. Perhaps surprisingly, specification often turns out to be the hardest part of this game, for example, how would you specify a compiler, an OS?

There is also one somehow fatal flaw in my little game, how can I prove the correspondence between the algorithm I wrote in Coq and my original C code? If you read my proofs you will see that I tried to make them syntactically as close as possible, but this is in no way a formal proof. For instance, my Coq code uses unbounded integers while we all know that C has finite precision integers. What I would hope is that under certain assumptions (no overflow, for instance) the two programs are equivalent. This would require a formal proof. So I think that a tool taking C code and generating an equivalent Coq program (under some assumptions) could be really neat, unfortunately this is still science fiction today.

Update: I went to PLDI'14 in June and there was a very intersting paper published on exactly that subject. Maybe this translation from C to high level formal systems is no longer science fiction! Some people from NICTA Australia presented a paper Don't Sweat the Small Stuff: Formal Verification of C Code Without the Pain that seems to solve this abstraction barrier problem. Downside for me, it is for Isabelle/HOL, not Coq, but I guess it would have been too easy...