A Unit of Analogy

Porting Lemon to Zig: A YACC Shave

A Faithful Translation

In the last chapter, I went on a side-quest, discussing the coding style of lemon.c in detail. This is one of the things I was paying attention to as I read both the original, and golemon, to start to familiarize myself with the source, and get a handle on what I had gotten myself into.

I discovered that the Go port is a very faithful translation indeed of the original. I have a love/hate relationship with Go, and this port of Lemon has increased my appreciation for it. I have always praised Go for being easy to read, and often seek out (properly licensed) code in Go when researching an algorithm or data structure, for this very reason. It helps that Go largely eschews any fancy or clever enhancements to the language, so when I do find an example in my research, it’s unlikely to obscure the structure of the program using various language-specific hacks.

Golemon has convinced me that anyone who has an application (application specifically) which is written in C, and wishes to ‘modernize’ it by translating to a safer and more maintainable language, would do well to consider Go. Go and C share a lineage, and it shows: most of the pointer idioms in C translate directly to Go, which shares C’s pervasive nullability of everything, default zero allocation of structs, and even its convention for dereferencing. Plus you get to ignore the memory managing code entirely, which is a true luxury. The resulting binary would be larger, surely, and somewhat slower if the application has long CPU-bound regions, but for the common case I would expect the result to be quite satisfactory, and any remaining buffer overflows would panic rather than risking a privilege escalation.

It even has go to statements, just like C. Handy to get you out of a tight spot, those. Great idea right until they’re not.

Zig is another credible option, and markedly superior for libraries, which Go does poorly if at all outside of its own ecosystem. But it became clear, in the course of this initial reading, that my own task would be more difficult. As a minor thing, I would need to port the memory management code as well, which means understanding the program’s memory policy. But the major hurdle would be something else: nullability.

Zig is a null-safe language. A pointer for which null is a valid value is spelled ?*T, not *T, and this condition must be reckoned with for the duration of the program. So my task as a translator included finding out which pointers are in fact allowed to be null, and which are constructed so as never to be so.

A similar consideration applies to integers. In the original C language, a function returned int if not otherwise specified, un-typed parameters were int, and so on. BCPL, its ancestor, had no other type at all, drawing no distinction at all between a value holding an address and any other kind. Numeric types easily cast to int in arithmetic. Without typedefs, if you want an unsigned integer, you have to spell it out, unsigned, and remember, this is C, letters are expensive. It’s an int-oriented language which punishes you for using something else, basically.

Zig’s casting rules are strict, sometimes onerously so. There’s interest in refining them to allow fewer casts in some cases, but that’s a digression. It doesn’t go so far as Rust, which asks you to cast a u8 explicitly to u641, but it certainly won’t allow you to subscript an array or slice using a signed value, not without a cast. This is of course the most normal thing in the world in C, which, combined with the usual malloc strategy of placing allocation metadata in the bytes behind an allocation, can make life very exciting, especially before the proliferation of tools designed to wrap gauze around the various bleeding foot-holes which C code has a tendency to induce.

Go doesn’t mind either, and will, at least, properly panic on you if and when mistakes are made. While this solves memory-corruption issues, I still consider it sloppy: allowing a type where half+1 of the range is illegal for subscripting, to be used in subscripting, is a preventable recklessness, and worse, it removes intention from the type signatures in the program.

C programmers have another reason to prefer int even in these cases: it allows the conventional use of -1 to represent a missing value, as it lacks any concept of an optional in its type system, yet another trait it shares in common with Go2. Excusable in a language with such a long history as C has, but in anything from the 21st century, which Go certainly is, this is simply a hideous design flaw, which gets at the hate side of my love/hate relationship with it. It is a post-millennial Boomer language, a deliberate anachronism, and has produced a cadre of programmers who simply don’t think there’s anything wrong with pervasive nullability and magical negative numbers.

But here I admit that Zig’s answer to this is not, as yet, entirely acceptable as a substitute. It will be: at some future point, a ?u31, that is, an optional unsigned integer of 31 bits, will be four bytes in size, and store null as -1. Making it identical to this kind of int in terms of application, but superior in control flow, and in enforcing the invariant that a negative number is to be treated as if there is no value there.

By the way: did you know that if fork fails, it will return a -1? Also, say you pass -1 to kill, for instance, maybe you want to kill the child process and forgot a check? kill will then kill every single process on the machine, other than init and the calling process. True story!

But as of Zig v0.14, a ?u31 is eight bytes in size, not four, and the generated assembly is not quite as good either. I would encourage the intrepid to ignore this temporary fact, and use types such as ?u31, then enjoy the optimization when it arrives. Zig is stabilizing, but not yet complete; better to target the language it will be, rather than that which it happens to be today.

So I knew going into it that, in addition to figuring out the actual nullability of every pointer, I would need to find appropriate values for each int, be that i32, ?u31, just u32 or usize, and so on. So I wrote a little typedef of my own:

const int = i32;

So that I could distinguish between “I have decided this is a real i32” and “Let’s start with what the code says and see where it takes us”.

I realized as I did this comparison, between the original and the first port, that it would be essential to make as few changes as I could reasonably stomach. Every case where I made names or logic differ from the original, would be context I would need to carry forward while completing the translation. I decided there were two transformations which would be consistent and easy, namely capitalizing types, and lowercasing enum members, so I did those on the fly.

When it came to naming things, I stuck with this policy very closely, and this was essential as I would search and grep between the original and my translation. As we’ll see, I did translate some data structures to use standard Zig ones, and while I paid for this hubris, it was overall the right thing to do.

There’s Nothing to It but to Do It

So at last, sharpening my cursor to a pretty point, I started at the top and began to port. Declaration order is significant in C, which tends to put most of the types at the top, so this was smooth going at first: I made my best guess as to the null-status of each pointer, used int for int, made char * into []u8, and so on.

Then I hit a couple of functions, and bushwhacked my way through them, coming to a standstill as I realized I had hit the limits of this approach. The problem, and this is the major topic of this chapter, is that Zig’s type system is richer than C’s, and more particular, and this has real consequences when translating code of any significance.

I should mention that, largely in the service of interop with C, Zig does have exact equivalents of each and every C type. It would have been viable to do a closer translation using C-isms, but this would not have served me in two ways: for one, it wouldn’t help me understand the code I was porting, and for another, I would just have to go back and make it idiomatic, when the time came to transform it into a Zig parser generator. Spoiler alert: doing the job as I went wasn’t so bad, but at the same time, it was the lion’s share of the effort I needed to put in.

But at the time, I felt the growing force of debt, a debt I had no stomach for paying off all at once. This is due to a striking fact about Zig, its lazy compilation model. All this code I had written to this point was syntactically valid, it parses as Zig code. But since none of it was reachable, it had not undergone any semantic analysis whatsoever.

So I did what I always do: made a commit, closed the editor, enjoyed the rest of my day, and slept on it. When I woke I had the key insight, which made the rest of the task bearable.

What it was can wait until the next chapter .

  1. In fairness, it makes up for this: the as keyword is fairly light on the page. Zig’s approach is… not that.

    1

  2. Go takes the concept of nullability quite a bit further than C, and does eschew puns like treating a numeric zero (any number in fact) as holding a boolean value. But the thing about options is, it’s not the option to not have a value which is important, it’s the option to not have the option to have a value. That’s the part you want.

    2

Part Two: Lemon.c: A Critical Review
Part Four: The Angle of Attack