A Unit of Analogy

Nulls and Slices and Signs, Oh My

Zig is a credible alternative to C. This is rare. Whenever there’s talk of “replacing C”, it’s good to clarify what that could possibly mean.

Languages can achieve a level of success where they win for themselves a sort of undeath. COBOL is the classic example. There’s a lot of it out there, more is being written every day to support those programs, no one is choosing it for new software, not without stretching the meaning of ‘new’. Companies get rid of it when they can justify the expense.

We may not even have hit the point where the number of deployed lines of COBOL diminishes per year. We will, we might have, but even over that threshold, there’s a sort of half-life effect which kicks in. The time before the final line of load-bearing COBOL is finally retired might well be measured in centuries. Despite its reputation, COBOL is actually fairly good for what it does. It’s a wonky and verbose DSL for accounting and reporting on record-oriented data, and there’s a lot of such data which needs accounting for.

C is still very much alive, vibrantly so. COBOL is disliked; C is feared, mainly by those who don’t use it. New C is being written all the time, especially in industries you probably don’t think about much. Partly because there’s no alternative, but mainly because alternatives are resisted. C programmers like C.

C should also be feared by those who do use it. It’s a sharp object, it contains several design flaws which have been disastrously expensive to support over time. Even with more modern tooling, it’s just fiendishly difficult to stay within the boundaries of defined behavior in C code.

To say that Zig can replace C, then, is to say that Zig is a credible alternative to C, in the domains where C is thriving. That’s quite a check to write, and Zig as it is in 2026 cannot quite cash it. However, there are many cases, a growing number, where Zig is not just an acceptable or viable alternative to C, but a superior choice.

I’m not talking about the applications for which C is a dubious or even an indifferent choice. Alternatives of that nature have existed for decades, dozens of them, and more come down the pipeline every day. I said earlier that, if the mission is to port a CLI application in C into a language which is safer and easier to maintain, Go is probably the single best choice.

The elephant in the room is, of course, C++. What? You thought I was going to say Rust? C++ is Rust’s elephant. Rust is doing a credible job of offering a better alternative to C++, but has little to nothing to say about why, after decades of C++ existing, there’s still so much C out there. So if you’re looking to port a C++ codebase to a safer and nicer language, there’s also an obvious choice. I’m speaking, of course, of D. What?

There’s so much more to say here, but this is an intro, not an entire post on the subject raised in the first paragraph. When we left off, I was in the midst of porting a 30+ years old C program to Zig. I hit on a strategy, ultimately successful, of printing the aspects of the internal state which were altered by each operation, from both programs, then diffing the result. It worked great. Lemon is a compiler, a transpiler if you must, so verifying that identical inputs create identical intermediate states, and ultimately, identical outputs, is a viable way to confirm the base level of correctness of the new version.

I said, right at the beginning, that there were significant differences between allocation in Zig and in C, but then spent the rest of the post explaining how I was able to closely emulate the allocation policy of the original program, and never quite explained how those differences affected the process. That was deliberate, so let’s circle back and answer that.

A minor way: In C, the allocator (which is idiomatically global) is responsible for tracking how much memory was alloted to a given pointer. So it isn’t uncommon to allocate enough memory for several data structures, dole it out, then hand free the first of those pointers. It will take care of the rest. In Zig you can’t do that, the allocation contract says that freed memory must indicate the amount of bytes to free. One may certainly request some memory and break it up for parts, but without some way of reassembling it to return to the allocator, it’s a bad idea.

Lemon takes advantage of this in several places. Rather than emulate that, which would have called for a touch of book-keeping and boilerplate, I just separated the allocations. I think both languages have the right idea for themselves. In Lemon the advantages of guaranteeing those structures occupy a contiguous region of memory would be minimal, but there are applications where it’s more important; if you have one of those, Zig can get you this result, it’s a bit off the beaten path, but so what.

The big differences all relate to several far-reaching differences in how Zig and C prefer to represent and work with data. Zig’s type collection is a proper superset of C’s, to support complete interoperability in both directions with the C ABI. But almost all Zig code will use data structures which C lacks, because Zig insists on tracking several things which C does not. Chief among them: length, and nullability.

C has a handful of languages which are lineal descendents: C++, D, and Objective-C, to name a few¹. But it has only one sequel: Go². Conventionally, Ken Thompson is credited with Unix, and Dennis Ritchie with C, but the reality was a much closer collaboration, and Ken wrote the B language, which came between BCPL and C itself. If you look at the awards the two of them share, it’s clear that the perception that they were the primary co-creators of both Unix and C is nigh universal. Commander Pike is no slouch either, just a bit later to the game. So I feel safe in saying that Go is what the creators of C wanted in a programming language which wasn’t C itself.

A Billion Here, a Billion There

Go has a proverb: “make the zero value useful”. Go also has slices. I must conclude from this that Ken Thompson disagrees with Tony Hoare about the billion dollar mistake being any kind of mistake at all. At least he saw the light about null terminated strings!

This carries over a practice from C, and yes, this influences allocation semantics as well, quite directly so. A C pointer is allowed any value, including zero, and zero is false in terms of control flow. C variables, when declared, are of arbitrary value, but for structs at least, memory is easily filled to zero: struct something a_struct = {}; will do it. So important is this convention that all operating systems outside of narrow embedded contexts won’t map the page containing 0x0, so that at least NULL references segfault, rather than the worse alternatives which we might contemplate.

Idiomatic C code makes heavy use of this fact. It has an elegance to it: if (some_p) is a null check, and the true branch may assume that some_p points somewhere valid (whether that’s true or not). This is used as a termination check in pointer-walking as well, something C does often, since arrays ‘decay’ to pointers when so much as glanced at.

Zig begs sharply to differ with the wisdom of this. This is the first of three core differences which end up pervading nearly every line of code I ended up porting. In Zig, for any *T, it is allowed to have any value except zero. Any type in Zig may be made optional, and a ?*T is guaranteed to represent “no *T” with the value zero, pronounced null.

Zig does not initialize data types to zero unless specifically directed, and if the data type is *T, Zig won’t even allow it without considerable hassle. I happen to think this is perfectly correct, and that C and Go are starking bonkers here. We pronounce *T “pointer to T”, we don’t say “pointer to T unless it isn’t lol. lmao”.

C has calloc, which allocates memory guaranteed to be all zeroes. Zig does not; although I believe for performance reasons it should gain this ability, it would be of more limited use, since anything with a pointer couldn’t use it. realloc is defined so that, if given a NULL pointer, it will behave exactly like malloc. Try that with Zig and you’ll get a segfault. That bit me early in the process, I ended up allocating a bunch of zero-length []u8 slices up front as the cleanest, or at least most expedient, of several possible solutions.

In Zig code, bringing null into the type system solves problems, it doesn’t cause them. Even across the C boundary, a couple of lines of conversion code serve to induct a pointer into the type system. When porting C code, it does create problems, which need to be solved on the spot.

What is a program? A miserable pile of pointers.
- Dracula, maybe

Lemon, being a C program, is more-or-less entirely composed of pointers. Some of the most important data structures are linked lists, mostly intrusive, all them singly linked. The link fields are easy: that is obviously a ?*T, because the list has to end, and that’s how it ends.

That just leaves everything else. As I went through the code, I was constantly asking myself: is this checked for 0³? Does this happen repeatedly? Fortunately, this is good C code, so it populates pointers which are not meant to be null as early as it can. I was able to initialize those as undefined and be confident they would get a value before being referenced. That’s always better, and sometimes it would involve initializing using a named block and a break statement. What I’ve found is that every change from the original, no matter how minor, is costly, but that kind of transformation is least so; the code looks different from the original, which makes it harder to figure out how it’s actually different when I was trying to find where I screwed up.

Use of *T did allow me to elide several assertions, and even the occasional control branch, but it’s an up-front cost on porting C which the interested should be aware of.

And if you’re not the interested: how on Earth did you get this far?

Strings By the Slice are Rather Nice

I said there were three core differences, optionals being the first. The second is slices. Lemon in C has dozens of char *, Lemon in Zig has none.

Advantage: Zig. Not in a totally uncomplicated way: you might assume, for instance, that C’s use of the NUL byte for string termination means that it is more sparing of memory than the Zig equivalent.

But you would be wrong. Remember, free just takes a pointer, so it is required by the standard to be able to free all of the pointed-to memory, just given that pointer. How does it accomplish this? The standard doesn’t say, but classically, malloc allocates a word behind the pointer to store the length. In Zig, this word isn’t stored in the allocation, it gets given back from alloc along with the pointer: a slice. There is no need for a NUL byte, although Zig has sentinel slices, spelled [:0]u8 in the char * equivalent case, for whenever that’s necesssary or useful. I used one to store the input file buffer, so that the tokenizer could look for NUL instead of constantly length checking. This was both closer to the original, and more performant: Zig’s tokenizer does it as well.

Advantage: Zig, by one byte per string. Again, not uncomplicated: in the event that many pointers to the same string are traveling around, C gets its edge back. That happens.

But it was and is the wrong thing to optimize for, and since Go has slices, it would be nice to hear Ken Thompson admit that Niklaus Wirth was right.

Of course this is not the only advantage of slices, it’s not even one of the main ones. The OG Lemon tokenizer caches a byte in the input buffer, replacing it with zero, then putting it back once the parser is done with it. This perverse⁴ behavior is absolutely ordinary in C. The libc function strtok chops up what it’s given in this dissolute manner, not even favoring you by restoring the string, it’s just left with a bunch of holes in it. Good thing free doesn’t need to use strlen to free it eh?

And of course, inevitably, that string’s length is needed at some point. The only way to get it is to go looking for NUL and hope that you find it. If you need it more than once, you’ll want to make a field to store it on, and congratulations, you have built a slice from parts.

There’s a routine in Lemon which copies a string to a scratch buffer, so that it can strip whitespace, because it needs to put the NUL in, then copies that over to the heap, so that the next iteration can use the scratch buffer, and frees both the array it’s using and all the strings at the end of the function, after printing what it put together. I did all of that with slices, and no additional allocation. That isn’t a flex: Lemon in both C and in Zig is perceptually instant when it executes, so any talk of efficiency is purely performative. It’s just a whole rigamarole I was not obliged to go through.

I also want to point out that we have this exotic construct in Zig: it’s spelled [*:0]u8, we call that a sentinel-terminated multipointer. The brackets mean you can do arithmetic on it, and the :0 means it has a NUL byte. Those usually arrive from C code, and we turn them into [:0]u8, a sentinel slice, just as soon as we can.

Advice to those of you on a porting adventure: just go straight for slices, right away. That’s the obvious part, this is less perfectly obvious but more important: represent “null” char * as a slice of length 0. ?[]u8 is a valid type, and it’s the same size as a []u8, but you’ve given yourself two ways to represent the absence of data, and you only ever need one. Go, due to that love affair with the zero value, always has both options available whether you want it or not, and you can spend as much time reading Go blogs talking about the nil slice vs. the empty slice as you wish, it’s a perennial bellyache in those circles.

Also, you’ll likely discover, as I did, that the C code already has slices. They’ll just be stored separately, as a pointer and an int, the latter usually called nFoo. Just be careful about that, there’s no law that says it’s counting every address in the decayed array, but usually it is, and you can drop the int and use the .len field directly. Then you can use the slice-traversing for loop and reduce the amount of i you have to deal with.

Separating the *T from the ?*T is labor, but it’s up-front labor, it smooths out as the port progresses. Turning char * into []u8 is dead easy, almost mechanical, and when you absentmindedly forget to turn if (some_string) into if (some_string.len > 0), the compiler will let you know. Nothing can solve the fact that string manipulation in C is a horrible faff full of bad life choices⁵, and that makes it a real bear to wrap your head around. So here’s another tip: when the code starts pointer bumping, then yeah, you can slice your slice and get the same effect, but consider throwing an index variable into the mix. It’s just cleaner to take all your regions out of the same slice, it compiles to the same code, thank me later.

Note that null-termination by itself is not such a bad thing. I made the file buffer a [:0]u8, not just because it made life easier in porting routines like the tokenizer, but also because having a defined termination value is more efficient for that sort of traversal. The Zig parser does the same thing, for the same reason. So remember that’s an option, for sure, but a sentinel-terminated slice is still a slice.

So that’s two out of three, and we haven’t gotten to the hard part. If you’ve ported C to Zig you know what’s coming, otherwise you might not have guessed it, but it’s spelled int.

And it sucks.

Of Integers and Integrity: The Sign of the Times

“But why”, I’m pretending you asked, “Zig has i32, it has u32, isn’t that just int and unsigned?” and no it damn well ain’t. Yes, they are the same data type, they pass along the C ABI in just that way, but they are not theologically⁶ equivalent.

C code heavily uses int, including very frequently to represent values which logically can only be zero or positive. D. Richard Hipp C code more than most, and remember, this is the reigning champion of robust and correct C code we’re talking about here. It’s not out of ignorance and it isn’t laziness either.

In a way it’s worse than that: it’s idiom. It’s not just that unsigned is five letters longer than int, although this is C, letters are very expensive. C’s integer promotion rules favor signedness. C will quite happily let you index a pointer (or an array if you happen to be in the instant before it decays into a pointer) using a signed value⁷. Unsigned and signed are of equal rank, so if you assign an unsigned to a signed, there’s no type conflict, and the result is undefined on overflow. In practice you get an unexpectedly negative number. A problem signed types of equal rank don’t have, since by definition they can hold the exact same range of positive values.

The language nudges you in numerous ways to use signed integers, going so far as to reward the use of unsigned with new and exciting ways for your program to be wrong. On the positive end this is of little consequence: if your problem domain needs to model numbers over one billion-ish⁸, it’s probably not limited to two billion and you’re going to want a long.

One reason C code does use unsigned values is because arithmetic is defined to be modular, so it’s convenient for hash routines and such. Look out for that: if C code is doing arithmetic with unsigned, you probably want to use *%, +% and the like.

None of which is the idiom part. There is a more urgent reason for the preference for signed values, namely, C has no concept of an optional type. In the case where 0 is not a valid value, no problem, and you get clean control flow to boot, but this is very frequently not the case. Even if 0 can represent “you get nothing”, sometimes C code needs to say “actually something bad happened”, and they know full well you won’t be checking errno, so in-band signalling it is.

In C, idiomatically that is, both of the situations are handled with that most magic of numbers: -1. You can get control flow with >= or <, depending, and Heaven will not aid you if you assign this to an unsigned, because, equal rank remember? C takes the stance on numeric conversions that you either know what you’re doing or you get what you deserve, and it doesn’t care which.

But assign it to another int, and show the appropriate amount of discipline, and you have another notionally positive number which happens to not exist. Do remember to check.

This is another way in which Go is an easier porting target for C code than Zig is. It’s just as pleased to permit signed indexing, and it has a runtime, so any out-of-bounds index, negative or otherwise, will panic on you so you can fix it with minimal hassle. That’s an improvement on C, Zig in safe modes checks bounds also. But.

As I write this, Zig’s type casting rule is admirably simple:

Type coercions are only allowed when it is completely unambiguous how to get from one type to another, and the transformation is guaranteed to be safe. There is one exception, which is C Pointers.

This is… wildly, extremely different from C. Drastically so. When this rule is applied to arithmetic, it can become genuinely onerous; there is some talk of alleviating this burden and I hope it happens.

For casting though, it’s the right thing. If you think you can get away with assigning a value to a type which may or may not be able to hold it, go right ahead, @intCast is ready and willing. Because Zig is not going to compile that without it.

The result of this is that Zig punishes you harshly, and repeatedly, for representing a properly-unsigned value as a signed one. Indexing is done with usize, whatever that is on your platform, but it’s at least u32 so anything smaller will always index. No numeric type starting with i may index a slice without casting. The compiler won’t remember the last time you did it either. In Zig you use signed types when you will be dealing with negative numbers, or you pay for it.

Also, Zig has both optional types, and errors, with control flow which plays very nicely with both of these. The Magical Value Minus One is simply not a part of how things are done.

Properly, ?u31 should be the precise equivalent of int when used with the negative-null intention, right down to a representation of the null type which is bit-identical with -1. But Zig doesn’t optimize that specific case yet, so it’s eight bytes, same as ?u32. u31 introduces a required cast from the more usual u32, but it will at some future point take up half the space in optional form, so it’s a wash which way you want to go with this one. I finished the port with ?u32, then future-proofed it by narrowing those uses by one bit. Path of least resistance.

This was the bad one of the three. Those ints dogged me through the entire port, top to bottom. The most mind-bending example is the boolean evaluator for %if-type clauses in the grammar. It’s recursive, and canonically returns a 0 for false or a 1 for true, and takes a lineno parameter for error reporting. But when called recursively, lineno is -1, and if you get a syntax error in the recursive call, the return value will also be negative. This negative number is added-by-subtraction to the character pointer to report the location of the interior syntax error, and the syntax error routine knows whether or not to return a negative value or to die on the spot by whether or not lineno is negative. That was an interesting puzzle to figure out.

Here’s another pro tip: do this:

const int = i32;

And use it when you don’t know what’s going on. That way you can delete it at the end and be sure you got everything ship-shape and Bristol. Signed values which survive will represent integers including a negative range, or at least routines where you kept the C semantics to preserve your sanity.

I mentioned that Zig casting rules get rough with arithmetic, here’s what I mean. You don’t get to do this:

fn cmp(a: u32, b: u32) i64 {
  return a - b;
}

I mean, you can write it actually, which is maybe worse, but you get an integer ‘overflow’ if b is in fact larger. Zig simply does not consider the result location, casts are purely local and between operands, and so yes, you would need to individually cast both a and b and, I love this language and this specific thing will probably improve but, that looks like, er, this:

fn cmp(a: u32, b: u32) i64 {
  return @as(i64, @intCast(a)) - @as(i64, @intCast(b));
}

Which… is a lot⁹. I write an inline which lets me just say cast(i64, a) when this comes up, and it’s somewhere or another in most of my libraries.

cmp was a bit of foreshadowing: C code does a lot of this, qsort for instance wants a function where the semantics are negative, zero, and positive. It’s actually not so bad, Zig doesn’t, it uses a boolean “lessThanFn”. Which is required to be actually less than, but I digress. So you implement with comparisons, which are always safe and simple, instead of subtractions, which can get fairly fraught.

By the way, C will also do awful things to you if you try to write cmp with two unsigned integers. Why did Zig say we had an integer overflow up there? Yep. That’s why. So like I said, there are reasons why int gets used for logically non-negative values.

Zig’s system here isn’t perfect, but it isn’t finished yet either. If it never changed I would still strongly prefer it to C. Zig has an integral concept of result location semantics, I think that’s a persuasive argument for performing the (perfectly safe) conversion of both operands in version one of cmp to i64 before subtracting; @as actually creates a result location, which is how @intCast knows what to do, so even inline you could have @as(i64, a - b), and it would make things a lot more expressive. You only need an i33 here, but let’s not get distracted.

Back to the port: translating a cmp-based comparison function to use < and its cognates is not especially recondite. But proofreading complex control flow to figure out if magical -1 is even in play or if, in fact, that’s a load-bearing genuine article negative number, that: that is fatiguing.

This bit me hardest right at the end, in the code which builds the action tables which drive the parser’s finite state machine. I saw some uses of -1 which were clearly marking invalid or missing data, and looked at the tables generated, and figured this could all be replaced with optional ?u32. I didn’t look carefully enough. The reduce tables have negative offsets, so despite negative values being illegitimate in many parts of the operation, the int data type was proper to the domain. That took some backing out of. The final code has no less than 22 @intCasts in it, in one form or another. Probably, I could factor some of these out. Probably, I will not.

That, and working out the behavior of string operations, with all the pointer bumping, indexing, pointers-to-pointers, NUL impositions, and panoply of libc functions with gnomic names¹⁰, that’s where most of my effort in the port went. Well that, and figuring out how Lemon does what it does, oh, and fixing my bugs, of course. But all that is priced in, if you feel me.

The last chapter of this saga will meander through some highlights of the porting process itself, give some broader advice gleaned from this experience, and talk a bit about why you, Dear Reader, might want to port an application from C to Zig as well!

Not Zig, and not Rust. All three of the languages I cited can compile some subset of C directly, often a substantial one. Zig and Rust can’t parse so much as a line of it, other than by happenstance. zig-the-program can both translate and compile it, but that’s a secret third thing.
1

Still in use, that is. Go owes as much to Limbo as it does to C, but Limbo is well into the undeath stage and likely to stay that way.
2

drh does not define a NULL macro, nor use the provided one. This is equally true in Pikchr as in Lemon. C89 includes the NULL macro in several of the headers which both said programs include, leading me to think it’s an affirmative preference. With a language server on hand to remind me of the type when I need that reminder, this poses no actual problems. It’s a quirk nonetheless.
3

Say you wanted to run the tokenizer on static memory, to test it or something. Sucks to be you!
4

Unix has a panoply of different utilities for string handling, including awk, a full-fledged programming language in its own right. From a certain perspective this is all to compensate for the original sin of C’s awful char * strings. Of course they’re all written in C, but you only have to do it once.
5

Is there a better way to put this? Not really.
6

This plays nicely with malloc‘s habit of storing allocation information at p[-1] or thereabouts, it’s all very exciting.
7

I am studiously ignoring the fact that an int does not have a guaranteed size in the standard. There’s only so much of this I can stomach.
8

Rust makes you do it more often, but at least it’s just a as i64.
9

drh is quite disciplined about using the sane subset of those functions, as you might expect. You might not be so lucky.
10

A Unit of Analogy

Porting Lemon to Zig: A YACC Shave

Nulls and Slices and Signs, Oh My

A Billion Here, a Billion There

Strings By the Slice are Rather Nice

Of Integers and Integrity: The Sign of the Times