By Casey Muratori
I am not a hardware designer, so I usually don’t think too hard about hardware issues. How could I? Most of the things hardware designers deal with today are issues so far beyond my basic understanding of semiconductor engineering that I probably don’t even know enough to know what I don’t know. But once and a while I have questions about the part of hardware I see as a software developer, and I wonder if there’s a good reason why things are the way they are. Are the current practices due to external factors? If hardware designers could control the whole computing ecosystem, would they still do it the same way?
One example of this was in the early 2000s, as the gap between memory latency and core clock began to widen. I asked some engineers at Intel the innocent question, “Why is memory latency so high?” The answer was simply that the memory was literally far away from the CPU, so there’s too much physical distance between the processor and the memory to access it quickly. When I asked the follow-up question, “So… why don’t you just move the memory closer?” The answer to this, sadly, was that Intel didn’t control the way motherboards were made, and memory was a third-party component, so forcing that ecosystem to switch to a design where the memory was physically more proximal was not something they felt they could accomplish. While the first answer was not so surprising, the second answer was, so I feel like I gained some insight into the realities of hardware design just by following it through (and amusingly enough, ~15 years later we’ve now come to the point where they are putting technology into production where significant amounts of main memory can be built very close to the CPU silicon).
Lately I’ve been having another one of those nagging questions, and it seems like the easiest way to get an answer is to just post it here. The question is, why do CPUs still take instruction streams encoded with register names?
If you’re a hardware designer, or a software engineer who knows a lot about hardware, that question may seem stupid and the answer obvious. In which case I’d like to know what it is :) If you’re not, then you probably don’t know why I’m asking that question in the first place, so let me elaborate a little.
Compilers only care about registers because they have to.
When you write a compiler, you typically have a lot of intermediate work that gets done to turn source code into a executable code. Most of this work doesn’t have anything to do with registers, because for the most part, registers in a CPU are largely incidental to the correctness of the program. If it weren’t concerned with efficiency, the compiler could almost get away without thinking about registers at all, because all it really cares about is memory. Registers are just temporary places where math can occur. So the compiler natively just wants to do something like, “add the contents of this memory location to the contents of that memory location, and store the result in this other one”, and it is actually an additional burden on the compiler to then convert those operations to be expressed in terms of a limited set of registers.
What this means is that any compiler could easily choose to spit out an executable that does not refer to registers at all, but instead encodes the operations with memory addresses directly. It can’t run that program anywhere, because no modern CPUs operate directly on memory exclusively. x64 processors do support memory operands for some instructions, but not only is support incomplete, but there is also a reason why you might want to use registers even for those that do.
The reason is that instruction size matters. Since code, like data, must be actively read into a CPU cache and processed, the more of it there is, the more you risk CPU stalls related simply to processing the memory that contains the code, separate from any data that code might access. Since memory operands take up more coding space than register operands (since memory addresses are obviously many more bits than register numbers), this means that even when you could use a memory operand, you still might not.
But, it’s worth noting that this may be an artificial tradeoff. If you think about it, there’s no reason a CPU architecture couldn’t encode memory operands in equally few bits as register operands provided the memory was laid out in a particular fashion. For example, consider a CPU which has a “stack relative” addressing mode. This mode would use a compressed encoding for the memory operand, whereby a small operand like 0-16 would fit in very few bits, but a full 32-bit or 64-bit address would take up many more bits. Presumably there would also be a striding bit which used the size of the operand (128 bits if it’s an SSE operand, 512 for AVX-512, etc.) as a memory stride, to mimic the behavior of registers (and this might be implicit in the type of instruction, too).
In this scheme, the “registers” of the processor now just become the top values of the program stack, and the CPU no longer needs any “registers” per se, rather just two “segments” used for addressing (the stack and the segment base). So instead of having a register file, the processor works entirely with memory operands, and there are just certain memory operands that are very cheap to encode, so the code will not be larger than the equivalent code for registers. It is just a pure stack machine at this point, much like the one I said the compiler already knew how to spit out, because it is effectively a prerequisite to the register-using version it has to produce now.
Sound good? I’m honestly asking, because again, I don’t know why it’s not done this way, so the answer may be “no it sounds very bad, and here’s why.”
Now lets get to the meat of my question:
On a modern CPU, wouldn’t it always be better to use this kind of encoding?
The reason I ask is because modern CPUs don’t really think in terms of literal registers, anyway. When you talk about a “register” in a modern CPU, you’re really talking about a “unique transient value” anyway! Modern CPUs do all sorts of out-of-order execution, branch prediction, cache management (MESI and the like), hyperthreading, etc. An x64 CPU doesn’t read the instruction stream as something to blindly execute, the way an 8086 would have. It’s doing all sorts of crazy stuff to make things run as fast as possible, so a “register” is now a semantic idea first, and an actual physical thing second. It is more something to inform the microcode about what you wanted to have happen, but not really something that says the literal place where it will.
So this is where I start getting really confused about sticking with the register paradigm. The compiler knows what all the unique values are, and where they are in memory. The CPU needs to know that, too, and just wants to do the right series of operations. By adding the constraint that most operations have to go through registers to happen, the current paradigm effectively adds a bunch of obfuscation from the compiler to the CPU, making it harder for the CPU to tell what is actually necessary. This has real costs, even if the CPU is very efficient.
Now you might say, well, a “register” is actually information beyond just an additional addressable set of values. It also tells the CPU that these values are temporary, because they do not have permanently assigned locations in main memory and therefore don’t have to be “saved”. But these days, that’s very far from the case too: When an operating system wants to swap threads, it must do exactly that. It has to manually save the state of all the registers in the CPU to memory, then restore the state of all the registers from memory for another thread.
By comparison, if all register-esque operations were instead encoded as memory operations at the top of the stack, swapping threads would always be a matter of saving and restoring just 3 64-bit values, no matter how complex the processor’s logic  —  the stack pointer, the segment pointer, and the IP (or 4 if you want two segments). The program can use as many “registers” of whatever size it wants, and the only cost to using more “registers” is that the instruction encoding becomes larger.
Now, it may be tempting to think that memory-based “registers” would automatically be slow because the processor has to do a lot of extra memory traffic, or something similar. But as I mentioned earlier, the processor already deals with your instruction stream as a logical set of operations, not a mechanical prescription, and it already deals with the MESI protocol and intricate cache management operations. So again, not being a hardware designer, I couldn’t say for sure if there is some obvious reason why it would be any harder to treat the top of the stack as a logical register file internally, and get effectively the exact same performance it’s getting now. Or even better performance, since all the cruft of register naming is now gone.
It may turn out to be that the answer is, “It is very important that registers cannot themselves be addressed by values in registers”. Stated alternately, if the processor was required to be able to, in the same thread, hold a pointer to some part of the stack that was being used as the virtual register file, that might cause some pretty nasty stuff to have to happen to ensure proper execution. However, I’m not entirely certain it really does anything all that different from current MESI and branch-mispredict work. It could, for example, be specified that if anything does operate on pointers that end up targeting registers, the entire pipeline might have to get flushed at that point, in much the same way a branch mispredict would. Since programmers never really need to do such things anyway, this would never happen in any performance-concerned program, and certainly never from any compiler-generated output.
Furthermore, there is an even simpler version of this idea, which is that the register file address is just a register in the processor, and that’s the only value you save and restore. It’s not based on the stack or anything else, it’s just a memory address that says where all the CPU register state is stored, and to swap contexts you just write a new value into it. Are there obvious reasons such an approach is not considered in modern CPUs?
To complicate the issue, at least from my perspective, it is not as if people haven’t apparently proposed instruction encodings not based on registers. For example, the Mill Architecture proposes a “belt” to replace registers, such that instructions will be encoded by saying “how many instructions ago” the value you’re referencing was produced. To me, these sorts of alternative architectures seem kind of crazy, because I’m not really sure what problem they’re solving… they still have the problem of needing manual save/restore, they still have additional artificial names for any memory-backed value, and they still require hard-coded ideas about how many of something you have, and how big those somethings are. If ideas this far-fetched are being proposed, surely someone must have considered just making a stack machine, or using a register file address? But yet I haven’t heard of it in modern times (there were very old CPUs that worked this way, forty years ago or so). So I keep thinking there must be something obvious I’m not considering.
What am I missing?
Probably quite a few things. But from an uneducated, external perspective, it seems like straight stack machines might be an interesting encoding for programs as of, say, when x64 was introduced. Were they? Or are there a lot of reasons why they aren’t, but you have to be familiar with the intricacies of processor design to really get why?
I am currently working on 1935, Meow the Infinite, and Handmade Hero.