I did a lot of research on this [1]. I got confirmation from Robert Garner (architect of the SPARC processor) that the NSA did indeed ask for the population count instruction. His story of meeting with the NSA is pretty amusing [2].
Having given a zillion interviews, I expect that they weren't looking for the One True Answer, but were interested in seeing if you discussed plausible reasons in an informed way, as well as seeing what areas you focused on (e.g., do you discuss compiler issues or architecture issues). Saying "I dunno" is bad, especially after hints like "what about ..." and spouting complete nonsense is also bad.
(I'm just commenting on interviews in general, and this is in no way a criticism of your response.)
I think I said something about the stack efficiency. I was a kid that barely understood out of order execution. Register renaming and the rest was well beyond me. It was also a long time ago, so recollections are fuzzy. But, I do recall is they didn't prompt anything. I suspect the only reason I got the interview is I had done some SSE programming (AVX didn't exist yet, and to give timing context AltiVec was discussed), and they figured if I was curious enough to do that I might not be garbage.
Edit: Jogging my memory I believe they were explicit at the end of the interview they were looking for a Masters candidate. They did say I was on a good path IIRC. It wasn't a bad interview, but I was very clearly not what they were looking for.
My compiler knowledge is limited, but I think that you end up with the same parse tree at a very early level of processing, whether you use Reverse-Polish notation or inline notation. So I don't think a language change would make a difference.
The microcode was so huge that they had to use a semi-analog ROM that held two bits per transistor by using four transistor sizes.
As far as the layout, the outputs from the microcode ROM are the control signals that go to all parts of the chip, so it makes sense to give it a central location. There's not a lot of communication between the upper half of the chip (the bus interface to the 8086 and memory) and the lower half of the chip (the 80-bit datapath), so it doesn't get in the way too much. That said, I've been tracing out the chip and there is a surprising amount of wiring to move signals around. The wiring in the 8087 is optimized to be as dense as possible: things like running some parallel signals in silicon and some in polysilicon because the lines can get squeezed together just a bit more that way.
If you happen to know... what was the reasoning behind the oddball stack architecture? It feels like Intel must have had this already designed for some other purpose so they tossed it in. I can't imagine why anyone would think this arch was a good idea.
Then again... they did try to force VLIW and APX on us so Intel has a history of "interesting" ideas about processor design.
edit: You addressed it in the article and I guess that's probably the reason but for real... what a ridiculous hand-wavy thing to do. Just assume it will be fine? If the anecdotes about Itanium/VLIW are true they committed the same sin on that project: some simulations with 50 instructions were the (claimed) basis for that fiasco. Methinks cutting AMD out of the market might have been the real reason but I have no proof for that.
Stack-based architectures have an appeal, especially for mathematics. (Think of the HP calculator.) And the explanation that they didn't have enough instruction bits also makes sense. (The co-processor uses 8086 "ESCAPE" instructions, but 5 bits get used up by the ESCAPE itself.) I think that the 8087's stack could have been implemented a lot better, but even so, there's probably a reason that hardly any other systems use a stack-based architecture. And the introduction of out-of-order execution made stacks even less practical.
x86 has a general pattern of encoding operands, the ModR/M byte(s), which gives you either two register operands, or a register and a memory operand. Intel also did this trick that uses one of the register operand for extra opcode bits, at the cost of sacrificing one of the operands.
There are 8 escape opcodes, and all of them have a ModR/M byte trailing it. If you use two-address instructions, that gives you just 8 instructions you can implement... not enough to do anything useful! But if you're happy with one-address instructions, you get 64 instructions with a register operand and 64 instructions with a memory operand.
A stack itself is pretty easy to compile for, until you have to spill a register because there's too many live variables on the stack. Then the spill logic becomes a nightmare. My guess is that the designers were thinking along these lines--organizing the registers in the stack is an efficient way to use the encoding space, and a fairly natural way to write expressions--and didn't have the expertise or the communication to realize that the design came with some edge cases that were painfully sharp to deal with.
> there's probably a reason that hardly any other systems use a stack-based architecture
I don't know about other backend guys, but I disliked the stack architecture because it just incompatible with enregistering variables, register allocation by live range analysis, common subexpression elimination, etc.
There are software workarounds for some of those and very simple hardware workarounds for the others. In a stack-based architecture there should also be some directly-addressable registers for storing long-lived temporary variables. Most stack-based architectures included some set of stack shuffling operations that solved the problem of common subexpression elimination.
The real disadvantage is that the stack operations share the output operand, which introduces a resource dependency between otherwise independent operations, which prevents their concurrent execution.
There are hardware workarounds even for this, but the hardware would become much more complex, which is unlikely to be worthwhile.
The main influencer of the 8087 architecture, William Kahan, had previously worked on the firmware of the HP scientific calculators, so he was well experienced in implementing numeric algorithms by using stacks.
When writing in assembly language, the stack architecture is very convenient and it minimizes the program size. That is why most virtual machines used for implementing interpreters for programming languages have been stack-based.
The only real disadvantage of the stack architecture is that it prevents the concurrent execution of operations, because all operations have a resource dependency by sharing the stack as output location.
At the time when 8087 was designed, the possibility of implementing parallel execution of instructions in hardware was still very far in the future, so this disadvantage was dismissed.
Replacing the stack by individually addressable registers is not the only possible method for enabling concurrent execution of instructions. There are 2 alternatives that can continue to use a stack architecture.
One can have multiple operand stacks and each instruction must contain a stack number. Then the compiler assigns each chain of dependent operations to one stack and the CPU can execute in parallel as many independent chains of dependent instructions as there are stacks.
The other variant is to also have multiple operand stacks but to have the same instruction set with only one implicit stack, while implementing simultaneous multi-threading (SMT). Then each hardware thread uses its own stack while sharing the parallel execution units and then one can execute in parallel as many instructions as there are threads. For this variant one would need to have much more threads than in a CPU with registers, which combines superscalar execution with SMT, so one would need 8 or more SMT threads to be competitive.
Is the 8087 related to the FPU of the 432 in any way? I’ve always suspected the former’s stack nature was due to the latter being entirely stack-based, but precisely no sources mention that, so is it just a coincidence that Intel did two stack-based architectures essentially at the same time (and then never repeated that mistake)?
Yes, the iAPX 432's FPU is related to the 8087. I think they took the 8087 design and redid it for the 432, but I haven't been able to nail down the details. I should take a closer look at the dies and see if there is any similarity.
I looked at the iAPX's 432 floating point more closely; it uses the same floating point model (which became IEEE 754), but the hardware is completely different. In particular, the iAPX 432 doesn't have nearly the same hardware support for floating point that the 8087 does. The iAPX 432 uses a 16-bit ALU both for integer and floating-point math, so it's much slower than the 8087's specialized 80-bit datapath. The 432 also doesn't support transcendental functions like the 8087 does; it is much more limited, supporting arithmetic, absolute value, and square root.
I already have a San Francisco Public Library card because it gives me access to some very useful archives, but I had no idea that I could access O'Reilly as well. Thanks for mentioning this!
If you're dealing with computer graphics, audio, or data analysis, I highly recommend learning Fourier transforms, because they explain a whole lot of things that are otherwise mysterious.
[1] https://retrocomputing.stackexchange.com/a/8666/4158
[2] https://archive.computerhistory.org/resources/access/text/20...
reply