More

rep_lodsb · 2026-02-18T23:27:51 1771457271

But is the "proper UNIX way" a good thing?

Funny that you would be arguing for that (unless I misunderstood the intention), given your many other posts about how C is a horrible broken unsafe language that should not be used by anyone ever. I tend to agree with that, btw, even if not so much with the "memory safety" hysteria.

Should every program, now and in the future, be forced to depend on libc, just because it's "grandfathered in"?

IMO, Linux is superior because you are in fact free to ignore libc, and directly interface with the kernel. Which is of course also written in C, but that's still one less layer of crud. Syscalls returning error codes directly instead of putting them into a thread-local variable would be one example of that.

Should a hypothetical future OS written in Rust (or Ada, ALGOL-68, BLISS, ...) implement its own libc and force userspace appplications to go through it, just because that's "proper"?

wtetzner · 2026-02-19T00:01:32 1771459292

I don't think GP is arguing that's the best way to design an OS, just that interfacing with non-Linux Unixes is best done via libc, because that's the stable public interface.

rep_lodsb · 2026-02-18T15:13:14 1771427594

With the WWW, from here on out and especially in multimedia WWW applications, frames are your friend. Use them always. Get good at framing. That is wisdom from Gary.

The problem most website designer have is that they do not recognize that the WWW, at its core, is framed. Pages are frames. As we want to better link pages, then we must frame these pages. Since you are not framing pages, then my pages, or anybody else's pages will interfere with your code (even when the people tell you that it can be locked - that is a lie). Sections in a single html page cannot be locked. Pages read in frames can be.

Therefore, the solution to this specific technical problem, and every technical problem that you will have in the future with multimedia, is framing.

Frames securely mediate, by design. Secure multi-mediation is the future of all webbing.

rep_lodsb · 2026-02-16T18:52:00 1771267920

The better analogy would be https://en.wikipedia.org/wiki/Project_Stargate

"If there's a chance psychic powers are real..."

rep_lodsb · 2026-02-16T05:29:08 1771219748

Forth generated code is basically a long series of "assembler macros", always doing the same maximally-generic thing for each primitive operation. Even a very simple-minded compiler of the 1980s could already beat that.

    VAR1 @ VAR2 @ + VAR3 !

will execute this at run time:

    ; push address of VAR1
    inc    bx
    inc    bx
    push   bx
    ; fetch and jump to next primitive
    lodsw
    mov    bx,ax
    jmp    [bx]
    ; push contents of variable
    pop    bx
    push   [bx]
    ; next primitive...
    ; push address of VAR2, next...
    ; push contents of variable, next...
    ; add top two stack elements, push sum, next...
    ; push address of VAR3, next...
    ; store to address, next...

There are some "low-hanging fruits", like keeping top-of-stack in a register, which the Forth used here doesn't do though. Or direct threading.

Still, an incredibly stupid compiler could do better (for execution speed, definitely not size) by outputting similar code fragments - including all the pushes and pops, who needs a register allocator? - just without the interpreter overhead (lodsw etc.) in between them.

A compiler producing worse code likely didn't exist before today's glorious world of vibe coding ;)

A slightly better compiler would directly load the variables instead of first pushing their addresses, etc. You don't need to be an expert in compiler theory to come up with enough ideas for boiling it down to the same three instructions that a competent assembly programmer would write for this operation. And at least for this case, Forth doesn't even have the size advantage anymore, the code would only take 10 bytes instead of 14.

ajross · 2026-02-16T16:10:02 1771258202

The compiler space in 1985 was really thin. You were basically looking at Microsoft/Lattice C and Turbo Pascal. And while I don't have any of them handy for a test, that's pretty much exactly the character of the code they'd generate. In particular the 8086 calling conventions were a kludgey dance on the stack for every function, something forth could actually improve on.

rep_lodsb · 2026-02-16T18:13:09 1771265589

I know Turbo Pascal produces really bad code (even in later versions), but it's not on the same level as a non-optimized Forth. Function prologue on x86 doesn't have much overhead either.

It's somewhat closer for 8-bit processors, the most popular ones had either no convenient addressing mode for stack frames at all, or an extremely slow one, like IX/IY on the Z80. For even more primitive CPUs, you might already need a virtual machine to make it "general-purpose programmable" at all -- for example if there is only one memory address register, and no way to save its contents without clobbering another one. I think that some of Chuck Moore's earliest Forth implementations were for architectures like that.

Also memory constraints could have been more important than efficiency of course. I'm not saying Forth is worthless, but it's not competing with any compiled language in terms of speed, and IMHO it also does away with too many "niceties" like local variables or actually readable syntax. Your mileage may vary :)

rep_lodsb · 2026-02-16T04:39:39 1771216779

Yes, the 80287 and 387 used some I/O port addresses reserved by Intel to transfer the opcode, and a "DMA controller" like interface on the main processor for reading/writing operands, using the COREQ/COACK pins.

Instead of simply reading the first word of a memory operand and otherwise ignoring ESC opcodes, the CPU had to be aware of several different groups of FPU opcodes to set up the transfer, with a special register inside its BIU to hold the direction (read or write), address, and segment limit for the operand.

It didn't do all protection checks "up front", since that would have required even more microcode, and also they likely wanted to keep the interface flexible enough to support new instructions. At that time I think Intel also had planned other types of coprocessor for things like cryptography or business data processing, those would have used the same interface but with completely different operand lengths.

So the CPU had to check the current address against the segment limit in the background whenever the coprocessor requested to transfer the next word. This is why there was a separate exception for "coprocessor segment overrun". Then of course the 486 integrated the FPU and made it all obsolete again.

rep_lodsb · 2026-02-14T20:02:06 1771099326

Nitpick (footnote 3): "64-bit kernels can run 32-bit userspace processes, but 64-bit and 32-bit code can’t be mixed in the same process. ↩"

That isn't true on any operating system I'm aware of. If both modes are supported at all, there will be a ring 3 code selector defined in the GDT for each, and I don't think there would be any security benefit to hiding the "inactive" one. A program could even use the LAR instruction to search for them.

At least on Linux, the kernel is perfectly fine with being called from either mode. FASM example code (with hardcoded selector, works on my machine):

    format elf executable at $1_0000
    entry start
    
    segment readable executable
    
    start:  mov     eax,4                   ;32-bit syscall# for write
            mov     ebx,1                   ;handle
            mov     ecx,Msg1                ;pointer
            mov     edx,Msg1.len            ;length
            int     $80
    
            call    $33:demo64
    
            mov     eax,4
            mov     ebx,1
            mov     ecx,Msg3
            mov     edx,Msg3.len
            int     $80
            mov     eax,1                   ;exit
            xor     ebx,ebx                 ;status
            int     $80
    
    use64
    demo64: mov     eax,1                   ;64-bit syscall# for write
            mov     edi,1                   ;handle
            lea     rsi,[Msg2]              ;pointer
            mov     edx,Msg2.len            ;length
            syscall
            retfd                           ;return to caller in 32 bit mode

    Msg1    db      "Hello from 32-bit mode",10
    .len=$-Msg1
    
    Msg2    db      "Now in 64-bit mode",10
    .len=$-Msg2
    
    Msg3    db      "Back to 32 bits",10
    .len=$-Msg3

ronsor · 2026-02-14T20:52:55 1771102375

This is also true on Windows. Malware loves it! https://encyclopedia.kaspersky.com/glossary/heavens-gate/

bonzini · 2026-02-14T20:55:01 1771102501

Isn't it how recent Wine runs 32-bit programs?

josephh · 2026-02-14T21:15:56 1771103756

Much like there is 64-bit "code", there is also 32-bit "code" that can only be executed in the 32-bit (protected) mode, namely all the BCD, segment-related, push/pop-all instructions that will trigger an invalid opcode exception (#UD) when executed under long mode. In that strictest sense, "64-bit and 32-bit code can’t be mixed".

jcranmer · 2026-02-15T01:14:20 1771118060

x86 has (not counting the system-management mode stuff) 4 major modes: real mode, protected mode, virtual 8086 mode, and IA-32e mode. Protected mode and IA-32e mode rely on the bits within the code segment's descriptor to figure out whether or not it is 16-bit, 32-bit, or 64-bit. (For extra fun, you can also have "wrong-size" stack segments, e.g., 32-bit code + 16-bit stack segment!)

16-bit and 32-bit code segments work almost exactly in IA-32e mode (what Intel calls "compatibility mode") as they do in protected mode; I think the only real difference is that the task management stuff doesn't work in IA-32e mode (and consequently features that rely on task management--e.g., virtual-8086 mode--don't work either). It's worth pointing out that if you're running a 64-bit kernel, then all of your 32-bit applications are running in IA-32e mode and not in protected mode. This also means that it's possible to have a 32-bit application that runs 64-bit code!

But I can run the BCD instructions, the crazy segment stuff, etc. all within a 16-bit or 32-bit code segment of a 64-bit executable. I have the programs to prove it.

adrian_b · 2026-02-16T16:40:16 1771260016

Yes, but you transition between the 2 modes with far jumps, far calls or far returns, which reload the code segment.

Without passing through a far jump/call/return, you cannot alternate between instructions that are valid only in 32-bit mode and instructions that are valid only in 64-bit mode.

Normally you would have 32-bit functions embedded in a 64-bit main program, or vice-versa. Unlike normal functions, which are invoked with near calls and end in near returns, such functions would be invoked with far calls and they would end in far returns.

However, there is no need to write now such hybrid programs. The 32-bit compatibility mode exists mainly for running complete legacy programs, which have been compiled for 32-bit CPUs.

rep_lodsb · 2026-02-14T17:52:29 1771091549

The low 8 bits of SI, DI, BP and SP weren't accessible before, but now they are in 64-bit mode.

The earliest ancestor of x86 was the CPU of the Datapoint 2200 terminal, implemented originally as a board of TTL logic chips and then by Intel in a single chip (the 8008). On that architecture, there was only a single addressing mode for memory: it used two 8-bit registers "H" and "L" to provide the high and low byte of the address to be accessed.

Next came the 8080, which provided some more convenient memory access instructions, but the HL register pair was still important for all the old instructions that took up most of the opcode space. And the 8086 was designed to be somewhat compatible with the 8080, allowing automatic translation of 8080 assembly code.

16-bit x86 didn't yet allow all GPRs to be used for addressing, only BX or BP as "base", and SI/DI as "index" (no scaling either). BP, SI and DI were 16-bit registers with no equivalent on the 8080, but BX took the place of the HL register pair, that's why it can be accessed as high and low byte.

Also the low 8 bits of the x86 flag register (Sign,Zero,always 0,AuxCarry,always 0,Parity,always 1,Carry) are exactly identical to those of the 8080 - that's why those reserved bits are there, and why the LAHF and SAHF instructions exist. The 8080 "PUSH PSW" (Z80 "PUSH AF") instruction pushed the A register and flags to the stack, so LAHF + PUSH AX emulates that (although the byte order is swapped, with flags in the high byte whereas it's the low byte on the 8080).

bonzini · 2026-02-14T21:05:32 1771103132

Fun fact, that obviously you already know but may be interesting to others.

In the encoding the registers are ordered AX, CX, DX, BX to match the order of the 8080 registers AF, BC (which the Z80 uses as count register for the DJNZ instruction, similar to x86 LOOP), DE and HL (which like BX could be used to address memory).

rep_lodsb · 2026-02-10T18:20:17 1770747617

So you can have bit arrays of any length in memory, rather than just 32 bits in a register.

cmovq · 2026-02-10T18:50:38 1770749438

That makes sense. LLVM could probably do better here by using the memory operand version:

https://godbolt.org/z/jeqbaPsMz

jxors · 2026-02-10T22:17:03 1770761823

The memory operand version tends to be as slow or slower than the manual implementation, so LLVM is right to avoid it.

cmovq · 2026-02-13T17:03:18 1771002198

Right, it has much worse throughput:

Memory: https://uica.uops.info/tmp/f022a3c0a70e4ae5ab3588ebe65fd2a5_...

Register: https://uica.uops.info/tmp/e80e60e0c4914955b11dc6590711c1b8_...

ack_complete · 2026-02-11T03:51:31 1770781891

Don't think the memory operand version would work here. If I understand the x86 architectural manual description, the 32-bit operand form interprets the bit offset as signed. A 64-bit operand could work around that but then run into issues with over-read due to fetching 64 bits of data.

rep_lodsb · 2026-02-10T15:17:33 1770736653

Implementing rotate through carry like that was a really bad decision IMO - it's almost never by more than one bit left or right at a time, and this could be done much more efficiently than with the constant-time code which is only faster when the count is > 6.

Is the full microcode available anywhere?

ajenner · 2026-02-10T15:40:20 1770738020

I haven't published it yet as there are still some rough edges to clear up, but if you email me ([email protected]) I'll send you the current work-in-progress (the same one that nand2mario is working from).

kjs3 · 2026-02-10T15:31:17 1770737477

Since the shifter is also used for bit tests, the 'most things are a 1-bit shift' might not be the case. Perhaps they did the analysis and it made sense.

rep_lodsb · 2026-02-10T16:07:02 1770739622

There are separate opcodes for shift/rotate by 1, by CL, or by an immediate operand. Those are decoded to separate microcode entry points, so they could have at least optimized the "RCL/RCR x,1" case.

And the microcode for bit test has to be different anyway.

cbsmith · 2026-02-12T06:26:49 1770877609

Except that there are tremendous advantages to constant-time execution, not the least of which is protection from timing security attacks/information leakage (which admittedly were less of a concern back then). Sure you can get the one instruction executed for the <6 case faster, but the transistor budget for that isn't worth it, particularly if you pipeline the execution into stages. It makes optimization far more complex...

rep_lodsb · 2026-02-04T00:26:42 1770164802

RAMAC, not RAMDAC: https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_d...

However it doesn't seem to be divided into sectors at all, more like each track is like a loop of magnetic tape. In that context it makes a bit more sense to use decimal units, measuring in bits per second like for serial comms.

Or maybe there were some extra characters used for ECC? 5 million / 100 / 100 = 500 characters per track, leaves 72 bits over for that purpose if the actual size was 512.

First floppy disks - also from IBM - had 128-byte sectors. IIRC, it was chosen because it was the smallest power of two that could store an 80-column line of text (made standard by IBM punched cards).

Disk controllers need to know how many bytes to read for each sector, and the easiest way to do this is by detecting overflow of an n-bit counter. Comparing with 80 or 100 would take more circuitry.