That is amazing, isn't it? I want to understand, in: uint64_t sum5(Node *node) {...

trealira · on July 13, 2024

> Also.. why is `node + 1' even a valid comparison here?

In C and C++, when you add 1 to a pointer, you actually make it point to the next object of that size in memory, e.g. if I have a pointer to a 4-byte integer at address 0x8000, incrementing it will make it point to address 0x8004.

Because arrays are contiguous chunks of memory, you can use this to iterate through arrays.

  int array[50];
  int *p = &array[0]; // Take the address of the first array element
  int sum = 0;
  for (int i = 50; i != 0; i--) {
      sum += *p;
      p++; // Point to the next element
  }

After this loop, p will be equal to &array[50], which is one past the last element of the array, because the loop will have run 50 times, and p is incremented once per loop.

What OP did is allocate an array of linked list nodes, and test to see if the next linked list node was actually just the next array element.

metadat · on July 13, 2024

Makes sense, thanks! I'm still curious how this compiles down to faster ASM than a check for null.

trealira · on July 13, 2024

CPUs can execute arithmetic instructions fast, because we've been trying to execute sequential instructions faster for years. One way to do this is to create a pipeline: while the result of one instruction is being calculated, start reading the next; when the result is being written to a register, you're executing the previous one.

Bottlenecks are introduced by dependencies, like if an instruction modifies one register and the next one immediately uses its value. With a pipeline, the result of the previous instruction might not be written back to the register in time for the current instruction to read the correct value from it, causing the CPU to execute the instruction incorrectly. So, you have to stall the pipeline and do nothing until that register is written to.

Memory loads cause bottlenecks for similar reasons. In fact, loading memory is usually the slowest instruction.

One way of getting around this is to execute instructions out of order, but do it in such a way that it looks like it's executing in-order. For dealing with branches, you can speculatively execute both branches, but only commit one result to the CPU state.

By testing whether the next node in memory was actually the next node in the list, the CPU could speculatively start executing the next iteration of the loop body as though it were; if it turns out not to be, then it can just not commit the results. If it is, then you've avoided a pipeline stall.

Note: I'm not a CPU expert; I just took a class in computer architecture recently. I could be mistaken about something, so feel free to correct me.

metadat · on July 13, 2024

This jives with my understanding of pipelining architecture and speculative execution, thank you for connecting the dots! Seriously cool to understand why that line works better.

It seems unfortunate the compiler isn't able to realize "no further assignments are ever performed on node->next, it is side-effect dependency free, optimize the spec exec accordingly". Though, how often would this be the case in a real-world program of actual utility.. probably rare.

trealira · on July 13, 2024

No problem; I'm glad to be helpful.

mistercow · on July 13, 2024

The other person who replied may have already cleared this up for you, but the two conditions you listed mean very different things. The first means “if the next node is not adjacent in memory”, while the second means “if there is no next node.

mjevans · on July 13, 2024

This is crucially followed by the node++ instruction. As noted elsewhere, this increases the value of the <pointer to 'a node'> by the <sizeof 'a node'>, thereby allowing the CPU to execute at the faster of it's memory prefetch / the speculation of the inner code loop execution speed, for as long as the sequence of list nodes is contiguous and sequential.