palango's comments

palango · on March 20, 2019

Can someone explain this in a bit more detail? Why does "the computation of the most significant bits of a 64-bit product on an ARM processor requires a separate and expensive instruction"?

ridiculous_fish · on March 20, 2019

ARM64 has separate instructions for computing the low and high halves of a product, in keeping with its single destination register approach. x86-64 has a single instruction that computes both halves simultaneously, writing to two registers.

petermcneeley · on March 20, 2019

It would have been just simpler for the author to show us the difference in the multiplication performance directly. Here the benchmarks are comparing the completely different hash functions and we have to simply take his word that it is this instruction that is causing the performance discrepancy.

sligor · on March 20, 2019

On a modern out of order uarch that kind of "complex" instruction would by spited in two micro operation, one for each register result. And Agner's tables [1] confirms that the mul 64*64 => 128 is split in two micro-ops on Skylake. So it doesn't give any strong advantage.

[1] https://www.agner.org/optimize/instruction_tables.pdf

BeeOnRope · on March 20, 2019

Yes, but the second uop is not expensive like the first in this case. That is, it seems like the the full multiplication is done by the latency-3 op on p1 and the other uop is just needed to move the high half of the result to the destination (indeed, instructions with 2 outputs always need 2 uops due to the way the renamer works). The whole 64x64->128 multiplication still has a latency of only 3, and a throughput of 1 per cycle.

So the 64x64->128 multiplication is still quite efficient compared to ARM where two "full strength" multiplications are needed. It is odd though that there is nearly a 20x difference in relative speeds though, I wouldn't expect multiply upper to be that slow.

dragontamer · on March 21, 2019

Note: The test seems to have been done on Skylark (aka: Ampere), which is a non-standard ARM core. I can't find any documentation on Skylark's latency / throughput specifications.

sligor · on March 20, 2019

I strongly suspect the ARM compiler is not optimized for 128b mults and just calls a generic software function to do the computation

like this one: https://github.com/llvm-mirror/compiler-rt/blob/master/lib/b...

mmozeiko · on March 20, 2019

For 64-bit ARM there are no function calls when using clang: https://godbolt.org/z/IR2DIj

mmrezaie · on March 21, 2019

I just wanted to thank you for this website. It is putting end to so many discussions we have about the quality of the code. Extra, it has a support for FORTRAN. Previously we were doing debug and disassembly ourselves.

sligor · on March 20, 2019

you're right, and even on old versions of gcc and clang it seems that it is correctly generated as mul + umulh pair

dragontamer · on March 21, 2019

Intel does 64x64 bit multiplies and returns a 128-bit result. ARM does 64x64 bit multiplies and returns just the lower (or upper) 64-bits.

Wyhash is built around the 128-bit result. This is fast on Intel, but slower on ARM.