It's fast, but I figured doing that on both sides before adding looked a bit ine...

simias · on Feb 8, 2022

On a modern architecture given that most integers are usually u32 by default but the underlying CPU deals with 64bits natively, I'd just cast to u64 and call it a day.

Actually I was curious to see if GCC would be smart enough to automatically choose what's the best optimization depending on the underlying architecture, but it doesn't appear to be the case.

For x86_64 (with -O3 or -Os):

    avg_64bits:
    .LFB0:
        .cfi_startproc
        movl    %edi, %edi
        movl    %esi, %esi
        leaq    (%rdi,%rsi), %rax
        shrq    %rax
        ret
        .cfi_endproc

    avg_patented_do_not_steal:
   .LFB1:
        .cfi_startproc
        movl    %edi, %eax
        movl    %esi, %edx
        andl    %esi, %edi
        shrl    %eax
        shrl    %edx
        andl    $1, %edi
        addl    %edx, %eax
        addl    %edi, %eax
        ret

Clearly just casting to 64bits seems to denser code

For ARM32 (-O3 and -Os):

    avg_64bits:
        push    {fp, lr}
        movs    r3, #0
        adds    fp, r1, r0
        adc     ip, r3, #0
        mov     r0, fp
        mov     r1, ip
        movs    r1, r1, lsr #1
        mov     r0, r0, rrx
        pop     {fp, pc}

    avg_patented_do_not_steal:
        and     r3, r1, #1
        ands    r3, r3, r0
        add     r0, r3, r0, lsr #1
        add     r0, r0, r1, lsr #1
        bx      lr

A lot more register spilling in the 64bit version since it decides to do a true 64bit add using two registers and an adc.

My code, for reference:

    uint32_t avg_64bits(uint32_t a, uint32_t b) {
      uint64_t la = a;
      uint64_t lb = b;
    
      return (la + lb) / 2;
    }

    uint32_t avg_patented_do_not_steal(uint32_t a, uint32_t b) {
        return (a / 2) + (b / 2) + (a & b & 1);
    }