pext/pdep are incredible, I'm hoping to see them in more SIMD ISAs in the future.
But my favorite is the 8x8 bit matrix transpose SIMD instruction (gf2p8affine, which does a bit more, buy I care about the tranapose). Combined with SIMD byte permutes it allows you to do things like: arbitrarily permute bits in SIMD elements, find the invers of a permutation, very fast histograming/binning
But my favorite is the 8x8 bit matrix transpose SIMD instruction (gf2p8affine, which does a bit more, buy I care about the tranapose). Combined with SIMD byte permutes it allows you to do things like: arbitrarily permute bits in SIMD elements, find the invers of a permutation, very fast histograming/binning