`_mm_alignr_epi8` is a compile-time known shuffle that gets optimized well by LLVM [1].
If you need the exact behavior of `pshufb` you can use asm or the llvm intrinsic [2]. iirc, I once got the compiler to emit a `pshufb` for a runtime shuffle... that always guaranteed indices in the 0..15 range?
Ironically, I also wanted to try zig by doing a StreamVByte implementation, but got derailed by the lack of SSE/AVX intrinsics support.
Oh, that's actually quite neat, it did not occur to me that you can use @shuffle with a compile time mask and it will optimize it to a specialized instruction.
Massena Hospital is a 25-bed hospital. Might have to go to Canton or Ogdensburg for a family doctor (45 minutes by car). Most things serious get referred to Syracuse or Burlington (3 hours away by car).
AFAIK, Cost[1] is "theoretically" nothing if annual income is less than the federal poverty line ($15,650 for an individuality). And might as well be free for an income up to $39,125.
Alcoa (Aluminum Smelter, *cheap electricity*) was the major industry in the area. Massena plant now produces 85% less aluminum compared to ~15 years ago (AFAICT), leading to something of a ghost town (and cheap housing).
Limited internet connections (speed and/or data-caps). Something like Hughesnet (satellite ISP) couldn't stream more than 240p from youtube during peek times. The data-cap coerced users to do downloads between 2am to 6am.
The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.
Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.
On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.
On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.
It has no need for that. count_if is a fold/reduce operation where the accumulator is simply incremented by `(int)some_condition(x)` for all x. In Rust:
let arr = [ 1, 3, 4, 6,7, 0, 9, -4];
let n_evens = arr.iter().fold(0, |acc, i| acc + (i & 1 == 0) as usize);
assert_eq!(n_evens, 4);
I know that. But that’s still a different interface. If you have a predicate you now have to wrap that in a different closure that conforms it to a new pattern.
This is the same argument as why have count_if if I can write a for loop.
Sure. But at least I interpreted the GP as just saying that the "count-if" operation can be implemented in terms of `reduce` if the latter is available.
Why not use regular rejection sampling when `limit` is known at compile-time.
Does fastrange[1] have fewer rejections due to any excess random bits[2]?
Fastrange is slightly biased because, as Steve Canon observes in that Swift PR, it is just Knuth’s multiplicative reduction. The point of this post is that it’s possible to simplify Lemire’s nearly-divisionless debiasing when the limit is known at compile time.
Lemire’s algorithm rejects the fewest possible samples from the random number generator, so it’s generally the fastest. The multiplication costs very little compared to the RNG.
That snippet would not reject bad chars, are non-digits rejected somewhere else?
A simple scalar loop that multiplies an accumulator by 10, itoa() style, would be faster.
There are two digits for months. A check is required per-character to prevent bad chars (e.g 0x3A) from being laundered by the multiply. The '19' comes from fact that the first digit may be only 0 or 1, while the 2nd digit maybe any value 0..9 (e.g September '0'9' October '1'0'). The second check catches bad month values between 13..19 which were not caught by looking at individual digits. Realistically, the first check may be overbuilt, it only needs to check is_digit or not, but it still has to ignore the padding bytes at the end, somehow. Now... I believe there would be a problem with the month value of '00'... because it unconditionally subtracts 1 from that field then uses it as a table index.