Hacker Newsnew | past | comments | ask | show | jobs | submit | aqrit's commentslogin

MS makes "Times New Roman" available (at no cost), but not "Calibri".


`_mm_alignr_epi8` is a compile-time known shuffle that gets optimized well by LLVM [1].

If you need the exact behavior of `pshufb` you can use asm or the llvm intrinsic [2]. iirc, I once got the compiler to emit a `pshufb` for a runtime shuffle... that always guaranteed indices in the 0..15 range?

Ironically, I also wanted to try zig by doing a StreamVByte implementation, but got derailed by the lack of SSE/AVX intrinsics support.

[1] https://github.com/aqrit/sse2zig/blob/444ed8d129625ab5deec34... [2] https://github.com/aqrit/sse2zig/blob/444ed8d129625ab5deec34...


Oh, that's actually quite neat, it did not occur to me that you can use @shuffle with a compile time mask and it will optimize it to a specialized instruction.


> health services in the area

Massena Hospital is a 25-bed hospital. Might have to go to Canton or Ogdensburg for a family doctor (45 minutes by car). Most things serious get referred to Syracuse or Burlington (3 hours away by car).

AFAIK, Cost[1] is "theoretically" nothing if annual income is less than the federal poverty line ($15,650 for an individuality). And might as well be free for an income up to $39,125.

[1] https://info.nystateofhealth.ny.gov/EssentialPlan


Alcoa (Aluminum Smelter, *cheap electricity*) was the major industry in the area. Massena plant now produces 85% less aluminum compared to ~15 years ago (AFAICT), leading to something of a ghost town (and cheap housing).


Limited internet connections (speed and/or data-caps). Something like Hughesnet (satellite ISP) couldn't stream more than 240p from youtube during peek times. The data-cap coerced users to do downloads between 2am to 6am.


A optimized version would use 64-bit accumulators (`psadbw` on SSE2, or some sort of horizontal adds on NEON). The `255` max constraint is pointless.

Many programming languages/frameworks expose this operation as `reduce()`.


It's not that trivial:

The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.

Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.

On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.

On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.


Reduce does not accept a predicate.


It has no need for that. count_if is a fold/reduce operation where the accumulator is simply incremented by `(int)some_condition(x)` for all x. In Rust:

  let arr = [ 1, 3, 4, 6,7, 0, 9, -4];
  let n_evens = arr.iter().fold(0, |acc, i| acc + (i & 1 == 0) as usize);
  assert_eq!(n_evens, 4);
Or more generally,

  fn count_if<T>(it: impl Iterator<Item=T>, pred: impl Fn(&T) -> bool) -> usize {
      it.fold(0, |acc, t| acc + pred(&t) as usize)
  }


I know that. But that’s still a different interface. If you have a predicate you now have to wrap that in a different closure that conforms it to a new pattern.

This is the same argument as why have count_if if I can write a for loop.


Sure. But at least I interpreted the GP as just saying that the "count-if" operation can be implemented in terms of `reduce` if the latter is available.


Why not use regular rejection sampling when `limit` is known at compile-time. Does fastrange[1] have fewer rejections due to any excess random bits[2]?

[1] https://github.com/lemire/fastrange

[2] https://github.com/swiftlang/swift/pull/39143


What do you mean by “regular” rejection sampling?

Fastrange is slightly biased because, as Steve Canon observes in that Swift PR, it is just Knuth’s multiplicative reduction. The point of this post is that it’s possible to simplify Lemire’s nearly-divisionless debiasing when the limit is known at compile time.

I previously experimented with really-divisionless debiasing but I was underwhelmed with the results https://dotat.at/@/2022-04-20-really-divisionless.html


> “regular” rejection sampling

I was thinking naive: mask off unwanted bits then reject any value above the limit.

It would seem like https://c-faq.com/lib/randrange.html would also move the multiply --or divide by constant-- out of the loop.


Lemire’s algorithm rejects the fewest possible samples from the random number generator, so it’s generally the fastest. The multiplication costs very little compared to the RNG.


That snippet would not reject bad chars, are non-digits rejected somewhere else? A simple scalar loop that multiplies an accumulator by 10, itoa() style, would be faster.


The methods in those articles could be slightly improved by picking better multipliers. https://stackoverflow.com/a/71594769


There are two digits for months. A check is required per-character to prevent bad chars (e.g 0x3A) from being laundered by the multiply. The '19' comes from fact that the first digit may be only 0 or 1, while the 2nd digit maybe any value 0..9 (e.g September '0'9' October '1'0'). The second check catches bad month values between 13..19 which were not caught by looking at individual digits. Realistically, the first check may be overbuilt, it only needs to check is_digit or not, but it still has to ignore the padding bytes at the end, somehow. Now... I believe there would be a problem with the month value of '00'... because it unconditionally subtracts 1 from that field then uses it as a table index.


Ok 8 bit unsigned ints and zero padded month, and then they do the zero saturated subtract. I hadn't made it to the second check, doh.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: