Hacker Newsnew | past | comments | ask | show | jobs | submit | stephencanon's commentslogin

For throughput-dominated contexts, evaluation via Horner's rule does very well because it minimizes register pressure and the number of operations required. But the latency can be relatively high, as you note.

There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.

There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.


Ah, I didn't know of either scheme. Still, am I right in that this mainly makes sense for degrees above five or six or so?

You can often eke something out for order-four, depending on uArch details. But basically yeah.

Ideally either one is just a library call to generate the coefficients. Remez can get into trouble near the endpoints of the interval for asin and require a little bit of manual intervention, however.

These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.

The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.

Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.


Yeah, the only big problem with approx. sqrt is that it's not consistent across systems, for example Intel and AMD implement RSQRT differently... Fine for graphics, but if you need consistency, that messes things up.

Newer rsqrt approximations (ARM NEON and SVE, and the AVX512F approximations on x86) make the behavior architectural so this is somewhat less of a problem (it still varies between _architectures_, however).

Wait, what? Do you have a resource I could read up on about that? That is moderately concerning if your math isn't portable across chips.

When Intel specced the rsqrt[ps]s and rcp[ps]s instructions ~30 years ago, they didn't fully specify their behavior. They just said their relative error is "smaller than 1.5 * 2⁻¹²," which someone thought was very clever because it gave them leeway to use tables or piecewise linear approximations or digit-by-digit computation or whatever was best suited to future processors. Since these are not IEEE 754 correctly-rounded operations, and there was (by definition) no software that currently used them, this was "fine".

And mostly it has been OK, except for some cases like games or simulations that want to get bitwise identical results across HW, which (if they're lucky) just don't use these operations or (if they're unlucky) use them and have to handle mismatches somehow. Compilers never generate these operations implicitly unless you're compiling with some sort of fast-math flag, so you mostly only get to them by explicitly using an intrinsic, and in theory you know what you're signing up for if you do that.

However, this did make them unusable for some scenarios where you would otherwise like to use them, so a bunch of graphics and scientific computing and math library developers said "please fully specify these operations next time" and now NEON/SVE and AVX512 have fully-specified reciprocal estimates,¹ which solves the problem unless you have to interoperate between x86 and ARM.

¹ e.g. Intel "specifies" theirs here: https://www.intel.com/content/www/us/en/developer/articles/c...

ARM's is a little more readable: https://developer.arm.com/documentation/ddi0596/2021-03/Shar...


Thanks!

[flagged]


For the asinf libcall on macOS/x86, my former colleague Eric Postpischil invented the novel (at least at the time, I believe) technique of using a Remez-optimized refinement polynomial following rsqrtss instead of the standard Newton-Raphson iteration coefficients, which allowed him to squeeze out just enough extra precision to make the function achieve sub-ulp accuracy. One of my favorite tricks.

We didn't carry that algorithm forward to arm64, sadly, because Apple's architects made fsqrt fast enough that it wasn't worth it in scalar contexts.


Yann is definitely more well-known outside of academia. Inside academia, it's going to depend a lot on your specific background and how old you are.

Mechanically, you're probably right, but the screen-centric controls of the newer generation are _awful_ by comparison to the F generation's physical buttons and dials (this isn't BMW though, it's the whole industry).

My wife and I both have F31s, which we will drive until we can no longer source replacement parts unless the industry comes to its senses first (unlikely). Any time we've ever looked at plausible replacements, the screen-based controls are an immediate hard no.


Our (~2015) 3-series controls are just about perfect. Where they differ from Honda/Toyota's controls that I am also very familiar with, they're noticeably better now that I'm familiar with them. Everything is really well thought-out.

Of course, now they (and almost every other manufacturer) have followed Tesla off the cliff and made everything a screen, so the current generation cars have abysmal controls.


Schubfach's table is quite large compared to some alternatives with similar performance characteristics. swiftDtoa's code and tables combined are smaller than just Schubfach's table in the linked implementation. Ryu and Dragonbox are larger than swiftDtoa, but also use smaller tables than Schubfach, IIRC.

If I$ is all you care about, then table size may not matter, but for constrained systems, other algorithms in general, and swiftDtoa in particular, may be better choices.


IEEE 754 is a floating point standard. It has a few warts that would be nice to fix if we had tabula rasa, but on the whole is one of the most successful standards anywhere. It defines a set of binary and decimal types and operations that make defensible engineering tradeoffs and are used across all sorts of software and hardware with great effect. In the places where better choices might be made knowing what we know today, there are historical reasons why different choices were made in the past.

DEC64 is just some bullshit one dude made up, and has nothing to do with “floating-point standards.”


It is important to remember that IEEE 754 is, in practice, aspirational. It is very complex and nobody gets it 100% correct. There are so many end cases around the sticky bit, quiet vs. signaling NaNs, etc, that a processor that gets it 100% correct for every special case simply does not exist.

One of the most important things that IEEE 754 mandates is gradual underflow (denormals) in the smallest binate. Otherwise you have a giant non-monotonic jump between the smallest normalizable float and zero. Which plays havoc with the stability of numerical algorithms.


Sorry, no. IEEE 754 is correctly implemented in pretty much all modern hardware [1], save for the fact that optional operations (e.g., the suggested transcendental operations) are not implemented.

The problem you run into is that the compiler generally does not implement the IEEE 754 model fully strictly, especially under default flags--you have to opt into strict IEEE 754 conformance, and even there, I'd be wary of the potential for bugs. (Hence one of the things I'm working on, quite slowly, is a special custom compiler that is designed to have 100% predictable assembly output for floating-point operations so that I can test some floating-point implementation things without having to worry about pesky optimizations interfering with me).

[1] The biggest stumbling block is denormal support: a lot of processors opted to support denormals only by trapping on it and having an OS-level routine to fix up the output. That said, both AMD and Apple have figured out how to support denormals in hardware with no performance penalty (Intel has some way to go), and from what I can tell, even most GPUs have given up and added full denormal support as well.


The orbital example where BDF loses momentum is really about the difference between a second-order method (BDF2) and a fourth-order method (RK), rather than explicit vs implicit (but: no method with order > 2 can be A-stable; since the whole point of implict methods is to achieve stability, the higher order BDF formulas are relatively niche).

There are whole families of _symplectic_ integrators that conserve physical quantities and are much more suitable for this sort of problem than either option discussed. Even a low-order symplectic method will conserve momentum on an example like this.


Obviously^1. But it illustrates the broader point of the article, even if for the concrete problem even better choices are available.

1) if you have studied these things in depth. Which many/most users of solver packages have not.


The fascinating thing is that discrete symplectic integrators typically can only conserve one of the physical quantities exactly, eg angular momentum but not energy in orbital mechanics.


I have always wanted to know if there is any theorem that says one cannot preserve all of the standard invariants.

For example, we know for mappings that we cannot preserve angles, distances and area simultaneously.


The short answer is that discretization can generally preserve only one invariant exactly; others must be approximate.

This could provide some evidence for the universe not being truly discrete since we have multiple apparent exactly preserved kinematic quantities, but it’s hard to tell experimentally since proposed discrete space times have discretization sizes on the order of hbar, which means deviations from the continuum would be hard to detect.


Thanks for replying on this now rather inactive thread.

I am really curious about this issue and am looking for a theorem that gives an impossibility result (or an existence result).

It might be well known but I don't know DE to be aware about f the result.


leapfrog!


> Where you are not under any circumstances can be robbed by a random person on a street.

I will be very surprised if there's anywhere in the world where the expected loss from being robbed on the street while walking exceeds the expected loss from being in a car accident while driving.

Getting in a car is by far the most dangerous thing most people do routinely.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: