Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem with x86 in particular is that there is tons of cruft. You can get lost for days reading about obsolete functionality.

Here's my general workflow for optimizing functions in HFT:

Write a function in C++ and compile it. Look at the annotated disassembly and try to improve it using intrinsics, particularly vector intrinsics, and rdtsc times.

Then compare your output to the compiler's "-ftree-vectorize -march=native" and compare what it did to what you did. Lookup the instructions it used and compare them with what you did, check for redundancies, bad ordering, register misuse/underuse in the compiler output.

Then see if you can improve that.

But all that being said, note that in general this kind of cycle-counting micro-optimization is often overshadowed by instruction/data cacheline loads. It's rare that you have a few kilobytes of data that you will constantly iterate over with the same function. Most learning resources and optimizing compilers seem to ignore this fact.



I’ve wondered why there aren’t more tools for predicting how a program fits into cache lines and data caching effects. For given cpu parameters it seems a reasonable task to estimate cache lines based on a sample dataset. Am I just missing what tools are used out there?


The best tool for this in my experience is callgrind with assembly notation. You can configure it to more or less mimic the cache layout of whatever particular chip you're running and then execute your code on it.

You can use the start and stop macros in valgrind.h to show cache behaviour of a specific chain of function calls, like when a network event happens, then in the view menu of kcachegrind select IL Fetch Misses, and show the hierarchical function view.

It doesn't mimic the exact branch prediction or whatever of your architecture but when you compare it to actual timings it's damn close.


Wow, that's cool!


Why not just write the function in ASM in the first place?


1) Because the compiler gives you a clear reference implementation to test against for correctness and performance.

2) Because after you do this enough times, you will learn when to write your own, when not to, and when to spot inefficiencies in the compiler output. The point is to learn, both about how the instructions work and how the compiler works.

3) The C/C++ implementation serves as documentation of intent and is portable across architectures (including future x86-64 architectures). It's fucking atrocious when devs write pure assembly without a C/C++ reference that can replace it. To me, finding random assembly without a code implementation in the project is the ultimate indictment of a hot rod programmer not thinking about the future or future maintainers.


Can you talk about your day job?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: