Hacker Newsnew | past | comments | ask | show | jobs | submit | gamozolabs's commentslogin

I wrote an extremely fast hot polled pipe in Rust for QEMU instrumentation. I’m sure there’s room to improve but it’s effectively bottlenecking on the uarch. https://github.com/MarginResearch/cannoli/blob/main/mempipe/...

This was specifically designed for one producer (the QEMU processor), and many consumers (processing the trace of instructions and memory accesses). I can't remember what specific tunings I did to help with this specific model. Data structures like this can always be tuned slightly based on your workload, by changing structure shapes, shared cache lines, etc.

With about 3-4 consumers QEMU was not really blocking on the reporting of every single instruction executed, which is really cool. This requires a low noise system, just having a browser open can almost half these numbers since there's just more cache coherency bus traffic occuring. If having a browser open doesn't affect your benchmark, it's probably not bottlenecking on the uarch yet ;)

https://github.com/MarginResearch/cannoli/blob/main/.assets/...

A little blog on Cannoli itself here: https://margin.re/2022/05/cannoli-the-fast-qemu-tracer/

Ultimately, mempipe is not really a big thing I've talked about, but it's actually what makes cannoli so good and enables the design in the first place.


Here's the mempipe benchmark latency for core<->core: https://github.com/MarginResearch/cannoli/blob/main/mempipe/...

https://raw.githubusercontent.com/MarginResearch/cannoli/mai...

You can see a massive improvement for shared hyperthreads, and on-CPU-socket messages.

In this case, about ~350 cycles local core, ~700 cycles remote core, ~90 cycles same hyperthread. Divide these by your clock rate as long as you're Skylake+ for the speed in seconds. Eg. about 87.5 nanoseconds for a 4 GHz processor for local core IPC.


Anyways, since you express disappointment in ~1000 cycle cost. That's about right. The latency between cores is actually quite high and there's not much you can do about it, especially on a system like x86 which has extremely strong cache coherency by default. One thing that is really important to understand is that the ownership of cache lines dramatically affects the cost of memory.

For IPC, this effectively requires one thread writing to memory (thus, making that cache line modified to that core, and evicted from all other cores). Then, when the polling thread checks in on the line, it will have to demote that cache line from modified, to shared by flushing it out to memory (usually L2 or L3, but also writing out to memory). This causes some memory traffic, and constantly means that the cores are fighting over the same cache line. Since x86 is strongly ordered and caches are coherent, this traffic is extremely expensive. Think of a write as "tell all other cores that you modifed this memory, so they have to evict/invalidate their cache lines". And a read as "tell all cores to flush their cache lines if they're modified, then wait for them to tell me they're done, then I can read the memory". This effectively is a massively blocking operation. The simple act of reading the mailbox/ticket/whatever from another core to check if a message is ready will actually dramatically affect the speed the other core can write to it (as now that write is effectively full latency).

There are some tricks you can do to get extremely low latency between cores. One of them, is making sure you're on cores that are physically near each other (eg. on the same processor socket). This is only really relevant on servers, but it's a big thing. You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.

Another trick that you can do, is have both threads on the same core, thus, using hyperthreads. Hyperthreads share the same core and thus a lot of resources, and are able to actually skip some of the more expensive coherency traffic, as they share the same L1 cache (since L1 is per-core). The lowest latency you will be able to observe for IPC will be on the same core with hyperthreads, but that's often not really useful for _processing_ the data, since performance will not be great on two busy cores. But in theory, you can signal a hyperthread, the hyperthread can then go and raise some other signal, while the original hyperthread still continues doing some relevant work. As long as one of them is blocking/halted, the other won't really be affected by two things on the same thread.

Finally, the most reasonable trick, is making sure your tickets/buffers/mailboxes/whatever are _not_ sharing the same cache lines (unless they contain data which is passed at the same time). Once again, the CPU keeps things in sync at cache line levels. So having two pieces of data being hammered by two cores on the same cache line is asking for hundreds of cycles per trivial data access. This can be observed in an extreme case with many core systems, with multiple sockets, fighting over locks. I've done this on my 96C/192T system and I've been able to get single `lock inc [mem]` instructions to take over 15,000 cycles to complete. Which is unreal for a single instruction. But that's what happens when there's 200-500 cycles of overhead every single time that cache line is "stolen" back from other cores. So, effectively, keep in your head which state cache lines will be in. If they're going to be modified on one core, make sure they're not going to be read on another core while still being written. These transitions are expensive, you're only going to get your 3-4 cycle "random L1 data hit performance" if the cache line is being read, and it's in the exclusive, modified, or shared state, and if it's being written, it has to be exclusive or modified. Anything else and you're probably paying hundreds of cycles for the access, and thus, also probably hurting the other side.

Ultimately, what you're asking from the CPU is actually extremely complex. Think about how hard it would be for you to manage keeping a copy of a database in sync between hundreds of writers and reads (cores). The CPU is doing this automatically for you under the hood, on every single memory access. It is _not_ free. Thus, you really have to engineer around this problem, batch your operations, find a design that doesn't require as intense of IPC, etc. On more weakly ordered systems you can use some more tricks in page tables to get a bit more control over how cache coherency should be applied for various chunks of memory to get more explicit control.


Oh sliding in here late, but it’s also extremely important to pin your threads to specific cores. I kinda only think about computers in this mode so I didn’t bring it up, but the kernel will effectively randomly schedule your process to different physical cores. If you are doing intense IPC or otherwise relying on specific cache usage, getting assigned to a different core is a massive loss of existing state which takes time to recover from!


> You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.

This is a very cool comment in general, but I'm intrigued by this bit in particular. I'd love to see an implementation, if anyone knows of any.


There’s one big public work by utexas that covers this in a few places! https://sites.utexas.edu/jdm4372/2021/05/27/locations-of-cor...

See additional references for PDFs.

I’ve also played around with this in my research kernel, but never thoroughly enough to write up. Something I should revisit, I think it’d be a really fun thing to discuss and work on. Doing this level of timing requires doing BIOS configuration to make the CPU as deterministic as possible (turn off dynamic throttling, make sure you’re thermally keeping the CPU stable, etc).

I always highly encourage people to write advanced benchmarks. It’s a great way to learn how computers work!


Sticking with x86, I'm pretty sure CPUID can tell you what the topology of all the cores you can query on are in. Not that I'd tell anyone not to infer it from timing, that just sounds like fun.


It can tell you specified topology, like cores, threads, NUMA nodes. But it can’t tell you the physical locations of cores. Processors are binned based on what cores are functional after fabrication, thus your 12 core processor is probably a 16-18 core processor, and the 4 disabled cores are “random” based on manufacturing defects. Knowing this is completely unimportant, but cool. Ultimately yes, cpuid will tell you and relevant topology, since anything beyond this requires likely trillions of iterations to even detect.


It’s common in Anandtech benchmarks, eg https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...

Dunno what software they use…



You mention the most reasonable trick is to just avoid hammering/read-write the same cache lines; I guess you didn't hit this as a need because your QEMU instruction pipe was fast enough, but would you batch up your events along cache lines, fill up a line, and signal it's free to the consumers instead?


Yeah. So I forget the original tuning I did for that project. But, I fill up a buffer which is on its own cache line (Chunk) and then signal that the chunk is ready for ownership on the other side, thus sending it. I’m not sure why the signaling atomics aren’t on their own cache lines, I imagine I tried both and this was faster? There’s also a chance I never tried it because I felt I didn’t need it? I’m not sure!

Edit: Yep, looking at this I think I see some room for improvement in design. I'm getting about 60-70 ns/iter right now on my Intel(R) Core(TM) i9-9900KS CPU @ 4.00GHz with turbo on, and a loaded up system. This code is 2 years old and I've gotten a lot better at designing stuff like this. But, it's still really good. It's something I'd like to revisit at some point. The main bottleneck I think I see is `cur_seq` should be on its own cache line, as that's the only heavily thrashed value between cores. Most of the others are primarily in the shared cache line state, until a writer hits a mailbox. This design searches linearly through mailboxes for work to do, perhaps it would be faster to save the previous index? I'm also not sure I always need to have readers strongly sequenced, such that they can process more than one message a time, but this could have been a requirement of the way I was using it? While in theory storing/caching some information is great, in practice it often means unpredictable and dependent loads to use them. Eg, storing the previous index now means the CPU has to fetch the previous index from memory, and then loop on it. The compiler also has to treat this as a "volatile" value, and thus will struggle to do things like loop unrolling and bounds checking removal. With a low N (~4-16, number of mailboxes), it's probably always faster to just `for _ in 0..16` instead of doing smarts here, as the processor can do all 16 of those effectively in parallel by the time the CPU has even loaded the "cached index". Once again, tuning is extremely specific to a workload and parameters.

For maximum performance, it's pretty much always required to try out some un-ergonomic API. In this case the sequencing is nice, but perhaps I could rewrite the code that uses mempipe to not care and handle sequencing someplace else? I forget. In theory I could have multiple banks of tickets, and switch between them on some level of pressure or timeout. Using the same signal line between the writer and the reader (eg, they both must write to client_owned) is probably a mistake. I think it'd be better to use two indexes, bumped on one side when a writer writes, and bumped on the other side when a reader is done. This would allow better "bursting" of data in my opinion? Who knows, even a data structure as simple as this effectively requires building it and trying to really determine how it performs. So many resources in use in the CPU and it's a tough balance in your head.


I put some thought into that, unfortunately I had some issues with writing these hooks in QEMU TCG itself. That's what I originally did, such that it was JIT-target agnostic, but unfortunately I couldn't get the register stuff to work correctly? Idk, probably an assumption of the QEMU TCG implementation I didn't allow.

I don't think they'd want it if it's specific to one target arch. Idk. In theory the hooks into QEMU are target agnostic, there's no assembly on the QEMU side, only in the Cannoli server side (shared library).

I had it working target agnostic if I did memory accesses from the JIT, but it "didn't work" when I tried to put my data in registers (eg. the only generic implementation I used caused 2 extra memory accesses per target instruction, which is brutal at these speeds)


Perhaps you want to check with the Qemu devs? They may have advice on what work would be needed regardless of whether they accept your contribution or not, and additionally what they'd want to see in order to accept your contribution.

In any case, thank you for such a nice creation!


Oooh, wasn't really expecting this to make it to HN cause it was meant to be more of an announcement than a description.

But yes, I've done about 7 or 8 operating systems for fuzzing in the past and it's a massive performance (and cleanliness) cleanup. This one is going to be like an operating system I wrote 2-3 years ago for my vectorized emulation work.

To answer your QEMU questions, the goal is to effectively build QEMU with MUSL (just to make it static so I don't need a dynamic loader), and modify MUSL to turn all syscalls to `call` instructions. This means a "syscall" is just a call to another area, which will by my Rust Linux emulator. I'll implement the bare minimum syscalls (and enum variants to those syscalls) to get QEMU to work, nothing more. The goal is not to run Linux applications, but run a QEMU+MUSL combination which may be modified lightly if it means a lower emulation burden (eg. getting rid of threading in QEMU [if possible] so we can avoid fork())

The main point of this isn't performance, it's determinism, but that is a side effect. A normal syscall instruction involves a context switch to the kernel, potentially cr3 swaps depending on CPU mitigation configuration, and the same to return back. This can easily be hundreds of cycles. A `call` instruction to something that handles the syscall is on the order of 1-4 cycles.

While for syscalls this isn't a huge deal, it's even more emphasized when it comes to KVM hypercalls. Transitions to a hypervisor are very expensive, and in this case, the kernel, the hypervisor, and QEMU (eg. device emulation) will all be running at the same privilege level and there won't be a weird QEMU -> OS -> KVM -> other guest OS device -> KVM -> OS -> QEMU transition every device interaction.

But then again, it's mainly for determinism. By emulating Linux deterministically (eg. not providing entropy through times or other syscall returns), we can ensure that QEMU has no source of external entropy, and thus, will always do the same thing. Even if it uses a random-seeded hash table, the seed would be derived from syscalls, and thus, will be the same every time. This determinism means the guest always will do the same thing, to the instruction. Interrupts happen on the same instructions, context switches do, etc. This means any bug, regardless of how complex, will reproduce every time.

All of this syscall emulation + determinism I have also done before, in a tool called tkofuzz that I wrote for Microsoft. That used Linux emulation + Bochs, and it was written in userspace. This has proven incredibly successful and it's what most researchers are using at Microsoft now. That being said, Bochs is about 100x slower than native execution, and now that people have gotten a good hold of snapshot fuzzing (there's a steep learning curve), it's time to get a more performant implementation. With QEMU with get this with a JIT, which at least gets us a 2-5x improvement over Bochs while still "emulating", but even more value could be found if we get the KVM emulation working and can use a hypervisior. That being said, I do plan to support a "mode" where guests which do not touch devices (or more specifically, snapshots which are taken after device I/O has occurred) will be able to run without QEMU at all. We're really only using QEMU for device emulation + interrupt control, thus, if you take a snapshot to a function that just parses everything in one thread, without process IPC or device access (it's rare, when you "read" from a disk, you're likely just hitting OS RAM caches, and thus not devices), we can cut out all the "bloat" of QEMU and run in a very very thin hypervisor instead.

In fuzzing it's critical to have ways to quickly map and unmap memory as most fuzz cases last for hundreds of microseconds. This means after a few hundred microseconds, I want to restore all memory back to the state "before I handled user input" and continue again. This is extremely slow in every conventional operating system, and there's really no way around it. It's of course possible to make a driver or use CRIU, but these are still not exactly the solution that is needed here. I'd rather just make an OS that trivially runs in KVM/Hyper-V/Xen, and thus can run in a VM to get the cross-platform support, rather than writing a driver for every OS I plan to use this on.

Stay cute, ~gamozo


Your YouTube looks awesome, and I just followed you on Twitch. Looking forward to a marathon stream. :)

A question about fuzzing and determinism:

It seems like this requirement of determinism is connected to (and reinforces) a brute force approach to discovering problematic inputs.

This precludes strategies like performing a truly stochastic search (where you put complete faith in the randomness) over your input space.

Are people attempting this kind of thing? Are there additional requirements to fuzzing that make probabilistically finding a thin trajectory through the input space undesirable?


Determinism is already fairly important in fuzzing for 2 major reasons.

One, is that having determinism makes it easier to triage bugs. If I find a crash while fuzzing, but there's no way to reproduce it, it's going to pretty much never get fixed (unless the call stack from the first crash is obvious enough to prepare a fix). For fuzzing systems, this is effectively mandatory. Imagine a memory allocation failure leading to a NULL-deref, the odds of hitting that same bug are so low because it requires going OOM at the same point in a subsequent execution. Snapshot fuzzing (each fuzz case starts from a snapshotted state of memory/registers/whatever) mitigates some of this, but there's still noise from context switches that will affect RNGs, which will affect heap layouts, which will affect everything on the system. That being said, most fuzzing out there publicly is not system's fuzzing, and thus this is typically not something people think about.

But two, is what most people use determinism for. In modern fuzzing, coverage guidance is pretty much mandatory. This means when new code is hit, to save off the input such that it can be built upon. At a very simple level, this means a problem which is 256^4, turns into a 256*4, as all requirements do not need to be satisfied simultaneously, as long as the previous requirements cause new code to get hit they can be built upon. Of course, if the program does not behave the same way every time, the noise can start to erode the value which is provided here, since you're not actually building upon the same thing.

I'm not sure if this answers the question.


I think you're mixing two different concepts up here.

Determinism in the context of fuzzing is related to the reproducibility of a program state. It's deterministic in the sense that all the inputs to the system are known and controlled. This allows us to repeat all inputs and reproduce the exact same behaviour as before, e.g. an error state we stumbled upon or an interesting program state we want to continue exploring.

This in no way precludes sampling the input space stochastically. Brute forcing by sampling all possible inputs sequentially is usually untenable and wasteful. However, once you do encounter a new program state, you'll be able to perfectly recreate it forever.


I think you're right, I was mixing two concepts and my question wasn't really about determinism.

Writing a specialized OS suggests to me that someone is very focused on... the best way I can describe it is cutting a fat trajectory through the input space. I am curious if anyone is spending their effort on sparser (but more intelligent) sampling of the input space instead.


Yes, there's a lot of work being done on more intelligent fuzzing. To throw some terms into the mix, there's coverage-guided fuzzing (which is now an old technique), concolic testing (which combines concrete execution with symbolic execution in order to reach new branches in a targetted way) and grammar fuzzers (which generate valid inputs according to a grammar).

These are not really mutually exclusive with the type of work gamozolabs is doing because even with hyperintelligent input generation, you still ideally want raw speed.


Oopsies, yep, I got banned making Maplestory cheats with Ghidra when it came out.

Nevertheless, I got unbanned and I still stream at: https://www.twitch.tv/gamozo I upload all the vods at: https://www.youtube.com/user/gamozolabs

Hope y'all enjoy the content <3


Back in Pentium 4 days you could use the DS and CS override prefixes on conditional branches to hint taken and not taken, respectively.

Kinda neat, but it's not a thing anymore.

Some more info here: https://stackoverflow.com/questions/14332848/intel-x86-0x2e-...


Sorry about that. This blog kinda just hopped into the meat of it as I was trying to keep it short. It's largely a followup to a previous blog of mine https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll... where I go into the details and descriptions.


Hehe, hyperthreading has some issues. This issue technically works single thread, but it's hard for sensitive data to survive a context switch. That being said, this issue is mitigated in all common OSes and latest microcode.

I'll be curious as to what there is to learn from this. It's more of a longshot goal for me to learn how things work, develop accurate uarch models, and then learn from those models better than I could guess and check hardware results.

Hard to say if it'll go well....


> That being said, this issue is mitigated in all common OSes and latest microcode.

Emphasis on this issue, eh?

> I'll be curious as to what there is to learn from this...

I'm also very interested to see what you and the community can discover using this trick!


I'll eat my hat for this, but effectively the mitigation to this is clearing all caches and internal buffers in the CPU on each context switch.

I'm sure we'll see more types of leaks, but unless they're actively fetching invalid data [1], there isn't much sensitive data to leak anymore.

I don't think there is much during speculation that can load _new_ data during that window.

[1]: So far almost every CPU bug has leaked something in an internal cache.


A machine clear clears the pipeline. Does it clear these internal caches? There is, of course, no machine clear instruction. Could you construct a machine clearing sequence, insert it into the context switch code and test your hypothesis?


The `verrw` legacy instruction has been added to with microcode to flush internal caches (load buffers, store buffers, etc). Any serializing instruction should (hopefully) cause a pipeline flush. This is the mitigation solution Intel made available to OS developers and should be what is being used.


> I'll eat my hat for this, but effectively the mitigation to this is clearing all caches and internal buffers in the CPU on each context switch.

Are you talking about software context switches or hardware (hyperthreading) ones?

If HW threading constantly clears caches wouldn't it cause a huge performance loss? Isn't that something that can occur 1-100 million times per second?

Sadly, software context switch cache clearing is pretty much given these days.


In this case a privilege transition requires flushing caches. The scheduler has to be aware to not schedule two different permission levels/domains on the same core. It's a huge amount of osdev work to make hyperthreading "safe". I'll be curious if Intel doubles down on HT again.


The x86 `syscall` instruction stores the return address into `rcx` as well as `RFLAGS` into `r11`. These are unconditionally clobbered and thus cannot be saved or used in a `syscall` transition.


This kernel is just kind of a playground for projects I have. Specifically there were a few from this past year or so.

I used this kernel originally for my vectorized emulator, which is designed as a high-performance fuzzer/harness to find bugs (more info https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_e...). I used vectorized emulation on Windows DHCP to find multiple RCEs (which were disclosed earlier this year), as well as one of the Intel MDS vulnerabilities (such as RIDL and Fallout) disclosed earlier this year (specifically I found "MLPDS", https://nvd.nist.gov/vuln/detail/CVE-2018-12127).

I do most of my work in a personal kernel as it really gives me an edge with optimization. I'm able to use page tables directly (super fast fork()-like behavior), and write hypervisors that don't have to go through crazy call stacks to vmexit, use bleeding-edge CPU features, etc. Ultimately I just do it because it's fun, but I've found ways to justify it from time to time.


Thanks for posting this. This is a fascinating project and I hope you post more about it. I had a question about something under the Microcode section which stated:

>"These are often complex operations, like switching operating modes, reading/writing internal CPU registers, etc."

Is a switch from ring 3 to ring 0 handled by microcode then? If so why is this?


Is it a pure hobby project on the side, or do you work for a company / research lab ? That is really interesting btw.


I currently work at Microsoft, however my work and my hobby are pretty much the same at this point so I put a lot of time into it!


I greatly appreciate the interest in support! I don't have a need for funding for projects like these, as I get plenty of enjoyment out of doing them on my own. For now I'd suggest helping the many other projects/blogs/charities out there! If my following grows I'd love to add a Patreon and directly feed any proceeds there back into the community via projects, free trainings, blogs, etc.

I'm happy enough with where I am in life, and I'd really like to just give back and teach others. I was fortunate enough to have a mentor at a very young age (and many other great teachers and leaders throughout life), and I recognize the huge advantage it gave me. If I can pass knowledge on to the next generation (and current generations of course), that's the best outcome I can hope for!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: