Hacker Newsnew | past | comments | ask | show | jobs | submit | davidtgoldblatt's commentslogin

There's two I've tried to do this:

- On the wait side, do the CAS to set the "waiter present" bit. Down unlock, do a (relaxed) read of the lock word, and if "waiter present" isn't set, just do a release store to unlock (and go down some slow CAS-y wake path if a waiter is present). On the wait side, never do an un-timed futex wait; just do a series of timed waits, with increasing wait times (so that you still eventually fix things if you hit the unlucky race between the previous holder's check+store sequence). (You can also do some tricks with counters to let waiters do an unconditional sleep once they get their wait acknowledged).

- Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).

Practically, neither was ever "better enough" for anything but microbenchmarks to be worth the complexity.


> Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).

If you are doing this, given the cost of membarrier, you are optimizing for almost always uncontended locks. At this point you can make your lock default-owned by the first thread to lock it and have the owner lock and unlock be basically free until it is contended. This is basically the biased locking optimization that Java implements (or used to).


It kinda depends; you only do the membarrier when you're about to sleep anyways, and the non-expedited membarrier() call is just a synchronize_rcu(), so it's not that drastically more expensive than a futex wait.

You don't necessarily want a biased lock for all this kind of stuff, because "sparsely contended" doesn't necessarily imply thread-associated. E.g. one place I was looking at this for was locks for pages of virtual memory in a heap; no thread "owns" any given heap page, but it was very uncommon to get unlucky and have two threads touching adjacent pages at the exact same time. These kind of "sloppy mutexes" get half the fast-path speedup of biased locks but without the heavily asymmetric performance costs. (At least, that was the theory; like I said it didn't really pan out to be that useful in practice).


> There's clearly an opportunity for a much longer talk, this teases that Paul (McKenney, of Linux fame) will have a very different take and we don't hear what it is, maybe that was presented at another session of this conference.

Paul's talk is here: https://www.youtube.com/watch?v=iJP6DWVrLjM


Thanks, I think I'll end up watching a good number of these


My usages are similar to yours, but new C standards still benefit me because I can opportunistically detect and make use of new features in a configure script.

To use my baby as an example: free_sized(void *ptr, size_t alloc_size) is new in C23. I can detect whether or not it's available and use it if so. If it's not available, I can just fall back to free() and get the same semantics, at some performance or safety cost.


The free_sized function is kind of a bad example, though. For years to come, using free_sized will break malloc interposition. Interposed mallocs will not support free_sized initially. (It's currently not in jemalloc, tcmalloc, mimalloc as far as I can see.) If we add it to glibc and your application picks it up, calls to free_sized will end up with the glibc allocator even if malloc/free/… have been interposed. Maybe there is a way to paper over this in the glibc implementation of free_sized (rather than calling free unconditionally), and still do something useful for the glibc allocator. I don't know.


> Maybe there is a way to paper over this in the glibc implementation of free_sized (rather than calling free unconditionally), and still do something useful for the glibc allocator. I don't know.

We emailed about this a little contemporaneously (thread "Sized deallocation for C", from February), and I think we came to the conclusion that glibc can make interposition work seamlessly even for interposed allocators lacking free_sized, by checking (in glibc's free_sized) if the glibc malloc/calloc/realloc has been called, and redirecting to free if it hasn't. (The poorly-named "Application Life-Cycle" section of the paper).


I don't fully understand the need or benefit of having free_sized() available tbh.

Spec says it's functionally equivalent to free(ptr) or undefined:

If ptr is a null pointer or the result obtained from a call to malloc, realloc, or calloc, where size size is equal to the requested allocation size, this function is equivalent to free(ptr). Otherwise, the behavior is undefined

Even the recommended practice does not really clarify things:

Implementations may provide extensions to query the usable size of an allocation, or to determine the usable size of the allocation that would result if a request for some other size were to succeed. Such implementations should allow passing the resulting usable size as the size parameter, and provide functionality equivalent to free in such cases

When would someone use this instead of simply free(ptr) ?


> I don't fully understand the need or benefit of having free_sized() available tbh.

It's a performance optimization. Allocator implementations spend quite a lot of time in free() matching the provided pointer to the correct size bucket (as to why they don't have something like a ptr->bucket hash table, IDK, maybe it would consume too much memory overhead particularly for small allocations?). With free_sized() this step can be jumped over.


Thanks for your insights, which prompted to actually jump into the malloc.c implementation.


Of particular interest is section 20.18 -- "Mirroring memory operands"; extending (something like) register renaming to memory.


I wonder what the limits are on that, and if it chains very deeply, e.g mreg ("memory register") to op to mreg to op to mreg.... What's the limit? Can this be used to extend the optimization of some chained computations via a kind of unrolling?


In Agner's forum post linked a few days ago[1], it sounded like it was quite limited. I mean, super impressive, but as far as I understood, it didn't handle nesting (just the single pointer/memory access + offset)[2]. I'm not very well versed on this stuff though, so maybe I misunderstood.

[1] https://news.ycombinator.com/item?id=24302057

[2] > The mechanism works only under certain conditions. It must use general purpose registers, and the operand size must be 32 or 64 bits. The memory operand must use a pointer and optionally an index. It does not work with absolute or rip-relative addresses.

> It seems that the CPU makes assumptions about whether memory operands have the same address before the addresses have been calculated. This may cause problems in case of pointer aliasing.

Or from the PDF:

•The instructions must use general purpose registers.

•The memory operands must have the same address.

•The operand size must be 32 or 64 bits.

•You may have a32 bit read after a 64 bit write to the same address, but not vice versa.

•The memory address must have a base pointer, no absolute address, and no rip-relative address. The memory address may have an index register,a scale factor, and an offset no bigger than 8 bits.

•The memory operand must be specified in exactly the same way with the same unmodified pointer and index registers in all the instructions involved.

•The memory address cannot cross a cache line boundary.

•The instructions can be simple MOV instructions, read-modify instructions,or read-modify-write instructions.It also works with PUSH and POP instructions.

•Complex instructions with multiple μops cannot be used.


My educated guess it that it is handled the same way as register renaming. Zen 2 has 180 GP registers.

For 64-bit platform, you would need some algorithm that needs more than 15 registers (so you need to spill to stack) that need to be execute faster than cache access latency. There might be some, but I doubt many would fall into this category.


A better name might be "Memory Operand Forwarding" since the data is probably coming from a write buffer.


> since the data is probably coming from a write buffer

Taking pending writes from the store buffer before they have retired is something else and has (obviously) been done since we first had OOO execution.

This is referring back to the operand's original value in a register file because that can be done with less latency than searching around in the store buffer, which I think isn't much different to L1.


>> This is referring back to the operand's original value in a register file because that can be done with less latency...

Thanks, I had missed that distinction.


That name doesn't really distinguish this feature from the normal store-to-load forwarding from the store buffer that processors have been doing for decades, though. I think memory renaming is the probably the best name for it.


> Maybe there's some reason this doesn't work, or maybe the performance improvements aren't worth the effort?

I work on an allocator, and I've lobbied (Linux) kernel people for this over the years. I think everyone mostly agrees it'd be useful (and, the TLB shootdown cost is indeed often nontrivial), but the current architecture of the various subsystem's you'd need to touch makes it hard to do.

As allocators move more and more to a hugepages-first world, I think we'll see fewer and fewer benefits from faster unmappings (instead, we'll just spend more effort making sure we fill in the holes in existing in-use ranges).


I can't imagine how pissed off I'd be if I survived a nuclear winter and traveled hundreds of miles to read the prophesied guidestones that could save us all, only to find out that they had several weird eugenicist rules but zero information about like, water purification or animal husbandry.


Well, they do want to 'maintain' a small population, so why would they give you any survival tips?


Example usage on some code with interesting pipelining + resource contention properties: https://godbolt.org/z/11oyav

(The code is from the stream vbyte repo; see https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-... ).

Edit: Example interesting fact: the first iteration takes 35 cycles before it finishes; but the reciprocal throughput of the loop (assuming it executes a few times) is 5.3 cycles.


Depending on what exactly you mean by "hand it a block of memory to manage", you can do this with jemalloc's extent hook functionality. Though that's meant for "I have some long-lived data structures, I want all their memory stored on hugepages" more than "I have this small-ish data structure, I want to batch its allocations together".


Another fun attack vector that I don't think has been well explored yet involves the use of C++ sized deallocation functions. If a base class is missing a virtual destructor, or if an array allocated with new[] is deallocated with delete (instead of delete[]), then the allocation can be freed with an incorrect size parameter. If this happens, you can trigger some of the same sorts of state corruption issues that a double-free would cause (you set a random bitmap bit to "free" in the metadata, since you're calculating offsets from the start of a slab incorrectly).

Valgrind won't ever catch this, and Address Sanitizer won't always (it depends on both the exact type of bug, and the sanitization settings).



Oh yes; while certainly mismatched new[]/delete has been a source of problems since forever, what I mean specifically is the new types of exploits possible with the "operator delete(void* ptr, size_t sz)" overload (and its array cousin).

Before C++14, using delete (rather than delete[]) to deallocate an array of ints (or other trivially destructible data types) would be safe in practice, even though it was disallowed. In a world with sized deallocation functions, it's exploitable.



(As background, I'm a jemalloc developer; reposting my twitter comment on the same article): This is quite a bad way of doing a malloc benchmark -- getting realistic activity patterns is critical (see e.g. Wilson et al.'s survey). It doesn't meaningfully test inter-thread interactions, and randomizes in a way that hurts the effectiveness of thread-local caching.

For the large majority of server workloads on Linux, jemalloc or tcmalloc is probably the right choice of allocator. Trying these out (and spending a few additional test runs tuning their configuration) will often yield significant wins compared to the glibc allocator.


if your application needs to be up a lot then you will also care about memory fragmentation - as far as I know most benchmarks do not focus on such problems. You still must test it under real live conditions! No way around that.


jemalloc and tcmalloc have been much better than glibc for me, especially when it comes to avoiding fragmentation with some server workloads http://smalldatum.blogspot.com/2017/11/concurrent-large-allo... http://smalldatum.blogspot.com/2018/04/myrocks-malloc-and-fr... http://smalldatum.blogspot.com/2015/10/myrocks-versus-alloca...


So true. Inside Google, it used to be the case that every N months someone had to figure how to build their Python daemon with Blaze (Bazel) using tcmalloc. It was painful and I hope it's easier nowadays. Without it, after a few weeks you'd see memory leaks whose actual root cause was fragmentation. I can't remember how they interacted with the allocator in Python, but I always suspected that heavy use of protocol buffers made the problem a lot more acute.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: