More

davidtgoldblatt · on Oct 3, 2024

There's two I've tried to do this:

- On the wait side, do the CAS to set the "waiter present" bit. Down unlock, do a (relaxed) read of the lock word, and if "waiter present" isn't set, just do a release store to unlock (and go down some slow CAS-y wake path if a waiter is present). On the wait side, never do an un-timed futex wait; just do a series of timed waits, with increasing wait times (so that you still eventually fix things if you hit the unlucky race between the previous holder's check+store sequence). (You can also do some tricks with counters to let waiters do an unconditional sleep once they get their wait acknowledged).

- Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).

Practically, neither was ever "better enough" for anything but microbenchmarks to be worth the complexity.

gpderetta · on Oct 3, 2024

> Split out the "waiter present" bit into its own byte, do a store-load sequence (with just a compiler reordering fence) to check for waiters, and have waiters either do a membarrier() syscall or wait "long enough" that they're sure they've gotten the same effect. (This gets tricky w.r.t. mutex lifetime though; you either need out of band lifetime knowledge or to use RCU or whatever and indirect through pointers).

If you are doing this, given the cost of membarrier, you are optimizing for almost always uncontended locks. At this point you can make your lock default-owned by the first thread to lock it and have the owner lock and unlock be basically free until it is contended. This is basically the biased locking optimization that Java implements (or used to).

davidtgoldblatt · on Oct 3, 2024

It kinda depends; you only do the membarrier when you're about to sleep anyways, and the non-expedited membarrier() call is just a synchronize_rcu(), so it's not that drastically more expensive than a futex wait.

You don't necessarily want a biased lock for all this kind of stuff, because "sparsely contended" doesn't necessarily imply thread-associated. E.g. one place I was looking at this for was locks for pages of virtual memory in a heap; no thread "owns" any given heap page, but it was very uncommon to get unlucky and have two threads touching adjacent pages at the exact same time. These kind of "sloppy mutexes" get half the fast-path speedup of biased locks but without the heavily asymmetric performance costs. (At least, that was the theory; like I said it didn't really pan out to be that useful in practice).

davidtgoldblatt · on March 4, 2024

> There's clearly an opportunity for a much longer talk, this teases that Paul (McKenney, of Linux fame) will have a very different take and we don't hear what it is, maybe that was presented at another session of this conference.

Paul's talk is here: https://www.youtube.com/watch?v=iJP6DWVrLjM

tialaramex · on March 4, 2024

Thanks, I think I'll end up watching a good number of these

davidtgoldblatt · on Nov 21, 2022

My usages are similar to yours, but new C standards still benefit me because I can opportunistically detect and make use of new features in a configure script.

To use my baby as an example: free_sized(void *ptr, size_t alloc_size) is new in C23. I can detect whether or not it's available and use it if so. If it's not available, I can just fall back to free() and get the same semantics, at some performance or safety cost.

fweimer · on Nov 21, 2022

The free_sized function is kind of a bad example, though. For years to come, using free_sized will break malloc interposition. Interposed mallocs will not support free_sized initially. (It's currently not in jemalloc, tcmalloc, mimalloc as far as I can see.) If we add it to glibc and your application picks it up, calls to free_sized will end up with the glibc allocator even if malloc/free/… have been interposed. Maybe there is a way to paper over this in the glibc implementation of free_sized (rather than calling free unconditionally), and still do something useful for the glibc allocator. I don't know.

davidtgoldblatt · on Nov 22, 2022

> Maybe there is a way to paper over this in the glibc implementation of free_sized (rather than calling free unconditionally), and still do something useful for the glibc allocator. I don't know.

We emailed about this a little contemporaneously (thread "Sized deallocation for C", from February), and I think we came to the conclusion that glibc can make interposition work seamlessly even for interposed allocators lacking free_sized, by checking (in glibc's free_sized) if the glibc malloc/calloc/realloc has been called, and redirecting to free if it hasn't. (The poorly-named "Application Life-Cycle" section of the paper).

Bhurn00985 · on Nov 21, 2022

I don't fully understand the need or benefit of having free_sized() available tbh.

Spec says it's functionally equivalent to free(ptr) or undefined:

If ptr is a null pointer or the result obtained from a call to malloc, realloc, or calloc, where size size is equal to the requested allocation size, this function is equivalent to free(ptr). Otherwise, the behavior is undefined

Even the recommended practice does not really clarify things:

Implementations may provide extensions to query the usable size of an allocation, or to determine the usable size of the allocation that would result if a request for some other size were to succeed. Such implementations should allow passing the resulting usable size as the size parameter, and provide functionality equivalent to free in such cases

When would someone use this instead of simply free(ptr) ?

jabl · on Nov 21, 2022

> I don't fully understand the need or benefit of having free_sized() available tbh.

It's a performance optimization. Allocator implementations spend quite a lot of time in free() matching the provided pointer to the correct size bucket (as to why they don't have something like a ptr->bucket hash table, IDK, maybe it would consume too much memory overhead particularly for small allocations?). With free_sized() this step can be jumped over.

Bhurn00985 · on Nov 22, 2022

Thanks for your insights, which prompted to actually jump into the malloc.c implementation.

davidtgoldblatt · on Aug 30, 2020

Of particular interest is section 20.18 -- "Mirroring memory operands"; extending (something like) register renaming to memory.

boxfire · on Aug 30, 2020

I wonder what the limits are on that, and if it chains very deeply, e.g mreg ("memory register") to op to mreg to op to mreg.... What's the limit? Can this be used to extend the optimization of some chained computations via a kind of unrolling?

dkersten · on Aug 30, 2020

In Agner's forum post linked a few days ago[1], it sounded like it was quite limited. I mean, super impressive, but as far as I understood, it didn't handle nesting (just the single pointer/memory access + offset)[2]. I'm not very well versed on this stuff though, so maybe I misunderstood.

[1] https://news.ycombinator.com/item?id=24302057

[2] > The mechanism works only under certain conditions. It must use general purpose registers, and the operand size must be 32 or 64 bits. The memory operand must use a pointer and optionally an index. It does not work with absolute or rip-relative addresses.

> It seems that the CPU makes assumptions about whether memory operands have the same address before the addresses have been calculated. This may cause problems in case of pointer aliasing.

Or from the PDF:

•The instructions must use general purpose registers.

•The memory operands must have the same address.

•The operand size must be 32 or 64 bits.

•You may have a32 bit read after a 64 bit write to the same address, but not vice versa.

•The memory address must have a base pointer, no absolute address, and no rip-relative address. The memory address may have an index register,a scale factor, and an offset no bigger than 8 bits.

•The memory operand must be specified in exactly the same way with the same unmodified pointer and index registers in all the instructions involved.

•The memory address cannot cross a cache line boundary.

•The instructions can be simple MOV instructions, read-modify instructions,or read-modify-write instructions.It also works with PUSH and POP instructions.

•Complex instructions with multiple μops cannot be used.

innocenat · on Aug 30, 2020

My educated guess it that it is handled the same way as register renaming. Zen 2 has 180 GP registers.

For 64-bit platform, you would need some algorithm that needs more than 15 registers (so you need to spill to stack) that need to be execute faster than cache access latency. There might be some, but I doubt many would fall into this category.

phkahler · on Aug 30, 2020

A better name might be "Memory Operand Forwarding" since the data is probably coming from a write buffer.

chrisseaton · on Aug 30, 2020

> since the data is probably coming from a write buffer

Taking pending writes from the store buffer before they have retired is something else and has (obviously) been done since we first had OOO execution.

This is referring back to the operand's original value in a register file because that can be done with less latency than searching around in the store buffer, which I think isn't much different to L1.

phkahler · on Aug 30, 2020

>> This is referring back to the operand's original value in a register file because that can be done with less latency...

Thanks, I had missed that distinction.

rrss · on Aug 30, 2020

That name doesn't really distinguish this feature from the normal store-to-load forwarding from the store buffer that processors have been doing for decades, though. I think memory renaming is the probably the best name for it.

davidtgoldblatt · on May 18, 2020

> Maybe there's some reason this doesn't work, or maybe the performance improvements aren't worth the effort?

I work on an allocator, and I've lobbied (Linux) kernel people for this over the years. I think everyone mostly agrees it'd be useful (and, the TLB shootdown cost is indeed often nontrivial), but the current architecture of the various subsystem's you'd need to touch makes it hard to do.

As allocators move more and more to a hugepages-first world, I think we'll see fewer and fewer benefits from faster unmappings (instead, we'll just spend more effort making sure we fill in the holes in existing in-use ranges).

davidtgoldblatt · on Oct 11, 2019

I can't imagine how pissed off I'd be if I survived a nuclear winter and traveled hundreds of miles to read the prophesied guidestones that could save us all, only to find out that they had several weird eugenicist rules but zero information about like, water purification or animal husbandry.

dasKrokodil · on Oct 11, 2019

Well, they do want to 'maintain' a small population, so why would they give you any survival tips?

davidtgoldblatt · on May 3, 2019

Example usage on some code with interesting pipelining + resource contention properties: https://godbolt.org/z/11oyav

(The code is from the stream vbyte repo; see https://lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-... ).

Edit: Example interesting fact: the first iteration takes 35 cycles before it finishes; but the reciprocal throughput of the loop (assuming it executes a few times) is 5.3 cycles.

davidtgoldblatt · on Jan 7, 2019

Depending on what exactly you mean by "hand it a block of memory to manage", you can do this with jemalloc's extent hook functionality. Though that's meant for "I have some long-lived data structures, I want all their memory stored on hugepages" more than "I have this small-ish data structure, I want to batch its allocations together".

davidtgoldblatt · on Aug 16, 2018

Another fun attack vector that I don't think has been well explored yet involves the use of C++ sized deallocation functions. If a base class is missing a virtual destructor, or if an array allocated with new[] is deallocated with delete (instead of delete[]), then the allocation can be freed with an incorrect size parameter. If this happens, you can trigger some of the same sorts of state corruption issues that a double-free would cause (you set a random bitmap bit to "free" in the metadata, since you're calculating offsets from the start of a slab incorrectly).

Valgrind won't ever catch this, and Address Sanitizer won't always (it depends on both the exact type of bug, and the sanitization settings).

chrisrohlf · on Aug 16, 2018

When in doubt just assume Mark Dowd found it first http://taossa.com/index.php/2007/01/03/attacking-delete-and-... (dead link) but see here too http://www.blackhat.com/presentations/bh-usa-07/Dowd_McDonal...

davidtgoldblatt · on Aug 16, 2018

Oh yes; while certainly mismatched new[]/delete has been a source of problems since forever, what I mean specifically is the new types of exploits possible with the "operator delete(void* ptr, size_t sz)" overload (and its array cousin).

Before C++14, using delete (rather than delete[]) to deallocate an array of ints (or other trivially destructible data types) would be safe in practice, even though it was disallowed. In a world with sized deallocation functions, it's exploitable.

matttb · on Aug 17, 2018

https://web.archive.org/web/20070206052102/http://taossa.com...

First archive of the dead link

davidtgoldblatt · on July 5, 2018

(As background, I'm a jemalloc developer; reposting my twitter comment on the same article): This is quite a bad way of doing a malloc benchmark -- getting realistic activity patterns is critical (see e.g. Wilson et al.'s survey). It doesn't meaningfully test inter-thread interactions, and randomizes in a way that hurts the effectiveness of thread-local caching.

For the large majority of server workloads on Linux, jemalloc or tcmalloc is probably the right choice of allocator. Trying these out (and spending a few additional test runs tuning their configuration) will often yield significant wins compared to the glibc allocator.

MichaelMoser123 · on July 5, 2018

if your application needs to be up a lot then you will also care about memory fragmentation - as far as I know most benchmarks do not focus on such problems. You still must test it under real live conditions! No way around that.

mdcallag · on July 5, 2018

jemalloc and tcmalloc have been much better than glibc for me, especially when it comes to avoiding fragmentation with some server workloads http://smalldatum.blogspot.com/2017/11/concurrent-large-allo... http://smalldatum.blogspot.com/2018/04/myrocks-malloc-and-fr... http://smalldatum.blogspot.com/2015/10/myrocks-versus-alloca...

puzzle · on July 5, 2018

So true. Inside Google, it used to be the case that every N months someone had to figure how to build their Python daemon with Blaze (Bazel) using tcmalloc. It was painful and I hope it's easier nowadays. Without it, after a few weeks you'd see memory leaks whose actual root cause was fragmentation. I can't remember how they interacted with the allocator in Python, but I always suspected that heavy use of protocol buffers made the problem a lot more acute.