> 5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?
It's pretty typical for large programs to spend 15+% of their "CPU time" waiting for the TLB. [1] So larger pages really help, including changing the base 4 KiB -> 16 KiB (4x reduction in TLB pressure) and using 2 MiB huge pages (512x reduction where it works out).
I've also wondered why the TLB isn't larger.
> On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much?
This is the granularity at which physical memory is assigned, and there are a lot of reasons most of a page might be wasted:
* The heap allocator will typically cram many things together in a page, but it might say only use a given page for allocations in a certain size range, so not all allocations will snuggle in next to each other.
* Program stacks each use at least one distinct page of physical RAM because they're placed in distinct virtual address ranges with guard pages between. So if you have 1,024 threads, they used at least 4 MiB of RAM with 4 KiB pages, 16 MiB of RAM with 16 KiB pages.
* Anything from the filesystem that is cached in RAM ends up in the page cache, and true to the name, it has page granularity. So caching a 1-byte file would take 4 KiB before, 16 KiB after.
5-10% performance boost sounds huge. Wouldn't we have much larger TLBd if page walk was really this expensive?
On the other hand 9% increase in memory usage also sounds huge. How did this affect memory usage that much?