Generally when I want to run something with so much parallelism I just write a small Go program instead, and let Go's runtime handle the scheduling. It works remarkably well and there's no execve() overhead too
So, there are a few reasons why forkrun might work better than this, depending on the situation:
1. if what you want to run is built to be called from a shell (including multi-step shell functions) and not Go. This is the main appeal of forkrun in my opinion - extreme performance without needing to rewrite anything.
2. if you are running on NUMA hardware. Forkrun deals with NUMA hardware remarkably well - it distributes work between nodes almost perfectly with almost 0 cross-node traffic.
Actually the thing that might need that would be make -j. Maybe it would make sense to make a bsd or gnu make version that integrates those optimizations. I am actually running a lot of stuff in cygwin and there is huge fork penalties on windows, so my hope would be to actually get some real payloads faster...
I hate to say it but forkrun probably wont work in cygwin. I haven't tried it, but forkrun makes heavy use of linux-only syscalls trhat I suspect arent available in cygwin.
forkrun might work under WSL2, as its my understanding WSL2 runs a full linux kernel in a hypervisor.
AFAIK, the Go runtime is pretty NUMA-oblivious. The mcache helps a bit with locality of small allocations, but otherwise, you aren't going to get the same benefits (though I absolutely here you about avoiding execve overhead).
So...yes, the execve overhead is real. BUT there's still a lot you can accomplish with pure bash builtins (which don't have the execve overhead). And, if you're open to rewriting things (which would probably be required to some extent if you were to make something intended for shell to run in Go) you can port whatever you need to run into a bash builtin and bypass the execve overhead that way. In fact, doing that is EXACTLY what forkrun does, and is a big part of why it is so fast.
0. vfork (which is sometimes better than CoW fork) + execve if exec is the only outcome of the spawned child. Or, use posix_spawn where available.
1. Inner-loop hot path code {sh,c}ould be made a bash built-in after proving that it's the source of a real performance bottleneck. (Just say "no" to premature optimization.) Otherwise, rewrite the whole thing in something performant enough like C, C++, Rust, etc.
2. I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C doing it in 1 thread worker per core.
> I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C
Short answer: in its fastest mode, forkrun gets very close to the practical dispatch limit for this kind of workload. A tight C loop would still be faster, but at that point you're no longer comparing “parallel job dispatch”—you're comparing raw in-process execution.
Let me try and at least show what kind of performance forkrun gives here. Lets set up 1 billion newlines in a file on a tmpfs
cd /tmp
yes $'\n' | head -n 1000000000 > f1
now lets try frun echo
time { frun echo <f1 >/dev/null; }
real 0m43.779s
user 20m3.801s
sys 0m11.017s
forkrun in its "standard mode" hits about 25 million lines per second running newlines through a no-op (:), and ever so slightly less (23 million lines a second) running them through echo. The vast majority of this time is bash overhead. forkrun breaks up the lines into batches of (up to) 4096 (but for 1 billion lines the average batch size is probably 4095). Then for each batch, a worker-specific data-reading fd is advanced to the correct byte offset where the data starts, and the worker runs
mapfile -t -n $N -u $fd A # N is typically 4096 here
echo "${A[@]}"
The second command (specifically the array expansion into a long list of quoted empty args) is what is taking up the vast majority of the time. frun has a flag (-U) then causes it to replace `"${A[@]}"` with `${A[*]}`, which (in the case of all empty inputs) collapses the long string of quoted empty args into a long list of spaces -> 0 args. This considerably speeds things up when inputs are all empty.
time { frun -U echo <f1 >/dev/null; }
real 0m13.295s
user 6m0.567s
sys 0m7.267s
And now we are at 75 million lines per second. But we are still largely limited by passing data through bash....which is why forkrun also has a mode (`-s`) where it bypasses bash mapfile + array expansion all together and instead splices (via one of the forkrun loadable builtins) data directly to the stdin of whatever you are parallelizing. If you are parallelizing a bash builtin (where there is no execve cost) forkrun gets REALLY fast.
time { frun -s : < f1; }
real 0m0.985s
user 0m13.894s
sys 0m12.398s
which means it is delimiter scanning, dynamically batching and distributing (in batches of up to 4096 lines) at a rate of OVER 1 BILLION LIONES A SECOND or at a rate of ~250,000 batches per second.
At that point the bottleneck is basically just delimiter scanning and kernel-level data movement. There’s very little “scheduler overhead” left to remove—whether you write it in bash+C hybrids (like forkrun) or pure C.
to a NUMA-Aware Contention-Free Dynamically-Auto-Tuning Bash-Native Streaming Parallelization Engine. I dare say 10 years is about the norm for going from "beginner" to "PhD-level" work.