I’ve been working through that repo and managed the 13B dataset on a single Pi4 8gig
I’ve also replicated the work in OpenMPI ( from a thread on the llama.cpp GitHub repo ) and today I managed to get the 65B dataset operational on three pi4 nodes.
I’m not saying this as any achievement of mine, but as a comment on the current reality of reproducible LLM At home on anything you’ve got.
The objective performance I'm getting is flat poor, mostly because of the network I'm using. On the other hand, simply being able to do it at all with one node on wireless until I can pull another drop, and the rest being on 100 Mbit ... I'm really running a bargain basement cluster.
I don't know about SRD, but llama.cpp has MPI configurations built-in. I didn't have to engineer anything or rewrite anything ( I made an optimization patch, but I didn't even make that one up myself ) I just compiled it with flags set.
As far as performance on 65B, I'm still waiting for it to finish to get the timings :)
I eagerly await your numbers! Maybe I'll post some of my own if I can get far enough ahead at work.
I was thinking it's time to upgrade from 1GigE anyway, 10GigE is cheap and at work we're ripping that out in favor of 25 and 50...
I'll look at the code, depending on how well the authors used MPI it could be exciting times! It's not that hard (or expensive) to get a bunch of power hungry used servers off ebay and string em together with a cheap 10GigE switch. It would be loud and power hungry but I wonder if I could have a 65B local model in the privacy of my own home, for fractions of the cost of buying a A100....
Edit: Oh, and SRD is a ... network protocol designed to work hand in hand with EFA which can substantially improve the performance of HPC MPI workloads, running on EC2, during network bound phases.
The other way around is whole number math. I added the 3-node output from the 13B model to github, the timings are below. The 3-node 65B job hasn't finished yet.
llama_print_timings: load time = 17766.29 ms
llama_print_timings: sample time = 264.42 ms / 128 runs ( 2.07 ms per token, 484.07 tokens per second)
llama_print_timings: prompt eval time = 10146.71 ms / 8 tokens ( 1268.34 ms per token, 0.79 tokens per second)
llama_print_timings: eval time = 287157.12 ms / 127 runs ( 2261.08 ms per token, 0.44 tokens per second)
llama_print_timings: total time = 297598.22 ms
I’ve also replicated the work in OpenMPI ( from a thread on the llama.cpp GitHub repo ) and today I managed to get the 65B dataset operational on three pi4 nodes.
I’m not saying this as any achievement of mine, but as a comment on the current reality of reproducible LLM At home on anything you’ve got.
It really feels like this technique has arrived.
https://github.com/cameronbunce/ClusterConfig