I’ve been working through that repo and managed the 13B dataset on a single Pi4 ...

loxias · on Aug 16, 2023

> I’ve also replicated the work in OpenMPI...

Oh cool! How did it perform?

I wonder if this would be an exciting test for Amazon's SRD protocol which appears to be built for HPC. I'm looking for an excuse to play with it...

cameron_b · on Aug 16, 2023

The objective performance I'm getting is flat poor, mostly because of the network I'm using. On the other hand, simply being able to do it at all with one node on wireless until I can pull another drop, and the rest being on 100 Mbit ... I'm really running a bargain basement cluster.

I don't know about SRD, but llama.cpp has MPI configurations built-in. I didn't have to engineer anything or rewrite anything ( I made an optimization patch, but I didn't even make that one up myself ) I just compiled it with flags set.

As far as performance on 65B, I'm still waiting for it to finish to get the timings :)

loxias · on Aug 17, 2023

I eagerly await your numbers! Maybe I'll post some of my own if I can get far enough ahead at work.

I was thinking it's time to upgrade from 1GigE anyway, 10GigE is cheap and at work we're ripping that out in favor of 25 and 50...

I'll look at the code, depending on how well the authors used MPI it could be exciting times! It's not that hard (or expensive) to get a bunch of power hungry used servers off ebay and string em together with a cheap 10GigE switch. It would be loud and power hungry but I wonder if I could have a 65B local model in the privacy of my own home, for fractions of the cost of buying a A100....

Edit: Oh, and SRD is a ... network protocol designed to work hand in hand with EFA which can substantially improve the performance of HPC MPI workloads, running on EC2, during network bound phases.

Tokumei-no-hito · on Aug 16, 2023

Nice write up man. Thanks for sharing your research

eurekin · on Aug 15, 2023

How many tokens a second?

cameron_b · on Aug 16, 2023

The other way around is whole number math. I added the 3-node output from the 13B model to github, the timings are below. The 3-node 65B job hasn't finished yet.

llama_print_timings: load time = 17766.29 ms llama_print_timings: sample time = 264.42 ms / 128 runs ( 2.07 ms per token, 484.07 tokens per second) llama_print_timings: prompt eval time = 10146.71 ms / 8 tokens ( 1268.34 ms per token, 0.79 tokens per second) llama_print_timings: eval time = 287157.12 ms / 127 runs ( 2261.08 ms per token, 0.44 tokens per second) llama_print_timings: total time = 297598.22 ms

eurekin · on Aug 16, 2023

This is very interesting and actually in the usable realm, for some use cases

cameron_b · on Aug 16, 2023

My networking setup is not optimal, but it was quite surprising how easy it was to get it all to work.