Lack of data parallelism is implied by computation that is performed. You gradie...

sigmoid10 · on March 23, 2024

Your gradient descent is an operation on a directed acyclic graph. The graph itself is stateless. You can do parts of the graph without needing to have access to the entire graph, particularly for transformers. In fact this is already done today for training and inference of large models. The transfer bottleneck is for currently used model sizes and architectures. There's nothing to stop you from building a model so complex that compute itself becomes the bottleneck rather than data transfer. Except its ultimate usability of course, as I already mentioned.

mirekrusin · on March 23, 2024

Your DAG is big. It's stateless for single pass. Next one doesn't operate on it anymore, it operates on new, updated one from previous step. It has fully connected sub DAGs.

There is nothing stopping you from distributing assembly/machine code for CPU instructions, yet nobody does it because it doesn't make sense from performance perspective.

Or amazon driving truck from one depo to other to unload one package at a time to "distribute" unloading because "distributing = faster".

sigmoid10 · on March 28, 2024

Remember the famouse Tannenbaum quote:

"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."

It's only a question of scale and not of any principles.

ajb · on March 23, 2024

Openai had a more distributable algorithm: https://openai.com/research/evolution-strategies

But given that they don't seem to have worked on it since, I guess it wasn't too successful. But maybe there is a way

mirekrusin · on March 23, 2024

Yes, if there was something interesting there you'd think since 2017 something would happen. Reinforcement Learning (that is compared with) is not particularly famous for its performance (it is it's biggest issue and reason for not being used that much). Also transformers don't use it at all.

sigmoid10 · on March 28, 2024

OpenAI has turned for profit and stopped releasing any tehcnical details regarding architectures or training. So how do you know that nothing has happened? Because they didn't release it? Do you see the issue here?