Lack of data parallelism is implied by computation that is performed.
You gradient descend on your state.
Each step needs to work on up to date state otherwise you're computing gradient descend from state that doesn't exist anymore and your computed gradient descent delta is nonsensical if applied to the most recent state (it was calculated on old one, direction that your computation calculated is now wrong).
You also can't calculate it without having access to the whole state. You have to do full forward and backward pass and mutate weights.
There aren't any ways of slicing and distributing that make sense in terms of efficiency.
The reason is that too much data at too high frequency needs to be mutated and then made readable.
That's also the reason why nvidia is focusing so much on hyper efficient interconnects - because that's the bottleneck.
Computation itself is way ahead of in/out data transfer. Data transfer is the main problem and going in the direction of architecture that dramatically reduces it by several orders of magnitude is just not the way to go.
If somebody solves this problem it'll mean they solved much more interesting problem – because it'll mean you can locally uptrain model and inject this knowledge into bigger one arbitrarily.
Your gradient descent is an operation on a directed acyclic graph. The graph itself is stateless. You can do parts of the graph without needing to have access to the entire graph, particularly for transformers. In fact this is already done today for training and inference of large models. The transfer bottleneck is for currently used model sizes and architectures. There's nothing to stop you from building a model so complex that compute itself becomes the bottleneck rather than data transfer. Except its ultimate usability of course, as I already mentioned.
Your DAG is big. It's stateless for single pass. Next one doesn't operate on it anymore, it operates on new, updated one from previous step. It has fully connected sub DAGs.
There is nothing stopping you from distributing assembly/machine code for CPU instructions, yet nobody does it because it doesn't make sense from performance perspective.
Or amazon driving truck from one depo to other to unload one package at a time to "distribute" unloading because "distributing = faster".
Yes, if there was something interesting there you'd think since 2017 something would happen. Reinforcement Learning (that is compared with) is not particularly famous for its performance (it is it's biggest issue and reason for not being used that much). Also transformers don't use it at all.
OpenAI has turned for profit and stopped releasing any tehcnical details regarding architectures or training. So how do you know that nothing has happened? Because they didn't release it? Do you see the issue here?
You gradient descend on your state.
Each step needs to work on up to date state otherwise you're computing gradient descend from state that doesn't exist anymore and your computed gradient descent delta is nonsensical if applied to the most recent state (it was calculated on old one, direction that your computation calculated is now wrong).
You also can't calculate it without having access to the whole state. You have to do full forward and backward pass and mutate weights.
There aren't any ways of slicing and distributing that make sense in terms of efficiency.
The reason is that too much data at too high frequency needs to be mutated and then made readable.
That's also the reason why nvidia is focusing so much on hyper efficient interconnects - because that's the bottleneck.
Computation itself is way ahead of in/out data transfer. Data transfer is the main problem and going in the direction of architecture that dramatically reduces it by several orders of magnitude is just not the way to go.
If somebody solves this problem it'll mean they solved much more interesting problem – because it'll mean you can locally uptrain model and inject this knowledge into bigger one arbitrarily.