Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Your gradient descent is an operation on a directed acyclic graph. The graph itself is stateless. You can do parts of the graph without needing to have access to the entire graph, particularly for transformers. In fact this is already done today for training and inference of large models. The transfer bottleneck is for currently used model sizes and architectures. There's nothing to stop you from building a model so complex that compute itself becomes the bottleneck rather than data transfer. Except its ultimate usability of course, as I already mentioned.


Your DAG is big. It's stateless for single pass. Next one doesn't operate on it anymore, it operates on new, updated one from previous step. It has fully connected sub DAGs.

There is nothing stopping you from distributing assembly/machine code for CPU instructions, yet nobody does it because it doesn't make sense from performance perspective.

Or amazon driving truck from one depo to other to unload one package at a time to "distribute" unloading because "distributing = faster".


Remember the famouse Tannenbaum quote:

"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."

It's only a question of scale and not of any principles.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: