Your gradient descent is an operation on a directed acyclic graph. The graph its...

mirekrusin · on March 23, 2024

Your DAG is big. It's stateless for single pass. Next one doesn't operate on it anymore, it operates on new, updated one from previous step. It has fully connected sub DAGs.

There is nothing stopping you from distributing assembly/machine code for CPU instructions, yet nobody does it because it doesn't make sense from performance perspective.

Or amazon driving truck from one depo to other to unload one package at a time to "distribute" unloading because "distributing = faster".

sigmoid10 · on March 28, 2024

Remember the famouse Tannenbaum quote:

"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."

It's only a question of scale and not of any principles.