But if we think of mixture of experts models outperforming "monolithic" models, why not? Maybe instead of 8 you can do 1000 and that is easy to paralellize. It sounds worth exploring to me.
I think the MoE models are trained together just like any other network though, including the dispatcher layer that has to learn which "expert" route each token to. Perhaps you could do some kind of technically worse model architecture that is trained separately and then a more complex dispatcher that then learns to utilize the individually trained experts as best as it can?
During MoE training you still need access to all weights.
1k experts would mean 30 TB of state to juggle with on 7B params. Training and inference is infeasible at this size.
If you'd want to keep the size while increasing number of experts, you'd end up with 7b -> 56m model. What kind of computation can you do on 56m model? Remember that expert model in MoE runs the whole inference without consulting or otherwise reusing any information from other experts. Thin network at the top just routes it to one of experts. But at this small size those are not "experts" anymore, it'd be more like Mixture of Idiots.
To put it in other way, MoE is optimization technique with low scaling ceiling that is more local maximum solution that global one (this idea works against you quickly if you want to go more that direction).
I don't think MoE allows for that either. You'd have to come up with a whole new architecture that allows parts to be trained independently and still somehow be merged together in the end.
Cool paper. It's more independent than dense or normal MoE but I think it's still far away from the distributed training you're looking for because you still need a seed LM which is trained normally and when fine-tuning each expert from the seed LM, you still need enough GPUs or VRAM to fine-tune that LLM so you're still limited to large GPU clusters which is the problem we're trying to avoid.
In the case of the paper, they are using OPT-6.7b as the seed LM which requires 8xV100 GPUs for fine-tuning each expert. That's a combined total of 256GB of VRAM for a single expert while the 3090 only has 24GB of VRAM and is still one of the most expensive GPUs out there.
Maybe we could use something like PEFT or QLoRA in combination with this technique to make each expert small enough for the community to fine-tune and make a worse Mixtral 8x7b, but I don't know enough to say for sure.
Or maybe it turns out we can make a good MoE model with thousands of smaller experts. Experts small enough for a separate member of the community to independently fine-tune on a normal GPU, but idk.
To have both a performant and distributed LLM trained from scratch, we still need a completely different architecture to do it, but this work is pretty cool and may mean that if nothing else, there is something the community can do to help move things forward.
Also, I was going to say the MoE routing on this technique was lacking, but I found a more recent paper[0] by Meta which fixes this with a final fine-tuning stage.
Base model was still trained in usual, non distributed way (by far the most cost).
Fine tunes were also trained in usual, non distributed way.
Proposed approach tries out several combinations to pick one that seems to perform better (where combination means ie. adhoc per layer operation).
Merging is not distributed as well.
There is not much distribution happening overall beyond the fact that fine tunes were trained independently.
Taking weight averages, weighted weight averages, trimming low diffs, doing arithmetic (subtracting base model from fine tune) etc. are all ad hoc trials throwing something on the wall and seeing what sticks the most. None of those work well.
For distributed training to work we'd have to have better algebra around this multidimentional/multilayer/multiconnectivity state. We don't have it and it has many problems, ie. evaluation is way too expensive. But solving "no need to rerun through whole training/benchmark corpus to see if my tiny change is better or not" problem will mean we solved problem of extracting essence of intelligence. If we do that, then hyper-efficient data centers will still keep beating out any distributed approach and it's all largely irrelevant because that's pure AGI already.
But if we think of mixture of experts models outperforming "monolithic" models, why not? Maybe instead of 8 you can do 1000 and that is easy to paralellize. It sounds worth exploring to me.