During MoE training you still need access to all weights.
1k experts would mean 30 TB of state to juggle with on 7B params. Training and inference is infeasible at this size.
If you'd want to keep the size while increasing number of experts, you'd end up with 7b -> 56m model. What kind of computation can you do on 56m model? Remember that expert model in MoE runs the whole inference without consulting or otherwise reusing any information from other experts. Thin network at the top just routes it to one of experts. But at this small size those are not "experts" anymore, it'd be more like Mixture of Idiots.
To put it in other way, MoE is optimization technique with low scaling ceiling that is more local maximum solution that global one (this idea works against you quickly if you want to go more that direction).
During MoE training you still need access to all weights.
1k experts would mean 30 TB of state to juggle with on 7B params. Training and inference is infeasible at this size.
If you'd want to keep the size while increasing number of experts, you'd end up with 7b -> 56m model. What kind of computation can you do on 56m model? Remember that expert model in MoE runs the whole inference without consulting or otherwise reusing any information from other experts. Thin network at the top just routes it to one of experts. But at this small size those are not "experts" anymore, it'd be more like Mixture of Idiots.
To put it in other way, MoE is optimization technique with low scaling ceiling that is more local maximum solution that global one (this idea works against you quickly if you want to go more that direction).