To caveat, smaller batch sizes are generally better for model stability, but we ...

spi · 2025-12-10T09:44:42 1765359882

Mmh not really. As OP shows, speed increases with larger batch size, but only initially, until the GPU has high enough utilization; then speed improvements flatten out (although you might get OOM before that and not "really" see the flat part). Using smaller batch size increases _noise_, so quite literally decreases stability. That might be good sometimes: in the limit case, if the batch is as large as your training set, you'll end up in local minima and not be able to get out of it. But this is true for toy datasets like MNIST, here it's an entirely different beast.

With such large corpora as the ones used here, and very noisy ones at that, gradient updates are very noisy and that can harm quality. Or anyway, common lore is that one needs pretty large batch size to have the language model improve steadily.

alansaber · 2025-12-10T16:40:32 1765384832

Are you sure about the top-cap on batch size for speed? See https://arxiv.org/pdf/1904.00962

spi · 2025-12-12T13:37:23 1765546643

Sorry I just opened that file now, and browsed through it very quickly, but my eye fell on the excerpt: ``` However, we did not observe any speedup by increasing the batch size from 65536 to 131072 for the first stage, thus, we restrict the batch size to 65536 for this stage. ``` which I think is more or less my point: increasing batch size essentially always helps, but the speedup reduces the more you push the batch size. Provided that your dataset is large enough, more batch size will always make you run a bit faster without sacrificing accuracy, but the speedup will be less and less as you increase the batch size, until you are anyway maxing out the power of your GPU and you can't see any measurable speedup anymore.