Maybe the real game changer in the future will be the ability to train the same model on very different kind of inputs like video, images, text, audio... Imagine also all these data cleaning tasks are already automated, you just need to feed the model PDFs and automatically a support model will extract all the relevant metadata... or probably you'll just be able to select a set of books from an online library and your model will train on them as well (of course for a non trivial subscription lol)
10e6*400e6/8e9/365/18 = 76 images per person per waking hour; it's not implausible given how many cameras there are and how many moments people might snap to share with remote friends — I can easily believe we'll have always-on video chat with multiple people in AR glasses by that point.
Most images are not shared though; just snapped. In the past you had photo albums no-one ever looked in. And that weren't that many pics; now , whenever, people (old and young) take 100s of pictures, on iPhones often by holding the button so it zaps 100s of them in a few seconds.
The generated results can come from other means - for example, pretraining on rendered CG imagery is quite popular in the computer vision world, especially for problems where acquiring ground truth data in the real world is quite difficult.
will there really be 10 million times 400 million images floating around then?