so say i have a site with 3000 images, 2M pixel each. How many GPU-months it would take to mark them? And, what gigabytes i would have to keep for the model?
The amounts of gpu time in the paper are for all experiments, not just training the last model that is OSS (which is usually reported). People don't just oneshot the final model.
Yes, although the number of parameters is not directly linked with the flops/speed of inference. What's nice about this AE architecture is that most of the compute (message embedding, and merging) is done at low resolution, same idea as behind latent diffusion models