Valid point. Conventional codecs draw things on screen that are not in the original, too, but we are used to low quality images and videos, and learned to ignore the block edges and smudges unconsciously. NN models “recover” much complex and plausible-looking features. It is possible that some future general purpose image compressor would do the same thing to small numbers lossy JBIG2 did.
How do we know whether it's an image with 16 fingers or it just looks like 16 fingers to us?
I looked at the bear example above and I could see how either the AI thought that there was an animal face embedded in the fur or we just see the face in the fur. We see all kinds of faces on toast even though neither the bread slicers nor the toasters intend to create them.