At some point autoencoding during training will take care of that, it seems. We kind of already it with 'tutor' training and gpt-generated training data, as well as synthetic logic and math problem sets.
Oh, having language models sit on top of autoencoders is 100% the right way to go, even if we weren't moving towards multi-modal models, just because right now LLMs can be brilliant in one language and retarded in another based on the training data set. Putting an autoencoder in front and using "language agnostic" encoding would solve that problem.