Some people still think that LLM are just word predictors. Technically, it is no...

Some people still think that LLM are just word predictors. Technically, it is not. First, transformer architectures don't process words, they process semantic representations stored as vectors or embeddings in a continuous space. What a lot of people don't understand is that in an LLM, we go from discrete values (the tokens) to continuous values (embeddings) that the transformer takes as input. A transformer is a polynomial function that will project into the latent embedding space. It doesn't generate word per se, but a vector that is then compared against the latent embedding space to find the closest matches. The decoding part is usually not deterministic. This huge polynomial function is the reason we can't understand what is going on in a transformer. It doesn't mimick human speech, it builds a huge representation of the world, which it uses to respond to a query. This is not a conceptual graph, it is not a mere semantic representation. It is a distillation of all the data it ingested. And each model is unique as the process itself is split over hundreds of gpu, with no control over which GPU is going to churn out which part of the dataset and in which order.