That’s false. Larger LLMs learn token decompositions through their training, and in fact modern training pipelines are designed to occasionally produce uncommon tokenizations (including splitting words into individual characters) for this reason. Frontier models have no trouble spelling words even without tools. Even many mid-sized models can do that.
Wait, where can I learn more about this? I don't doubt that varying the tokenization during training improves results, but how does/would that enable token introspection?
Because LLMs can learn that different token sequences represent the same character sequence from training context. Just like they learn much more complex patterns from context.
You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.
> (so infrastructure for clean water and all chemicals)
Fabs are some of the most complex chemical engineering sites (dealing with some of the most dangerous substances) in the world. So don't underestimate the complexity of this part.
Every year I ask the latest version of Chat GPT a basic facts question about rugby results. It almost always gets it wrong - even when it does web search and cites sources. Wrong scores, hallucinated matches, wrong locations - just gob smacking amounts of wrongness.
The latest "Thinking" version gets it reliably right but spent about 3 minutes coming up with the answer that 10 seconds of googling answers.
So I don't believe we are currently in a situation where LLMs are an effective replacement for search engines.
VisiCalc was "the" killer app for early micros, but being able to edit a written text on screen and then print it out with letter-like quality was nothing to sneeze at, either. This was plausibly a key gain in efficiency for the service sector, perhaps comparable to the 10%~25% that's now being talked about re: LLM's (which is huge on a secular basis).
reply