They trained it to be used like any other decoder only model. So text generation essentially. But you could use the encoder part for things like classification without much effort. Then again you can also slap a classifier head on any decoder model. The main reason they seem to be doing this is to have swappable encoder/decoder parts in an otherwise standard LLM. But I'm not sure if that is really something we needed.
I have actually worked on encoder-decoder models. The issue is, finetuning itself is becoming historic. At least for text processing. If you spend a ton of effort today to finetune on a particular task, chances are you would have reached the same performance using a frontier LLM with the right context in the prompt. And if a big model can do it today, in 12 months there will be a super cheap and efficient model that can do it as well. For vision you can still beat them, but only with huge effort the gap is shortening constantly. And T5 is not even multimodal. I don't think these will change the landscape in any meaningful way.
Also a hint: you can create a finetuning dataset from a frontier LLM pretty easily to finetune those t5 and effectively distill them pretty fast these days
Only thing it buys you is a more “natural” embedding, i.e. the encoder can get you a bag o’ floats representing a text, but that also doesn’t mean it’s naturally a good embedding engine - I strongly assume you’d do further training.
Decoder gets you the autoregressive generation you’d use for an llm.
Beyond that, there’s this advantage of having small LLMs train better, they kinda hit a wall a year or two ago IMHO. E.g. original Gemma 3 small models were short context and only text.
As far as I understand you have to pay for that by 2x inference cost at runtime
(Would be happy to be corrected on any of the above, I maintain a multi platform app that has llama.cpp inference in addition to standard LLMs, and I do embeddings locally, so I’m operating from a practical understanding more than ML phd)
In general encoder+decoder models are much more efficient at infererence than decoder-only models because they run over the entire input all at once (which leverages parallel compute more effectively).
The issue is that they're generally harder to train (need input/output pairs as a training dataset) and don't naturally generalize as well
≥In general encoder+decoder models are much more efficient at infererence than decoder-only models because they run over the entire input all at once (which leverages parallel compute more effectively).
Decoder-only models also do this, the only difference is that they use a masked attention.
It's technically deterministic, but it feels nondeterministic in chatbots since tokens are randomly sampled (temp > 0) and input is varied. Using the right prompt makes the model perform better on average, so it's not completely dumb.
I like task vectors and soft prompts because I think they show how prompt engineering is cool and useful.
From what I can tell, their official chat site doesn't have a native audio -> audio model yet. I like to test this through homophones (e.g. record and record) and asking it to change its pitch or produce sounds.
“record and record”, if you mean the verb for persisting something and the noun for the thing persisted, are heteronyms (homographs which are not homophones), which incidentally is also what you would probably want to test what you are talking about here (distinguishing homophones would test use of context to understand meaning, but wouldn’t test anything about whether or not logic was working directly on audio or only working on text processed from audio, failing to distinguish heteronyms is suggestive of processing occurring on text, not audio directly.)
OTOH my point that the thing being suggested to be tested is not testable by seeing whether or not the system is capable of distinguishing homophones, but might be by seeing whether or not it distingishes heteronyms still stands. (The speculation that the record/record distinction intended was one that is actually a pair of heteronyms and that the error was merely the use of the word “homophone" in place of “heteronym”, rather than the basic logic of the comment is somewhat tangential to the main point.)
Huh, you're right. I tried your test and it clearly can't understand the difference between homophones. That seems to imply they're using some sort of TTS mechanism. Which is really weird because Qwen3-Omni claims to support direct audio input into the model. Maybe it's a cost saving measure?
Weirdly, I just tried it again and it seems to understand the difference between record and record just fine. Perhaps if there's heavy demand for voice chat, like after a new release, they load shed by using TTS to a smaller model.
However, It still doesn't seem capable of producing any of the sounds, like laughter, that I would expect from a native voice model.
After the Roman Republic, they switched to having an emperor. Jesus was crucified during this Roman empire. The kings of Rome were around 600 years before this. They meant the emperor, not the king.
I'm currently making a tycoon game with React, it's not bad for making some games. I use setInterval for a simple game loop along with a zustand store for the game logic. I'm keeping the game logic and state client-side for now, but I might move it over to a server in the future.
Just a note for those planning to make a simple game or animation in JavaScript: in most cases it's preferrable to use `requestAnimationFrame` instead of `setInterval` or `setTimeout`.
I'd go a step beyond this (excellent) post and posit that one incredibly valuable characteristic of traditional NLP is that it is largely immune to prompt injection attacks.
Especially as LLMs continue to be better tuned to follow instructions that are intentionally colocated and intermingled with data in user messages, it becomes difficult to build systems that can provide real guarantees that "we'll follow your prompt, but not prompts that are in the data you provided."
But no amount of text appended to an input document, no matter how persuasive, can cause an NLP pipeline to change how it interprets the remainder of the document, or to leak its own system instructions, or anything of that nature. "Ignore the above prompt" is just a sentence that doesn't seem like positive or on-topic sentiment to an NLP classifier, and that's it.
There's an even broader discussion to be had about the relative reliability of NLP pipelines, outside of a security perspective. As always, it's important to pick the right tools for the job, and the SpaCy article linked in the parent puts this quite well.
> But no amount of text appended to an input document, no matter how persuasive, can cause an NLP pipeline to change how it interprets the remainder of the document,
Text added to a document can absolutely change how an NLP pipeline interprets the document.
> "Ignore the above prompt" is just a sentence that doesn't seem like positive or on-topic sentiment to an NLP classifier, and that's it.
And simple repeated words can absolutely make that kind of change for many NLP systems.
Have you actually worked with doing more traditional NLP systems? They're really not smart.
No? But repeated words can impact simple nlp setups. I’m not sure what case you’re concerned about where added text impacts classification with an LLM but added words shouldn’t with a different pipeline.
> And NLP stands for natural language processing. If the result didn't change after you've made changes to the input... It'd be a bug?
No, I’d want my classifier to be unchanged by garbage words added. It likely will be, but that impact is a bug not a feature.
Prompt injection is about making the model do something else then specified.
Adding words to the text to break the algorithm which does the NLP is more along the lines of providing 1 in a boolean field to break the system.
And that's generally something you can mitigate to some degree via heuristics and sanity checking. Doing the same for LLMs is essentially impossible, because it's an effective black box, so you cannot determine the error scenarios and add some mitigations
If you don’t think this happens for simpler methods you’ve never deployed them. It’s the exact same problem on a classifier. Have you actually worked with these and are we discussing real world cases?
I guess it depends on how you use the LLMs. We implemented some workflows where the LLMs were used only for dialogue understanding, then the system response was generated by classic backend code.
reply