Mostly just better training data and instruction following in the newer models. They’re much better at recognising encoded content and understanding intent regardless of language. A base64 string that would’ve slipped past a model a year ago gets decoded and flagged now because the model just… understands what you’re trying to do.
The attacks that still work tend to be the ones that don’t try to hide the intent at all. The winning attack on our first challenge was in plain English. It just reframed the context so that the dangerous action looked like the correct thing to do. Harder to train against because there’s nothing obviously malicious in the input.
Thank you. Its not your fault at all, but to me, "the model just… understands what you’re trying to do." shows me there is a whole new paradigm in some ways to get used to as far as understanding this software.
The attacks that still work tend to be the ones that don’t try to hide the intent at all. The winning attack on our first challenge was in plain English. It just reframed the context so that the dangerous action looked like the correct thing to do. Harder to train against because there’s nothing obviously malicious in the input.