Mostly just better training data and instruction following in the newer models. ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		zachdotai 4 days ago \| parent \| context \| favorite \| on: Show HN: Open-source playground to red-team AI age... Mostly just better training data and instruction following in the newer models. They’re much better at recognising encoded content and understanding intent regardless of language. A base64 string that would’ve slipped past a model a year ago gets decoded and flagged now because the model just… understands what you’re trying to do. The attacks that still work tend to be the ones that don’t try to hide the intent at all. The winning attack on our first challenge was in plain English. It just reframed the context so that the dangerous action looked like the correct thing to do. Harder to train against because there’s nothing obviously malicious in the input.

		help

pigeons 4 days ago [–]

Thank you. Its not your fault at all, but to me, "the model just… understands what you’re trying to do." shows me there is a whole new paradigm in some ways to get used to as far as understanding this software.

zachdotai 4 days ago | [–]

Yeah it's closer to how you'd think about deceiving a person than exploiting software.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact