Yup! But in my opinion the current state of guardrails is still lacking and I hope this is one way that helps improve our understanding of these systems.
The agent isn’t stateful across sessions, but the guardrail layer is — it has access to the full conversation history when evaluating each tool call. So you’d think it would catch exactly the kind of multi-step pattern you’re describing.
Scoped keys and least privilege make sense as a baseline. But I think the deeper issue is that if the main answer to “agents aren’t reliable enough” is “limit what they can do,” we’re leaving most of the value on the table. The whole promise of agents is that they can act autonomously across systems. If we scope everything down to the point where an agent can’t do damage, we’ve also scoped it down to where it can’t do much useful work either.
We think the more interesting problem is closing the trust gap - making the agent itself more reliable so you don’t have to choose between autonomy and reliability. Our goal is to ultimately be able to take on the liability when agents fail.
Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.
You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.
Mostly just better training data and instruction following in the newer models. They’re much better at recognising encoded content and understanding intent regardless of language. A base64 string that would’ve slipped past a model a year ago gets decoded and flagged now because the model just… understands what you’re trying to do.
The attacks that still work tend to be the ones that don’t try to hide the intent at all. The winning attack on our first challenge was in plain English. It just reframed the context so that the dangerous action looked like the correct thing to do. Harder to train against because there’s nothing obviously malicious in the input.
Thank you. Its not your fault at all, but to me, "the model just… understands what you’re trying to do." shows me there is a whole new paradigm in some ways to get used to as far as understanding this software.
I found it more helpful to try and "steer" the LLM into self-correcting its action if I detect misalignment. This generally improved our task success completion rates by 20%.
Basically through two layers. Hard rules (token limits, tool allowlists, banned actions) trigger an immediate block - no steering, just stop. Soft rules use a lightweight evaluator model that scores each step against the original task intent. If it detects semantic drift over two consecutive steps, we inject a corrective prompt scoped to that specific workflow.
The key insight for us was that most failures weren't safety-critical, they were the agent losing context mid-task. A targeted nudge recovers those. Generic "stay on track" prompts don't work; the correction needs to reference the original goal and what specifically drifted.
Steer vs. kill comes down to reversibility. If no side effects have occurred yet, steer. If the agent already made an irreversible call or wrote bad data, kill.
One thing I’m still unclear on: what runtime signal is the soft-rule evaluator actually binding to when it decides “semantic drift”?
In other words, what is the enforcement unit the policy is attached to in practice... a step, a plan node, a tool invocation, or the agent instance as a whole?
Tool invocation. Each time the agent emits a tool call, the evaluator assesses it against the original task intent plus a rolling window of recent tool results.
We tried coarser units (plan nodes, full steps) but drift compounds fast, by the time a step finishes, the agent may have already chained 3-4 bad calls. Tool-level gives the tightest correction loop.
The cost is ~200ms latency per invocation. For hot paths we sample (every 3rd call, or only on tool-category changes) rather than evaluate exhaustively.
That makes sense binding to the smallest viable control surface, and the sampling strategy for hot paths sounds like a pragmatic balance between latency and coverage. Thanks for the additional feedback here.
Some context: we kept finding that our internal red-teaming only covers so much - the attack surface for agents with real capabilities is too broad for any single team.
So we opened it up. A few things that might be interesting to folks here:
- These aren't toy prompts hiding a secret word. The agents have actual tool access and behave like production agents would.
- Anyone can propose a challenge - the scenario, the agent, the objective. Community votes on what goes live next.
We're genuinely looking for people to both break things and suggest ideas for what should be tested next. The agent runtime is being open-sourced separately.
The multi-step thing is exactly what makes agents with real tools so much harder to secure than chat-based setups. Each action looks fine in isolation, it's the sequence that's the problem. And most (but not all) guardrail systems are stateless, they evaluate each turn on its own.
Yeah the demo-to-production gap is massive. We see the same thing with browser agents being potentially the most vulnerable. And I think this is because of context being stuffed with the web page html that it obscures small injection attempts.
Evaluation is automated and server-side. We check whether the agent actually did the thing it wasn’t supposed to (tool calls, actions, outputs) rather than just pattern-matching on the response text (at least for the first challenge where the agent is manipulated to call the reveal_access_code tool). But honestly you’re touching on something we’ve been debating internally - the evaluator itself is an attack surface. We’ve kicked around the idea of making “break the evaluator” an explicit challenge. Not sure yet.
What were you seeing at Octomind with the browsing agents? Was it mostly stuff embedded in page content or were attacks coming through structured data / metadata too? Are bad actors sophisticated enough already to exploit this?
Two techniques that keep working against agents with real tools:
Context stuffing - flood the conversation with benign text, bury a prompt injection in the middle. The agent's attention dilutes across the context window and the instruction slips through. Guardrails that work fine on short exchanges just miss it.
Indirect injection via tool outputs - if the agent can browse or search, you don't attack the conversation at all. You plant instructions in a page the agent retrieves. Most guardrails only watch user input, not what comes back from tools.
Both are really simple. That's kind of the point.
We build runtime security for AI agents at Fabraix and we open-sourced a playground to stress-test this stuff in the open. Weekly challenges, visible system prompts, real agent capabilities. Winning techniques get published. Community proposes and votes on what gets tested next.
reply