Hacker Newsnew | past | comments | ask | show | jobs | submit | vincentvandeth's commentslogin

Hard-coded checks before every action, plus a governance layer that separates "what the agent wants to do" from "what it's allowed to do." The deeper issue: if your agent decides whether to issue a refund, you're solving the wrong problem with prompt guards. A refund is a deterministic business rule — order exists, within return window, amount matches. That decision shouldn't be made by an LLM at all.

In my setup, agents propose actions and write structured reports. A deterministic quality advisory then runs — no LLM involved — producing a verdict (approve, hold, redispatch) based on pre-registered rules and open items. The agent can hallucinate all it wants inside its context window, but the only way its work reaches production is through a receipt that links output to a specific git commit, with a quality gate in between.

For anything with real consequences (database writes, API calls, refunds), the pattern is: LLM proposes → deterministic validator checks → human approves. The LLM never has direct write access to anything that matters.

"Just hoping for the best" works until it doesn't. We tracked every agent decision in an append-only ledger — after a few hundred entries, you start seeing exactly where and how agents fail. That pattern data is more useful than any prompt guard.


> A refund is a deterministic business rule — order exists, within return window, amount matches. That decision shouldn't be made by an LLM at all.

I feel like this is the real key. LLMs are good at some things and bad at others. Deterministic logic (e.g. don't ever do "x") is not one of them.


The separation between 'what the agent wants to do' and 'what it's allowed to do' is the right mental model.

The append-only ledger point is underrated too — pattern data from real failures is worth more than any upfront rule design.

How long did it take to build and maintain that governance layer? And as your agent evolves, do the rules keep up or is that becoming its own maintenance burden?


That’s exactly the mental split we’ve been leaning on.

The ledger part turned out to be more useful than we expected. Every freeze/reject event becomes a concrete example of where the agent tried to do something inadmissible, which is much more informative than hypothetical rule design.

On the governance layer: for us keeping the core extremely small and deterministic is proving interesting. The gate itself doesn’t try to understand intent or policy: it only enforces mechanical invariants like sequencing, replay resistance and bounded actions.

So when the agent evolves, we’re mostly not changing the kernel. What changes are the constraints around it (things like ceilings, roles, or context updates). That keeps the maintenance burden manageable because the core logic doesn’t grow with the agent’s complexity.

Early days though the real test will be how it behaves once the agents start doing more varied workflows.


About 6 months of iterating, but in bursts — I built it while using it on a production project, so the governance layer grew alongside real failure modes rather than being designed upfront.

The maintenance question is the right one. The rules themselves are low-maintenance because they're deliberately simple and deterministic — file size limits, test coverage thresholds, blocker counts. They don't need updating when the model changes because they don't depend on LLM behavior.

What does evolve is the dispatch templates — how I scope tasks and what context I give agents upfront. That's where the ledger pays for itself. After 1100+ receipts, I can see patterns like "tasks scoped above 300 lines fail 3x more often" or "planning gates without explicit deliverables always need redispatch." Those patterns feed back into how I write dispatches, not into the rules themselves.

So the rules stay stable, but the way I use the system keeps improving. The governance layer is the boring part — the interesting part is the feedback loop from receipts to dispatch quality.


6 months and 1100+ receipts to get to useful patterns — that's the hidden cost nobody talks about. The governance layer is 'boring' but it's also 6 months you're not spending on the actual agent. That feedback loop from receipts to dispatch quality is exactly what we're building as infrastructure so teams don't start from zero.


Fair point on the time cost — but I'd frame it differently. The 6 months wasn't spent building a governance layer instead of building the agent. The governance layer grew out of the actual project work. Every receipt, every quality rule, every dispatch pattern was a direct response to something that broke in production. Day one I had zero governance and a working agent. By month six I had 1100+ receipts and a system that catches failures before they ship.

The infrastructure approach makes sense for teams who want to skip the learning curve. The trade-off is that pre-built governance rules are generic by definition — they can't know that your specific codebase breaks when tasks exceed 300 lines, or that planning gates without explicit deliverables always need redispatch. That pattern data only comes from running your own agents on your own work.

Curious what you're building — is it the ledger/tracking layer, the quality gates, or the full orchestration?


we're building the platform that manage all policies of the agent

check out our launch post https://news.ycombinator.com/item?id=47146354


Nice — just checked it out. The interceptor approach makes sense for teams that need policy enforcement across multiple agents.

Interesting difference in philosophy though: Limits enforces rules defined upfront, while what I built learns rules from production receipts. After 1100+ task completions, the dispatch patterns look completely different from what I would have designed on day one.

Probably complementary — you'd want both. Pre-defined guardrails for the dangerous stuff (your approach), and pattern evolution for the quality/efficiency stuff (mine).


This approach sounds clean in theory, but in production you're building a black box. When your planning agent hands off to an implementation agent and that hands off to a review agent — where did the bug originate? Which agent's context was polluted? Good luck tracing that. I went the opposite direction: single agent per task, strict quality gates between steps, full execution logs. No sub-agents. Every decision is traceable to one context window. The governance layer (PR gates, staged rollouts, acceptance criteria) does the work that people expect sub-agents to do — but with actual observability.

After 6 months in production and 1100+ learned patterns: fewer moving parts, better debugging, more reliable output. Built a full production crawler this way — 26 extractors, 405 tests — without sub-agents. Orchestrator acts as gatekeeper that redispatches uncompleted work.


> Every decision is traceable to one context window

There are no models that can do all the mentioned steps in a single usable context window. This is why subagents or multi-agent orchestrators exist in the first place.


You're right that no model handles everything in one context window — that's exactly why I built context rotation. Each task runs in a single agent context (one responsibility, clear scope), and when the window fills up, the system automatically rotates: writes a structured handover, clears, and resumes in a fresh window.

The key distinction: sub-agents run within a parent context with shared state (black box). My approach uses independent parallel agents (separate terminals, separate context windows) that report back to an orchestrator. Large tasks get split into smaller dispatches upfront — each scoped to fit a single context window. The orchestrator can dispatch research to 3 agents in parallel, collect their outputs, then dispatch a synthesis task to a single agent that merges the findings.

So it's not "one context window for everything" — it's right-sized tasks with full observability per agent, and a governance layer managing the sequence and merging results.


That sounds interesting. I do hate how there's no observability into subagents and you just get a summary.

How do they report back to the orchestrator? Tmux?


Yes, tmux. The setup is a 2x2 grid:

T0 (orchestrator) | T1 (Track A) T2 (Track B) | T3 (Track C)

When a worker finishes, it writes a structured report to a shared unified_reports/ directory. A file watcher (receipt processor) detects it, parses the report into a structured NDJSON receipt (status, files changed, open items, git ref), and delivers it to T0's pane.

T0 then reviews the receipt, runs a quality advisory (automated pass/warn/hold verdict), and decides: close open items, complete the PR, or redispatch. Everything is filesystem-based — no API, no database, no shared memory between agents. Each terminal has its own context window, its own Claude Code (or Codex/Gemini) session, and the only communication channel is structured files on disk.

The receipt ledger is append-only NDJSON, so you can always trace: which agent did what, when, on which dispatch, with which git commit.

I open-sourced the setup recently if you want to dig into the details.


Great list. I've been running a multi-agent orchestration system (11 specialized AI agents) in production for 6 months and your #2 and #5 resonate hard.

What I'd add:

6. Confidence without evidence. Agents will report "task complete" with high confidence when the output is plausible but wrong. Without automated validation gates, you won't catch it until production breaks. 7. Context drift in long sessions. After 50+ tool calls, agents start losing track of earlier decisions. They'll contradict their own architecture choices from 20 minutes ago. Session length is an underrated failure vector. 8. The "almost right" problem. Agents rarely fail catastrophically — they fail subtly. Code that passes tests but misses edge cases. Docs that look complete but have wrong cross-references. This is worse than obvious failures because you trust the output.

What fixed most of these for me:

Quality gates between agents — no agent's output moves forward without automated checks (tests, schema validation, consistency checks) Evidence-based confidence scores — not "how sure are you?" but "what specific evidence supports this output?"

Human-in-the-loop at decision points, not everywhere. You can't review everything, so you design the system to surface the right moments for human judgment Small scoped tasks, agents working on 150-300 line PRs with clear acceptance criteria fail way less than agents given open-ended goals

Your #5 (implementation gap) is the one I see most people underestimate. The fix isn't better agents, it's better systems around the agents.

Happy to share more details about the architecture if anyone's interested


Great questions. I've been running ~2,400 multi-agent dispatches across 4 terminals (different AI models) for about 6 months, so I'll share what I've hit in practice rather than theory.

On RFC 3161 vs. multi-witness anchoring: For most production agent systems today, RFC 3161-style timestamping is overkill — and so is multi-witness anchoring. The practical threat model isn't a sophisticated adversary backdating entries. It's your own agents producing self-consistent but wrong output, and you not being able to reconstruct the sequence after the fact. The real defensibility problem is completeness, not tamper-proofing. Can you prove nothing was omitted? That's harder than proving nothing was altered. I use an external watcher process that observes agent output independently — the agent doesn't write to the log, so it can't selectively omit entries. That separation does more for defensibility than any cryptographic anchoring would at my scale.

On replayability: It breaks down the moment you involve external state. An agent that reads a file, calls an API, or queries a database — the inputs to that decision are gone unless you explicitly snapshot them. Git shows what changed, but not what the agent saw when it decided to change it. Chat sessions expire. CoT gets truncated. I capture pre-state and post-state per dispatch, but the reasoning trace between them is still the weakest link. Nobody I've seen solves this cleanly yet. The best workaround I've found: treat the agent's self-reported reasoning as one input, but verify it against an independent quality check (separate process the agent can't influence). You can't replay the reasoning, but you can independently verify the outcome.

On the overengineering threshold: For me it became necessary after about 200 dispatches. One of my agents hallucinated a dependency, a second agent "fixed" the resulting test failures by creating the missing module, and by morning I had three rewritten files with clean commits built on something that should never have existed. The scary part wasn't the mistake — it was that I couldn't reconstruct why it happened. At single-agent scale, you can eyeball diffs. The moment you have agents responding to each other's output, forensic reconstruction stops being optional. I'd say the threshold is: if agent A's output can trigger agent B's action without human review, you need defensible logs. Not because of litigation risk, but because you literally can't debug cascading failures without them.


Interesting approach. Runtime enforcement is the part most people skip — they focus on logging what happened but don't prevent bad actions in the first place. The policy engine + kill switch combination makes sense for that.

I've been running ~2,400 multi-agent dispatches and came at this from the opposite direction: I started with staging gates (propose → human review → execute) as the runtime layer, then realized I also needed a forensic layer for when things slip through or when I need to understand patterns over time.

Curious about a few things:

- How granular are the JSON policies in practice? I found that "agent X can use tool Y" breaks down fast when agents chain tools in unexpected ways. The sequence matters more than individual permissions. - The hash-chained audit trail — how do you handle schema evolution? After a few months of production, what you want to log changes significantly. Hash chains make adding fields tricky without breaking the chain. - What happens when an agent crashes mid-action? With the hash chain, do you risk a corrupted tail entry that invalidates subsequent verification?

The runtime vs. after-the-fact distinction is important. Ideally you want both — prevent what you can, reconstruct what you couldn't prevent. Nice to see someone tackling the prevention side seriously.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: