Feels like a whole bunch of us are converging on very similar patterns right now.
I've been building OctopusGarden (https://github.com/foundatron/octopusgarden), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory (https://factory.strongdm.ai/). The autoissue.py script (https://github.com/foundatron/octopusgarden/blob/main/script...) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus.
On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it.
The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently.
Also want to shout out Ouroboros (https://github.com/Q00/ouroboros), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.
Appreciate the thoughtful comment. I think there's a key distinction though: this isn't a conversational agent pipeline where you need to trace reasoning chains.
The attractor loop is closer to gradient descent than to an agent conversation. Generated code is treated as opaque weights, and only externally observable behavior matters (scored 0-100 by an independent LLM judge against holdout scenarios). "Things going sideways" just means the satisfaction score is low on that iteration, which naturally feeds back as context for the next one. Build failures, test failures, partial correctness: they're all just points on a convergence curve rather than catastrophic failures requiring forensic replay.
So the observability you need shifts from "what did the agent think at step 12?" to "is the loss curve trending down?" We persist per-iteration satisfaction scores, failures, and token costs, which gives you the audit trail. But it's a pretty compact one: a number, a list of failing scenarios, and a cost.
The spec durability point is a good one to raise. In this case specs aren't documentation that drifts from code over time. They're the actual input to the system. If the spec is wrong, you fix the spec. The generated code is disposable by design.
You're absolutely right that multi-run observability becomes important as this scales though. Watching N specs converge simultaneously will need a proper dashboard. But it's N loss curves rather than N conversation traces, which should be fundamentally simpler to reason about.
Right now OctopusGarden logs every LLM call with token counts and cost, and the SQLite store records each run and iteration (spec hash, scores per scenario, generated code). So you get a full trace of what was generated, what it was tested against, and how it scored.
For approvals, the current model is that the spec is the approval. If the spec is right and scenarios pass at 95%+ satisfaction, the code ships. There's no PR review step by design (the "code is opaque weights" philosophy).
That said, you could totally layer approvals on top. Gate on spec changes, require sign-off before a run kicks off, or add a human checkpoint between "converged" and "deployed." The tool doesn't enforce a deployment pipeline, so that's up to your org's workflow.
Worth noting: this is purely a hobby project at this point. It hasn't been used in any commercial setting. The guard rails and approval workflow stuff is where it would need the most work before anyone used it for real.
Cool site/ good idea. Maybe I'm underestimating it (I probably am), but I don't think it's a huge leap from what I published today and that compliant vision you're tackling.
GDAL is dope. Everything geospatial is built on on it. Props to osgeo.org for organizing the support and development for it for the past 2+(?) decades.
I have a M1 8GB MacBook Air. I keep wishing I’ll run into a performance issue so I have an excuse to buy a new laptop, but it never happens. I also don’t use chrome because of all the reasons mentioned in these comments.
I've been building OctopusGarden (https://github.com/foundatron/octopusgarden), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory (https://factory.strongdm.ai/). The autoissue.py script (https://github.com/foundatron/octopusgarden/blob/main/script...) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus.
On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it.
The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently.
Also want to shout out Ouroboros (https://github.com/Q00/ouroboros), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.