Hacker Newsnew | past | comments | ask | show | jobs | submit | shubhamintech's commentslogin

Runtime policies as an actual gate rather than prompt instructions is the right model. Most frameworks just bolt governance on as a wrapper and hope the model obeys. What I'd want on top of this: observability into why agents are hitting policy blocks, not just that they were blocked.

Yep, totally agree. And Orloj has this built in. Tracks the entire lifecycle of your tasks through traces in real time so you can audit why everything happened good/bad. During your task you can see how many tokens each call used (input/output), and latency for each model/tool call.

The oracle problem is tractable when the output is code: you can compile it, run tests, diff the output. For conversational AI it's much harder. We've seen teams use LLM-as-judge as their validation layer and it works until the judge starts missing the same failure modes as the generator.

The MoE point matters here ie sparse activation means you're not reading all 2TB per forward pass, but the access pattern flips from sequential to random which is exactly the worst case for NVMe. Been thinking about this a lot for agent inference workloads where you want consistent latency more than peak throughput.

We've seen this exact pattern. Most devtools assume a human will eventually log in and contextualize the data. When the 'user' is an agent, you need the surface to be machine-readable by default, not as an afterthought. The adapter approach mostly doesn't work ie you end up with a translation layer that loses exactly the signal you needed.

its like devtools are now agenttools


4.4 tok/s with reliable structured output is a solid local benchmark altho the question is whether SSD streaming introduces per-token latency variance that messes up tool call parsing downstream. The gap between 400 GB/s unified memory bandwidth and 17.5 GB/s SSD reads means you're in the hot path pretty much every time an expert isn't cached.

we've started to document any a/b decision we take in terms of tech and store in our engineering internal docs! have gone back to it once in a whilw but that usually helps keep us grounded


IMO the under-discussed risk here is that sites will start serving different content to verified crawlers vs real users. You're already seeing it with known search bots getting sanitized views. If your agent's context comes from a crawl the site knows is going to an AI, you have no guarantee it matches what a human sees, and that data quality problem won't surface until your agent starts acting on selectively curated information.

This could go wrong on same levels.


This already happens in the opposite direction. See: news websites that drop their pay wall for GoogleBot


Hard limits are a good first layer but they don't tell you why the agent is looping. Retrying because it's confused, retrying because a dependency is flaky, and genuine planning loops are three different problems with different fixes. What helped us was logging the agent's intent at each step, and if it's asking the same underlying question three times in different syntax, that's the signal to bail early rather than burning through your iteration budget.


The BCG framing makes it sound like a cognitive load problem but I think it is more unreliability fatigue. When your AI does 8 things right and then confidently does the 9th wrong, you spend mental energy second-guessing everything. Supervising an unreliable system is more exhausting than just doing the task yourself.


Automation Bias is probably the thing you're trying to describe. :)

https://en.wikipedia.org/wiki/Automation_bias


Same mental model problem comes up in AI agent observability. Two conversation flows can produce identical user outcomes and look totally different at the message level, or vice versa. The normalization step that actually captures 'did behavior change' is the hard part in both domains.


That's a really sharp parallel. "Did behavior change" is exactly the question in both cases, and the surface-level representation lies to you in both. We normalize ASTs before hashing so reformatting or renaming a local variable doesn't register as a change. Curious what normalization looks like on the agent observability side, feels like a harder problem when the output is natural language instead of code.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: