Hacker Newsnew | past | comments | ask | show | jobs | submit | wayy's commentslogin

It seemed to me the author was simply sharing his own lived experience, which happens to be a bit contrarian to the popular hype around LLMs. It may seem courageous for some but I can see a world where the author didn't think twice about writing down his thoughts in 15 minutes and publishing on his own personal site. Perhaps it comes naturally to people who have been around this industry longer


everybody loves building agents, nobody likes debugging them. agents hit the classic llm app lifecycle problem: at first it feels magical. it nails the first few tasks, doing things you didn’t even think were possible. you get excited, start pushing it further. you run it and then it fails on step 17, then 41, then step 9.

now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong


That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.


In the event this comment is slathered in sarcasm:

  Well done!  :-D


Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)


There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.


https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.

Interesting use of Rerun!

https://github.com/gustofied/P2Engine


For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.


heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

[email protected]

(gist: evals as a service)


That everybody seems to love building these things while people like you harbor deep skepticism about them is a reason to get your hands dirty with an agent, because the cost of doing that is 30-45 minutes of your time, and doing so will arm you with an understanding you can use to make better arguments against them.

For the problem domains I care about at the moment, I'm quite bullish about agents. I think they're going to be huge wins for vulnerability analysis and for operations/SRE work (not actually turning dials, but in making telemetry more interpretable). There are lots of domains where I'm less confident in them. But you could reasonably call me an optimist.

But the point of the article is that its arguments work both ways.


Thanks for pointing this out - we didn't consider how important stackoverflow comments could be. Will look into including as a source.


Awesome product btw!


Thanks for trying it out!

We can definitely be clearer on our focus for programming related queries. We usually don't display code snippets for non-programming questions but definitely still tuning a couple things there.

We're not focused on simple factoid answers like population of cities because that's not where people get the most value.

AWS API is a bit tricky because it is a rather broad technology with SDKs in different languages and thus the search results for a question will return a mishmash of solutions which we then try to make sense of. If you share some sample queries you tried related to this, I'd be happy to look into them and improve our answers there.

Business model: We're currently just focused on building something developers want. Agree that ads and dev tools aren't the most synergetic.


The answer is based on the contents of the websites returned by the search engine + other some sources.


What is the correct/expected answer to that query?


We think the solution is simply having good sources and answer transparency. If you mouse over part of the answer, we try to show you the source of that sentence. Obviously this system is early and will improve over time, but if can easily check if an answer is from say, the FastAPI official documentation, then the false-confidence effect of these models become less of an issue.


Interesting, we are trying to wrangle with the nondeterminism but sometimes it can't be helped as Bing itself can produce different results. Always actively working on the model though.

The citing sources feature can definitely be improved - right now it works at a sentence granularity and insists on finding the best source when sometimes not appropriate. Thanks for pointing all of this out


If you can't find the information yourself via a web search, it's going to be difficult for the model to find the answer too... for now ;)


You can leave feedback on individual search results via the check and X buttons below the answer. Leaving feedback e.g votes on each code snippet is on the roadmap.


Right but that's just a blanket "good" or "bad"

I hit the green check, but I wanted to leave a detailed comment


Ah we only put the option for detailed comment on negative feedback (try clicking the x and a form will pop up). Will also give that option for positive feedback in the future.


Oh, got it, well at least there's a way to leave it!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: