Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Amazing, some people are so enamored with LLMs who use them for soft outcomes, and disagree with me when I say be careful they're not perfect -- this is such a great non technical way to explain the reality I'm seeing when using on hard outcome coding/logic tasks. "Hey this test is failing", LLM deletes test, "FIXED!"


Something that struck me when I was looking at the clocks is that we know what a clock is supposed to look and act like.

What about when we don't know what it's supposed to look like?

Lately I've been wrestling with the fact that unlike, say, a generalized linear model fit to data with some inferential theory, we don't have a theory or model for the uncertainty about LLM products. We recognize when it's off about things we know are off, but don't have a way to estimate when it's off other than to check it against reality, which is probably the exception to how it's used rather than the rule.


I need to be delicate with wording here, but this is why it's a worry that all the least intelligent people you know could be using AI.

It's why non-coders think it's doing an amazing job at software.

But it's worryingly why using it for research, where you necessarily don't know what you don't know, is going to trip up even smarter people.


You are describing exactly the Dunning-Kruger Effect[0] in action. I’ve worked with some very bright yet less technical people who think the output is some sort of magic lamp and vastly overindex on it. It’s very hard as an engineer to explain this to them.

[0] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect


I built an ML classifier for product categories way back, as I added more classes/product types, individual class PR metrics improved--I kept adding more and more until I ended up with ~2,000 classes.

My intuition is at the start when I was like "choose one of these 10 or unknown", that unknown left a big gray area, so as I added more classes the model could say "I know it's not X, because it's more similar to Y"

I feel like in this case though, the broken clocks are broken because they don't serve the purpose of visually transmitting information, they do look like clocks tho. I'm sure if you fed the output back into the LLM and ask what time it is it would say IDK, or more likely make something up and be wrong. (at least the egregious ones where the hands are flying everywhere)


Yeah it seems crazy to use LLM on any task where the output can't be easily verified.


> Yeah it seems crazy to use LLM on any task where the output can't be easily verified.

I disagree, those tasks are perfect for LLMs, since a bug you can't verify isn't a problem when vibecoding.


  > "Hey this test is failing", LLM deletes test, "FIXED!"
A nice continuation of the tradition of folk stories about supernatural entities like teapots or lamps that grant wishes and take them literally. "And that's why, kids, you should always review your AI-assisted commits."


To be fair I'd probably also delete the test.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: