Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's so easy to ship completely broken AI features because you can't really unit test them and unit tests have been the main standard for whether code is working for a long time now.

The most successful AI companies (OpenAI, Anthropic, Cursor) are all dogfooding their products as far as I can tell, and I don't really see any other reliable way to make sure the AI feature you ship actually works.





Tests are called "evals" (evaluations) in the AI product development world. Basically you let humans review LLM output or feed it to another LLM with instructions how to evaluate it.

https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-...


Interesting, never really thought of it outside of this comment chain but I'm guessing approaches like this hurt the typical automated testing devs would do but seeing how this is MSFT (who already stopped having dedicated testing roles for a good while now, rip SDET roles) I can only imagine the quality culture is even worse for "AI" teams.

Yes. Because why would there ever be a problem with a devqaops team objectively assessing their own work's effectiveness?

Traditional Microsoft devs are used to deterministic tests: assert result == expected, whereas AI requires probabilistic evals and quality monitoring in prod. I think Microsoft simply lacks the LLM Ops culture right now to build a quality evaluation pipeline before release; they are testing everything on users

Microsoft: What? You want us to eat this slop? Are you crazy?!

50% of our code is being written by AI! Or at least, autocompleted by AI. And then our developers have to fix 50% of THAT code so that it does what they actually wanted to do in the first place. But boy, it sure produces a lot of words!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: