Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The full-session evaluation framing is the right call - most teams don't realize the failure happened in turn 2 until they've spent 3 hours blaming the model. One thing worth thinking about as you grow: connecting caught regressions to production conversation data. When your simulation flags a new failure mode, being able to say "this pattern has already surfaced X times in prod this week" cuts the prioritization debate in half. Does Cekura currently let you correlate simulation failures back to real user sessions, or is that still a manual step?


We track the failure modes in production directly instead of relying on simulation. So if suddenly we are seeing a failure mode pop up too often, we can alert timely. In the approach of going from simulation to monitoring, I am worried the feedback might be delayed.

Doing it in production also helps to go run simulations by replaying those production conversations ensuring you are handling regression.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: