Hacker Newsnew | past | comments | ask | show | jobs | submit | kenmu's commentslogin

Love your idea. We have timeout mechanisms and we originally would be pretty aggressive with timeouts based on both time and response length to balance accuracy and speed. There’s research that longer responses tend to be less accurate (when compared to other responses to the same prompt). So we came up with an algorithm that optimized this very effectively. However, we eventually removed this mechanism to avoid losing any accuracy or comprehensiveness. We have other systems, including confidence scoring, that are pretty effective at judging long responses and weighting them accordingly.

We may reintroduce some of the above with user-configurable levers.


As of right now, we do not. I'm working on these other benchmarks, but unfortunately they cost quite a bit of money to run, which I'm hoping will come from many people using Sup :)

I mentioned in another comment that I make sure the cost/time is within 1.25x of the next best single-model run. So it's not perfect, but I think that aspect will only get better with time.

Of course I'm biased, but using Sup has been great for me personally. Even disregarding the HLE score, having many different perspectives in the answers, and most importantly the combined answer, has been very helpful in feedback for architectural decisions I make for Sup, and many other questions I would normally ask ChatGPT/Gemini/Claude/Grok individually.


Love the idea of learning the skill of vibe coding. The subtitle of the book, "Idea → Prompt → AI → Edit → Ship" doesn't quite go with the premise, "There is no magic prompt that turns your idea into a finished product overnight."

Ooof and thank you. It didn’t even occur to me.

Well-written article. Does a great job walking through why any robust system will need what DSPy provides. Though there are many libraries and frameworks that will provide the basics, RAG, exponential back-off, etc.

DSPy's real value is in its prompt optimization framework, which was barely mentioned. And this has requirements like datasets and specific tasks, which not every project has. This is probably the main reason for its smaller and happier user base than projects like LangChain.


Is their scaffold available? Does it do anything special beyond feeding the warmup, single challenge, and full problem to an LLM? Because it's interesting that GPT-5.2 Pro, arguably the best model until a few months ago, couldn't even solve the warmup. And now every frontier model can solve the full problem. Even the non-Pro GPT-5.4. Also strange that Gemini 3 Deep Think couldn't solve it, whereas Gemini 3.1 Pro could. I read that Deep Think is based on 3.1 Pro. Is that correct?

I see that GPT-5.2 Pro and Gemini 3 Deep Think simply had the problems entered into the prompt. Whereas the rest of the models had a decent amount of context, tips, and ideas prefaced to the problem. Were the newer models not able to solve this problem without that help?

Anyway, impressive result regardless of whether previous models could've also solved it and whether the extra context was necessary.

I know these frontier models behave differently from each other. I wonder how many problems they could solve combining efforts.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: