Hacker Newsnew | past | comments | ask | show | jobs | submit | shanev's commentslogin

Congrats on the launch! It would be useful to have a matrix somewhere showing how this compares to Jina, Firecrawl, etc.



The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it. Then come back to Claude to run custom code review plugins. Then, of course review it with my own eyes before merging the PR.

My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm :)


GPT-5 was the first model that occasionally produced code that I could push without any changes

Claude still tends to add "fluff" around the solution and over-engineer, not that the code doesn't work, it's just that it's ugly


Interesting, I have consistently found that Codex does much better code reviews than Claude. Claude will occasionally find real issues, but will frequently bike shed things I don't care about. Codex always finds things that I do actually care about and that clearly need fixing.


I’d agree with you until Opus 4.5.


Eh sonnet 4.5 was better at Rust for me


I built a little Chrome extension that shows the flag for where the account is based in on the profile page itself: https://grokify.app.


This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).


At the developer level all my LLM use is in the context of agentic wrappers, so my benchmark is fairly trivial:

Configure aider or claude code to use the new model, try to do some work. The benchmark is pass/fail, if after a little while I feel the performance is better than the last model I was using it's a pass, otherwise it's a fail and I go back.

Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.


> Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.

I'd really caution against this approach, mainly because humans suck at removing emotions and other "human" factors when judging how well something works, but also because comparing across models gets a lot easier when you can see 77/100 vs 91/100 as a percentage score, over your own tasks that you actually use the LLMs for. Just don't share this benchmark publicly once you're using it for measurements.


So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.

At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.


> So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.

Yeah no you're right, if consistency isn't important to you as a human, then it doesn't matter. Personally, I don't trust my "humanness" and correctness is the most important thing for me when working with LLMs, so that's why my benchmarks focus on.

> At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.

Yes, this is exactly my point. The benchmarks the makers of these LLMs seems to always provide a better and better score, yet the top scores in my own benchmarks have been more or less the same for the last 1.5 years, and I'm trying every LLM I can come across. These "the best LLM to date!" hardly ever actually is the "best available LLM", and while you could make that judgement by just playing around with LLMs, actually be able to point to specifically why that is, is something at least I find useful, YMMV.


I think that's what this site is doing: https://aistupidlevel.info/


Well, openai github is open to write evaluations. Just add your there and guaranteed that the next model will perform better on them.


We have to keep in mind that "solving" might mean having the LLM recognize the pattern of solving something.


That’s called evals and yes any serious AI project uses them


Close. It won because of Github. Git was gaining over SVN slowly but it was Github that really propelled it into widespread use.


DAOs are fixing this in the crypto world. You contribute to the protocol and get paid by the DAO. Everything is transparent and open. If you do this enough and earn the respect of the dev team it could even turn into a full-time role.


> He worked hard to enable software reuse. No one was interested in his idea of trying to monitor component use during runtime to pay developers

People are experimenting doing this in blockchain smart contracts. It’s transparent and supports micropayments as well.


Yupp! I made a little toy project for the EVM over a year ago with this exact concept, never really did anything with it sadly, life seems to find ways to get busy. Due to the nature of needing to send 'gas' to make function calls, it was a natural fit to add a call to send a small portion of the value to an address before returning the computations result.

I really loved the idea of being able to create libraries of code that could just be called for a small fee or copied for free if one didn't have the funds. I hope this idea continues to catch on, it seemed to me to be a perfect incentive fit for the open source world.


I was mulling over this exact issue the other day but couldn't find the words to express it. Thank you for this.


I thought it was about resque, the ruby queuing library, switching from redis to kafka.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: