Hacker Newsnew | past | comments | ask | show | jobs | submit | lanthissa's commentslogin

in nyc any car based transportation is slower than subways often, but everyones so narrow minded they just think about their own life. if you're old in nyc cabs/ubers/waymo are a big deal, without them you're stuck walking to a bus stop or subway and that gets hard in your 70s and 80s.

did they not pay them enough to get good ratings on the other 3 models?

whats the logic in claiming its a borked metric when everything listed is an anthropic model.


There a few benchmarks out there where all existing models have abysmal scores. So it's not actually a problem if Antrophic's older models are bad, especially if the jump to the newest model is huge, and the competition is also way below it.

the marketcap represents the cashflow estimated by the market to be taken out of the business over the lifetime of the company discounted today.

your suggestion makes no sense


opus to produce workflows, flash 3.5 to do them.

Chinese models prob work too, but idk since i cant use them at work


yeah where i work has been at $150 a week with a pretty generous over ride if you ask.

people self limit when there are caps. if you give people unlimited they wont even use sonnet easy things.


yes? the future for any verifiable task is the model attempts to verify initial state and a goal then decomposes its tasks in to every smaller verifiable subtasks, with /memory being the persistence between runs and then /dreaming on the results of those memory files + run data to introduce new ideas.

i think thats the path to async agi these labs are imagining. The only limit is that sensor data you have on the world or your system, how long your willing to wait, and how much you're willing to spend to parallelize it.

maybe once you start building out these verified workflows you can feed that back into training and hte model starts to get a feel for the world to the point that it can intuit things since it has these sub paths built.

my personal agi test is can a model, trained on video of someone knocking on a door and then open it encounter a microwave for the first time and open it when the foods done without knocking.


You ought to include a canary string if you are going to disclose your evals like that!

i used to use opus for everything, thats not an option once you move to a multi agent system unless you're working on like high end research. I could easily spend 3k a day if i was using opus as just a normal dev.

As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.

Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.

I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.


this is the finance team doing a fantastic job. keep in mind they're raising this cash right before 3 major ipos in their sector which people will need to raise money for and will fight against htem in the narrative.

If i was a google cfo and was trading at a premium to my peers before that, i'd want to raise the cash now. Look at MSFT, they're trading at 25 forward p/e and were buying back shares at 40. If they have to issue equity over the next few years the spread between teh performance of the 2 cfos could be 40-50b on that alone.


Just as a google shareholder, this company bought back shares hand over fist at a low p/e for a few years, issues 100 year debt at low rates, and is selling equity when its at a premium to its peers right before 2-3 major ipos of competitors put selling pressure on the stock for a while.

I don't know who's going to win the llm battle, but googles finance team has been doing their job fantastically.


flash 3.5 is the best price/performance model for what i'm doing. I had been using opus for everything but as we started running many agents at once, and then eventually agent managing sub agents frontier is not an option.

we started model testing the cost/performance of our skills and agents and flash 3.5 wins in most things.

As people develop harnesses for their codebase i think the intelligence required comes down a lot.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: