Hacker Newsnew | past | comments | ask | show | jobs | submit | gck1's commentslogin

I think pi wants you to write your own extensions, adapted to your meeds.

I haven't had a need for any extensions though. Maybe subagents, but I solved that with tmux. For all the rest, I just use "skills".


I gave a fairly complex reverse engineering task to DS-4 xhigh and GPT-5.5 xhigh today.

After about 6 hours, both ultimately failed to fully RE, however, there were some drastic differences:

DS stopped every 30 minutes or so, saying it did full RE and it should all work now, while in fact, it didn't complete even 1% of it. It also looked for shortcuts again and again, despite me prompting heavily that the specific shortcut may not be used. It was a complete and utter failure.

GPT-5.5, on the other hand, blew me away. It just did the right things, didn't jump to next steps until it was sure it completed the initial layers and had a full understanding of what's required. The only time I prompted it during the 6 hours was when I saw it going in the right direction and I could nudge it slightly towards an even better way. I never felt I was fighting it. Okay, maybe a little bit - after compaction, it sometimes would go on a "no I'm not helping you with reverse engineering" tangent, but it would resolve in a clean session.

I cancelled my Claude subscription a month ago, so I haven't tested that, but DeepSeek has reminded me a lot of how I worked with Opus 4.6/4.7. Which perhaps could be a positive sign to some, but GPT-5.5 showed me that the way claude/ds work is just way too annoying.


The GPT models are heavily biased to a more incremental, empirical, evidence based approach. Sometimes to a fault. I prefer them for this reason, but it requires coaxing or strategic use of /goal to break it out if its highly staged, one piece at a time, approach.. if you don't like it.

I suspect for people doing more... website ... type development, the more "yeet this into existence" style of Opus feels preferable.

With Claude I was constantly jamming my finger on the escape key "wait, you did what?! based on what proof?!"


You make it sound as if Codex is for people who know what they want and Claude Code is for people who don’t know what they’re doing.

I was trying to not sound that biased, but ok ;-)

> DS stopped every 30 minutes or so, saying it did full RE and it should all work now, while in fact, it didn't complete even 1% of it. It also looked for shortcuts again and again, despite me prompting heavily that the specific shortcut may not be used. It was a complete and utter failure.

This is my experience with non-SOTA models across the board. When you try them on little tasks and they work it feels amazing, but then you go deeper and you're back to going in loops and fighting the model for hours.

Switching back to a SOTA model immediately yields progress again.

When I read all of the comments from people saying they can't tell a difference between Opus and <insert open weight model here> I don't know if they haven't really used it much yet, or if they're just not doing anything complicated.


Did you read the OP when he's exactly chiding the model you're glazing?

Did you intentionally miss the point of my comment? Substitute Opus for GPT-5.5 if you will. I use both as well as locally hosted models using some of your branches, even.

Fair enough. I agree with you - although DS4 Pro is a GPT 5 class model which scores 46% on ARC-AGI-2[^1]. It's behind by maybe 9 months, I think it's still good enough for a lot of complex tasks as well. They definitely need to work on a "just fucking works" harness like CC/Codex. Also thanks!

[^1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...


What you’re experiencing is the difference in model intelligence. Most models can seem pretty good at simple stuff over short time horizons. Complex work requires that more intelligence be stuffed into those trillion-dimensional spaces.

I'm from one of the countries on the list. Not only is there no way to legally immigrate to the US anymore, but just visiting US once requires me to give an interest-free loan of up to $15k to the US government. Yeah, no, thank you.

I never considered illegal immigration, nor will I ever - I value predictable outcomes.

But looking at these new rules, I can't help but think that it really punishes people who want to play by the rules and sets the price for ones that don't to approximately $15k.


My country is not in the list (Mexico, not that we need to... Americans hate us), but I just cannot comprehend why people would go through all the pain for the immigration process in the US.

Actually, it kind of make sense why only the most desperate try to get into the US , people who have something to lose are naturally repelled by the bureaucracy.


The average American, at least in my state (Washington), does not hate Mexicans. The people running the federal government seem to, however.

We love to paint the US in broad brushstrokes of color, but it more of a muddy brown across the entire country. Washington State doesn't have huge expat communities of Mexicans, but what about if I'm Chinese going to school in Spokane? Or Somali in St Paul, MN? or Pakistani in Chicago? Some "average Americans" seem to hate these people in every locale.

EDIT: Wash. is actually a top 8 destination in the US for Mexican immigrants, with an estimated population of 250-300K people, so not huge but definitely sizeable!


I dunno. The southern parts of SW WA can be pretty racist (Lewis County and south). Rural, much more red, but without the extensive farming more pervasive on the east side.

Yeah, I'm brown. I'm not even going to visit the US while the turd reich is in power. I had always wanted to visit the Smithsonian.

>Americans hate us

I don't mean to minimize any negative experiences you've had. But as a lifelong American, I've never heard anyone make a negative comment about Mexicans. Even in online spaces like X where there is a lot of racism, it's usually not directed at Mexicans.

If you look at Trump's famous comment about Mexicans in his speech from 2015, he actually points to Mexicans in the audience and refers to them as Mexico's best people. The media cut that part out, of course. (I'm not a Trump supporter.) https://www.youtube.com/watch?v=apjNfkysjbM#t=3m25s


Here is the actual quote you are defending:

    TRUMP: "When Mexico sends its people, they're not sending their best -- they're not sending you [points at unidentified people off-camera] -- they're not sending you -- they're sending people that have lots of problems, and they're bringing those problems with us.  They're bringing drugs; they're bringing crime; they're rapists and some, I assume, are good people."
There is no apparent indication in that video that the people he's pointing at are Mexican.

Among other issues, countries are not generally 'sending people' to immigrate to other countries. Most countries are in general keen to avoid emigration.

>There is no apparent indication in that video that the people he's pointing at are Mexican.

But it seems like the most natural interpretation.

The most obvious, good-faith interpretation of this quote is that Mexico has a mix of good and bad people, like every country, and the ones immigrating illegally tend to be bad. The fact that the media didn't even consider this common-sense interpretation contributed a lot to Trump's popularity.


> But as a lifelong American, I've never heard anyone make a negative comment about Mexicans.

What an absurd statement on its face that comes from a place of extreme privilege. I am a brown-skinned man in America and I lost count of all of the times people that look like me have been denigrated and lambasted in this country.


>What an absurd statement on its face that comes from a place of extreme privilege.

Oh boy, here we go again. I even said "I don't mean to minimize any negative experiences you've had" and I'm still getting the privilege discourse. You are really determined to prevent the Democrats from winning elections, aren't you?

Even assuming I am privileged, then what I'm telling you is that privileged white people like myself aren't shit-talking Mexicans behind their back. Wouldn't that be relevant information? Why would it be an absurd statement?

>brown-skinned

That's not the same as Mexican.

When was the most recent time this happened to you in person? A recent, representative concrete example would be a lot more compelling that performative outrage.

It can't be online. White people catch lots of crap online too: https://pbs.twimg.com/media/G88oMOwWAAIqBUp?format=jpg&name=...


I've always thought I'd end up in the US at some point, but as someone who prefers to make things rather than spend years at some faceless megacorp (writing up cover sheets for TPS reports), it never seemed hugely viable, even starting out from a first-world country.

Now it doesn't seem viable at all. Meanwhile, anyone who shows up illegally is merely "undocumented", and half of US politics consists in coddling them (the other half in enforcing existing immigration laws capriciously). Even for someone who's quite pro-immigration like myself, that's just bizarre. There's no way this is a functional system.

Most of the people in my circles don't want to go to the US anymore. I suppose I'll ride it out and see what comes next (after 2028 at minimum). If I ever make it, I'll have spent many of my productive years outside the US, since I wasn't welcome during those. Weird system.


The next time someone says "stop crying about usage limits, they're losing money on your subscription ", I'm going to link to this comment.

Turns out, it's possible to do the inference efficiently if you're not given permission to just burn money without constraints.


> haven't been disclosed, because they're not fixed.

That's convinient.

But wait, don't they have this amazing AI that can fix all the issues itself with a single /goal command? What's the holdup?


You should really read the article, every question asked so far in this thread has been very clearly answered.

I miss the days when HN would RTFA.


He doesn’t want to read the article. He just wants to LLM bad.

From the article

> As we noted above, the bottleneck in fixing bugs like these is the human capacity to triage, report, and design and deploy patches for them.

...

> To begin, we’ve released Claude Security in public beta for Claude Enterprise customers. It’s a tool that helps teams scan their codebases for vulnerabilities, and which can generate proposed fixes for them. In the three weeks since launch, Claude Opus 4.7 has been used to patch over 2,100 vulnerabilities. (This is faster than the open-source patching described above in large part because enterprises are fixing their own code, whereas open-source fixes usually require volunteer maintainers who work through coordinated disclosure.)

Your critique of the article would likely land much better if you engaged with it.


It's the same as the origin of "Codex/Opus subscription usage is heavily subsidized" - the sales departments equipped with AI agents with the prompt: "use anonymous accounts on the internet to make it easy for me to sell it at $price".

> This is why I believe mythos will remain private for the foreseeable future. There's such a large surface that needs to be secured and so much to triage, fix, deploy.

sigh I remember the GPT-2 days - when it was the first time OpenAI restricted access to the models citing "humanity is not ready for it". The model was good at writing poetry or something.

Since then, I don't remember a single model announcement from OAI/ANT that didn't use similar wording.

The so-called leak of model announcement was marketing, it being dangerous is marketing, the world not being ready for it is marketing. And yes, the ones that were given access to saying "oh wow", believe or not, is also marketing.

It's all marketing. You can get the same results from any of the top-5/10 models that are generally available already.

Mythos is Anthropic's way to sell the new idea, because the previous one has democratized.


Writing marketing 10 times doesn't invalidate the (many) claims from many respectable sources that the model is a step change in cybersec. There's also the report [1] from the Brits that track cyber capabilities since '22 or '23 and they've also confirmed it's a step change (together with 5.5 cyber or whatever they call it).

Marketing is like propaganda. It doesn't need to be based on false facts. Of course they're gonna milk it, keep it private and so on. But that doesn't mean the model is bad. Or that others are as good (apparently they're not there yet).

[1] - https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...


Please don't misrepresent the article it says clearly "a step up in cyber performance over previous frontier models" and that gpt-5.5 is on their tests is slightly better than mythos.

Scroll to the graph labeled "Completed steps..."

If that doesn't convince you that both mythos and 5.5 are a step up (several steps, hah) nothing will.


I think you just aren't reading the post, or any of the Glasswing partner's posts. You have this view in your head of what Mythos is, and nobody can say anything dissuade you from it.

"Partners" is the important word in your comment. I am reading all of it, but I have a huge barrel of salt to consume along with everything that I read, because I see conflicts of interest everywhere I go, with fancy words and no means to verify.

If I was given free access to any frontier model to use on my projects, equivalent of millions of dollars in AI credits, I sure hope people didn't trust anything that came out of my mouth until they were able to verify my claims themselves.

AI industry has even resulted in a new term - benchmaxing - which essentially means we can't even trust the data anymore until we can touch the model ourselves. So this is not at all surprising to me. What's surprising is why am I in the minority here, and since when trusting authorities that have obvious conflicts of interest became normal.


I don't think Firefox or The Linux Foundation have conflicts of interest here. They've said in their contracts that they get the tokens irrespective of what they say about Mythos. Additionally, the findings speak for themselves.

This just seems overly conspiratorial to me. I don't remember Anthropic ever lying in their blog posts. They've been about as consistent as Apple when it comes to product claims.


It’s still not clear to me that humanity was ready for GPT-2! Quite a lot of people claim to hate and fear LLMs. https://www.kcl.ac.uk/news/one-in-five-britons-think-ai-will... or https://yougov.com/en-us/articles/54762-most-americans-say-a... for example.

Agreed, also amazing citations in the parent comment ^^

It's also possible that this will become (or already has) a 'standard' writing style of humans. A style has to come from somewhere, and it would be most influenced by the style that you consume often.

People read this style for every single question that they ask nowadays.

Who knows, maybe watercooler chats will also start having higher occurence of the phrase "You're absolutely right" in the not so distant future.


I sit next to my 4U server with all enterprise components apart from fans - these are consumer grade.

I had to mod the chassis slightly (with just pliers, tape and random inserts) to fit these fans in there, and add fans in front to push the air in. The PSU that came with it was obnoxiously loud, but thankfully, Supermicro has a quiet version that I can't even hear. Even if SM didn't have this PSU, I could have easily modified the PSU and fit some noctuas in there without any issue or safety concerns - like I did with my enterprise grade Mikrotik switch that also had obnoxious fans by default.

I even have an enterprise grade UPS that is dead silent when it's not running on battery power (I swapped the fans there too).

I essentially try to buy enterprise gear whenever possible. Not only is it usually much better than the consumer alternative, but it also is frequently much cheaper too because of second hand market. Before AI sucked the soul out of the hardware market in general, you could have bought enterprise SSDs that had life expectancy - TBW - measured in petabytes, and MTFB - practically never - for half the price of the top consumer SSD that had TBW measured in tens of TB and MFTB of yesterday.

And the entire rack is just slightly more louder than the PC I was using.

The only consumer grade computer at my home is my MacBook and my phone.


Enterprise SSDs are all that. Just make sure you power it up. For data retention without power the requirements are 3mo for enterprise vs 1yr for consumer grade.


As someone who went full circle prompt-enforcement > deterministic flow > prompt-enforcement, I disagree.

The reason why "DO NOT SKIP" fails is because your agent is responsible for too many things and there's things in context that are taking away the attention from this guidance.

But nobody said the agent that does enforcement must be the same agent that builds. While you can likely encode some smart decision making logic in your deterministic control flow, you either make it too rigid to work well, or you'll make it so complex that at that point, you might as well just use the agent, it will be cheaper to setup and maintain.

You essentially need 3 base agents:

- Supervisor that manages the loop and kicks right things into gear if things break down

- Orchestrator that delegates things to appropriate agents and enforces guardrails where appropriate

- Workers that execute units of work. These may take many shapes.


Exactly, just keep adding more agents


I can't tell if this is satire or not. Well done!


It a heisenberg satire because more agents going wild is indeed horrible but agents restricting and counterbalancing each other can be useful (token cost ignored!).


I think the key question is: How can you be sure the supervisor/orchestrator agents are reliable? You are just pushing the complexity down into another layer.


You can't be sure but the point is you can be more sure, since agent 2 ("agent" which is really just a fancy way of saying some code that calls anthropics api) has only the context to look for a violation of a single rule.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: