Meanwhile I’d like a reverse filter, like don’t show me anything posted by people under 25.
And I’m sure lots of young people would love to filter out anything posted by people over 40.
Mythos is only real when it's actually available. If you're using Opus 4.7 right now, you know how incredibly nerfed the Opus autonomy is in service of perceived safety. I'm not so confident this will be as great as Anthropic wants us to believe..
Any static benchmark older than 12-18 months is basically worthless, because the content will have spread all over the internet and have found its way into the latest model's training set.
I did some study on Verified, not Pro, but Mythos number there rings a lot of questions on my end.
If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).
Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.
Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.
Yep, I read this blog. What confuses me is that Anthropic doesn't seem to be bothered by this study and keeps publishing Verified results.
That is what gets me curious in the first place. The fact Mythos scored so high, IMO, exposes some issues with this model: it is able to solve seemingly impossible to solve problems.
Without cheating allegation, which I don't think ANT is doing, it has to be doing some fortune telling/future reading to score that high at all.
The "bug" discussed in the article is only part of the problem.
The main problem, which is notifications text is stored on a DB in the phone outside of signal, is not addressed. To avoid that you have to change your settings.
In this case, the defendant had deleted the signal app completely, and that likely internally marks those app's notifications for deletion from the DB, so the bug fixed here is that they were not removing notifications from the local database when the app that generated them was removed, now they do.
Impact: Notifications marked for deletion could be unexpectedly retained on the device
Description: A logging issue was addressed with improved data redaction.
CVE-2026-28950
They classify this as "loggging issue" so it sounds like notifications were not actually in the database itself but ended up in some log.
i'll speculate further: it could've been on the dismiss notification code, and when you delete the app the OS dismisses the removed app's notifications, triggering the same code path.
in this case as per reporting, defendant removed the app. unclear if they first dismissed them.
I find it a good surprise. Signals they probably don't want to become an advertising company, which is what every other tech business ended up becoming. And that is not against having ads be a core revenue driver, like for companies in other sectors.
And with cashback through gcp usage!
reply