More

6thbit · 2026-04-24T21:56:45 1777067805

A 10B insurance policy on google’s business sounds like a bargain?

And with cashback through gcp usage!

6thbit · 2026-04-24T17:58:01 1777053481

Meanwhile I’d like a reverse filter, like don’t show me anything posted by people under 25. And I’m sure lots of young people would love to filter out anything posted by people over 40.

6thbit · 2026-04-23T21:13:50 1776978830

Satya is 58, plus his years at ms totals over 70, maybe he takes the buyout : )

6thbit · 2026-04-23T19:19:31 1776971971

                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%

Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing

aliljet · 2026-04-23T19:47:31 1776973651

Mythos is only real when it's actually available. If you're using Opus 4.7 right now, you know how incredibly nerfed the Opus autonomy is in service of perceived safety. I'm not so confident this will be as great as Anthropic wants us to believe..

XCSme · 2026-04-23T19:51:46 1776973906

They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.

Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...

sigmoid10 · 2026-04-24T15:28:30 1777044510

Any static benchmark older than 12-18 months is basically worthless, because the content will have spread all over the internet and have found its way into the latest model's training set.

William_BB · 2026-04-24T06:44:14 1777013054

Good luck arguing with SWE benchmark purists

kaonashi-tyc-01 · 2026-04-23T20:49:40 1776977380

I did some study on Verified, not Pro, but Mythos number there rings a lot of questions on my end.

If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).

Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.

Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.

yfontana · 2026-04-23T21:09:40 1776978580

OpenAI wrote a couple months ago that they do not consider SWE Bench Verified a meaningful benchmark anymore (and they were the ones who published it in the first place): https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

kaonashi-tyc-01 · 2026-04-23T21:14:55 1776978895

Yep, I read this blog. What confuses me is that Anthropic doesn't seem to be bothered by this study and keeps publishing Verified results.

That is what gets me curious in the first place. The fact Mythos scored so high, IMO, exposes some issues with this model: it is able to solve seemingly impossible to solve problems.

Without cheating allegation, which I don't think ANT is doing, it has to be doing some fortune telling/future reading to score that high at all.

alansaber · 2026-04-23T20:27:47 1776976067

A single benchmark is meaningless, you always get quirky results on some benchmarks.

6thbit · 2026-04-22T21:44:48 1776894288

"Never attribute to malice that which is adequately explained by stupidity."

6thbit · 2026-04-22T21:15:35 1776892535

The "bug" discussed in the article is only part of the problem.

The main problem, which is notifications text is stored on a DB in the phone outside of signal, is not addressed. To avoid that you have to change your settings.

In this case, the defendant had deleted the signal app completely, and that likely internally marks those app's notifications for deletion from the DB, so the bug fixed here is that they were not removing notifications from the local database when the app that generated them was removed, now they do.

  Impact: Notifications marked for deletion could be unexpectedly retained on the device
  Description: A logging issue was addressed with improved data redaction.
  CVE-2026-28950

They classify this as "loggging issue" so it sounds like notifications were not actually in the database itself but ended up in some log.

firesteelrain · 2026-04-23T01:54:50 1776909290

This tweet seems to imply it’s logs, json, plist and SQLite DB.

Biome — /private/var/mobile/Library/Biome/streams/.../Notification/segments/ — the raw title/body logs

2. BulletinBoard + UserNotificationsCore — /var/mobile/Library/{BulletinBoard,UserNotificationsCore}/.{json,plist} — delivered + dismissed state

3. CoreDuet — /var/mobile/Library/CoreDuet/coreduetdClassD.db — SQLite that re-ingests Biome events

https://x.com/zeroxjf/status/2047081983449178128?s=46

saagarjha · 2026-04-23T03:43:41 1776915821

I don’t think they are correct

concinds · 2026-04-22T21:45:16 1776894316

You're speculating. "Marked for deletion" could mean after you dismiss it, not just after you delete the whole app.

6thbit · 2026-04-22T23:09:26 1776899366

i'll speculate further: it could've been on the dismiss notification code, and when you delete the app the OS dismisses the removed app's notifications, triggering the same code path.

in this case as per reporting, defendant removed the app. unclear if they first dismissed them.

twoodfin · 2026-04-22T21:54:26 1776894866

SQLite WAL?

saagarjha · 2026-04-22T22:38:47 1776897527

Why do you think they aren't the same thing?

6thbit · 2026-04-22T21:07:44 1776892064

in the case reported the content did not leave the device. feds retreived them directly from the phone.

6thbit · 2026-04-21T19:14:37 1776798877

System card link with safety details https://deploymentsafety.openai.com/chatgpt-images-2-0

direct pdf https://deploymentsafety.openai.com/chatgpt-images-2-0/chatg...

dang · 2026-04-22T00:30:40 1776817840

Link added to toptext. Thanks!

6thbit · 2026-04-21T18:00:27 1776794427

you're onto something, why pay for ads if i can pay a post-ads agency to ensure maximal product placement during training.

6thbit · 2026-04-21T17:47:53 1776793673

I find it a good surprise. Signals they probably don't want to become an advertising company, which is what every other tech business ended up becoming. And that is not against having ads be a core revenue driver, like for companies in other sectors.