More

pants2 · 2026-02-27T01:43:08 1772156588

1) This is by any source I can find, incorrect. Twitter had ~8,000 employees when Musk bought it. After layoffs that was trimmed to a low of around 1,500 employees (19%), and today it has around 2,800 employees.

Also worth mentioning that a lot of Twitter's products are built on X.ai which has 1,200 core employees on Grok with 3,000+ on the Datacenter build-out side.

pants2 · 2026-02-25T01:37:49 1771983469

Reading such obvious LLM-isms in the announcement just makes me cringe a bit too, ex.

> We optimize for speed users actually feel: responsiveness in the moments users experience — p95 latency under high concurrency, consistent turn-to-turn behavior, and stable throughput when systems get busy.

pants2 · 2026-02-21T00:44:00 1771634640

To think I used to log in to Facebook every day, scroll friends' posts until it said "You're caught up!" then leave.

That's almost unimaginable now, but I deeply wish I could return to that experience. Unfortunately as the suggested content got turned up, friends stopped posting, so even with all the browser extensions in the world I can't get that same experience back.

rconti · 2026-02-21T19:46:53 1771703213

Yup. And unfortunately, I realized I've benefitted from the algo too. I tried the friends feed, and I ended up with MORE politics!

pants2 · 2026-02-20T22:06:13 1771625173

If you're heating water, the heating is just "talk" while boiling is "action" -- but boiling takes a long time even once you've reached the boiling point!

pants2 · 2026-02-20T17:34:19 1771608859

Couldn't these tools be made to run in an OverlayFS-type filesystem that the user could review and apply changes to when they're done?

It would also be nice to have a second agent review every command to ensure nothing overly destructive is happening.

Are either of these things possible with Codex/CC?

pants2 · 2026-02-20T03:19:26 1771557566

Strange that you say that because the general consensus (and my experience) seems to be the opposite, as well as the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

maxwellcoffee · 2026-02-20T04:51:36 1771563096

Google actually has the BEST ratings in the AA-Omniscience Index: AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer.

Gemini 3.1 is the top spot, followed by 3.0 and then opus 4.6 max

holbrad · 2026-02-20T16:58:44 1771606724

This isn't actually correct.

Gemini 3.0 gets a very high score because it's very often correct, but it does not have a low hallucination rate.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

It looks like 3.1 is a big improvement in this regard, it hallucinates a lot less.

tempestn · 2026-02-20T20:24:31 1771619071

Yes and no. The hallucination rate shown there is the percentage of time the model answers incorrectly when it should have instead admitted to not knowing the answer. Most models score very poorly on this, with a few exceptions, because they nearly always try to answer. It's true that 3.0 is no better than others on this. By given that it does know the correct answers much more often than eg. GPT 5.2, it does in fact give hallucinated answers much less often.

In short, its hallucination rate as a percentage of unknown answers is no better than most models, but its hallucination rate as a percentage of total answers in indeed better.

tempestn · 2026-02-20T03:40:57 1771558857

I can only speak to my own experience, but for the past couple of months I've been duplicating prompts across both for high value tasks, and that has been my consistent finding.

fnord123 · 2026-02-20T11:09:01 1771585741

> the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

As sibling comment says, AA-Omniscience Hallucination Rate Benchmark puts Gemini 3.0 as the best performing aside from Gemini 3.1 preview.

https://artificialanalysis.ai/evaluations/omniscience

holbrad · 2026-02-20T17:00:48 1771606848

You are misreading the benchmark.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

If you look at the results 3.0 hallucinates an awful lot, when it's wrong.

It's just not wrong that often.

(And it looks like 3.1 does better on both fronts)

pants2 · 2026-02-19T06:41:59 1771483319

I tried it with a few of my Suno prompts and it seems way behind Suno v5. Which is especially surprising because It's 5 months newer!

pants2 · 2026-02-17T03:00:48 1771297248

That headline seems like a stretch after reading the article (Fox after all)

> sexually inappropriate messages were sent to "~500k victims per DAY in English markets only."

This sounds like a total count of unsolicited sexual messages sent to all users every day.

pants2 · 2026-02-16T20:48:07 1771274887

Fun when these things hold a surprising amount of weight. Reminds me when these two engineers on Lego Masters made a bridge:

https://www.youtube.com/watch?v=G9WT6TB15yE

sysworld · 2026-02-16T21:25:49 1771277149

wtf, why lego, whhhy? "The uploader has not made this video available in your country"

edit: What, they geoblocked a ~1min clip, wow.

layer8 · 2026-02-16T23:20:59 1771284059

This is from a Fox TV series, probably distribution rights.

https://en.wikipedia.org/wiki/Lego_Masters_(American_TV_seri...

bookofjoe · 2026-02-16T21:32:04 1771277524

I live in the U.S.: I can watch it.

What is "your country?"

abanana · 2026-02-16T21:52:57 1771278777

It's Lego Masters USA (Fox), rather than the Lego company itself, so I imagine they're being extra-careful with licensing.

I'm in the UK and it's geoblocked for me.

pants2 · 2026-02-14T00:16:29 1771028189

They are (in my mind) still the best models for fast general taka, when hosted on Groq / Cerebras