Hacker Newsnew | past | comments | ask | show | jobs | submit | pants2's commentslogin

1) This is by any source I can find, incorrect. Twitter had ~8,000 employees when Musk bought it. After layoffs that was trimmed to a low of around 1,500 employees (19%), and today it has around 2,800 employees.

Also worth mentioning that a lot of Twitter's products are built on X.ai which has 1,200 core employees on Grok with 3,000+ on the Datacenter build-out side.


Reading such obvious LLM-isms in the announcement just makes me cringe a bit too, ex.

> We optimize for speed users actually feel: responsiveness in the moments users experience — p95 latency under high concurrency, consistent turn-to-turn behavior, and stable throughput when systems get busy.


To think I used to log in to Facebook every day, scroll friends' posts until it said "You're caught up!" then leave.

That's almost unimaginable now, but I deeply wish I could return to that experience. Unfortunately as the suggested content got turned up, friends stopped posting, so even with all the browser extensions in the world I can't get that same experience back.


Yup. And unfortunately, I realized I've benefitted from the algo too. I tried the friends feed, and I ended up with MORE politics!

If you're heating water, the heating is just "talk" while boiling is "action" -- but boiling takes a long time even once you've reached the boiling point!

Couldn't these tools be made to run in an OverlayFS-type filesystem that the user could review and apply changes to when they're done?

It would also be nice to have a second agent review every command to ensure nothing overly destructive is happening.

Are either of these things possible with Codex/CC?


Strange that you say that because the general consensus (and my experience) seems to be the opposite, as well as the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

Google actually has the BEST ratings in the AA-Omniscience Index: AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer.

Gemini 3.1 is the top spot, followed by 3.0 and then opus 4.6 max


This isn't actually correct.

Gemini 3.0 gets a very high score because it's very often correct, but it does not have a low hallucination rate.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

It looks like 3.1 is a big improvement in this regard, it hallucinates a lot less.


Yes and no. The hallucination rate shown there is the percentage of time the model answers incorrectly when it should have instead admitted to not knowing the answer. Most models score very poorly on this, with a few exceptions, because they nearly always try to answer. It's true that 3.0 is no better than others on this. By given that it does know the correct answers much more often than eg. GPT 5.2, it does in fact give hallucinated answers much less often.

In short, its hallucination rate as a percentage of unknown answers is no better than most models, but its hallucination rate as a percentage of total answers in indeed better.


I can only speak to my own experience, but for the past couple of months I've been duplicating prompts across both for high value tasks, and that has been my consistent finding.

> the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

As sibling comment says, AA-Omniscience Hallucination Rate Benchmark puts Gemini 3.0 as the best performing aside from Gemini 3.1 preview.

https://artificialanalysis.ai/evaluations/omniscience


You are misreading the benchmark.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

If you look at the results 3.0 hallucinates an awful lot, when it's wrong.

It's just not wrong that often.

(And it looks like 3.1 does better on both fronts)


I tried it with a few of my Suno prompts and it seems way behind Suno v5. Which is especially surprising because It's 5 months newer!

That headline seems like a stretch after reading the article (Fox after all)

> sexually inappropriate messages were sent to "~500k victims per DAY in English markets only."

This sounds like a total count of unsolicited sexual messages sent to all users every day.


Fun when these things hold a surprising amount of weight. Reminds me when these two engineers on Lego Masters made a bridge:

https://www.youtube.com/watch?v=G9WT6TB15yE


wtf, why lego, whhhy? "The uploader has not made this video available in your country"

edit: What, they geoblocked a ~1min clip, wow.


This is from a Fox TV series, probably distribution rights.

https://en.wikipedia.org/wiki/Lego_Masters_(American_TV_seri...


I live in the U.S.: I can watch it.

What is "your country?"


It's Lego Masters USA (Fox), rather than the Lego company itself, so I imagine they're being extra-careful with licensing.

I'm in the UK and it's geoblocked for me.


They are (in my mind) still the best models for fast general taka, when hosted on Groq / Cerebras

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: