> Models were run with maximum available reasoning effort in our API (xhigh for GPT‑5.2 Thinking & Pro, and high for GPT‑5.1 Thinking), except for the professional evals, where GPT‑5.2 Thinking was run with reasoning effort heavy, the maximum available in ChatGPT Pro. Benchmarks were conducted in a research environment, which may provide slightly different output from production ChatGPT in some cases.
Feels like a Llama 4 type release. Benchmarks are not apples to apples. Reasoning effort is across the board higher, thus uses more compute to achieve an higher score on benchmarks.
Also notes that some may not be producible.
Also, vision benchmarks all use Python tool harness, and they exclude scores that are low without the harness.
I am a huge proponent of renewables, but you cannot compare their capacity in GW with other energy sources because their output is variable and not always maximal. To realistically get 100GW in solar you would need at least 500GW of panels.
On the other hand, I think we will not actually need 100GW of new installations because capacity can be acquired by reducing current usage by making it more efficient. The term negawatt comes to mind. A lot of people are still in the stone age when it comes to this even though it was demonstrated quite effectively by reduced gas use in the US after the oil crisis in the 70s. Which basically recovered to the pre crisis levels only recently.
High gas prices caused people to use less and favor efficiency. The same thing will happen with electricity and we'll get more capacity. Let the market work.
Nope. Construction induces ECC-driven emergent modular manifolds in latent space during KVQ maths. Can't use any ole ECC / crux why works. More in another reply.
Doesn't apply as long as the improvements obtained there scale with compute.
Now, are there actual meaningful improvements to obtain, and do they stick around all the way to frontier runs? Unclear, really. So far, it looks like opening a can of hyperparameters.
this is a bad example to claim the bitter lesson applies to, it’s about the fundamentals of optimization techniques not about tying to hand-crafted things for the solution space.
Sure, of course. Wasn't suggesting "are you beating a sota benchmark"? I'm floating the idea of an ablation that matches a realistic scenario for the dataset / task. Personally curious how manifold muon performs compared to AdamW in a throughly explored context. This is the first time I've seen a 3-layer mlp on cifar-10.
I probably should have made the 9-layer ResNet part more, front-and-center / central to my point.
> “That’s right outta ‘Goodfellas,’ that’s right out of a mafioso going into a bar saying, ‘Nice bar you have here, it’d be a shame if something happened to it,’” Cruz said, using the iconic New York accent associated with the Mafia.
It does go to the government and not to Trump's personal wallet (like the memecoins and lavish gift), it's just a tax that's just not being called a tax, and frankly it's a good idea. The current abuse of H1B doesn't work out positively for anyone but the companies making a boatload of money on exploiting people.
Are you querying from an EC2 instance close to the S3 data? Are the CSVs partitioned into separate files? Does the machine have 500GB of memory? It’s not always duckdb fault when there can be a clear I/O bottleneck…
These just seems like over engineered solutions trying to guarantee their job security. When the dataflows are so straight forward, just replicate into pick your OLAP, and transform there.
I came from traditional engineering into data engineering by accident and had a similar view. But every time I tried to make a pipeline from first principles it always eventually turned out something like this for a reason. This is especially true when trying to bridge many teams and skillsets - everyone wants their favourite tool.
our approach wasn’t about over-engineering, we were trying to leverage our existing investments (like Confluent BYOC) while optimizing for flexibility, cost, and performance. We wanted to stay loosely coupled to adapt to cloud restrictions across multiple geographic deployments.
What is the current state of the art (open source) when doing oltp to olap pipelines in these days? I don’t mean a one-off etl style load at night but a continuous process with relatively low latency?
Idk what the state of the art is, but I’ve used change data capture with Debezium and Kafka, sink’d into Snowflake. Not sure Kafka is the right tool as you don’t need persistence, and having replication slots makes a lot of operations (eg DB engine upgrade) a lot harder though.
reply