I think unlike Gluon/CuTe/ThunderKittens (which distinguish themselves from Triton by being lower level giving you more control, thus being less performance portable and harder to write), Helion distinguishes itself from Triton by being higher level and easier to write.
IMO, this is something that makes sense for PyTorch to release, as "neutral ground" in the industry.
What's the point of Triton compared to Gluon? What's the point of PyTorch compared to Triton?
One of the main values of Triton is that it significantly expanded the scope of folks who can write kernels - I think Helion could expand the scope even more.
> Last year, OpenAI expected about $5 billion in losses on $3.7 billion in revenue. OpenAI’s annual recurring revenue is now on track to pass $20 billion this year, but the company is still losing money.
> “As long as we’re on this very distinct curve of the model getting better and better, I think the rational thing to do is to just be willing to run the loss for quite a while,” Altman told CNBC’s “Squawk Box” in an interview Friday following the release of GPT-5.
Selling compute for less than it cost you will have as much revenue as you want to pay for.
Anthropic founder described it as: if each model were a company, they be hugely profitable. It looks bad since when the model you trained in 2024 is generating net positive revenue, you’re also training a more expensive model for 2025 that won’t generate revenue until then. So currently, they’re always burning more cash than they’re bringing in, under the expectation that every model will increase revenue even more. Who knows how long that lasts, but it’s working so far.
Paraphrase is from the podcast he was in with the stripe founder, cheeky pints I think
The problem for OpenAI and the difference with other FAANGs is that they don’t own the internet. Other companies are able to replicate their product, which prevents them from fully realizing profits.
Google doesn’t have this problem. They only run Google ads in their search results. Same thing for Facebook.
If I have the numbers right, OpenAI will burn more money this year alone than all of those prior companies did in their entire profitless phase of existence.
This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.
If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.
There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.
Agree that the writeup is very wrong, especially for the output tokens. Here is how anyone with enough money to allocate a small cluster of powerful GPUs can decode huge models at scale, since nearly 4 months ago, with costs of 0.2 USD/million output tokens.
You can also look at the price of opensource models on openrouter, which are a fraction of the cost of closed source models. This is a market that is heavily commoditized, so I would expect it reflect the true cost with a small margin.
If you make careful calculations and estimate the theoretical margins for inference only of most of the big open models on openrouter, the margins are typically crazy high if the openrouter providers served at scale (north of 800% for most of the large models). The high cost probably reflects salaries, investments, and amortization of other expenses like free serving or occasional partial serving occupancy. Sometimes it is hard to keep uniform high load due to other preferences of users that dont get covered at any price, eg maximal context length (which is costing output performance), latency, and time for first token, but also things like privacy guarantees, or simply switching to the next best model quickly. I have always thought that centralized inference is the real goldmine of AI because you get so much value at scale for hardly any cost.
No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.
Thanks for the correction (author here). I'll update the article - very fair point on compute on input tokens which I messed up. Tbh I'm pleased my napkin math was only 7x off the laws of physics :).
Even rerunning the math on my use cases with way higher input token cost doesn't change much though.
The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.
The component about requiring long context lengths to be compute-bound for attention is also quite misleading.
I mean, vllm and sglang are both "pure python" essentially as well. But yeah, in ML you rarely require C++ to get good performance for most of the systems people are writing.
1. The academy has had a significant increase of young voters in the past 10 years or so. Generally speaking, young voters are more likely to take animation as a "serious" medium.
2. These interviews were always somewhat overstated. Of course some voters have stupid rationales, but I don't think this dominates the academy.
3. Disney's Inside Out 2 was nowhere close to winning the award this year - Flow's biggest competition was The Wild Robot, which did gross far more than Inside Out 2, but far below Inside Out 2.
If you look at the past couple years, The Boy and the Heron (Studio Ghibli) won over Across the Spider-Verse (with Pixar's movie Elemental nowhere close) in 2023, Guillermo del Toro's Pinocchio won over Across the Spider-Verse (with Pixar's movie Turning Red nowhere close) in 2022, etc.
I'm curious what year you're thinking about above. Perhaps Toy Story 4 over Klaus in 2019?
4. The results can still be valid if there’s a lot of random noise in the sample. There are about 10,000 voters here. If 9,000 vote at random and 1,000 watch the films and vote on merit, there’s about a 2% chance of getting a different result than if all 10,000 watched and voted on merit.
> Broadcom generates a 70% profit margin from its work on TPUs, said a person with direct knowledge of the internal analysis. SemiAnalysis, a chip research firm, earlier reported that figure.
Good point yes. That explains why he's getting performance similar to the leading frameworks. Those tensor operations are helpful for training or for throughput-optimised batched inference but not really for a batch size of one.
I actually didn't know that. I'm in the space as a hobbyist and I had a vague understanding that tensor cores are essential for reaching peak performance, but can only work for certain operations like dense matrix-matrix multiplication. It was on my list to investigate whether they could be used to further improve single-batch decoding - makes sense that they don't help when it's all matrix-vector.
IMO, this is something that makes sense for PyTorch to release, as "neutral ground" in the industry.