Hacker Newsnew | past | comments | ask | show | jobs | submit | chillee's commentslogin

I think unlike Gluon/CuTe/ThunderKittens (which distinguish themselves from Triton by being lower level giving you more control, thus being less performance portable and harder to write), Helion distinguishes itself from Triton by being higher level and easier to write.

IMO, this is something that makes sense for PyTorch to release, as "neutral ground" in the industry.


What's the point of Triton compared to Gluon? What's the point of PyTorch compared to Triton?

One of the main values of Triton is that it significantly expanded the scope of folks who can write kernels - I think Helion could expand the scope even more.


If you think of Triton as a "baseline", most other DSLs are lower-level than Triton, whereas this is higher-level.


Clearly not true anymore given OpenAI and Anthropic's revenue growth.


Revenue... yes. Profit is still an open question.

https://www.cnbc.com/2025/08/08/chatgpt-gpt-5-openai-altman-...

> Last year, OpenAI expected about $5 billion in losses on $3.7 billion in revenue. OpenAI’s annual recurring revenue is now on track to pass $20 billion this year, but the company is still losing money.

> “As long as we’re on this very distinct curve of the model getting better and better, I think the rational thing to do is to just be willing to run the loss for quite a while,” Altman told CNBC’s “Squawk Box” in an interview Friday following the release of GPT-5.

Selling compute for less than it cost you will have as much revenue as you want to pay for.


Anthropic founder described it as: if each model were a company, they be hugely profitable. It looks bad since when the model you trained in 2024 is generating net positive revenue, you’re also training a more expensive model for 2025 that won’t generate revenue until then. So currently, they’re always burning more cash than they’re bringing in, under the expectation that every model will increase revenue even more. Who knows how long that lasts, but it’s working so far.

Paraphrase is from the podcast he was in with the stripe founder, cheeky pints I think


Which is not a good comparison because the LLMs are products not companies. If they are companies, they are competing against each other for revenue.

If I switch from Gemini Pro to Opus, that is good for Anthropic. If I switch from Opus 4 to 4.1, that’s not as good for Anthropic.

Sad that these CEOs can get away with this level of sophistry.


>Revenue... yes. Profit is still an open question.

could have said the same thing about most FAANG companies at one point or another.


The problem for OpenAI and the difference with other FAANGs is that they don’t own the internet. Other companies are able to replicate their product, which prevents them from fully realizing profits.

Google doesn’t have this problem. They only run Google ads in their search results. Same thing for Facebook.


If I have the numbers right, OpenAI will burn more money this year alone than all of those prior companies did in their entire profitless phase of existence.


Their gross profits are very high even though they're not making operating profit.


This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.

If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.

There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.


Agree that the writeup is very wrong, especially for the output tokens. Here is how anyone with enough money to allocate a small cluster of powerful GPUs can decode huge models at scale, since nearly 4 months ago, with costs of 0.2 USD/million output tokens.

https://lmsys.org/blog/2025-05-05-large-scale-ep/

This has gotten significantly cheaper yet with additional code hacks since then, and with using the B200s.


You can also look at the price of opensource models on openrouter, which are a fraction of the cost of closed source models. This is a market that is heavily commoditized, so I would expect it reflect the true cost with a small margin.


If you make careful calculations and estimate the theoretical margins for inference only of most of the big open models on openrouter, the margins are typically crazy high if the openrouter providers served at scale (north of 800% for most of the large models). The high cost probably reflects salaries, investments, and amortization of other expenses like free serving or occasional partial serving occupancy. Sometimes it is hard to keep uniform high load due to other preferences of users that dont get covered at any price, eg maximal context length (which is costing output performance), latency, and time for first token, but also things like privacy guarantees, or simply switching to the next best model quickly. I have always thought that centralized inference is the real goldmine of AI because you get so much value at scale for hardly any cost.


As much as I appreciate you saying the math is wrong, it doesn’t really help me adjust my expectations unless you provide correct numbers as well.


Right. Now I want to know if they're really losing money or not.


So, bottom line, do you think it’s probable that either OpenAI or Anthropic are “losing money on inference?”


No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.


Thanks for the correction (author here). I'll update the article - very fair point on compute on input tokens which I messed up. Tbh I'm pleased my napkin math was only 7x off the laws of physics :).

Even rerunning the math on my use cases with way higher input token cost doesn't change much though.


The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.

The component about requiring long context lengths to be compute-bound for attention is also quite misleading.


Anyone up to publishing their own guess range?


I’m pretty sure input tokens are cheap because they want to ingest the data for training later no? They want huge contexts to slice up.


Afaik all the large providers flipped the default to contractually NOT train on your data. So no, training data context size is not a factor.


Even if it is, ignoring the biggest costs going into the product and then claiming they are profitable would be actual fraud.


As one of those people who doesn’t really understand llms, does anyone have any recommendations to better my understanding of them?


I mean, vllm and sglang are both "pure python" essentially as well. But yeah, in ML you rarely require C++ to get good performance for most of the systems people are writing.


A couple things:

1. The academy has had a significant increase of young voters in the past 10 years or so. Generally speaking, young voters are more likely to take animation as a "serious" medium.

2. These interviews were always somewhat overstated. Of course some voters have stupid rationales, but I don't think this dominates the academy.

3. Disney's Inside Out 2 was nowhere close to winning the award this year - Flow's biggest competition was The Wild Robot, which did gross far more than Inside Out 2, but far below Inside Out 2.

If you look at the past couple years, The Boy and the Heron (Studio Ghibli) won over Across the Spider-Verse (with Pixar's movie Elemental nowhere close) in 2023, Guillermo del Toro's Pinocchio won over Across the Spider-Verse (with Pixar's movie Turning Red nowhere close) in 2022, etc.

I'm curious what year you're thinking about above. Perhaps Toy Story 4 over Klaus in 2019?


> Flow's biggest competition was The Wild Robot, which did gross far more than Inside Out 2, but far below Inside Out 2.

Exactly the same as Inside Out 2 then?

(I'm guessing it was far more than Flow but less than Inside Out 2?)


4. The results can still be valid if there’s a lot of random noise in the sample. There are about 10,000 voters here. If 9,000 vote at random and 1,000 watch the films and vote on merit, there’s about a 2% chance of getting a different result than if all 10,000 watched and voted on merit.


One of the big things this article misses is that Google pays Broadcom a significant amount for the actual chip design, also around a 70% margin.

Google certainly has infra/cost advantages, but it's nowhere near 10x.


Mind sharing your source on that? I’ve been trying to find one.

Edit: Specifically the nature and current status of the Broadcom/Google relationship as it relates to TPUs.


https://www.theinformation.com/articles/to-reduce-ai-costs-g...

Which takes it from

> Broadcom generates a 70% profit margin from its work on TPUs, said a person with direct knowledge of the internal analysis. SemiAnalysis, a chip research firm, earlier reported that figure.

https://semianalysis.com/2023/08/30/broadcoms-google-tpu-rev...


Thanks!


For latency-bound inference (i.e. one request) you don't need tensor-cores since all your operations are just matrix vector multiplications.


Good point yes. That explains why he's getting performance similar to the leading frameworks. Those tensor operations are helpful for training or for throughput-optimised batched inference but not really for a batch size of one.


I actually didn't know that. I'm in the space as a hobbyist and I had a vague understanding that tensor cores are essential for reaching peak performance, but can only work for certain operations like dense matrix-matrix multiplication. It was on my list to investigate whether they could be used to further improve single-batch decoding - makes sense that they don't help when it's all matrix-vector.


The big issue with Strassen isn't performance - it's numerical stability.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: