There are sequential dependencies, so you can't just arbitrarily increase speed by parallelizing over more GPUs. Every token depends on all previous tokens, every layer depends on all previous layers. You can arbitrarily slow a model down by using fewer, slower GPUs (or none at all), though.
Yes, because speculation has NEVER bitten us in the ass before, right? Coughs in Spectre
Speculative decoding is just running more hardware to get a faster prediction. Essentially, setting more money on fire if you're being billed per token.
While this guide covers roughly 80% of the material, it remains a high-level overview that lacks depth. I can't confirm if it was LLM-generated, but the content is undeniably superficial. Real-world production environments are far more complex; for instance, despite other users mentioning hugepages and TLB, there is no discussion of critical issues like TLB shootdown.
It's a bit ironic that the "soft" skills are becoming the hard skills nowadays. A lot of the AI buzz these days is around PM's, Data Scientists, etc. who now have the tools to code "well enough" and are attractive due to their people skills and/or other skillsets.
Not to say this is an objective analysis, just observing the subjective trends.
reply