I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
Sometimes code is definitely the bottleneck. For example some organizations have a very bureaucratic process guarding which projects get access to a development team and when. That's not needed if implementation is now faster/cheaper.
I'm also skeptical that development velocity is so separate from all those other things (context, stakeholder alignment,etc). It's much easier to get actionable feedback when you have a prototype.
I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.
I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.
It really is. This is because LLMs with a single output/user are strongly bandwidth limited. Although the hardware can generate multiple tokens simultaneously, it is slowed down if the tokens depend on each other, as is the case with regular text generation.
The draft model essentially predicts the next token quickly, enabling you to start generating the subsequent token in parallel. If the guess is right, the second generated token is correct. If it is wrong, the second generated token is also potentially wrong, so it must be generated again using the correct prior token obtained through the big model.
A poor draft model will simply slow down the process without affecting the output.
I think the acceptance criteria is not that the token is exactly the token the big model would have produced. It's accepted of the big model verifies that the probability of that token was high enough.
How close it is to the same output (or same distribution of outputs) you'd get from running the big model would be dependent on temperature, top-k, top-p settings, or other inference parameters.
There is more compute available than bandwidth when computing LLMs.
It's like branch prediction - the CPU predicts what branch you'll take and starts executing it. Later you find out exactly what branch you took. If the prediction was correct, the speculative executed code is kept. If the prediction was wrong, it's thrown away, the pipeline is flushed, and the execution resumes from the branch point.
The same with this thing: 3 tokens, A-B-C were "predicted", you start computing ALL them 3 at the same time, hoping that the prediction checks out. And because of the mathematical structure of the transformer, it costs you almost the same to compute 3 tokens at a time or just one - you are limited by bandwidth, not compute. But CRITICALLY, each token depends on all the previous ones, so if you predicted wrongly one of the tokens, you need to discard all tokens predicted after (flush the pipeline). This is why a prediction is required and why you can't always compute 3 tokens simultaneously - the serial dependency between consecutive tokens. If you were to start computing 3 tokens simultaneously without a prediction, for token C you need to assume some exact values for tokens A and B, but those were not computed yet! But if they were speculatively predicted you can start and hope the prediction was correct.
The token is correct if it matches the one generated by the main model. It works like this:
The draft model quickly generates draft-token 1.
The main model then starts working on two tokens in parallel. It calculates token 1 based on the context, and token 2 based on the context + draft-token 1.
Once the two tokens have been generated, you can check whether the draft-token 1 from the draft model matches token 1 from the main model.
If they match, you have just calculated two tokens in the time it takes to generate one, because the calculation was done in parallel.
If they do not match, delete token 2 and generate it again. Since you have already generated the correct token 1 with the big model, you can use the context + token 1 (from the main model). This takes more time, but the result is always the same.
Models do not generate tokens. They generate probabilities for each token.
Inference parameters select a token using those.
You can just select the top token all the time or you can do it probabilistically.
How you do that in both the speculative decoding and the main inference changes how likely you get the exact same tokens. And then you can choose to accept only if the token matches exactly, or you can choose to accept if it was reasonably likely to be chosen.
Let's say the main model picked the 2nd most likely token and speculative picked the most likely. You can reject that - but you get less speed up. You can accept it, you get more speed up, but you do change the output. You risk the distribution of your outputs not being what you hope.
In theory, you could do that and increase the speed at higher temperatures, but it would subtly alter your output based on the draft model preferences. Rather than picking randomly from the main model probabilities, you would have to accept a draft model pick if it is close enough.
As far as I know, this is not used in practice. Currently popular implementations always match the main model output, and the draft model only affects the speed.
Good analysis. That's surprising. I always heard that the draft model doesn't affect the output in any way. It seems they do it like this to achieve faster generation. It would be interesting to investigate how this affects the output.
Edit: I haven't gone through all the code, but they might do something like this: https://arxiv.org/abs/2211.17192 where a draft model is used and the output distribution is tweaked on rejection, resulting in the exact same distribution as the main model.
By "lossless" I believe they mean "stays within the target distribution". Thats what their validation test says it tests. Maybe that means there is no loss in quality in practice. I don't think it means there is no change in output.
The paper they link to in that first paragraph says you compare logits to accept or reject.
Speculative decoding batches multiple completions on all possible outcomes (0/1/2 draft tokens accepted) and sees if big model deviates at any point -- thus verifying each token. So there's no difference in output.
From the linked post, it didn't read like a separate KV cache was needed:
> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.
That's great news. That has not been the case with other MTP implementations like Qwen3.5, but I see the section in the article saying Google introduced some architectural optimizations to make this possible.
It's based on taking advantage of spare compute if you have it. A tiny model generates a few steps ahead first, then the large one runs batch inference on all of those at once as if you are at that point in time. If they all check out afterwards it jumps ahead, otherwise it discards and goes onto the next one.
Not sure about this implementation, but conceptually it only works well on very capable GPUs for very predictable output. Typical speedup is about 30%, not sure how google is claiming 250% which is ridiculous.
And if you don't have enough compute, then you get negative speedup from all the extra overhead.
I don't think this way because I like to collaborate. If a colleague can benefit from a tool I made I'm proud to save them time. I also think your attitude doesn't pass the golden rule: would you like to work on a team full of people like you?
I tend to agree with you - a rising tide lifts all boats and I want my team to be a rising tide. If I'm at a startup and I'm confident my tool is a good fit for what the rest of the team is doing and there's a genuine teamwork dynamic, oh absolutely I share things like this.
But when I've been stuck for a while in a dysfunctional team, I've definitely seen the flip side where other people will find ways to take a lot of credit for minor iterations on my work, where management will reward my productivity with high expectations and high pressure to continue the trajectory they perceive in a single idea, and when the tool becomes a support burden because too many people think it should solve all of their other problems too and I'm now perceived as being the owner of this thing they depend on.
It does seem like a highly antagonistic way of working or perhaps I'm just naive.
If your only goal is to maintain a performance lead on your peers, you either need to gain and keep an advantage or find ways to actively make your coworkers disadvantaged (or both). And if you're already doing 1) then 2) isn't a far stretch.
> would you like to work on a team full of people like you?
If their team is already like this, what choice do they have? It's a prisoners dilemma where everyone else is defecting and I'm the sole cooperator.
IMO the onus for solving this is on the business owner, either through establishing a knowledge sharing culture or more comprehensive performance evaluation that rewards these innovations.
Some parts of the anti-AI movement are becoming so unhinged that now any use of compute is considered an environmental threat. This degrowth mentality needs to die.
No need for unlimited growth, just normal sustainable progress like the one that allows you and me to communicate here after centuries of technological progress.
The "normal sustainable progress" has already pushed us to the brink of extinction. AI is rapidly accelerating our resource use, with nothing good to show for it.
How exactly are we "on the brink of extinction"? ("We" as in humans; many other species are obviously not as lucky.)
We are probably on the brink of very bad consequences for a signification fraction of all humans (up to and including all of them, to some extent), which is a huge problem that needs to be addressed.
But what do you gain by incorrectly labeling that as "extinction"? Because you do definitely lose credibility for it, similarly to everybody using hyperbolic language such as "boiling the oceans" etc.
This is literally why the EU mandates appliance energy efficiency.
It's never a binary thing. "Is using energy good or bad?" is a stupid question which can only provide stupid answers. It has to be placed in the context of whether it's proportionate to benefit.
Things which burn a lot of energy for little benefit - and in the case of AI, often negative benefit - end up more towards the "bad".
You're not seriously trying to explain that a kWh is equal to a kWh. Why not cut the crap? Are you trying to say washing clothes is of equal importance to convenience features in a browser, given that we can use each clean kWh only once? I can't tell what you truly mean like this
What do you mean you "disagree"? I pay for the electricity I use and I use it however I want.
Instead of trying to control other people, why can't you start with yourself? Throw away your phone/computer. Go live in a small hut. Practice what you preach.
>You are incurring debt and forcing it upon others.
You seem to have no problem whatsoever with using electricity yourself. So when do you get to tell me (or anyone else) how to live? And when does it stop? Btw, this is all bizarrely dramatic since we were talking about small local models anyway.
>future generations
Yeah, and some will also say (using the same arguments) that having children is harmful to the planet and we need "measures" to limit that too.
I’m not telling you to do one thing or another. I’m taking issue with your argument that because you pay an electric bill, it follows that you can do whatever you want.
That does not follow logically for me. As humans we disagree about many things, but we generally agree that things that we do often affect others, so one way or another, we need to come together and decide which things are agreed to be acceptable and which things are not.
And I'm not inclined to entertain this nonsense, not even as a hypothetical. I'm not giving up on my most basic and fundamental rights, doubly so when these draconian restrictions won't apply to the people who want to impose them.
The oceans are boiling [0], marine life is dying [1]. Land close to the water will be land under water soon [2]. The ice caps are melting and setting free all sorts of diseases. [3]
Large parts of our planet on fire all the time now, here's one from Australia from this year [4], but I'm sure you've read about wildfires in Australia last year, California every year, Greece last year etc etc.
What you're proposing is nothing short of a death cult. It's either degrowth or we all die, sacrificed at the altar of capitalism.
Why do you attribute to capitalism an issue that is much more fundamental than it? People want more stuff and better lives, it's as simple as that. Even hunger/gatherer societies brought themselves to extinction multiple times in the past, and I doubt the USSR would have fared better against climate change.
Technological progress is also societal progress. If we embraced degrowth in the 1800's (there was a ton of pollution back then, and a Malthusian belief in disaster!) we might not see slavery being abolished or women being able to vote.
> People want more stuff and better lives, it's as simple as that.
Not everyone wants this at the cost of others. It's not as simple as that / not a necessary consequence of our desire to find clever solutions to solve everyday inconveniences
Because capitalism ties together better lives an ideological belief in unbounded growth.
Will people's lives really be better once they're drowning or choking on wildfire smoke? But hey, at least they had cheap junk!
It's possible to have better lives as well as societal progress without endless growth. Technological progress, too, doesn't have to mean burning our oceans. We just gotta actually think about the costs and consequences of our actions.
Not every technological development is inherently good. Sometimes the cost is not worth the result. I posit the cost of AI so far has been astronomical, higher than anything else in living memory. The results on the other hand have been rather middling.
This is my issue. A cost/benefit analysis, not a strict no to progress.
Have you ever made a decision to NOT download something, turn on your computer, experiment, etc based on your perceived impact on the planet?
I mean this should (and is) be tackled at the source: 0/low emission energy generation and not consumer having to think about these decisions. Sustainable data centers using renewables etc. But not that the companies should associate/evaluate/consider bytes downloaded with environmental impact.
>not consumer having to think about these decisions
Consumers vote and advocate for what they want and don't want. There are many who say it's not an individual problem and should be dealt with broadly through regulation, then also oppose any attempts at regulation.
> this should (and is) be tackled at the source: 0/low emission energy generation and not consumer having to think about these decisions.
Until we're at that point though, the 'winners' in this market society (that wield unimaginable amounts of money = resources) such as Google could certainly think about consequences of their choices. And they usually do to some extent, I'm not saying they don't, just that electric supply and demand has two sides to it
I'm going to assume you work in tech and know the issues that come with scale.
Me, individually not doing something is gonna absolutely be drowned out by the scale of many other people not thinking of it or being incentivized against it.
This is a systemic issue. A systemic issue needs a systemic solution, not a blame shift to the individual.
We didn't get rid of lead in gas or asbestos in walls by telling people it was bad for them. We did so by banning it.
> The NHS launders money the indebted government doesn’t have into terrible health outcomes. This feels like a benefit because it conceals from patients the true cost of their care, while its shortcomings relative to other countries are noticeable only to policy nerds. That’s how most of Europe’s welfare states work.
The UK has less debt than the US and much better average health outcomes, while spending less on health per capita. This is just intellectually dishonest framing of how welfare systems work, ironically in a piece about comparative poverty.
What happens in days where renewables can't produce enough energy? Or the evenings where we don't have enough batteries (all evenings so far and for the next decade at least)? You can call it base load or whatever you want, but that energy is coming either from hydro, nuclear or a carbon-based source. And those carbons are hard to come by these days, so even if nuclear power is expensive, at least it is reliable.
It takes a decade at least for any new nuclear starting today to come online in the west. In that decade you’ve built an awful lot of batteries for the same amount of money.
No one wants to bet $10s of billions of nuke capex against the relentless progress of batteries and other tech over the next 10 years, and then the 30+ years of plant operations. It’s a suckers bet , so the only ones who can take it are nation states.
given that we dont have nukes, and we wont for 10 years even if we started today, and we arent going to start them because theyre economic disasters...
in the medium term its going to be batteries + solar/wind + gas backups for rare weather events. If we get the total annual use of gas down to a very achievable 10% we're still massively winning climate wise. California is getting there, 45% gas in 2022, 25% gas in 2025, and adding batteries at massively increasing rate. Full coverage of an average night is within sight, using gas just for shortfalls.
We can hopefully transition the last peaking gas backup usage to something else in the long term (hydrogen? SMRs if they ever exist?) but it isnt _that_ important in the grand arc of saving the climate.
So now the discussion is not about whether base load is a thing or not, it is that you firmly believe that batteries are the answer to everything.
First it should be said that this thread is primarily about decomissioning existing nuclear power plants. It makes enormous sense to keep operating those plants until we have a world like the one you describe, regardless of how much newer plants would cost.
But more importantly, your assumptions about the future are very optimistic. I'm sure the Germans also thought they were being very smart when they decided that nuke capex was not worth it because gas was so cheap and easily available, and then now we are finding out that this decision crippled their economy because it caused a dependency. In my opinion throwing all your chips into a technology that requires materials and production capacity you don't have, and in some cases doesn't even exist yet, is a real sucker's bet. All your rosy scenarios would fall apart in one second if China decides to stop selling batteries to you.
> So now the discussion is not about whether base load is a thing or not, it is that you firmly believe that batteries are the answer to everything.
Nope, im still talking about the economics of base load. It exists insofar as there is base load _demand_, aka the minimum demand point the grid has. Base load _supply_ is not a thing - there is no rule of nature or economics that says you have to match that minimum demand with static allocation of unvarying power sources like slow thermal (coal, nukes). That worked for awhile as an economic optimization, but now on grids with variable sources like wind, solar, batteries, it doesnt work. If your plant has to run at 100% at all times to be profitable (nukes), your economic model is now broken.
> First it should be said that this thread is primarily about decomissioning existing nuclear power plants. It makes enormous sense to keep operating those plants until we have a world like the one you describe, regardless of how much newer plants would cost.
Yep, I have absolutely no objections keeping existing plants running, thats a smart thing to do. Its building new ones that doesn't make economic sense anymore.
> All your rosy scenarios would fall apart in one second if China decides to stop selling batteries to you.
true, but its easier to build a homegrown battery manufacturing industry than it is a nuclear industry.
Mistral has a very difficult scenario to navigate. Training models in Europe is difficult and expensive because of regulations and energy prices. Their own open models are lagging behind the Chinese ones. That means eventually they will turn into an inference-only enterprise running mostly Chinese open models, at which point any other European player could compete (Hetzner, OVHCloud, etc.)
The regulatory concerns are worldwide: the GDPR has restrictions about the territorial location of data, so you cannot move data anywhere else other than EU or "adequate" countries (in practice, the US). Since the real gold is in using data that users submitted to you (ie, GDPR protected), they are kind of stuck in regards to where they can train.
Mistral's stack already heavily relies on American cloud providers and they have tons of American investors, so its sovereignty angle is dubious anyway.
Is this scenario far fetched? Just as Nations pay large companies to build a factory in their country, Nations will similarly pay AI companies to build a national AI model for their own consumption because AI is that beneficial.
It's a risk, but since they have training expertise they should be able to distill the best open source models to reach at least approximate parity comfortably. Frontier model territory looks increasingly out of reach for anyone without $100B for training and then you have to serve inference to recoup cost, that's an expensive proposition in EU.
...OTOH the cost of not sponsoring this in Europe may be complete technological obsolescence. Rock and a hard place situation.
The US single-handedly dominating AI at this point probably means a handful of tech overlords in charge of a surveillance society which depends on AI for everything, with some vague promises that everyone else will get some sort of allowance if they feel benevolent enough. For all existential risks discussed about ASI or whatever, having an oligarchy in complete control of this tech is maybe even worse.
So, I guess we all have to hope that more money does not necessarily lead to a "victory" here.
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
reply