People complaining about how hard to get simple answer is don't appreciate the complexity in figuring out optimal models...
There's so many knobs to tweak, it's a non trivial problem
- Average/median length of your Prompts
- prompt eval speed (tok/s)
- token generation speed (tok/s)
- Image/media encoding speed for vision tasks
- Total amount of RAM
- Max bandwidth of ram (ddr4, ddr5, etc.?)
- Total amount of VRAM
- "-ngl" (amount of layers offloaded to GPU)
- Context size needed (you may need sub 16k for OCR tasks for instance)
- Size of billion parameters
- Size of active billion parameters for MoE
- Acceptable level of Perplexity for your use case(s)
- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)
- even finer grain knobs: temperature, penalties etc.
Also, Tok/s as a metric isn't enough then because there's:
- thinking vs non-thinking: which mode do you need?
- models that are much more "chatty" than others in the same area (i remember testing few models that max out my modest desktop specs, qwen 2.5 non-thinking was so much faster than equivalent ministral non-thinking even though they had equivalent tok/s... Qwen would respond to the point quickly)
At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?
The same exercise with paid APIs exists too, obviously less knobs but depending on your use case, there's still differences between providers and models. You can abstract away a lot of the knobs , just add "are you satisfied with how much it cost" on top of the other 2 questions
That's a big flaw of LLMs, not limited to RAGs: it lacks the fundamental understanding of "good and bad", like Richard Sutton said in that Dwarkesh podcast.
So if you flood the Internet with "of course the moon landing didn't happen" or "of course the earth is flat" or "of course <latest 'scientific fact' lacking verifiable, definitive proof> is true", you then get a model that's repeating you the same lies.
This makes the input data curating extremely important, but also it remains an unsolved problem for topics where's there's no consensus
> That's a big flaw of LLMs, not limited to RAGs: it lacks the fundamental understanding of "good and bad", like Richard Sutton said in that Dwarkesh podcast.
After paticipating in social media since the beginning I think this problem is not limited to LLMs.
There are certain things we can debunk all day every day and the only outcome isit happens again next day and this has been a problem since long before AI - and I personally think it started before social media as well.
> After paticipating in social media since the beginning I think this problem is not limited to LLMs.
Yup, but for LLMs the problem is worse... many more people trust LLMs and their output much more than they trust Infowars. And with basic media literacy education, you can fix people trusting bad sources... but you fundamentally can't fix an LLM, it cannot use preexisting knowledge (e.g. "Infowars = untrustworthy") or cues (domain recently registered, no imprint, bad English) on its own, neither during training nor during inference.
"water is wet" kind of study, as tariffs are precisely supposed to increase price for consumers for imported goods... But the last 3 paragraphs are interesting:
- Importers raised the price more than needed (i.e. blame tarifs to increase their profit margin)
- Price increases took one year to fully reflect to the customers, and persisted nearly one year after the tariffs expired.
- chicken-tax-like loopholes implemented wherever possible (for wine apparently it's raising the ABV to more than 14%)
You remind me of the fact that humans do not in fact have sensors in the skin to detect specifically wetness.
I think given the amount of ideas floating around, it is occasionally good to revisit things that are "known", just in case some underlying assumption changed, especially for economics which is harder to get right as it deals a lot with what human want and do.
I can't see how anyone can think "the exporters pay the tariff" makes any sense. TBH, we'll never know how many people thought it made sense because it didn't matter.
In the end money move around. If - for example - the government would just give the citizens the money from the tariffs in equal share (I mean not that I suggest they would, but technically possible), it would be like taking from the citizens that consume more and give it to the citizens that consume less.
So, yes, it is correct in a practical immediate sense that "the exporters pay the tariff" but that excludes many relevant issues like how prices evolve (which are paid by the consumers), what the government does with the money (it could share or not) and what others decide to produce (to avoid tariffs). But definitely many people didn't thought of all that ...
Your first 2 points make me extra bitter about COVID.
Less store hours. Higher prices. Inflation. People in school got a terrible education and it affected my workforce. (But hey 1% of people died, as predicted if we did nothing at all... )
It only reinforces the importance of competition over protectionism.
I used to be a walmart fan, but my local store is cheaper now. I didn't bother to look at prices until things were getting silly.
> But hey 1% of people died, as predicted if we did nothing at all
Nope. Compare the death rates of Sweden vs its neighbours in the Nordics (the closest comparisons we have with similar weather/culture/etc.). Or if you don't care about minimising variables, in the US between states that did lockdowns and mask mandates and those that didn't. In every comparable (e.g. excluding rural vs urban) case, there were more deaths in "doing nothing" than implementing the same basic public health axioms that have held true for centuries.
> Inflation
That was also helped by Russia invading Ukraine, which increased global prices of multiple important raw materials. But yes, inflation after a period of deflation/economic contraction/restricted travel and consumption was to be expected.
> People in school got a terrible education and it affected my workforce
It's definitely a bigger issue for them than it is for you. And yeah, it sucks for them. Would have been pretty terrible to tell teachers (who overwhelmingly skew older) they should risk their lives just to keep kids occupied too.
> It only reinforces the importance of competition over protectionism.
The thing too many forget is that if we didn't flatten the curve our entire medical system was going to collapse. It's insane that people don't yet understand this concept and can't even empathize with medical professionals. Yes, we all struggled, but try talking to medical professionals to see how they did.
When something doesn't happen because enough measures were taken, then it wasn't worth it because it didn't happen?
> The thing too many forget is that if we didn't flatten the curve our entire medical system was going to collapse
Yep, if things were going well there wouldn't have been makeshift morgues with refrigerated trucks, sick people having to be moved around to different countries, the military deploying field hospitals, corpses piling in the streets. Those examples are from a variety of countries, which shows how bad the situation was globally.
You had 6 weeks of staying at home, and then quarantines for international travellers after that. In return, you had no COVID-19 at all for several years. Seems a fair trade.
Norway had that too; without lockdown. Curfews would require a change in the constitution and the last time they happened was during WWII which makes them doubly unpopular.
Sweden all-cause mortality was indeed higher if an immediate pre-pandemic year is taken as a base. However, pre-pandemic years in Sweden show a substantial dip in all-cause mortality, something that neighboring countries did not see. It is not that simple.
On my 32GB Ryzen desktop (recently upgraded from 16GB before the RAM prices went up another +40%), did the same setup of llama.cpp (with Vulkan extra steps) and also converged on Qwen3-Coder-30B-A3B-Instruct (also Q4_K_M quantization)
On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.
I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?
Quant choice depends on your vram, use case, need for speed, etc. For coding I would not go below Q4_K_M (though for Q4, unsloth XL or ik_llama IQ quants are usually better at the same size). Preferably Q5 or even Q6.
> Basically, I was told to make it so that my phone's camera could see something on the screen and my desk at the same time without washing out
+1. The low-tech version of this I've heard and I've been doing is:
Hold a printed white paper sheet right next to your monitor, and adjust the amount of brightness in monitor so the monitor matches that sheet.
This of course requires good overall room lightning where the printed paper would be pleasant to read in first place, whether it's daytime or evening/night
I think this was what I was told the first time. The advantage of taking a picture with my phone's camera is it kind of made it obvious just how much brighter the screen was then the paper.
Which, fair that it may be obvious to others to just scan their eyes from screen to paper. I've been surprised with how much people will just accept the time their eyes have to adjust to a super bright screen. Almost like it doesn't register with them.
There's some overlap with bias lighting here - good overall room lighting works if you've got good daylight, but it's much easier to get bright bias lighting at night than to light up the entire room.
I was talking about Germany's infrastructure. Last year I had 3x separate trips turn into chaos due to how broken their system is. Broken trains, broken track infrastructure etc. Think multiple hours on each trip rather than just 10 minutes delay.
- is there some standardized APIs each municipality provides, or do you go through the tedious task of building a per-municipality crawling tool?
- how often do you refresh the data? Checked a city, it has meeting minutes until 6/17, but the official website has more recent minutes (up to 12/2 at least)
- There is absolutely not a standardized API for nearly any of this. I build generalized crawlers when I can, and then build custom crawlers when I need.
- Can you let me know which city? The crawlers run for every municipality at least once every day, so that's probably a bug
I ran ollama first because it was easy, but now download source and build llama.cpp on the machine. I don't bother saving a file system between runs on the rented machine, I build llama.cpp every time I start up.
I am usually just running gpt-oss-120b or one of the qwen models. Sometimes gemma? These are mostly "medium" sized in terms of memory requirements - I'm usually trying unquantized models that will easily run on an single 80-ish gb gpu because those are cheap.
I tend to spend $10-$20 a week. But I am almost always prototyping or testing an idea for a specific project that doesn't require me to run 8 hrs/day. I don't use the paid APIs for several reasons but cost-effectiveness is not one of those reasons.
I know you say you don't use the paid apis, but renting a gpu is something I've been thinking about and I'd be really interested in knowing how this compares with paying by the token. I think gpt-oss-120b is 0.10/input 0.60/output per million tokens in azure. In my head this could go a long way but I haven't used gpt oss agentically long enough to really understand usage. Just wondering if you know/be willing to share your typical usage/token spend on that dedicated hardware?
For comparison, here's my own usage with various cloud models for development:
* Claude in December: 91 million tokens in, 750k out
* Codex in December: 43 million tokens in, 351k out
* Cerebras in December: 41 million tokens in, 301k out
* (obviously those figures above are so far in the month only)
* Claude in November: 196 million tokens in, 1.8 million out
* Codex in November: 214 million tokens in, 4 million out
* Cerebras in November: 131 million tokens in, 1.6 million out
* Claude in October: 5 million tokens in, 79k out
* Codex in October: 119 million tokens in, 3.1 million out
In general, I'd say that for the stuff I do my workloads are extremely read heavy (referencing existing code, patterns, tests, build and check script output, implementation plans, docs etc.), but it goes about like this:
* most fixed cloud subscriptions will run out really quickly and will be insufficient (Cerebras being an exception)
* if paying per token, you *really* want the provider to support proper caching, otherwise you'll go broke
* if you have local hardware that is great, but it will *never* compete with the cloud models, so your best bet is to run something good enough, basically cover all of your autocomplete needs, and also with tools like KiloCode an advanced cloud model can do the planning and a simpler local model do the implementation, then the cloud model validate the output
Sorry, I don't much track or keep up with those specifics other than knowing I'm not spending much per week. My typical scenario is to spin up an instance that costs less than $2/hr for 2-4 hours. It's all just exploratory work really. Sometimes I'm running a script that is making a call to the LLM server api, other times I'm just noodling around in the web chat interface.
I don't suppose you have (or would be interested in writing) a blog post about how you set that up? Or maybe a list of links/resources/prompts you used to learn how to get there?
No, I don't blog. But I just followed the docs for starting an instance on lambda.ai and the llama.cpp build instructions. Both are pretty good resources. I had already setup an SSH key with lambda and the lambda OS images are linux pre-loaded with CUDA libraries on startup.
Here are my lazy notes + a snippet of the history file from the remote instance for a recent setup where I used the web chat interface built into llama.cpp.
I created an instance gpu_1x_gh200 (96 GB on ARM) at lambda.ai.
connected from terminal on my box at home and setup the ssh tunnel.
ssh -L 22434:127.0.0.1:11434 ubuntu@<ip address of rented machine - can see it on lambda.ai console or dashboard>
Started building llama.cpp from source, history:
21 git clone https://github.com/ggml-org/llama.cpp
22 cd llama.cpp
23 which cmake
24 sudo apt list | grep libcurl
25 sudo apt-get install libcurl4-openssl-dev
26 cmake -B build -DGGML_CUDA=ON
27 cmake --build build --config Release
MISTAKE on 27, SINGLE-THREADED and slow to build see -j 16 below for faster build
28 cmake --build build --config Release -j 16
29 ls
30 ls build
31 find . -name "llama.server"
32 find . -name "llama"
33 ls build/bin/
34 cd build/bin/
35 ls
36 ./llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 --jinja
MISTAKE, didn't specify the port number for the llama-server
I switched to qwen3 vl because I need a multimodal model for that day's experiment. Lines 38 and 39 show me not using the right name for the model. I like how llama.cpp can download and run models directly off of huggingface.
Then pointed my browser at http//:localhost:22434 on my local box and had the normal browser window where I could upload files and use the chat interface with the model. That also gives you an openai api-compatible endpoint. It was all I needed for what I was doing that day. I spent a grand total of $4 that day doing the setup and running some NLP-oriented prompts for a few hours.
> $1 and $2 coins in wide circulation (instead of worn-out $1 bills).
This has its own pros/cons...
One advantage of $1 bill over coin is the majority of people in US don't need a wallet with zipper to hold coins. Five $1 bills is much less bulky and much lighter than five $1 CAD or five 1€ coins
I would contend that 5 bills are more bulky than 5 coins. The only upside of dealing with US bills when travelling in the US is that you feel like a millionaire when you pull out the massive wad of bills from your pocket.
There's so many knobs to tweak, it's a non trivial problem
- Average/median length of your Prompts
- prompt eval speed (tok/s)
- token generation speed (tok/s)
- Image/media encoding speed for vision tasks
- Total amount of RAM
- Max bandwidth of ram (ddr4, ddr5, etc.?)
- Total amount of VRAM
- "-ngl" (amount of layers offloaded to GPU)
- Context size needed (you may need sub 16k for OCR tasks for instance)
- Size of billion parameters
- Size of active billion parameters for MoE
- Acceptable level of Perplexity for your use case(s)
- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)
- even finer grain knobs: temperature, penalties etc.
Also, Tok/s as a metric isn't enough then because there's:
- thinking vs non-thinking: which mode do you need?
- models that are much more "chatty" than others in the same area (i remember testing few models that max out my modest desktop specs, qwen 2.5 non-thinking was so much faster than equivalent ministral non-thinking even though they had equivalent tok/s... Qwen would respond to the point quickly)
At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?
The same exercise with paid APIs exists too, obviously less knobs but depending on your use case, there's still differences between providers and models. You can abstract away a lot of the knobs , just add "are you satisfied with how much it cost" on top of the other 2 questions
reply