More

vermorel · 2025-09-29T16:33:50 1759163630

Full text at: https://www.lokad.com/document/book-introduction-to-supply-c...

vermorel · 2025-02-06T10:48:30 1738838910

Is-it though? Hyperbolic praise of a vendor on the web is quite transparent, hence, a casual reader is hardly going to be fooled. LLM distortion would be just as severe BUT much harder to spot.

palmfacehn · 2025-02-06T10:54:37 1738839277

Do LLMs not scrape press releases and generated reviews?

vermorel · 2025-02-07T13:49:31 1738936171

It's safe to assume that anything publicly available on the web is being scraped.

vermorel · on May 22, 2024

The other comment stating "Yann LeCun is the Paul Krugman of AI" does resonate with me. There is a lot to be criticized about his takes on AI in general, and the need for a "worldview" in particular.

The longer version at: https://www.lokad.com/blog/2024/3/18/ai-interview-with-yann-...

vermorel · on May 2, 2024

Lokad.com | Full stack, Backend, Frontend, Compiler | REMOTE or ONSITE | Paris, France | Full-time | https://www.lokad.com Lokad is a bootstrapped profitable software company - 60 employees and growing fast - that specializes on predictive supply chain optimization. We are based in France, but the majority of our clients are outside France.

Supply chains remain wasteful and poorly resilient to tail risks (as demonstrated by present day situation). We’re talking about roughly 15% of the worldwide economy: supply chains are vast, and double-digit improvements remain possible. We want to put supply chains on AI autopilot, and deliver above-human performance while doing so.

Technologies used: C#, F#, Typescript, .NET Core, Linux

Find out more: https://www.lokad.com/software-engineering

vermorel · on March 26, 2024

Our basic low-dimensional parametric model landed No1 at the SKU level at the M5, see my lecture https://www.lokad.com/tv/2022/1/5/no1-at-the-sku-level-in-th... (more references at the bottom)

hcarlens · on March 26, 2024

Interesting, thanks for sharing!

vermorel · on March 18, 2024

While there is some obvious US-centric left wing bias in most major LLMs, I am not sure this is what is at play here. I routinely end-up with similar behaviors on all sort of subjects.

vermorel · on March 8, 2024

As a CEO, this is one of the many reasons why I do not do money transfers myself.

Ps: your insights are spot on. Been plagued by this issue for years. Deep fake is merely a refinement of a pre-existing fraud pattern.

vermorel · on March 4, 2024

Does any of those LLM-as-a-service companies provide a mechanism to "save" a given input? Paying only for the state storage and the extra input when continuing the completion from the snapshot?

Indeed, at 1M token and $15/M tokens, we are talking of $10+ API calls (per call) when maxing out the LLM capacity.

I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.

Right now, only ChatGPT (the webapp) seems to be using such those snapshots.

Am I missing something?

msp26 · on March 4, 2024

> I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.

If you don't care about latency or can wait to set up a batch of inputs in one go there's an alternative method. I call it batch prompting and pretty much everything we do at work with gpt-4 uses this now. If people are interested I'll do a proper writeup on how to implement it but the general idea is very straightforward and works reliably. I also think this is a much better evaluation of context than needle in a haystack.

Example for classifying game genres from descriptions.

Default:

[Prompt][Functions][Examples][game description]

- >

{"genre": [genre], "sub-genre": [sub-genre]}

Batch Prompting:

[Prompt][Functions][Examples]<game1>[description]</game><game2>[description]</game><game3>[description]</game>...

- >

{"game1": {...}, "game2": {...}, "game3": {...}, ...}

hobofan · on March 4, 2024

I attempted similar mechanics multiple times in the past, but always ditched them, as there was always a non-negligable amount of cross-contamination happening between the individual instances you are batching. That caused so much of a headache that it wasn't really worth it.

msp26 · on March 4, 2024

Yeah that's definitely a risk with language models but it doesn't seem to be too bad for my use cases. Can I ask what tasks you used it for?

I don't really intend for this method to be final. I'll switch everything over to finetunes at some point. But this works way better than I would have expected so I kept using it.

hobofan · on March 4, 2024

One thing I tried using it for was for a summarization/reformulation tasks, where it did RAG of ~3-4 smallish (~single sentence) documents per instance where each should be in the end form a coherent sentence. There, batching either caused one of the facts to slip into an adjacent instance or two instances to be merged into one.

Another thing I used it for was data extraction, where I extracted units of measurements and other key attributes out of descriptions from classifieds listings (my SO and me were looking for a cheap used couch). Non-batched it performed very well, while in the batched mode, it either mixed dimensions of multiple listings or after the summary for the initial listing it just gave nulls for all following listings.

vermorel · on March 4, 2024

Agreed, some problem here.

lmeyerov · on March 4, 2024

Yes: That's essentially their fine-tuning offerings. They rewrite some weights in the top layers based on your input, and save+serve that for you.

It sounds like you would like a wrapped version tuned just for big context.

(As others write, RAG versions are also being supported, but they're less fundamentally similar. RAG is about preprocessing to cut the input down to relevant bits. RAG + an agent framework does get closer again tho by putting this into a reasoning loop.)

brokensegue · on March 4, 2024

Fine tuning is not great for the use case of long documents. RAG is closer

phillipcarter · on March 4, 2024

FWIW the use case you're describing is very often achievable with RAG. Embedding models are deterministic, so while you're still limited by the often-nondeterministic nature of the LLM, in practice you can usually get the same answer for the same input. And it's substantially cheaper to do.

vermorel · on March 4, 2024

With 1M tokens, if snapshotting the LLM state is cheap, it would beat out-of-the-box nearly all RAG setups, except the ones dealing with large datasets. 1M tokens is a lot of docs.

phillipcarter · on March 4, 2024

Yeah, but latency is still a factor here. Any follow-up question requires re-scanning the whole context, which often takes a long time. IIRC when Google showed their demos for this use case each request took over 1 minute for ~650k tokens.

ethbr1 · on March 4, 2024

How would that work technically, from a cost of goods sold perspective? (honestly asking, curious)

vermorel · on March 4, 2024

The "cost" is storing the state of the LLM after processing the input. My back-of-the-envelop guesstimate gives me 1GB to capture the 8bit state of 70B parameters model (I might be wrong though, insights are welcome), which is quite manageable with NVMe storage for fast reload. The operator would charge per pay per "saved" prompt, plus maybe a fix per call fee to re-load the state.

YetAnotherNick · on March 4, 2024

My calculation of kv cache gives 1GB per 3000 tokens for fp16. I am surprised openAI competitors haven't done this. This kind of features have not so niche uses, where prefix data could be cached.

FergusArgyll · on March 4, 2024

That's a great idea! It would also open up the possibility for very long 'system prompts' on the side of the company, so they could better fine-tune their guardrails

cjbprime · on March 4, 2024

I think the answer's in the original question: the provider has to pay for extra storage to cache the model state at the prompt you're asking to snapshot. But it's not necessarily a net increase in costs for the provider, because in exchange for doing so they (as well as you) are getting to avoid many expensive inference rounds.

datadrivenangel · on March 4, 2024

Isn't the expensive part keeping the tokenized input in memory?

chessgecko · on March 4, 2024

The problem is that it’s probably often not a lot cheaper. Most of the high end gpus have comparatively little bandwidth over pcie (that you’d need to use to store the context on a nvme for example). The cost there would scale with length too so you wouldn’t necessarily save more in that situation either. I think if you used a small enough gqa ratio and you knew for sure you would reuse the weights it could work, but my suspicion is that in general it would just be cheaper to recalculate.

vermorel · on March 27, 2023

Thanks! The interesting thing is that my casual observations indicate that GPT itself might already be good enough to self-arbiter itself. Just like a human writer can improve its own writing by iterating over it. In a sense, having humans in the loop were what it took (past) to gain the possibility to reach self-arbitration capacity.

vermorel · on Sept 13, 2022

Original paper (2018) https://www.nature.com/articles/s41586-018-0411-9