Is-it though? Hyperbolic praise of a vendor on the web is quite transparent, hence, a casual reader is hardly going to be fooled. LLM distortion would be just as severe BUT much harder to spot.
The other comment stating "Yann LeCun is the Paul Krugman of AI" does resonate with me. There is a lot to be criticized about his takes on AI in general, and the need for a "worldview" in particular.
Lokad.com | Full stack, Backend, Frontend, Compiler | REMOTE or ONSITE | Paris, France | Full-time | https://www.lokad.com
Lokad is a bootstrapped profitable software company - 60 employees and growing fast - that specializes on predictive supply chain optimization. We are based in France, but the majority of our clients are outside France.
Supply chains remain wasteful and poorly resilient to tail risks (as demonstrated by present day situation). We’re talking about roughly 15% of the worldwide economy: supply chains are vast, and double-digit improvements remain possible. We want to put supply chains on AI autopilot, and deliver above-human performance while doing so.
Technologies used: C#, F#, Typescript, .NET Core, Linux
While there is some obvious US-centric left wing bias in most major LLMs, I am not sure this is what is at play here. I routinely end-up with similar behaviors on all sort of subjects.
Does any of those LLM-as-a-service companies provide a mechanism to "save" a given input? Paying only for the state storage and the extra input when continuing the completion from the snapshot?
Indeed, at 1M token and $15/M tokens, we are talking of $10+ API calls (per call) when maxing out the LLM capacity.
I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.
Right now, only ChatGPT (the webapp) seems to be using such those snapshots.
> I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.
If you don't care about latency or can wait to set up a batch of inputs in one go there's an alternative method. I call it batch prompting and pretty much everything we do at work with gpt-4 uses this now. If people are interested I'll do a proper writeup on how to implement it but the general idea is very straightforward and works reliably. I also think this is a much better evaluation of context than needle in a haystack.
Example for classifying game genres from descriptions.
I attempted similar mechanics multiple times in the past, but always ditched them, as there was always a non-negligable amount of cross-contamination happening between the individual instances you are batching. That caused so much of a headache that it wasn't really worth it.
Yeah that's definitely a risk with language models but it doesn't seem to be too bad for my use cases. Can I ask what tasks you used it for?
I don't really intend for this method to be final. I'll switch everything over to finetunes at some point. But this works way better than I would have expected so I kept using it.
One thing I tried using it for was for a summarization/reformulation tasks, where it did RAG of ~3-4 smallish (~single sentence) documents per instance where each should be in the end form a coherent sentence. There, batching either caused one of the facts to slip into an adjacent instance or two instances to be merged into one.
Another thing I used it for was data extraction, where I extracted units of measurements and other key attributes out of descriptions from classifieds listings (my SO and me were looking for a cheap used couch). Non-batched it performed very well, while in the batched mode, it either mixed dimensions of multiple listings or after the summary for the initial listing it just gave nulls for all following listings.
Yes: That's essentially their fine-tuning offerings. They rewrite some weights in the top layers based on your input, and save+serve that for you.
It sounds like you would like a wrapped version tuned just for big context.
(As others write, RAG versions are also being supported, but they're less fundamentally similar. RAG is about preprocessing to cut the input down to relevant bits. RAG + an agent framework does get closer again tho by putting this into a reasoning loop.)
FWIW the use case you're describing is very often achievable with RAG. Embedding models are deterministic, so while you're still limited by the often-nondeterministic nature of the LLM, in practice you can usually get the same answer for the same input. And it's substantially cheaper to do.
With 1M tokens, if snapshotting the LLM state is cheap, it would beat out-of-the-box nearly all RAG setups, except the ones dealing with large datasets. 1M tokens is a lot of docs.
Yeah, but latency is still a factor here. Any follow-up question requires re-scanning the whole context, which often takes a long time. IIRC when Google showed their demos for this use case each request took over 1 minute for ~650k tokens.
The "cost" is storing the state of the LLM after processing the input. My back-of-the-envelop guesstimate gives me 1GB to capture the 8bit state of 70B parameters model (I might be wrong though, insights are welcome), which is quite manageable with NVMe storage for fast reload. The operator would charge per pay per "saved" prompt, plus maybe a fix per call fee to re-load the state.
My calculation of kv cache gives 1GB per 3000 tokens for fp16. I am surprised openAI competitors haven't done this. This kind of features have not so niche uses, where prefix data could be cached.
That's a great idea! It would also open up the possibility for very long 'system prompts' on the side of the company, so they could better fine-tune their guardrails
I think the answer's in the original question: the provider has to pay for extra storage to cache the model state at the prompt you're asking to snapshot. But it's not necessarily a net increase in costs for the provider, because in exchange for doing so they (as well as you) are getting to avoid many expensive inference rounds.
The problem is that it’s probably often not a lot cheaper. Most of the high end gpus have comparatively little bandwidth over pcie (that you’d need to use to store the context on a nvme for example). The cost there would scale with length too so you wouldn’t necessarily save more in that situation either. I think if you used a small enough gqa ratio and you knew for sure you would reuse the weights it could work, but my suspicion is that in general it would just be cheaper to recalculate.
Thanks! The interesting thing is that my casual observations indicate that GPT itself might already be good enough to self-arbiter itself. Just like a human writer can improve its own writing by iterating over it. In a sense, having humans in the loop were what it took (past) to gain the possibility to reach self-arbitration capacity.