OpenAI recently announced a Batch API [1] which allows you to prepare all prompt...

Tostino · on Sept 2, 2024

I hope some of the opensource inference servers start supporting that endpoint soon. I know vLLM has added some "offline batch mode" support with the same format, they just haven't gotten around to implementing it on the OpenAI endpoint yet.

asaddhamani · on Sept 3, 2024

Do note it can take up to 24 hours or drop requests altogether. But if that’s not an issue for your use case it’s a great cost saving.

jumploops · on Sept 3, 2024

This is neat, I’ve been looking for a way to run our analytics (LLM-based) without affecting the rate limits of our prod app.

May need to give this a try!

altdataseller · on Sept 3, 2024

What percentage of requests usually get dropped? Is it something miniscule like 1% or are we talking non trivial like 10%

johndough · on Sept 3, 2024

llama.cpp enabled continuous batching by default half a year ago: https://github.com/ggerganov/llama.cpp/pull/6231

There is no need for a new API endpoint. Just send multiple requests at once.

Tostino · on Sept 3, 2024

The point of the endpoint is to be able to standardize my codebase and have an agnostic LLM provider that works the same.

Continuous batching is helpful for this type of thing, but it really isn't everything you need. You'd ideally maintain a low priority queue for the batch endpoint and a high priority queue for your real-time chat/completions endpoint.

Would allow utilizing your hardware much better.

LunaSea · on Sept 3, 2024

That's a great proposition by OpenAI. I think however that it is still one to two orders of magnitude too expensive compared to traditional text extraction with very similar precision and recall levels.

cdrini · on Sept 3, 2024

Yeah this was a phenomenal decision on their part. I wish some of the other cloud tools like azure would offer the same thing, it just makes so much sense!