OpenAI recently announced a Batch API [1] which allows you to prepare all prompts and then run them as a batch. This reduces costs as its just 50% the price. Used it a lot with GPT-4o mini in the past and was able to prompt 3000 Items in less than 5min. Could be great for non-realtime applications.
I hope some of the opensource inference servers start supporting that endpoint soon. I know vLLM has added some "offline batch mode" support with the same format, they just haven't gotten around to implementing it on the OpenAI endpoint yet.
The point of the endpoint is to be able to standardize my codebase and have an agnostic LLM provider that works the same.
Continuous batching is helpful for this type of thing, but it really isn't everything you need. You'd ideally maintain a low priority queue for the batch endpoint and a high priority queue for your real-time chat/completions endpoint.
That's a great proposition by OpenAI.
I think however that it is still one to two orders of magnitude too expensive compared to traditional text extraction with very similar precision and recall levels.
Yeah this was a phenomenal decision on their part. I wish some of the other cloud tools like azure would offer the same thing, it just makes so much sense!
[1] https://platform.openai.com/docs/guides/batch