Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Devs using LLMs, how are you keeping costs low for LLM calls locally?
5 points by spruce_tips on July 18, 2024 | hide | past | favorite | 8 comments
My project has a multi step LLM flow using gpt-4o.

While developing new features/testing locally, the LLM flow frequently runs, and I use a bunch of tokens. My openAI bill spikes.

I've made some efforts to stub LLM responses but it adds a decent bit of complexity and work. I don't want to run a model locally with ollama because I need to output to be high quality and fast.

Curious how others are handling similar situations.




options if im not using langchain?



options if im not using cloudflare?


there's an open source ai gateway - https://github.com/Portkey-AI/gateway


You can do it easily with object caching / function memorization patterns in any modern language which should fit your desired solution. Best of luck!


Here's a mega guide on keeping costs low with LLMs - https://portkey.ai/blog/implementing-frugalgpt-smarter-llm-u...

tl;dr: - Keep prompts short, combine prompts or make more detailed prompts but go to a smaller model - Simple and semantic cache lookups - Classify tasks and route to the best LLM using an AI gateway

Portkey.ai could help with a lot of this


came across this guide earlier - valuable insights. thanks for sharing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: