Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Running LLMs in one line of Python without Docker (lepton.ai)
68 points by jiayq84 on Oct 4, 2023 | hide | past | favorite | 26 comments
Hello Hacker News! We're Yangqing, Xiang and JJ from lepton.ai. We are building a platform to run any AI models as easy as writing local code, and to get your favorite models in minutes. It's like container for AI, but without the hassle of actually building a docker image.

We built and contributed to some of the world's most popular AI software - PyTorch 1.0, ONNX, Caffe, etcd, Kubernetes, etc. We also managed hundreds of thousands of computers in our previous jobs. And we found that the AI software stack is usually unnecessarily complex - and we want to change that.

Imagine if you are a developer who sees a good model on github, or HuggingFace. To make it a production ready service, the current solution usually requires you to build a docker image. But think about it - I have a few python code and a few python dependencies. That sounds like a huge overhead, right?

lepton.ai is really a pythonic way to free you from such difficulties. You write a simple python scaffold around your PyTorch / TensorFlow code, and lepton launches it as a full-fledged service callable via python, javascript, or any language that understands OpenAPI. We use containers under the hood, but you don't need to worry about all the infrastructure nuts and bolts.

One of the biggest challenge in AI is that it's really "all-stack": in addition to a plethora of models, AI applications usually involves GPUs, cloud infra, web services, DevOps, and SysOps. But we want you to focus on your job - and we take care of the rest "boring but essential" work.

We're really excited we get to show this to you all! Please let us know your thoughts and questions in the comments.



To show some actual coding examples, We have made the python library open-source at https://github.com/leptonai/leptonai/. With it, launching a common HuggingFace model is as simple as a one liner. For example, if you have a GPU, Stable Diffusion XL is as simple as:

pip install -U leptonai

lep photon run -n sdxl -m hf:stabilityai/stable-diffusion-xl-base-1.0 --local

And you have a local OpenAPI server that runs it! Go to http://0.0.0.0:8080/docs, or use your favorite OpenAPI client.

We've been building AI API services using such tools ourselves. The easiest way to try out Lepton is to head to https://lepton.ai/playground and use our API service for popular models: Stable Diffusion, LLaMA, WhisperX, and other interesting showcases

We are proud of our performance. For example, we have probably the fastest LLaMA 7B and 70B model APIs, and it costs $0.8 to run 1 million tokens inference - we believe it's the most affordable one in the market. In addition, during the open beta phase, calling these services is free when you sign up for the Lepton AI platform.

Under the hood, we wrote a platform to allow you to run things easily on the cloud with ease. For example, if you find Pygmalion to be a great conversation model but you don't have a GPU, use lepton's Remote() capability to launch a service:

from leptonai import Remote

pygmalion = Remote("hf:PygmalionAI/pygmalion-2-7b", resource_shape="gpu.a10")

Wait a few minutes for the model to be downloaded and run, and you can now use it as if it were a standard python function:

print(pygmalion.run(inputs="Once upon a time", max_new_tokens=128))

If you are interested in the operational details, you can find fine-grained controls at https://dashboard.lepton.ai/ as a fully managed platform - we also support BYOC (bring your own compute) if you are an enterprise needing more autonomy over infrastructure.


congrats yangqing et al! i was really impressed by your llama2 demo https://llama2.lepton.run/ where you showed that you were the "fastest llama runners" (https://twitter.com/swyx/status/1695183902770614724). definitely needed for model hosting infra.


Thanks so much for the warm words!


How is data secured and protected?

Is the company collecting all my prompts and responses?

I don’t see a privacy policy or anything like that linked on the main page.


Thanks - the policies are listed here: https://www.lepton.ai/policies

we'll put a link on our homepage.

In short - we do not collect, record, or log any of your prompts and responses. They are computed in memory, returned and discarded on the fly.


Looks interesting! Your llama 2 demo is unfortunately down: https://imgur.com/a/MLw6dAk


Great catch! Our cloud machine encountered a cuda error (the GPU fell off PCIe) - had to restart it. It's back to normal now.

All the more reason to have a managed version of services :)


Wow, I love the QR code feature. That is so cool. I run events for parties and the thought of having cool QR code tickets is sick. Great work!


But how can I use Python without Docker?


mind elaborate here a bit? Feel like the competition between python env management tools r quite stiff. env, conda, menba, poetry, (and docker for sure, but at different level).


Sure. Building a reproducible image of a Python app for another platform basically can’t be done without docker. With lots of devs using macOS and Linux servers, this is common scenario. Sadly, the dependency management tools are not built for this use case. Docker is the only cross compilation story in town.


llama.cpp (and derivative projects) is quickly becoming SOTA for many use cases, and it basically has zero dependencies.

Kobold.cpp, for example, provides an entire web UI and API with python, and 3 python packages (numpy, sentencepiece, and gguf which is the llama.cpp library). The llm itself is a single file you can get with curl or whatever. It takes less than a minute to compile against the native CPU/acclerator architecture, with nothing but the GPU libs themself, which nets better performance than a generic binary distribution.

...Its not "one line" I guess, but I can hardly imagine a simpler setup. It doesn't really need docker or a fancy container.


Thanks - we definitely agree that llama.cpp is great. Big fan of their optimizations. We are more or less orthogonal to the engines though - in the sense that we serve as the infra/platform to run and manage those implementations easily. For example, we support running a wider range of models - for example sdxl is one single line too:

lep photon run -n sdxl -m hf:stabilityai/stable-diffusion-xl-base-1.0 --local

It's really about how to productize a wide range of models as easy as possible.


SDXL is indeed a monster to install and setup. The UIs are even worse.

IDK if the GPL license is compatible with your business, but I wonder if you could package Fooocus or Fooocus-MRE into a window? Its a hairy monster to install and run, but I've never gotten such consistently amazing results from a single prompt box + style dropdown box (including native HF diffusers and other diffusers-based frontends). The automatic augmentations to the SDXL pipine are amazing:

https://github.com/MoonRide303/Fooocus-MRE


Oh wow yeah, that is a beast. Let me give it a shot.


lepton is at a different layer comparing to llama.cpp, in fact for LLM model files that are of GGUF format, it's using llama.cpp (ctransformers to be precise) as the execution engine


How does this compare to ollama.ai


Oollama.ai focuses on making it as easy as possible to run models locally. We aim to provide a seamless experience that feels the same whether you're developing locally or deploying remotely for production.

declaimer: work at Lepton AI.


I've been using a remote ollama server with a local jupyter notebook. The langchain configuration allows me to specify the ollama host. So I can develop locally with remote models. I guess I still don't see the difference. Does lepton decouple the HTTP server from the model backend?


Llama 70B but no Falcon 180B?


Are there some examples(prompts) that Falcon 180B is performing better than Llama 70B?


Hardly anyone can even run a 70B model, let alone 180B. Any anecdata will be extremely rare.


In theory one can have 640G = 8 * 80G A100s memory and launch it. 180B Falcon with fp16 will be 360G, so there would be enough memory. It's definitely going to be very expensive indeed.


Llama.cpp can run quantized Falcon on a top end Mac Studio, which is only five grand: https://twitter.com/ggerganov/status/1699791226780975439

If I'm paying a third party a hundred bucks a month, I'd at least want them to be able to match the capacities of consumer hardware.


Is not having to build a dockers image worth $100 a month? I do find server setup to be a pain but I think if I will use a model for a year, I can take the time(3-4 hours) to set it up. Only with constant switching of models would I use a service like this.

I never set up bigger models like LLAMA on servers. Other hacker news people can chime in.


It's not only about "building a docker" but also maintaining multiple models, multiple environments and a lot of users. Imagine there is a group of engineers each needing to deploy their own models: one needs tensorflow 1.x, one needs tensorflow 2.x, one needs pytorch and one needs a very strange combination of dependencies. Trust me, things get complex very easily:

https://github.com/leptonai/examples/blob/main/advanced/whis...

I definitely agree that for a fixed use case, building a docker once and for all is probably the simplest and best approach. However, it quickly gets more complex and out of hand.

Also the basic plan is free for independent developers. You don't need to pay more than as if you were using EC2 instances, but with the platform convenience - we definitely hope it's worth it!


[deleted]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: