Show HN: Running LLMs in one line of Python without Docker

jiayq84 · on Oct 4, 2023

To show some actual coding examples, We have made the python library open-source at https://github.com/leptonai/leptonai/. With it, launching a common HuggingFace model is as simple as a one liner. For example, if you have a GPU, Stable Diffusion XL is as simple as:

pip install -U leptonai

lep photon run -n sdxl -m hf:stabilityai/stable-diffusion-xl-base-1.0 --local

And you have a local OpenAPI server that runs it! Go to http://0.0.0.0:8080/docs, or use your favorite OpenAPI client.

We've been building AI API services using such tools ourselves. The easiest way to try out Lepton is to head to https://lepton.ai/playground and use our API service for popular models: Stable Diffusion, LLaMA, WhisperX, and other interesting showcases

We are proud of our performance. For example, we have probably the fastest LLaMA 7B and 70B model APIs, and it costs $0.8 to run 1 million tokens inference - we believe it's the most affordable one in the market. In addition, during the open beta phase, calling these services is free when you sign up for the Lepton AI platform.

Under the hood, we wrote a platform to allow you to run things easily on the cloud with ease. For example, if you find Pygmalion to be a great conversation model but you don't have a GPU, use lepton's Remote() capability to launch a service:

from leptonai import Remote

pygmalion = Remote("hf:PygmalionAI/pygmalion-2-7b", resource_shape="gpu.a10")

Wait a few minutes for the model to be downloaded and run, and you can now use it as if it were a standard python function:

print(pygmalion.run(inputs="Once upon a time", max_new_tokens=128))

If you are interested in the operational details, you can find fine-grained controls at https://dashboard.lepton.ai/ as a fully managed platform - we also support BYOC (bring your own compute) if you are an enterprise needing more autonomy over infrastructure.

swyx · on Oct 4, 2023

congrats yangqing et al! i was really impressed by your llama2 demo https://llama2.lepton.run/ where you showed that you were the "fastest llama runners" (https://twitter.com/swyx/status/1695183902770614724). definitely needed for model hosting infra.

jiayq84 · on Oct 4, 2023

Thanks so much for the warm words!

johnnyo · on Oct 4, 2023

How is data secured and protected?

Is the company collecting all my prompts and responses?

I don’t see a privacy policy or anything like that linked on the main page.

jiayq84 · on Oct 4, 2023

Thanks - the policies are listed here: https://www.lepton.ai/policies

we'll put a link on our homepage.

In short - we do not collect, record, or log any of your prompts and responses. They are computed in memory, returned and discarded on the fly.

timsuchanek · on Oct 4, 2023

Looks interesting! Your llama 2 demo is unfortunately down: https://imgur.com/a/MLw6dAk

jiayq84 · on Oct 4, 2023

Great catch! Our cloud machine encountered a cuda error (the GPU fell off PCIe) - had to restart it. It's back to normal now.

All the more reason to have a managed version of services :)

badmanbailey · on Oct 5, 2023

Wow, I love the QR code feature. That is so cool. I run events for parties and the thought of having cool QR code tickets is sick. Great work!

pharmakom · on Oct 5, 2023

But how can I use Python without Docker?

yuuuzeee · on Oct 6, 2023

mind elaborate here a bit? Feel like the competition between python env management tools r quite stiff. env, conda, menba, poetry, (and docker for sure, but at different level).

pharmakom · on Oct 11, 2023

Sure. Building a reproducible image of a Python app for another platform basically can’t be done without docker. With lots of devs using macOS and Linux servers, this is common scenario. Sadly, the dependency management tools are not built for this use case. Docker is the only cross compilation story in town.

brucethemoose2 · on Oct 4, 2023

llama.cpp (and derivative projects) is quickly becoming SOTA for many use cases, and it basically has zero dependencies.

Kobold.cpp, for example, provides an entire web UI and API with python, and 3 python packages (numpy, sentencepiece, and gguf which is the llama.cpp library). The llm itself is a single file you can get with curl or whatever. It takes less than a minute to compile against the native CPU/acclerator architecture, with nothing but the GPU libs themself, which nets better performance than a generic binary distribution.

...Its not "one line" I guess, but I can hardly imagine a simpler setup. It doesn't really need docker or a fancy container.

jiayq84 · on Oct 4, 2023

Thanks - we definitely agree that llama.cpp is great. Big fan of their optimizations. We are more or less orthogonal to the engines though - in the sense that we serve as the infra/platform to run and manage those implementations easily. For example, we support running a wider range of models - for example sdxl is one single line too:

lep photon run -n sdxl -m hf:stabilityai/stable-diffusion-xl-base-1.0 --local

It's really about how to productize a wide range of models as easy as possible.

brucethemoose2 · on Oct 4, 2023

SDXL is indeed a monster to install and setup. The UIs are even worse.

IDK if the GPL license is compatible with your business, but I wonder if you could package Fooocus or Fooocus-MRE into a window? Its a hairy monster to install and run, but I've never gotten such consistently amazing results from a single prompt box + style dropdown box (including native HF diffusers and other diffusers-based frontends). The automatic augmentations to the SDXL pipine are amazing:

https://github.com/MoonRide303/Fooocus-MRE

jiayq84 · on Oct 4, 2023

Oh wow yeah, that is a beast. Let me give it a shot.

bddppq · on Oct 4, 2023

lepton is at a different layer comparing to llama.cpp, in fact for LLM model files that are of GGUF format, it's using llama.cpp (ctransformers to be precise) as the execution engine

hiatus · on Oct 4, 2023

How does this compare to ollama.ai

ideal0227 · on Oct 4, 2023

Oollama.ai focuses on making it as easy as possible to run models locally. We aim to provide a seamless experience that feels the same whether you're developing locally or deploying remotely for production.

declaimer: work at Lepton AI.

hiatus · on Oct 4, 2023

I've been using a remote ollama server with a local jupyter notebook. The langchain configuration allows me to specify the ollama host. So I can develop locally with remote models. I guess I still don't see the difference. Does lepton decouple the HTTP server from the model backend?

sbierwagen · on Oct 4, 2023

Llama 70B but no Falcon 180B?

bddppq · on Oct 4, 2023

Are there some examples(prompts) that Falcon 180B is performing better than Llama 70B?

moffkalast · on Oct 4, 2023

Hardly anyone can even run a 70B model, let alone 180B. Any anecdata will be extremely rare.

jiayq84 · on Oct 4, 2023

In theory one can have 640G = 8 * 80G A100s memory and launch it. 180B Falcon with fp16 will be 360G, so there would be enough memory. It's definitely going to be very expensive indeed.

sbierwagen · on Oct 5, 2023

Llama.cpp can run quantized Falcon on a top end Mac Studio, which is only five grand: https://twitter.com/ggerganov/status/1699791226780975439

If I'm paying a third party a hundred bucks a month, I'd at least want them to be able to match the capacities of consumer hardware.

rw2 · on Oct 4, 2023

Is not having to build a dockers image worth $100 a month? I do find server setup to be a pain but I think if I will use a model for a year, I can take the time(3-4 hours) to set it up. Only with constant switching of models would I use a service like this.

I never set up bigger models like LLAMA on servers. Other hacker news people can chime in.

jiayq84 · on Oct 4, 2023

It's not only about "building a docker" but also maintaining multiple models, multiple environments and a lot of users. Imagine there is a group of engineers each needing to deploy their own models: one needs tensorflow 1.x, one needs tensorflow 2.x, one needs pytorch and one needs a very strange combination of dependencies. Trust me, things get complex very easily:

https://github.com/leptonai/examples/blob/main/advanced/whis...

I definitely agree that for a fixed use case, building a docker once and for all is probably the simplest and best approach. However, it quickly gets more complex and out of hand.

Also the basic plan is free for independent developers. You don't need to pay more than as if you were using EC2 instances, but with the platform convenience - we definitely hope it's worth it!

on Oct 4, 2023

[deleted]