Hacker Newsnew | past | comments | ask | show | jobs | submit | dzr0001's commentslogin

I think there's a fundamental difference between programming language repos and package repositories like the official RPM, deb, and ports trees.

These (typically) operating system repos have oversight and are tested to work within a set of versions. Repositories with public contribution and publishing don't have any compatibility guarantees, so the cruft described in the article must be kept indefinitely.

Unfortunately, I don't think abstracting those repositories to work within the OS package ecosystem would solve that problem and I suspect the package manager SAT solvers would have a hard time calculating dependencies.


I agree re: the fundamental difference when it comes to compiled languages. I wrote rashly and out of frustration without thinking about it too deeply.

re: interpreted languages, though, I think it's still a shit show. I don't want to run "composer" or "npm" or whatever the Ruby and Python equivalents are on my production environment. I just want packages analogous to binaries that I can cleanly deploy / remove with OS package management functionality.


I suspect that there's some marketing component at play here. People who do not own but observe devices making seemingly unnecessary noises might perceive these devices as premium. Think about the various beeps that occur when locking a car and arming the alarm, the startup sound that infotainment systems in some EVs play, the twinkle twinkle little star of a fancy rice cooker.


I did a quick scan of the repo and didn't see any reference to Ray. Would this indicate that llm-d lacks support for pipeline parallelism?


We inherit any multi-host support from vLLM, so https://docs.vllm.ai/en/latest/serving/distributed_serving.h... would be the expected path.

We plan to publish examples of multi-host inference that leverages LeaderWorkerSets - https://github.com/kubernetes-sigs/lws - which helps run ranked serving workloads across hosts. LeaderWorkerSet is how Google supports both TPU and GPU multi-host deployments - see https://github.com/kubernetes-sigs/lws/blob/main/config/samp... for an example.

Edit: Here is an example Kubernetes configuration running DeepSeek-R1 on vLLM multi-host using LeaderWorkerSet https://github.com/kubernetes-sigs/wg-serving/blob/main/serv.... This work would be integrated into llm-d.


I believe this is a question you should ask about vLLM, not llm-d. It looks like vLLM does support pipeline parallelism via Ray: https://docs.vllm.ai/en/latest/serving/distributed_serving.h...

This project appears to make use of both vLLM and Inference Gateway (an official Kubernetes extension to the Gateway resource). The contributions of llm-d itself seems to mostly be a scheduling algorithm for load balancing across vLLM instances.


It does if you need pipeline parallelism across multiple nodes.


What drive is this and does it need a trim? Not all NVMe devices are created equal, especially in consumer drives. In a previous role I was responsible for qualifying drives. Any datacenter or enterprise class drive that had that sort of latency in direct IO write benchmarks after proper pre-conditioning would have failed our validation.


My current one reads SAMSUNG MZVL21T0HCLR-00BH1 and is built into a quite new work laptop. I can't get below around 250us avg.

On my older system I had a WD_BLACK SN850X but had it connected to an M.1 slot which may be limiting. This is where I measured 1-2ms latency.

Is there any good place to get numbers of what is possible with enterprise hardware today? I've struggled for some time to find a good source.


Unfortunately, this data is harder to find than it should be. For instance, just looking at Kioxia, which I've found to be very performant, their datasheets for the CD series drives don't mention write latency at all. Blocks and Files[1] mentions that they claim <255us average, so they must have published that somewhere. This is why we would extensively test multiple units ourselves, following proper preconditioning as defined by SNIA. Averaging 250us for direct writes is pretty good.

[1] https://blocksandfiles.com/2023/08/07/kioxias-rocketship-dat...


I have the original M1 air that I got the day it released in 2020. In a typical week, I will let my battery discharge to less than 10% twice and recharge it to 100%. I've logged 458 cycles and lost 11% of my capacity. Not too bad.


We used to have a pub rate of about 200k msgs/s, from about 400 producers all to a single exchange and had similar issues. However, we were able to mitigate this by using lazy queues.

This worked fine until things got behind and then we couldn't keep up. We were able to work around that by using a hashed exchange that spread messages across 4 queues. It hashed based on timestamp inserted by a timestamp plugin. Since all operations for a queue happen in the same event loop, any sort of backup led to pub and sub operations fighting for CPU time. By spreading this across 4 queues we wound up with 4x the CPU capacity for this particular exchange. With 2000 queues you probably didn't run into that issue very often.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: