More

anemll · 2026-04-01T20:03:41 1775073821

Check it out, you might be able to speed it up using this https://github.com/Anemll/anemll-flash-mlx https://x.com/anemll/status/2038684375425200360

aegis_camera · 2026-04-01T20:37:20 1775075840

Thanks, pure Swift was the design idea and since I found nothing could be used for my project https://www.sharpai.org then I created Swift version. Python is too heavy to be delivered with application, user mentioned they want to use MLX, that's why I've been working on it for 1-2 weeks for bug fixing and testing , then suddenly TurboQuant proposed, I had a quick integration. My 64GB M5 Pro is already good for my local security task, now it's able to use M1/M2 Mini w/ 8GB memory.

anemll · 2026-03-24T03:40:33 1774323633

17B includes 10 expert plus one shared. So actual size of the expert is much smaller

anemll · 2026-03-23T18:48:55 1774291735

Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

Yukonv · 2026-03-23T19:12:39 1774293159

That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.

anemll · 2026-03-23T19:19:23 1774293563

Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...

3abiton · 2026-03-23T21:18:50 1774300730

To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.

anemll · 2026-03-23T18:40:43 1774291243

SSD streaming to compute units is new. M4 max can do 15 t/s with its 15GB/s drives

bigyabai · 2026-03-23T23:03:03 1774306983

It was "new" in 2019. The PS5 and Xbox Series X both shipped with GPUDirect Storage, and even most dGPUs support it via ReBAR/RDMA nowadays.

anemll · 2026-03-23T18:39:39 1774291179

Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.

anemll · 2026-03-23T17:52:35 1774288355

multiple NAND, and apple already used it in Mac Studio. Plus better cooling

anemll · 2026-03-23T17:51:47 1774288307

both, tbh

anemll · 2026-03-23T17:49:47 1774288187

Probably 2x speed for Mac Studio this year if they do double NAND ( or quad?)

anemll · 2025-12-13T03:51:59 1765597919

Tensor Parallel test with RDMA last week https://x.com/anemll/status/1996349871260107102

Note fast sync workaround

anemll · 2025-11-25T17:24:03 1764091443

In macOS 26.2 (Tahoe) beta, Apple introduced a low-latency Thunderbolt 5 RDMA driver, enabling up to 80 Gb/s bidirectional bandwidth for Mac clustering—ideal for distributed ML on Apple Silicon. It's optimized for low latency, delivering ~14 Gbps throughput at 4K MTU. My tests (M4 Pro to M3 Ultra): Stock ibv_uc_pingpong achieved ~14 µs round-trip for 4K packets (requires GID index setup). Custom C++ variant hit 6-13 µs/iter: https://x.com/anemll/status/1993192776897642942 Code and details: https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... (includes steps to enable RDMA in macOS Recovery OS terminal) Theoretically, this accelerates pipeline parallelism (faster layer handoffs) and tensor parallelism (low-overhead sharding) on GPUs, with potential extensions to ANE for real-time AI workflows.