Thanks, pure Swift was the design idea and since I found nothing could be used for my project https://www.sharpai.org then I created Swift version.
Python is too heavy to be delivered with application, user mentioned they want to use MLX, that's why I've been working on it for 1-2 weeks for bug fixing and testing , then suddenly TurboQuant proposed, I had a quick integration.
My 64GB M5 Pro is already good for my local security task, now it's able to use M1/M2 Mini w/ 8GB memory.
Thanks for posting this, that's how I first found out about Dan's experiment!
SSD speed doubled in the M5P/M generation, that makes it usable!
I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.
That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.
Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop.
It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.
In macOS 26.2 (Tahoe) beta, Apple introduced a low-latency Thunderbolt 5 RDMA driver, enabling up to 80 Gb/s bidirectional bandwidth for Mac clustering—ideal for distributed ML on Apple Silicon. It's optimized for low latency, delivering ~14 Gbps throughput at 4K MTU.
My tests (M4 Pro to M3 Ultra): Stock ibv_uc_pingpong achieved ~14 µs round-trip for 4K packets (requires GID index setup). Custom C++ variant hit 6-13 µs/iter: https://x.com/anemll/status/1993192776897642942
Code and details:
https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun...https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... (includes steps to enable RDMA in macOS Recovery OS terminal)
Theoretically, this accelerates pipeline parallelism (faster layer handoffs) and tensor parallelism (low-overhead sharding) on GPUs, with potential extensions to ANE for real-time AI workflows.
reply