Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.

I wonder if we'll see a macOS port soon - currently it very much needs an NVIDIA GPU as far as I can tell.



That's at BF16, so it should fit fairly well on 24GB GPUs after quantization to Q4, I'd think. (Much like the other 30B-A3B models in the family.)

I'm pretty happy about that - I was worried it'd be another 200B+.


So like, 1x32GB is all you need for quite a while? Scrolling through the Web makes me feel like I'm out unless I have minimum 128GB of VRAM.


are there any that would run on 16GB Apple M1?


Not quite. The smallest Qwen3 A3B quants are ~12gb and use more like ~14gb depending on your context settings. You'll thrash the SSD pretty hard swapping it on a 16gb machine.


A fun project for somebody who has more time than myself would be to see if they can get it working with the new Mojo stuff from yesterday for Apple. I don't know if the functionality would be fully baked out enough yet to actually do the port successfully, but it would be an interesting try.


New Mojo stuff from Apple?



Would it run on 5090? Or is it possible to link multiple GPUs or has NVIDIA locked it down?


It'd run on a 5090 with 32GB of VRAM at fp8 quantization which is generally a very acceptable size/quality trade-off. (I run GLM-4.5-Air at 3b quantization!) The transformer architecture also lends itself quite well to having different layers of the model running in different places, so you can 'shard' the model across different compute nodes.


is there an inference engine for this on macos?


Not yet as far as I can tell - might take a while for someone to pull that together given the complexity involved in handling audio and image and text and video at once.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: