Great recommendation about the community Any other resources like that you could...

simonw · 2025-04-28T16:49:50 1745858990

I've shared a bunch of notes on MLX over the past year, many of them with snippets of code I've used to try out models: https://simonwillison.net/tags/mlx/

I mainly use MLX for LLMs (with https://github.com/ml-explore/mlx-lm and my own https://github.com/simonw/llm-mlx which wraps that), vision LLMs (via https://github.com/Blaizzy/mlx-vlm) and running Whisper (https://github.com/ml-explore/mlx-examples/tree/main/whisper)

I haven't tried mlx-audio yet (which can synthesize speech) but it looks interesting too: https://github.com/Blaizzy/mlx-audio

The two best people to follow for MLX stuff are Apple's Awni Hannun - https://twitter.com/awnihannun and https://github.com/awni - and community member Prince Canuma who's responsible for both mlx-vlm and mlx-audio: https://twitter.com/Prince_Canuma and https://github.com/Blaizzy

robbru · 2025-04-29T19:21:47 1745954507

Very cool insight, Simonw! I will check out the audio mlx stuff soon. I think that is kinda new still. Prince Canuma is the GOAT.

nico · 2025-04-28T17:10:41 1745860241

Amazing. Thank you for the great resources!

robbru · 2025-04-29T19:20:31 1745954431

Hey Nico,

Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on:

• M1 Max (MLX native)

• LM Studio (GLM, MLX, GGUFs)

• Llama.cp (GGUFs)

• n8n for orchestration + automation (multi-stage LLM workflows)

My emerging use cases: -Rapid narration scripting -Roleplay agents with embedded prompt personas -Reviewing image/video attachments + structuring copy for clarity -Local RAG and eval pipelines

My current lineup of small LLMs (this changes every month depending on what is updated):

MLX-native models (mlx-community):

-Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following

-InternVL3-8B-3bit → fast, memory-light, solid for doc summarization

-GLM-Z1-9B-bf16 → reliable multilingual output + inference density

GGUF via LM Studio / llama.cpp:

-Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue

-Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once

-GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts

Emerging / niche models tested:

MedFound-7B-GGUF → early tests for narrative medicine tasks

X-Ray_Alpha-mlx-8Bit → experimental story/dialogue hybrid

llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks

PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough)

I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference. The meta-trend: models are getting better, smaller, faster, especially for edge workflows.

Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines.

nico · 2025-04-30T04:17:00 1745986620

Impressive! Thank you for the amazing notes, I have a lot to learn and test