Also, what kind of models do you run with mlx and what do you use them for?
Lately I’ve been pretty happy with gemma3:12b for a wide range of things (generating stories, some light coding, image recognition). Sometimes I’ve been surprised by qwen2.5-coder:32b. And I’m really impressed by the speed and versatility, at such tiny size, of qwen2.5:0.5b (playing with fine tuning it to see if I can get it to generate some decent conversations roleplaying as a character)
I've shared a bunch of notes on MLX over the past year, many of them with snippets of code I've used to try out models: https://simonwillison.net/tags/mlx/
Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on:
• M1 Max (MLX native)
• LM Studio (GLM, MLX, GGUFs)
• Llama.cp (GGUFs)
• n8n for orchestration + automation (multi-stage LLM
workflows)
My emerging use cases:
-Rapid narration scripting
-Roleplay agents with embedded prompt personas
-Reviewing image/video attachments + structuring copy for clarity
-Local RAG and eval pipelines
My current lineup of small LLMs (this changes every month depending on what is updated):
MLX-native models (mlx-community):
-Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following
-InternVL3-8B-3bit → fast, memory-light, solid for doc summarization
-GLM-Z1-9B-bf16 → reliable multilingual output + inference density
GGUF via LM Studio / llama.cpp:
-Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue
-Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once
-GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts
Emerging / niche models tested:
MedFound-7B-GGUF → early tests for narrative medicine tasks
llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks
PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough)
I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference.
The meta-trend: models are getting better, smaller, faster, especially for edge workflows.
Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines.
Any other resources like that you could share?
Also, what kind of models do you run with mlx and what do you use them for?
Lately I’ve been pretty happy with gemma3:12b for a wide range of things (generating stories, some light coding, image recognition). Sometimes I’ve been surprised by qwen2.5-coder:32b. And I’m really impressed by the speed and versatility, at such tiny size, of qwen2.5:0.5b (playing with fine tuning it to see if I can get it to generate some decent conversations roleplaying as a character)