> The MoE version with 3b active parameters ~34 tok/s on a Radeon RX 7900 XTX un...

tgtweak · 2025-04-29T02:21:39 1745893299

And vmem use?

genpfault · 2025-04-29T13:30:43 1745933443

~18.6 GiB, according to nvtop.

ollama 0.6.6 invoked with:

    # server
    OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

    # client
    ollama run --verbose qwen3:30b-a3b

~19.8 GiB with:

    /set parameter num_ctx 32768

tgtweak · 2025-04-29T13:58:19 1745935099

Very nice, should run nicely on a 3090 as well.

TY for this.

update: wow, it's quite fast - 70-80t/s on LM Studio with a few other applications using GPU.