Early builds of Voice Pro used ONNX Runtime with the official Qwen ASR ONNX exports. It worked, but shipping it was painful. Here is what we ran into and why the app now runs on llama.cpp with GGUF-quantised models.
Problem 1: install size
A Windows build of Voice Pro with ONNX Runtime + the DirectML provider + the CUDA provider + the model weights came out to 430 MB. Linux was similar. That is a lot to ask someone to download for a dictation tool they have not tried yet.
The GGUF build is 83 MB on Windows. The compiled llama.cpp binary is tiny, the quantised base model is ~60 MB (vs ~200 MB as fp16 ONNX), and we do not ship separate execution-provider DLLs for CPU, CUDA, DirectML, and Vulkan — llama.cpp handles all four with one binary.
Problem 2: cold-start
ONNX Runtime with DirectML on a cold Windows install took 3–5 seconds to initialise the inference session. Every time the user pressed the hotkey for the first time after a reboot, they hit that delay. Unacceptable for a dictation tool where the whole point is "speak immediately".
llama.cpp loads the GGUF model in ~400 ms cold on the same hardware. Memory-mapped weights, no graph compilation step, no runtime wheel to initialise.
Problem 3: packaging hell
ONNX Runtime ships as a Python wheel, which means different wheels per Python version, OS, and CPU architecture. Add GPU acceleration and you multiply by CUDA version and DirectML version. Nuitka (our packager) kept bundling the wrong variant. We had build scripts with six conditional branches.
llama.cpp is a single C++ binary. We compile it once per platform (Windows x64, Linux x64) and ship it as a native executable that Voice Pro calls. No Python runtime dependency for inference at all — Python is just the glue.
Problem 4: quality at low sizes
The worry with quantisation is accuracy loss. In practice, Q5_K_M quantised Qwen3-ASR 0.6B (our base model) measures within 0.3% WER of the full fp16 ONNX version on our internal eval set. Q4_0 is noticeably worse, so we ship Q5_K_M as the default. Users who want maximum accuracy can download the fp16 tier from the model manager.
What we gave up
llama.cpp's ASR support is newer than ONNX Runtime's. Some exotic model architectures do not have GGUF converters yet. For now that does not matter — Qwen3-ASR converts cleanly — but if we want to try a radically different ASR model in the future we may need to keep the ONNX path as a fallback.
Net outcome: a 5× smaller install, ~10× faster cold start, one binary to build instead of twelve. We should have done this from day one.