Models, runtimes, UIs, integrations. Everything you can install once and run forever without an API key. Aggregating from real-world use, not feature lists.
75 curated, receipts-backed across 14 categories.
Fast inference for quantised LLMs on consumer NVIDIA GPUs. EXL2 format outperforms GGUF on the same hardware in many benchmarks.
Privacy-first desktop chat with curated quantised models. Strong CPU performance.
Open-source ChatGPT alternative. Bundles llama.cpp + a clean UI.
Single-binary llama.cpp wrapper with KoboldAI UI for chat, story-writing, RP.
Reference C++ implementation for running LLaMA-family and other transformer models with GGUF quantization. Powers most of the others in this section.
Polished desktop app for discovering, downloading, and running local LLMs. OpenAI-compatible server mode. Free for personal + commercial.
Self-hosted, OpenAI-compatible inference server. Text, image, audio, embeddings — all on your machine.
Rust LLM inference platform with quantization, vision, MoE, and speculative decoding.
Compile-once, deploy-anywhere LLM runtime. Targets WebGPU, Vulkan, CUDA, Metal, iOS, and Android from a single source.
Single-binary server with a built-in model library. Pull, run, and swap models with one command.
Fast LLM and VLM serving runtime with RadixAttention cache and structured-output support.
Gradio-based web UI for local LLMs. Supports GGUF, GPTQ, AWQ, EXL2.
High-throughput inference engine with PagedAttention. Designed for serving, not desktop chat — pair with Open WebUI or LiteLLM.
Workspace-style chat with built-in RAG. Works fully offline with a local LLM provider.
Local-only character / role-play chat. Bundles inference, no API key needed.
Fast desktop chat with branching conversations and parallel-model comparison. Free tier covers personal local use.
Self-hosted "ChatGPT clone" of the open-source world. Pair with Ollama or any OpenAI-compatible local server.
Desktop dictation app built on Qwen3-ASR + GGUF + llama.cpp. 36 languages, hotkey-anywhere transcription, file/microphone/system-audio input, LoRA personal voice training. 100% offline, no account required to transcribe. Disclosure: maintained by us.
CTranslate2-based reimplementation. ~4× faster than reference Whisper at the same accuracy.
Reference Python implementation. Accurate but slower than the C++ ports; useful when you need the exact research behaviour.
Low-latency streaming wrapper around faster-whisper for live dictation pipelines.
Lightweight offline speech recognizer with 20+ language models. Real-time on CPU.
C++ port of OpenAI Whisper with GGUF quantization. Runs on CPU, Metal, CUDA, Vulkan.
faster-whisper plus forced alignment, voice-activity detection, and speaker diarization.
Multilingual generative audio. Speech, sound effects, and music cues from text prompts.
Comprehensive TTS toolkit. Multiple architectures (Tacotron, VITS, XTTS) and voice cloning.
Tiny ~80M-param TTS model, surprisingly natural for the size. Suitable for low-end hardware.
Mycroft's neural TTS engine. Lightweight, multilingual.
Fast neural TTS. ONNX runtime, dozens of voices and languages. Designed for Raspberry Pi-class hardware.
High-fidelity expressive TTS with style transfer. Strong reference voice cloning.
The original ergonomic SD UI. Heavy plugin ecosystem.
Node-graph workflow editor for diffusion models. Powers most modern local image and video pipelines.
Image generator with sane defaults — minimal knobs for great results. Built on top of Stable Diffusion.
Performance-tuned A1111 fork by lllyasviel. Lower VRAM, faster on modern GPUs.
Pro-grade SD UI with strong canvas / inpainting tools. Enterprise tier; free local install remains open source.
All-in-one fork of A1111 with broader backend support (Diffusers, ONNX, ROCm).
Modular UI built on top of ComfyUI. User-friendly mode out of the box, full node-graph available when you need it.
ComfyUI nodes drive Lightricks LTX video models for text-to-video and image-to-video generation. The chunked-loop pattern (released in our [comfyui-workflows](https://github.com/BrethofAI/comfyui-workflows)) produces longer outputs than vanilla LTX allows.
Stripped-down Wan2.2 video pipeline for low-VRAM consumer GPUs.
Terminal pair-programming. Bring-your-own-LLM via LiteLLM — run with Ollama or any OpenAI-compatible local endpoint.
IDE assistant with first-class local-LLM support. Defaults can be set to Ollama / LM Studio. VS Code + JetBrains.
Vim plugin that streams llama.cpp completions inline. No cloud.
Self-hosted GitHub Copilot alternative. Local model serving with IDE plugins.
Free local AI extension for VS Code. Chat + autocomplete via Ollama.
Aider's planning mode separates "decide" and "edit" steps; works well with strong local reasoning models.
Agentic editing flow inside Continue. Pair with a local model for fully-offline coding agents.
Code-execution agent that runs Python/shell on your machine. Local-LLM friendly.
Embedding database designed for local-first usage. SQLite-style single-file or client/server.
Library for similarity search. The retrieval engine inside many of the others.
End-to-end vector search; OSS core, paid hosted version.
High-performance vector DB. Self-host the open-source binary.
Hybrid (vector + keyword) DB. Self-host the OSS distribution; cloud is optional.
BAAI's BGE family. Strong English + multilingual variants. Run via llama.cpp, sentence-transformers, or fastembed.
Lightweight CPU-friendly embedding library by Qdrant.
Reference Python library for sentence + paragraph embeddings.
Config-driven fine-tuning framework. LoRA, QLoRA, full fine-tunes.
Pipeline-parallel trainer for diffusion models. Multi-GPU LoRA on large image / video models.
Apple's native ML framework for Apple Silicon. Train and infer on M-series Macs without CUDA workarounds.
LoRA training UI for Flux, SD3, SDXL, LTX. Works on consumer hardware.
Fine-tune LLMs 2× faster with 70% less VRAM than reference HuggingFace pipelines.
Self-hosted workspace tool with integrated RAG. Listed twice intentionally — strong both as a chat app and a RAG layer.
Toolkit for building RAG pipelines. Works fully offline with local models + vector DBs.
Open-source AI search powered by SearXNG + your local LLM.
Ingest documents locally and query them with an offline LLM.
Self-hosted meta-search engine. Pair with a local LLM for an offline Perplexity-style assistant.
Container-native gaming and AI distro. Steam Deck-friendly, latest drivers, easy CUDA.
Fedora-based, atomic, container-first. Good "drop you in a known state" workstation for AI work.
Arch-based desktop distro with a tuned kernel and recent NVIDIA / AMD drivers. Sane out-of-the-box for new GPUs (Blackwell, RDNA 4).
Reproducible system config. Best when you need identical CUDA + ML toolchain across machines.
System76's NVIDIA-friendly desktop distro. ISO ships with proprietary drivers for plug-and-play GPU work.
Apple Silicon-native ML library. Already listed under training; it also ships an inference runtime competitive with llama.cpp on M-series.
NVIDIA's optimised LLM runtime for their data-center and consumer GPUs. Closed-weights binary; fastest CUDA path for many models.
Intel's inference toolkit. CPU, iGPU, dGPU (Arc), and NPU support for Intel laptops.
AMD GPU inference path. Llama.cpp's HIP backend now reaches CUDA parity on RDNA 3/4 in many workloads.
Open a PR — or email us the tool and we'll check it. Real VRAM/disk numbers, not vibes.
Local speech-to-text that learns your voice. Perpetual licence. Our flagship.
PAID · flagship
Local long-term memory for Claude Code — full-text + vector + graph, on SurrealDB. MIT.
FREE · open source
Print-ready digital models. STL/3MF/OBJ included. Lifetime access.
PAID · digital catalog
Our printed designs, shipped across Europe. Buy the object, not the file.
PAID · physical objects
Cyber-tiger AI host. Privacy-first AI explained without the corporate filter.
CHANNEL · live
Curated GitHub lists for AI, MCP, local AI, Linux for AI, and more. Receipts, not vibes.
FREE · curated
Long-form how-tos for local AI on Linux, Windows, macOS. Real configs, not marketing.
FREE · coming soon
Production-tested ComfyUI graphs — LTX chunked-loop, the Nova pipeline, and more.
FREE · workflows landing
Negative-curation: practices and tools that waste your time, ranked. Receipts required.
FREE · coming soon
Who we are, why we build local-first AI, and what we won't do.