Local LLM Inference Server
4.0k 2026-04-18
Michael-A-Kuykendall/shimmy
A Python-free Rust-based inference server providing an OpenAI-compatible API for local GGUF and SafeTensors LLM models.
Core Features
100% OpenAI API compatible endpoints for local LLMs.
Single-binary, Python-free, Rust-built for lightweight and dependency-free operation.
Automatic model discovery (Hugging Face, Ollama, local dirs) and hot model swap.
Advanced MOE (Mixture of Experts) support for large models on consumer hardware.
Includes all GPU backends in a single download, no compilation needed.
Quick Start
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy && ./shimmy serve &Detailed Introduction
Shimmy is a high-performance, dependency-free inference server built in Rust, designed to bring the power of large language models (LLMs) to local environments. It offers a 100% OpenAI API-compatible endpoint, allowing developers to seamlessly integrate local GGUF and SafeTensors models into existing AI tools and SDKs without code changes. Its single-binary distribution, automatic configuration, and advanced MOE support make it an ideal solution for private, efficient, and scalable local AI inference, enabling the use of powerful models even on consumer-grade hardware.