LLM Inference Server
9.9k 2026-04-14
jundot/omlx
An optimized LLM inference server for Apple Silicon, featuring continuous batching, tiered KV caching, and macOS menu bar management for efficient local AI.
Core Features
Optimized LLM inference for Apple Silicon (M-series chips).
Continuous batching and tiered KV caching (in-memory & SSD).
Convenient management via a macOS menu bar application.
Supports various AI models: LLMs, VLMs, OCR, embeddings, and rerankers.
OpenAI-compatible API and a web-based Admin Dashboard for monitoring and chat.
Quick Start
omlx serve --model-dir ~/modelsDetailed Introduction
oMLX addresses the challenge of running large language models locally on macOS, offering both convenience and control. It provides an optimized inference server specifically for Apple Silicon, leveraging continuous batching and a unique tiered KV cache that spans RAM and SSD. This design ensures past context remains cached and reusable across requests, making local LLMs practical for demanding tasks like coding. Managed directly from the macOS menu bar, oMLX simplifies the deployment and management of various AI models, enhancing productivity for developers and users.