mostlygeek/llama-swap
llama-swap enables seamless hot-swapping and management of multiple local generative AI models, acting as a unified API gateway compatible with OpenAI and Anthropic standards.
Core Features
Quick Start
docker run -it --rm --runtime nvidia -p 9292:8080 -v /path/to/models:/models -v /path/to/custom/config.yaml:/app/config.yaml ghcr.io/mostlygeek/llama-swap:cudaDetailed Introduction
llama-swap is a high-performance, Go-built tool designed to streamline local generative AI workflows. It acts as a robust proxy, allowing users to run multiple AI models (like those powered by llama.cpp or vllm) on their machine and hot-swap between them on demand. By providing a unified API endpoint compatible with both OpenAI and Anthropic standards, it simplifies interaction with diverse local inference servers, offering features like a real-time web UI, API key management, and flexible model configuration, making local AI development and testing significantly more efficient.