Tags: #llm-inference
vllm-project/vllm
A high-throughput and memory-efficient open-source engine designed for fast, easy, and cost-effective serving of large language models.
xlite-dev/Awesome-LLM-Inference
A comprehensive, curated list of research papers and associated code for optimizing Large Language Model (LLM) and Vision Language Model (VLM) inference.
jundot/omlx
An optimized LLM inference server for Apple Silicon, featuring continuous batching, tiered KV caching, and macOS menu bar management for efficient local AI.
Michael-A-Kuykendall/shimmy
A Python-free Rust-based inference server providing an OpenAI-compatible API for local GGUF and SafeTensors LLM models.
lyogavin/airllm
AirLLM optimizes large language model inference memory, enabling 70B LLMs on a single 4GB GPU without quantization, and 405B Llama3.1 on 8GB VRAM.
predibase/lorax
A multi-LoRA inference server designed to efficiently serve thousands of fine-tuned Large Language Models on a single GPU, drastically cutting serving costs while maintaining high throughput and low latency.
janhq/cortex.cpp
A local AI API platform designed to run various AI models (vision, speech, language) on local hardware with an OpenAI-compatible API.