LLM Inference and Serving Engine
78.1k 2026-04-25
vllm-project/vllm
vLLM is a high-throughput and memory-efficient open-source library designed for fast and easy serving of large language models.
Core Features
State-of-the-art serving throughput with PagedAttention
Efficient memory management and continuous batching
Broad support for various quantization techniques (FP8, INT4, GPTQ/AWQ)
Seamless integration with 200+ Hugging Face models and architectures
Flexible distributed inference and OpenAI-compatible API server
Quick Start
uv pip install vllmDetailed Introduction
vLLM, originating from UC Berkeley's Sky Computing Lab, is a leading open-source library for LLM inference and serving. It achieves state-of-the-art throughput and memory efficiency through innovations like PagedAttention, continuous batching, and advanced quantization. Designed for flexibility, vLLM integrates seamlessly with over 200 Hugging Face models, supports diverse hardware (NVIDIA, AMD, CPUs, TPUs), and offers features like distributed inference and an OpenAI-compatible API, making LLM deployment easy, fast, and cost-effective for a wide range of applications.