LLM Inference and Serving Engine
76.3k 2026-04-13
vllm-project/vllm
A high-throughput and memory-efficient open-source engine designed for fast, easy, and cost-effective serving of large language models.
Core Features
State-of-the-art serving throughput with PagedAttention
Efficient memory management and continuous batching
Broad support for quantization techniques (FP8, INT4, GPTQ/AWQ)
Flexible distributed inference (tensor, pipeline, data parallelism)
Seamless integration with Hugging Face models and OpenAI-compatible API
Quick Start
uv pip install vllmDetailed Introduction
vLLM is a powerful open-source library for LLM inference and serving, originating from UC Berkeley. It excels in delivering state-of-the-art throughput and memory efficiency through innovations like PagedAttention and continuous batching. Designed for flexibility, vLLM supports a wide array of quantization methods, distributed inference strategies, and integrates seamlessly with over 200 Hugging Face models. It provides an easy-to-use platform for deploying LLMs, making advanced AI serving accessible and cost-effective for various hardware and applications.