AI/ML Deployment Toolkit
3.7k 2026-04-26
PaddlePaddle/FastDeploy
A high-performance inference and deployment toolkit for Large Language Models (LLMs) and Vision-Language Models (VLMs) based on PaddlePaddle.
Core Features
Load-balanced PD decomposition for optimized resource utilization
Unified high-performance KV cache transfer
OpenAI API service and vLLM compatibility
Comprehensive quantization format support (W8A16, W4A16, FP8, etc.)
Advanced acceleration techniques like speculative decoding and multi-token prediction
Extensive multi-hardware support (NVIDIA, Kunlunxin, Hygon, etc.)
Detailed Introduction
FastDeploy is a production-grade inference and deployment toolkit built on PaddlePaddle, designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It offers out-of-the-box solutions with core features like load-balanced PD decomposition, unified KV cache transfer, and compatibility with OpenAI API services and vLLM. The toolkit supports various quantization formats and advanced acceleration techniques, ensuring high-performance and efficient deployment across a wide range of hardware platforms, including NVIDIA, Kunlunxin, and Hygon GPUs/XPUs.