Tags: #gpu-acceleration
sgl-project/sglang
A high-performance serving framework designed to accelerate inference for large language models and multimodal AI models.
vllm-project/vllm
A high-throughput and memory-efficient open-source engine designed for fast, easy, and cost-effective serving of large language models.
containers/ramalama
RamaLama simplifies the local serving and production inference of AI models from any source by leveraging familiar container patterns, eliminating complex host system configurations.
hpcaitech/ColossalAI
Colossal-AI makes training and deploying large AI models cheaper, faster, and more accessible through advanced distributed training techniques.
NVIDIA-NeMo/Curator
A GPU-accelerated, scalable toolkit for multimodal data preprocessing and curation, designed to train better AI models faster.
ModelCloud/GPTQModel
A toolkit for quantizing (compressing) Large Language Models (LLMs) with hardware acceleration across various GPUs and CPUs, integrating with popular inference frameworks.
alibaba/ROLL
An efficient and user-friendly library for scaling Reinforcement Learning with Large Language Models on large-scale GPU resources.
vllm-project/vllm-omni
vLLM-Omni is an efficient, flexible, and easy-to-use framework extending vLLM to serve omni-modality models (text, image, video, audio) with high throughput and an OpenAI-compatible API.
withcatai/node-llama-cpp
A Node.js library providing bindings for llama.cpp, enabling local AI model inference with advanced features like JSON schema enforcement and function calling.
edwko/OuteTTS
A versatile interface for OuteTTS models, providing flexible text-to-speech generation capabilities across various AI inference backends and hardware platforms.