Tags: #inference
unslothai/unsloth
Unsloth Studio is a web UI that enables efficient local training and inference of open-source large language models and other AI models with significant VRAM and speed optimizations.
vllm-project/vllm
vLLM is a high-throughput and memory-efficient open-source library designed for fast and easy serving of large language models.
xorbitsai/inference
A unified, production-ready inference API for deploying and serving open-source language, speech, and multimodal AI models on various infrastructures.
meta-llama/llama-cookbook
An official guide and collection of recipes for building applications with the Llama model family, covering inference, fine-tuning, and RAG.
containers/ramalama
RamaLama is an open-source developer tool that simplifies the local serving and production inference of AI models by leveraging familiar container technology.
PaddlePaddle/FastDeploy
A high-performance inference and deployment toolkit for Large Language Models (LLMs) and Vision-Language Models (VLMs) based on PaddlePaddle.
LMCache/LMCache
LMCache is an LLM serving engine extension designed to significantly reduce Time-To-First-Token (TTFT) and boost throughput by intelligently reusing KV caches across various storage tiers and serving instances.
stas00/ml-engineering
An open collection of methodologies, tools, and step-by-step instructions for successful training, fine-tuning, and inference of large language and multi-modal models.
mnfst/awesome-free-llm-apis
A comprehensive list of Large Language Model (LLM) APIs offering permanent free tiers for text inference, including provider and third-party inference services.
mlc-ai/web-llm
A high-performance, in-browser LLM inference engine with OpenAI API compatibility, leveraging WebGPU for local, private AI.
cheahjs/free-llm-api-resources
A comprehensive list of free and trial-based LLM inference resources accessible via API.
beam-cloud/beta9
An ultrafast, open-source Pythonic runtime for deploying and scaling serverless GPU inference, sandboxes, and background jobs with zero infrastructure overhead.
vllm-project/vllm-omni
A framework for efficient, fast, and cheap serving of omni-modality (text, image, video, audio) AI models.
vitoplantamura/OnnxStream
A lightweight C++ inference library designed to run large ONNX-based AI models like Stable Diffusion XL and Mistral 7B on resource-constrained devices with minimal memory footprint.