Tags: #llm-inference
skyzh/tiny-llm
A hands-on course for systems engineers to build an efficient LLM inference serving system from scratch on Apple Silicon using MLX, mimicking vLLM's core techniques.
xlite-dev/Awesome-LLM-Inference
A comprehensive, curated list of research papers and associated code implementations focused on optimizing Large Language Model (LLM) and Vision-Language Model (VLM) inference.
jundot/omlx
An LLM inference server optimized for Apple Silicon, featuring continuous batching, tiered KV caching, and macOS menu bar management for efficient local AI.
Michael-A-Kuykendall/shimmy
Shimmy is a Python-free Rust inference server that provides a 100% OpenAI-compatible API for running local Large Language Models (LLMs) with zero dependencies.
lyogavin/airllm
Optimizes large language model inference to run 70B models on a single 4GB GPU without quantization, enabling efficient deployment on resource-constrained hardware.
predibase/lorax
A multi-LoRA inference server designed to serve thousands of fine-tuned LLMs on a single GPU, significantly reducing serving costs while maintaining high throughput and low latency.
janhq/cortex.cpp
A local AI API platform for running various AI models (vision, speech, language) on diverse hardware with an OpenAI-compatible API.
GeeeekExplorer/nano-vllm
A lightweight and optimized Python library for fast offline large language model inference, offering comparable or better performance than vLLM with a more readable codebase.
vllm-project/vllm-ascend
A community-maintained hardware plugin that enables vLLM to run seamlessly and efficiently on Ascend NPUs for large language model inference.