Tags: #llm-inference

Educational Project / LLM Inference Serving Course

4.1k

skyzh/tiny-llm

A hands-on course for systems engineers to build an efficient LLM inference serving system from scratch on Apple Silicon using MLX, mimicking vLLM's core techniques.

llm-inference mlx apple-silicon

Details

Curated Resource List

Python

5.2k

xlite-dev/Awesome-LLM-Inference

A comprehensive, curated list of research papers and associated code implementations focused on optimizing Large Language Model (LLM) and Vision-Language Model (VLM) inference.

llm inference vlm inference optimization

Details

LLM Inference Server & Desktop Utility

macOS

11.7k

jundot/omlx

An LLM inference server optimized for Apple Silicon, featuring continuous batching, tiered KV caching, and macOS menu bar management for efficient local AI.

llm inference apple silicon kv caching

Details

AI Inference Server

4.7k

Michael-A-Kuykendall/shimmy

Shimmy is a Python-free Rust inference server that provides a 100% OpenAI-compatible API for running local Large Language Models (LLMs) with zero dependencies.

llm-inference openai-api-compatible rust

Replaces:

OpenAI API

Details

LLM Inference Optimization Library

Python

17.0k

lyogavin/airllm

Optimizes large language model inference to run 70B models on a single 4GB GPU without quantization, enabling efficient deployment on resource-constrained hardware.

llm inference gpu optimization memory efficiency

Details

AI/ML Inference Server

Docker

3.8k

predibase/lorax

A multi-LoRA inference server designed to serve thousands of fine-tuned LLMs on a single GPU, significantly reducing serving costs while maintaining high throughput and low latency.

llm inference lora model serving

Details

Local AI API Platform / CLI Tool

llama.cpp

2.8k

janhq/cortex.cpp

A local AI API platform for running various AI models (vision, speech, language) on diverse hardware with an OpenAI-compatible API.

local ai api platform llm inference

Replaces:

OpenAI

Details

LLM Inference Engine

Python

13.1k

GeeeekExplorer/nano-vllm

A lightweight and optimized Python library for fast offline large language model inference, offering comparable or better performance than vLLM with a more readable codebase.

llm inference deep learning python

Details

Hardware Plugin

vLLM

2.0k

vllm-project/vllm-ascend

A community-maintained hardware plugin that enables vLLM to run seamlessly and efficiently on Ascend NPUs for large language model inference.

vllm ascend llm-inference

Details