skyzh/tiny-llm
A hands-on course for systems engineers to build an efficient LLM inference serving system from scratch on Apple Silicon using MLX, mimicking vLLM's core techniques.
Core Features
Detailed Introduction
The `tiny-llm` project is an educational course for systems engineers to deeply understand and build an LLM inference serving system. It leverages Apple's MLX framework, focusing on low-level array/matrix APIs to construct the infrastructure from scratch, mimicking a simplified vLLM. The course covers essential techniques like attention mechanisms, KV caching, continuous batching, and flash attention, specifically using Qwen2 models. Its primary value lies in providing hands-on experience and in-depth knowledge of LLM serving optimizations on accessible macOS environments, bypassing the need for NVIDIA GPUs.