skyzh/tiny-llm
An educational course for systems engineers to learn LLM inference serving on Apple Silicon by building a simplified vLLM-like system with MLX from scratch.
Core Features
Detailed Introduction
The tiny-llm project is an educational course tailored for systems engineers aiming to master large language model (LLM) inference serving. It provides a hands-on approach to constructing a simplified vLLM-like system from the ground up, utilizing Apple's MLX framework. The curriculum delves into low-level MLX array/matrix APIs, bypassing high-level neural network abstractions to foster a deep understanding of underlying mechanisms. It covers essential LLM components and advanced serving optimizations like KV cache, continuous batching, and flash attention, specifically optimized for efficient deployment on Apple Silicon, using the Qwen2 model as a practical case study.