OSS Alternative - Discover Top Open Source Alternatives to Popular Software

skyzh/tiny-llm

A hands-on course for systems engineers to build an efficient LLM inference serving system from scratch on Apple Silicon using MLX, mimicking vLLM's core techniques.

Core Features

Implement LLM serving components from scratch using MLX.

Explore advanced inference optimizations like KV cache and continuous batching.

Designed for efficient deployment on Apple Silicon.

Step-by-step curriculum covering Qwen2 model integration.

Detailed Introduction

The `tiny-llm` project is an educational course for systems engineers to deeply understand and build an LLM inference serving system. It leverages Apple's MLX framework, focusing on low-level array/matrix APIs to construct the infrastructure from scratch, mimicking a simplified vLLM. The course covers essential techniques like attention mechanisms, KV caching, continuous batching, and flash attention, specifically using Qwen2 models. Its primary value lies in providing hands-on experience and in-depth knowledge of LLM serving optimizations on accessible macOS environments, bypassing the need for NVIDIA GPUs.