Technical Guide & Knowledge Base
17.8k 2026-04-27
stas00/ml-engineering
An open collection of methodologies, tools, and step-by-step instructions for successful training, fine-tuning, and inference of large language and multi-modal models.
Core Features
Comprehensive guides on LLM/VLM training and inference
Detailed insights into hardware (compute, storage, network) for ML
Guidance on orchestration systems like SLURM and container management
Practical debugging and troubleshooting techniques for ML applications
Collection of scripts and commands for quick problem-solving
Detailed Introduction
This project serves as an open book and technical guide for Machine Learning Engineering, specifically targeting engineers and operators involved in training and inferring Large Language Models (LLM) and Multi-modal Models (VLM). It compiles extensive experience from projects like BLOOM-176B and IDEFICS-80B, offering practical methodologies, tools, and actionable scripts. The resource covers critical aspects from hardware selection and orchestration to training, inference, and debugging, aiming to provide quick, proven solutions for complex ML challenges.