LLM Alignment Framework
1.4k 2026-04-18
OpenLMLab/MOSS-RLHF
An open-source framework providing code, models, and insights for stable Reinforcement Learning from Human Feedback (RLHF) training in Large Language Models, focusing on the PPO algorithm and reward modeling.
Core Features
Open-source PPO-max algorithm for stable RLHF training.
Pre-trained Chinese and English reward models.
Annotated HH-RLHF dataset with preference strength.
Released SFT and RLHF-aligned policy models.
Comprehensive technical reports on RLHF and Reward Modeling.
Detailed Introduction
MOSS-RLHF addresses the significant challenges in applying Reinforcement Learning from Human Feedback (RLHF) to Large Language Models, such as reward design complexity and training instability. It aims to lower the barrier for AI researchers by providing a robust framework. The project introduces the PPO-max algorithm for stable training, offers competitive pre-trained reward models in both Chinese and English, and releases valuable datasets, enabling better human alignment for LLMs.