om-ai-lab/VLM-R1
A stable and generalizable R1-style Large Vision-Language Model (VLM) framework that enhances visual understanding tasks through reinforced learning, outperforming SFT models in generalization.
Core Features
Detailed Introduction
VLM-R1 is a stable and generalizable R1-style Large Vision-Language Model, building upon the concepts of Deepseek-R1 to address visual understanding tasks. It leverages reinforcement learning to demonstrate superior generalization capabilities over traditional SFT models, especially on out-of-domain data for tasks like Referring Expression Comprehension (REC). VLM-R1 has achieved top performance on the Open-Compass Math Leaderboard (under 4B parameters) and state-of-the-art results on OVDEval. The project continuously optimizes model deployment and inference efficiency on Huawei Ascend hardware, supporting various VLMs and flexible fine-tuning strategies, providing a robust framework for developing high-performance, generalizable VLMs.