AI Model Framework
6.0k 2026-05-01
om-ai-lab/VLM-R1
VLM-R1 is a stable and generalizable R1-style Large Vision-Language Model that leverages reinforcement learning to significantly improve visual understanding tasks.
Core Features
Full and LoRA Fine-tuning for GRPO
Support for Multi-node and Multi-image Input Training
Compatibility with various VLMs like QwenVL and InternVL
Optimized inference with xllm and vllm-ascend frameworks
Detailed Introduction
VLM-R1 is an innovative R1-style Large Vision-Language Model designed for enhanced visual understanding. Building upon the Deepseek-R1 concept, it employs a reinforcement learning approach (GRPO) to achieve superior stability and generalizability compared to traditional SFT methods, especially on out-of-domain data. The project demonstrates state-of-the-art performance in tasks like Referring Expression Comprehension (REC) and Open-Vocabulary Detection (OVD), offering robust fine-tuning capabilities and broad hardware compatibility, including Huawei Ascend platforms.