Large Vision-Language Model Framework
5.9k 2026-04-18

om-ai-lab/VLM-R1

A stable and generalizable R1-style Large Vision-Language Model (VLM) framework that enhances visual understanding tasks through reinforced learning, outperforming SFT models in generalization.

Core Features

Full Fine-tuning for GRPO
LoRA Fine-tuning for GRPO
Multi-node and Multi-image Input Training
Support for various VLMs (QwenVL, InternVL)
Optimized inference with xllm and vllm-ascend on Huawei Ascend hardware

Detailed Introduction

VLM-R1 is a stable and generalizable R1-style Large Vision-Language Model, building upon the concepts of Deepseek-R1 to address visual understanding tasks. It leverages reinforcement learning to demonstrate superior generalization capabilities over traditional SFT models, especially on out-of-domain data for tasks like Referring Expression Comprehension (REC). VLM-R1 has achieved top performance on the Open-Compass Math Leaderboard (under 4B parameters) and state-of-the-art results on OVDEval. The project continuously optimizes model deployment and inference efficiency on Huawei Ascend hardware, supporting various VLMs and flexible fine-tuning strategies, providing a robust framework for developing high-performance, generalizable VLMs.

OSS Alternative

Explore the best open source alternatives to commercial software.

© 2026 OSS Alternative. hotgithub.com - All rights reserved.