Tags: #benchmarking
AI Agent Benchmarking Platform
VMware Workstation Pro
2.8k
xlang-ai/OSWorld
OSWorld is a comprehensive benchmarking platform designed to evaluate multimodal AI agents on open-ended tasks within real computer environments.
AI/ML Evaluation Framework
python
4.0k
EvolvingLMMs-Lab/lmms-eval
A unified, reproducible, and efficient multimodal evaluation toolkit for large language models across text, image, video, and audio tasks.