Tags: #evaluation
comet-ml/opik
An open-source platform for comprehensive tracing, evaluation, and optimization of LLM applications, RAG systems, and agentic workflows.
promptfoo/promptfoo
A CLI and library for testing, evaluating, and red-teaming LLM applications to ensure security, reliability, and performance across various models.
tensorzero/tensorzero
An open-source LLMOps platform unifying LLM gateway, observability, evaluation, optimization, and experimentation for robust AI application development.
coze-dev/coze-loop
Coze Loop is an open-source platform providing full-lifecycle management for AI agents, covering development, debugging, evaluation, and monitoring to streamline their creation and operation.
mlflow/mlflow
An open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing production-quality AI applications, including agents, LLMs, and ML models.
evalstate/fast-agent
A flexible CLI-first framework for building, evaluating, and interacting with sophisticated multimodal LLM agents and workflows, offering comprehensive model and skill support.
xlang-ai/OSWorld
OSWorld is a benchmark and environment for evaluating multimodal AI agents on open-ended tasks within real computer operating systems.
Arize-ai/phoenix
An open-source platform for debugging, evaluating, and monitoring AI/ML models and pipelines.
Kiln-AI/Kiln
A free, all-in-one platform for building, evaluating, and optimizing AI systems, offering tools for RAG, agents, fine-tuning, and synthetic data generation.
Giskard-AI/giskard-oss
An open-source Python library for comprehensive testing, evaluation, and red teaming of LLM agents and AI systems, designed for dynamic, multi-turn interactions.
oumi-ai/oumi
An end-to-end platform for fine-tuning, evaluating, and deploying open-source Large Language Models (LLMs) and Vision Language Models (VLMs).
promptslab/Promptify
A Python library for structured NLP tasks using LLMs, offering Pydantic outputs, multi-provider support, and built-in evaluation.
embeddings-benchmark/mteb
MTEB is a comprehensive benchmark and evaluation framework designed to assess the performance of text embedding models and retrieval systems across a wide range of tasks.