Tags: #evaluation

AI Observability and MLOps Platform

19.1k

comet-ml/opik

An open-source platform for comprehensive tracing, evaluation, and optimization of LLM applications, RAG systems, and agentic workflows.

llm observability evaluation

Details

LLM Evaluation & Red Teaming Tool

Node.js

20.5k

promptfoo/promptfoo

A CLI and library for testing, evaluating, and red-teaming LLM applications to ensure security, reliability, and performance across various models.

llm evaluation red-teaming

Details

LLMOps Platform

rust

11.3k

tensorzero/tensorzero

An open-source LLMOps platform unifying LLM gateway, observability, evaluation, optimization, and experimentation for robust AI application development.

llmops llm-gateway observability

Details

AI Agent Development and Operations Platform

docker

5.4k

coze-dev/coze-loop

Coze Loop is an open-source platform providing full-lifecycle management for AI agents, covering development, debugging, evaluation, and monitoring to streamline their creation and operation.

ai-agent llm-ops prompt-engineering

Details

AI Engineering Platform

Python

25.8k

mlflow/mlflow

An open-source AI engineering platform for debugging, evaluating, monitoring, and optimizing production-quality AI applications, including agents, LLMs, and ML models.

ai engineering llm mlops

Details

AI Agent Development Framework & CLI Tool

python

3.8k

evalstate/fast-agent

A flexible CLI-first framework for building, evaluating, and interacting with sophisticated multimodal LLM agents and workflows, offering comprehensive model and skill support.

llm-agent ai-development cli-tool

Details

AI Agent Benchmarking Platform

python

2.8k

xlang-ai/OSWorld

OSWorld is a benchmark and environment for evaluating multimodal AI agents on open-ended tasks within real computer operating systems.

ai agents benchmarking multimodal

Details

AI/ML Observability Platform

Python

9.4k

Arize-ai/phoenix

An open-source platform for debugging, evaluating, and monitoring AI/ML models and pipelines.

ai ml observability

Replaces:

Commercial AI Observability Platforms

Details

AI Development Platform

Ollama

4.8k

Kiln-AI/Kiln

A free, all-in-one platform for building, evaluating, and optimizing AI systems, offering tools for RAG, agents, fine-tuning, and synthetic data generation.

ai development llm ops evaluation

Details

AI/ML Testing and Evaluation Framework

5.3k

Giskard-AI/giskard-oss

An open-source Python library for comprehensive testing, evaluation, and red teaming of LLM agents and AI systems, designed for dynamic, multi-turn interactions.

llm testing evaluation

Details

MLOps Platform

Python

9.2k

oumi-ai/oumi

An end-to-end platform for fine-tuning, evaluating, and deploying open-source Large Language Models (LLMs) and Vision Language Models (VLMs).

llm vlm finetuning

Details

LLM Orchestration Library / NLP Framework

Python 3.9+

4.6k

promptslab/Promptify

A Python library for structured NLP tasks using LLMs, offering Pydantic outputs, multi-provider support, and built-in evaluation.

llm nlp prompt-engineering

Details

Benchmarking and Evaluation Framework

python

3.2k

embeddings-benchmark/mteb

MTEB is a comprehensive benchmark and evaluation framework designed to assess the performance of text embedding models and retrieval systems across a wide range of tasks.

embeddings benchmark nlp

Details

RAG Optimization Framework

python

4.7k

Marker-Inc-Korea/AutoRAG

An open-source framework that automates the evaluation and optimization of Retrieval-Augmented Generation (RAG) pipelines using AutoML-style automation for specific datasets.

rag automl llm

Details