Tags: #evaluation
langfuse/langfuse
An open-source LLM engineering platform for developing, monitoring, evaluating, and debugging AI applications.
comet-ml/opik
An open-source platform for debugging, evaluating, and monitoring LLM applications, RAG systems, and agentic workflows from prototype to production.
promptfoo/promptfoo
Test, evaluate, and red-team your LLM applications to ensure security, reliability, and optimal performance across various models.
tensorzero/tensorzero
An open-source LLMOps platform unifying LLM gateway, observability, evaluation, optimization, and experimentation for robust AI application development.
coze-dev/coze-loop
Coze Loop is an open-source, full-lifecycle management platform for AI agent development, debugging, evaluation, and monitoring.
evalstate/fast-agent
A flexible CLI-first framework for building, evaluating, and interacting with sophisticated AI agents and workflows, offering comprehensive LLM support and advanced debugging features.
Arize-ai/phoenix
An open-source platform for comprehensive AI/ML model observability, evaluation, and debugging.
Kiln-AI/Kiln
A free, comprehensive platform for building, evaluating, and optimizing AI systems, offering tools for RAG, fine-tuning, agents, and synthetic data generation.
Giskard-AI/giskard-oss
An open-source Python library for comprehensive evaluation, testing, and red teaming of LLM agents and agentic systems.
oumi-ai/oumi
An end-to-end platform for fine-tuning, evaluating, and deploying open-source Large Language Models (LLMs) and Vision Language Models (VLMs).
promptslab/Promptify
A Python library for task-based NLP using LLMs, providing structured outputs, universal LLM backend support, and built-in evaluation.
embeddings-benchmark/mteb
MTEB is a comprehensive benchmark and evaluation framework designed to assess the performance of text embedding models and retrieval systems across a wide range of tasks.