Tags: #evaluation

LLM Engineering Platform
ClickHouse
24.8k

langfuse/langfuse

An open-source LLM engineering platform for developing, monitoring, evaluating, and debugging AI applications.

AI Observability and LLM Development Platform
python
18.8k

comet-ml/opik

An open-source platform for debugging, evaluating, and monitoring LLM applications, RAG systems, and agentic workflows from prototype to production.

LLM Evaluation & Red Teaming Tool
Node.js
20.0k

promptfoo/promptfoo

Test, evaluate, and red-team your LLM applications to ensure security, reliability, and optimal performance across various models.

LLMOps Platform
openai-sdk
11.2k

tensorzero/tensorzero

An open-source LLMOps platform unifying LLM gateway, observability, evaluation, optimization, and experimentation for robust AI application development.

AI Agent Development and Operations Platform
Docker
5.4k

coze-dev/coze-loop

Coze Loop is an open-source, full-lifecycle management platform for AI agent development, debugging, evaluation, and monitoring.

AI Agent Development Framework and CLI Tool
python
3.7k

evalstate/fast-agent

A flexible CLI-first framework for building, evaluating, and interacting with sophisticated AI agents and workflows, offering comprehensive LLM support and advanced debugging features.

AI/ML Observability Platform
python
9.3k

Arize-ai/phoenix

An open-source platform for comprehensive AI/ML model observability, evaluation, and debugging.

AI Development Platform
4.7k

Kiln-AI/Kiln

A free, comprehensive platform for building, evaluating, and optimizing AI systems, offering tools for RAG, fine-tuning, agents, and synthetic data generation.

LLM Evaluation and Testing Framework
python
5.3k

Giskard-AI/giskard-oss

An open-source Python library for comprehensive evaluation, testing, and red teaming of LLM agents and agentic systems.

AI MLOps Framework
Python
9.2k

oumi-ai/oumi

An end-to-end platform for fine-tuning, evaluating, and deploying open-source Large Language Models (LLMs) and Vision Language Models (VLMs).

LLM Orchestration Library
python
4.6k

promptslab/Promptify

A Python library for task-based NLP using LLMs, providing structured outputs, universal LLM backend support, and built-in evaluation.

Benchmarking and Evaluation Framework
python
3.2k

embeddings-benchmark/mteb

MTEB is a comprehensive benchmark and evaluation framework designed to assess the performance of text embedding models and retrieval systems across a wide range of tasks.

AI/ML Optimization Framework
python
4.7k

Marker-Inc-Korea/AutoRAG

An open-source framework that automates the evaluation and optimization of Retrieval-Augmented Generation (RAG) pipelines using AutoML-style techniques for specific datasets.

OSS Alternative

Explore the best open source alternatives to commercial software.

© 2026 OSS Alternative. hotgithub.com - All rights reserved.