Tags: #data-pipeline
vectordotdev/vector
A high-performance, end-to-end observability data pipeline that empowers users to collect, transform, and route all their logs and metrics with significant cost reduction and enhanced control.
microsoft/graphrag
GraphRAG is a modular, graph-based Retrieval-Augmented Generation (RAG) system that leverages LLMs to extract structured data from unstructured text, enhancing reasoning on private datasets.
google-gemini/genai-processors
A lightweight Python library for building modular, asynchronous, and composable AI pipelines, enabling efficient, parallel, and multimodal content processing for Generative AI applications.
apache/airflow
A robust open-source platform for programmatically authoring, scheduling, and monitoring data workflows.
dagster-io/dagster
A cloud-native data pipeline orchestrator designed for the development, production, and observation of data assets, featuring integrated lineage, observability, and a declarative programming model.
towhee-io/towhee
Towhee is a cutting-edge framework designed to simplify and accelerate neural data processing pipelines, particularly for unstructured multimodal data and LLM orchestration.
bespokelabsai/curator
A Python library for generating and curating high-quality synthetic data for AI model training and structured data extraction.
astronomer/astronomer-cosmos
Integrate dbt Core projects seamlessly into Apache Airflow DAGs and Task Groups, enabling robust data transformation orchestration.
rom1504/img2dataset
An efficient command-line tool to download, resize, and package vast collections of image URLs into ready-to-use datasets for machine learning.
apache/dolphinscheduler
Apache DolphinScheduler is a modern, low-code data orchestration platform designed for agile creation of high-performance workflows and managing complex data pipeline dependencies.