Tags: #etl

Data Integration Platform

9.3k

apache/seatunnel

SeaTunnel is a high-performance, distributed data integration tool designed to synchronize massive amounts of multimodal data from diverse sources with efficiency and stability.

data integration etl big data

Details

Data Processing Library

python

14.6k

Unstructured-IO/unstructured

An open-source ETL solution for transforming complex documents into clean, structured data formats, optimized for language models.

etl document processing data extraction

Details

Workflow Orchestration Platform

Python

45.2k

apache/airflow

A platform to programmatically author, schedule, and monitor data workflows.

workflow orchestration data pipeline scheduler

Details

Data Pipeline Framework

Python

10.9k

kedro-org/kedro

A Python framework for building reproducible, maintainable, and modular data engineering and data science pipelines using software engineering best practices.

data science data engineering mlops

Details

Dataflow Orchestration Library

Python

2.5k

Apache Hamilton is a lightweight Python library that enables data scientists and engineers to define testable, modular, and self-documenting dataflows (DAGs) with built-in lineage and metadata, portable across any Python environment.

python dataflow dag

Details

Data Orchestration Platform

Python

15.4k

dagster-io/dagster

A cloud-native data pipeline orchestrator designed for the development, production, and observation of data assets, featuring integrated lineage, observability, and a declarative programming model.

orchestration data pipeline etl

Replaces:

Talend Data Integration Informatica PowerCenter...

Details

Workflow Orchestration Framework

python

22.3k

PrefectHQ/prefect

Prefect is a Python-based workflow orchestration framework designed to build resilient, dynamic data pipelines that automate processes and recover from unexpected changes.

python workflow orchestration data pipelines

Details

Data Orchestration Platform

OpenJDK

1.4k

apache/hop

An open-source platform designed to facilitate all aspects of data and metadata orchestration, enabling efficient data integration and pipeline management.

data orchestration etl data integration

Replaces:

Talend Informatica PowerCenter

Details

Airflow Extension for dbt Orchestration

apache airflow

1.2k

astronomer/astronomer-cosmos

Integrate dbt Core projects seamlessly into Apache Airflow DAGs and Task Groups, enabling robust data transformation orchestration.

dbt apache airflow etl

Details

LLM-powered Data Processing Framework

Python

3.7k

ucbepic/docetl

DocETL is an agentic LLM-powered framework designed for building and executing complex data processing and ETL pipelines, especially for documents.

llm etl data processing

Details

AI Agent Context Management Framework

Python

9.6k

cocoindex-io/cocoindex

CocoIndex is an incremental data indexing framework that provides continuously fresh context from diverse enterprise data sources for AI agents and LLM applications.

ai-agents llm rag

Details

Tags: #etl

apache/seatunnel

Unstructured-IO/unstructured

apache/airflow

kedro-org/kedro

apache/hamilton

dagster-io/dagster

PrefectHQ/prefect

apache/hop

astronomer/astronomer-cosmos

ucbepic/docetl

cocoindex-io/cocoindex