Tags: #data-processing
qax-os/excelize
A pure Go library for reading and writing Microsoft Excel spreadsheet files, supporting various formats and streaming API for large datasets.
argoproj/argo-workflows
An open-source, container-native workflow engine for Kubernetes, designed to orchestrate parallel jobs and complex multi-step tasks.
huggingface/datasets
A lightweight library providing one-line dataloaders and efficient pre-processing tools for a vast hub of AI datasets, supporting various ML frameworks.
Eventual-Inc/Daft
A high-performance data engine for AI and multimodal workloads, processing diverse data types at scale with Python and Rust.
vortex-data/vortex
Vortex is a next-generation, high-performance, and extensible open-source columnar file format and toolkit designed for blazing-fast data processing and storage, especially with object storage.
pditommaso/awesome-pipeline
A comprehensive, curated list of various pipeline toolkits, frameworks, and libraries for workflow management and data processing.
rom1504/img2dataset
A highly efficient command-line tool to download, resize, and package large sets of image URLs into machine learning datasets.
mbloch/mapshaper
A JavaScript-based tool for editing and transforming geospatial data formats like Shapefile, GeoJSON, and TopoJSON, offering both command-line and interactive web interfaces.
ucbepic/docetl
DocETL is an agentic LLM-powered framework designed for building and executing complex data processing and ETL pipelines, especially for documents.