Tags: #document-processing
docling-project/docling
A Python library designed to simplify the processing and parsing of diverse document formats, preparing them for seamless integration with generative AI ecosystems.
Unstructured-IO/unstructured
An open-source ETL solution for transforming complex documents into clean, structured data formats, optimized for language models.
microsoft/markitdown
A Python utility for converting various file formats and office documents into structured Markdown, optimized for LLM consumption and text analysis.
katanaml/sparrow
Sparrow is a production-ready platform for structured data extraction and instruction calling from various documents and images using ML, LLM, and Vision LLM technologies.
emcf/thepipe
A Python library for extracting clean markdown, multimodal media, and structured data from complex documents using vision-language models.