Document Processing Library for AI
57.6k 2026-04-12

docling-project/docling

A Python library designed to simplify the processing and parsing of diverse document formats, preparing them for seamless integration with generative AI ecosystems.

Core Features

Parsing of multiple document formats including PDF, DOCX, HTML, images, and audio.
Advanced PDF understanding, extracting layout, table structure, code, and formulas.
Unified DoclingDocument representation and various export options like Markdown and JSON.
Plug-and-play integrations with leading AI frameworks such as LangChain, LlamaIndex, and Haystack.
Extensive OCR support for scanned documents and images, alongside audio processing with ASR.

Quick Start

pip install docling

Detailed Introduction

Docling is an open-source Python library that streamlines the complex task of preparing unstructured and semi-structured documents for use with generative AI models. It excels at parsing a wide array of formats, from standard PDFs and office documents to images and audio, extracting rich structural and semantic information. By providing a unified document representation and seamless integrations with popular AI frameworks, Docling empowers developers to build robust AI applications that can intelligently interact with diverse data sources, ensuring data quality and accessibility for advanced AI workflows.

OSS Alternative

Explore the best open source alternatives to commercial software.

© 2026 OSS Alternative. hotgithub.com - All rights reserved.