Tags: #data-extraction
firecrawl/firecrawl
A robust Web Data API designed to provide clean, LLM-ready web data for AI agents, enabling scalable search, scraping, and interaction with the web.
docling-project/docling
A Python library designed to simplify the processing and parsing of diverse document formats, preparing them for seamless integration with generative AI ecosystems.
D4Vinci/Scrapling
An adaptive Python web scraping framework designed to handle everything from single requests to full-scale crawls, featuring anti-bot bypass and self-healing parsers.
Unstructured-IO/unstructured
An open-source ETL solution for transforming complex documents into clean, structured data formats, optimized for language models.
run-llama/llama_index
LlamaIndex is an open-source framework for building agentic applications, specializing in document processing, OCR, parsing, and indexing to empower LLMs.
opendataloader-project/opendataloader-pdf
An open-source PDF parser for AI-ready data extraction and automated PDF accessibility compliance.
neo4j-labs/llm-graph-builder
Transforms diverse unstructured data into structured Neo4j Knowledge Graphs using Large Language Models (LLMs) and LangChain.
ScrapeGraphAI/Scrapegraph-ai
A Python library that leverages LLMs and graph logic to simplify web scraping and data extraction from various sources.
emcf/thepipe
A Python library for extracting clean markdown, multimodal media, and structured data from complex documents using vision-language models.