Tags: #data-extraction
firecrawl/firecrawl
An API for AI agents to reliably search, scrape, and interact with the web, providing clean, LLM-ready data at scale.
docling-project/docling
Docling simplifies document processing, parsing diverse formats including advanced PDF understanding, and provides seamless integrations with the generative AI ecosystem.
Unstructured-IO/unstructured
An open-source ETL solution for transforming complex documents into clean, structured data formats, optimized for language models.
opendataloader-project/opendataloader-pdf
An open-source PDF parser for AI-ready data extraction and automated PDF accessibility remediation, offering benchmark-leading accuracy.
mishushakov/llm-scraper
A TypeScript library that leverages Large Language Models to extract structured data from any webpage.
neo4j-labs/llm-graph-builder
A powerful application that transforms diverse unstructured data sources into structured Neo4j Knowledge Graphs using Large Language Models (LLMs) and LangChain.
katanaml/sparrow
A production-ready platform for structured data extraction and instruction calling using ML, LLM, and Vision LLM technologies.
ScrapeGraphAI/Scrapegraph-ai
A Python library that leverages LLMs and graph logic to simplify web scraping and data extraction from various sources.
emcf/thepipe
A Python library for extracting clean markdown, multimodal media, and structured data from complex documents using vision-language models.