MarkItDown
MarkItDown
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to textract, but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools — and may not be the best option for high-fidelity document conversions for human consumption.
MarkItDown currently supports the conversion from:
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
- Youtube URLs
- EPubs
- … and more!
Why Markdown?
Markdown is extremely close to plain text, with minimal markup or formatting, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI’s GPT-4o, natively “speak” Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions are also highly token-efficient.
Prerequisites
MarkItDown requires Python 3.10 or higher. It is recommended to use a virtual environment to avoid dependency conflicts.
With the standard Python installation, you can create and activate a virtual environment using the following commands:
python -m venv .venvsource .venv/bin/activate
If using uv
, you can create a virtual environment with:
uv venv --python=3.12 .venvsource .venv/bin/activate# NOTE: Be sure to use 'uv pip install' rather than just 'pip install' to install packages in this virtual environment
If you are using Anaconda, you can create a virtual environment with:
conda create -n markitdown python=3.12conda activate markitdown
Installation
To install MarkItDown, use pip: pip install 'markitdown[all]'
. Alternatively, you can install it from the source:
git clone git@github.com:microsoft/markitdown.gitcd markitdownpip install -e 'packages/markitdown[all]'
Usage
Command-Line
markitdown path-to-file.pdf > document.md
Or use -o
to specify the output file:
markitdown path-to-file.pdf -o document.md
You can also pipe content:
cat path-to-file.pdf | markitdown
Optional Dependencies
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the [all]
option. However, you can also install them individually for more control. For example:
pip install 'markitdown[pdf, docx, pptx]'
will install only the dependencies for PDF, DOCX, and PPTX files.
At the moment, the following optional dependencies are available:
[all]
Installs all optional dependencies[pptx]
Installs dependencies for PowerPoint files[docx]
Installs dependencies for Word files[xlsx]
Installs dependencies for Excel files[xls]
Installs dependencies for older Excel files[pdf]
Installs dependencies for PDF files[outlook]
Installs dependencies for Outlook messages[az-doc-intel]
Installs dependencies for Azure Document Intelligence[audio-transcription]
Installs dependencies for audio transcription of wav and mp3 files[youtube-transcription]
Installs dependencies for fetching YouTube video transcription
Plugins
MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins:
markitdown --list-plugins
To enable plugins use:
markitdown --use-plugins path-to-file.pdf
To find available plugins, search GitHub for the hashtag #markitdown-plugin
. To develop a plugin, see packages/markitdown-sample-plugin
.
Azure Document Intelligence
To use Microsoft Document Intelligence for conversion:
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
More information about how to set up an Azure Document Intelligence Resource can be found here
Python API
Basic usage in Python:
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False) # Set to True to enable pluginsresult = md.convert("test.xlsx")print(result.text_content)
Document Intelligence conversion in Python:
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")result = md.convert("test.pdf")print(result.text_content)
To use Large Language Models for image descriptions, provide llm_client
and llm_model
:
from markitdown import MarkItDownfrom openai import OpenAI
client = OpenAI()md = MarkItDown(llm_client=client, llm_model="gpt-4o")result = md.convert("example.jpg")print(result.text_content)
Docker
docker build -t markitdown:latest .docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
Tip
MarkItDown now offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop. See markitdown-mcp for more information.
Important
Breaking changes between 0.0.1 to 0.1.0:
- Dependencies are now organized into optional feature-groups (further details below). Use
pip install 'markitdown[all]'
to have backward-compatible behavior. - convert_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
- The DocumentConverter class interface has changed to read from file-like streams rather than file paths. No temporary files are created anymore. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
← Back to projects