19 Apr 2025
Branching out my simple converter script to a full-fledged project.
The goal is to create a powerful Markdown parser, testing several approaches along the way, that can handle various formats and convert them into the cleanest Markdown possible.
Possibly leveraging AI for post-processing.
This project is important as it is the foundation to feed clean data into AI workflows.
Github
Project code here:

Libraries
Here are some libraries to test.
Microsoft's MarkItDown
Currently implemented.

QuivrHQ/MegaParse

Pandoc - CLI document converter
https://pandoc.org/

Pandoc is a universal document converter that can convert files between various markup formats, including Markdown, HTML, LaTeX, and more. It's useful for converting documents from one format to another, and it also includes features like automatic citations and bibliographies, as well as customization options through templates and filters.
ppt2desc: Convert PowerPoint files into semantically rich text using vision language models

MinerU: A High-Quality PDF-to-Markdown/JSON Converter Worth Checking Out

markpdfdown: A high-quality PDF to Markdown tool based on large language model visual recognition. 一款基于大模型视觉识别的高质量PDF转Markdown工具
27 Jul 2025

E2M API, converting everything to markdown (LLM-friendly Format).

Google LangExtract
Not extracting directly to Markdown, but might be worth exploring for trustworthy higher quality structured outputs in JSONL, which could be converted later to Markdown.

Dots OCR - Multilingual Document Text Extraction
A state-of-the-art image/pdf-to-markdown vision language model for intelligent document processing.
