Markdownee

19 Apr 2025

Branching out my simple converter script to a full-fledged project.

The goal is to create a powerful Markdown parser, testing several approaches along the way, that can handle various formats and convert them into the cleanest Markdown possible.
Possibly leveraging AI for post-processing.

This project is important as it is the foundation to feed clean data into AI workflows.

Github

Project code here:

Libraries

Here are some libraries to test.

Microsoft's MarkItDown

Currently implemented.

QuivrHQ/MegaParse

Pandoc - CLI document converter

https://pandoc.org/

Pandoc is a universal document converter that can convert files between various markup formats, including Markdown, HTML, LaTeX, and more. It's useful for converting documents from one format to another, and it also includes features like automatic citations and bibliographies, as well as customization options through templates and filters.

ppt2desc: Convert PowerPoint files into semantically rich text using vision language models

MinerU: A High-Quality PDF-to-Markdown/JSON Converter Worth Checking Out

markpdfdown: A high-quality PDF to Markdown tool based on large language model visual recognition. 一款基于大模型视觉识别的高质量PDF转Markdown工具

27 Jul 2025

E2M API, converting everything to markdown (LLM-friendly Format).

Google LangExtract

Not extracting directly to Markdown, but might be worth exploring for trustworthy higher quality structured outputs in JSONL, which could be converted later to Markdown.

Dots OCR - Multilingual Document Text Extraction

A state-of-the-art image/pdf-to-markdown vision language model for intelligent document processing.

links

social