14 Dec 2024
Very surprising (and exciting) to see Microsoft coming out with an open-source Python utility tool for converting various files to Markdown!
My attempts to write my own scripts for this have led to messy results, so exciting to test this out.
"The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)
It presently supports:
PDF (.pdf)
PowerPoint (.pptx)
Word (.docx)
Excel (.xlsx)
Images (EXIF metadata, and OCR)
Audio (EXIF metadata, and speech transcription)
HTML (special handling of Wikipedia, etc.)
Various other text-based formats (csv, json, xml, etc.)"
Install as:
pip install markitdown
The API is simple:
from markitdown import MarkItDown
markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)