
What it does
LiteParse is an open-source document parser from the LlamaIndex team. It performs spatial text parsing with bounding boxes entirely locally, no cloud dependencies. Built on PDF.js for fast native PDF parsing with optional Tesseract.js OCR for scanned documents.
Language
TypeScript (71.9%) with Python components (26.5%) for the OCR server.
Install
# npm
npm i -g @llamaindex/liteparse
# Homebrew (macOS/Linux)
brew install llamaindex-liteparse
Also available to build from source.
Key features
- Fast text parsing using PDF.js
- Flexible OCR with built-in Tesseract.js support
- Multiple output formats - JSON and text
- Precise bounding boxes for text positioning
- Screenshot generation for LLM agents
- Multi-platform - Linux, macOS, Windows
- No cloud required - runs fully standalone
Supported formats
Beyond native PDFs, LiteParse handles automatic conversion for:
- Office documents (Word, PowerPoint, spreadsheets)
- Images (JPG, PNG, GIF, TIFF, WebP)
Value
A solid alternative to cloud-based document parsers like LlamaParse (also from the LlamaIndex team, but cloud-hosted). The local-first approach is good for privacy-sensitive workflows and air-gapped environments. The bounding box output is useful for layout-aware RAG pipelines where you need to know where text sits on the page, not just what it says.