
LangExtract is an open-source Python library by Google that uses LLMs to pull structured data out of unstructured text. You define what you want to extract with a few examples, and it handles the rest. Powered by Gemini, but works with other models too.
What it does
- Source grounding - maps every extraction to its exact location in the source text. You can visually highlight where each piece of data came from.
- Structured outputs - enforces a consistent output schema based on your few-shot examples. Uses controlled generation in Gemini for guaranteed structured results.
- Long document handling - chunks text, processes in parallel, and runs multiple passes for high recall on large documents.
- Flexible LLM support - works with Gemini family models, or local open-source models via built-in Ollama interface.
- Domain adaptable - define extraction tasks for any domain with just a few examples. No fine-tuning needed.
Install
pip install langextract
Value
Good for extracting medical info from clinical text, parsing legal documents, processing customer feedback, or any domain where you need structured data from messy text. The source grounding is the killer feature - you can trace every extraction back to where it came from.
