Searching for PDFs by content using the command line

A command to search for PDF files containing a specific text within the user's home directory.

Prerequisites

  • Ensure pdfgrep is installed via Homebrew (brew install pdfgrep).

Searching Inside Text-Based PDFs

find ~/ -type f -name "*.pdf" -exec pdfgrep -il "MY_SEARCH_STRING" {} + 2>/dev/null

Explanation

  • find ~/ - Searches within the user's home directory.
  • -type f - Looks for regular files.
  • -name "*.pdf" - Filters for files with a .pdf extension.
  • -exec pdfgrep -il "MY_SEARCH_STRING" {} + - Runs pdfgrep on each found PDF file:
  • -i makes the search case-insensitive.
  • -l only outputs the names of matching files.
  • 2>/dev/null - Suppresses permission errors from directories the user cannot access.

Notes

  • If the search needs to be limited to a specific folder, replace ~/ with the desired directory path.
  • This method does not search inside image-based PDFs that require OCR processing.

Searching Inside Image-Based PDFs

If the PDFs contain scanned images rather than selectable text, use OCR to extract text before searching:

Convert PDFs to Text with OCR

First, install the necessary tool for OCR:

brew install tesseract poppler

Then, use the following command to search for the term "MY_SEARCH_STRING" in PDFs:

for file in $(find ~/ -type f -name "*.pdf"); do
    pdftotext "$file" - | grep -il "MY_SEARCH_STRING" && echo "$file"
done

Explanation

  • pdftotext "$file" - - Extracts text from the PDF.
  • grep -il "MY_SEARCH_STRING" - Searches for the term within the extracted text.

Notes

  • Ensure pdfgrep and OCR tools (tesseract, poppler) are installed.
  • If the search needs to be limited to a specific folder, replace ~/ with the desired directory path.
  • This method enables searching inside both text-based and image-based PDFs.

links

social