Skip to content

Setup

Key Libraries

Library Rationale Notes
pdfplumber pdf to img to str Requires Wand, Pillow, pdfminer.six; Wand is dependent on libmagickwand-dev for APT on Debian/Ubuntu and imagemagick via homebrew Mac.
opencv-python img manipulation Wrapper around OpenCV to apply changes to pdf-based images so that it can be prepared for OCR.
pytesseract from img to str From the repo: Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

MacOS

Local Device

Install common libraries in MacOS with homebrew:

Bash
brew install tesseract
brew install imagemagick
brew info imagemagick # check version

The last command shows the local folder where imagemagick is installed.

Bash
==> imagemagick: stable 7.1.1-17 (bottled), HEAD # note the version number
Tools and libraries to manipulate images in many formats
https://imagemagick.org/index.php
/opt/homebrew/Cellar/imagemagick/7.1.1-17 (807 files, 31MB) * # first part is the local folder
x x x

Virtual Environment

Update .env whenever imagemagick changes

The shared dependency is based on MAGICK_HOME folder. This can't seem to be fetched by python (at least in 3.11) so we need to help it along by explicitly declaring its location. The folder can change when a new version is installed via brew upgrade imagemagick

Create an .env file and use the folder as the environment variable MAGICK_HOME:

Text Only
MAGICK_HOME=/opt/homebrew/Cellar/imagemagick/7.1.1-17

This configuration allows pdfplumber to detect imagemagick.

Effect of not setting MAGICK_HOME:

Python
>>> import pdfplumber
>>> pdfplumber.open<(testpath>).pages[0].to_image(resolution=300) # ERROR
Text Only
OSError: cannot find library; tried paths: []

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
...
ImportError: MagickWand shared library not found.
You probably had not installed ImageMagick library.
Try to install:
  brew install freetype imagemagick

With MAGICK_HOME:

Python
>>> import pdfplumber
>>> pdfplumber.open<(testpath>).pages[0].to_image
PIL.Image.Image # image library and type detected

Create environment using poetry install:

TOML
[tool.poetry.dependencies]
python = "^3.11"
python-dotenv = "^1.0"
pdfplumber = "^0.9"
pillow = "^9.5"
opencv-python = "^4.7"
pytesseract = "^0.3.10"

pytest

Ensure inclusion of pytest-env

Add to pyproject.toml:

TOML
[tool.pytest.ini_options]
env = ["MAGICK_HOME=/opt/homebrew/Cellar/imagemagick/7.1.1-17"]

Dockerfile

See resources:

  1. ImageMagick
  2. Stack Overflow, Ankur
  3. nickferrando Gist
  4. Stack Overflow, Shreyesh Desai
Docker
ENV PYTHONDONTWRITEBYTECODE=1 \
  PYTHONUNBUFFERED=1 \
  MAGICK_HOME=/usr/local/lib/ImageMagick-$IM_VER

RUN apt update \
  && apt install -y \
    build-essential wget pkg-config \
    libxml2-dev zlib1g-dev \
    ghostscript tesseract-ocr tesseract-ocr-fra \
    libjpeg62-turbo-dev libtiff-dev libpng-dev libsm6 libxext6 ffmpeg libfontconfig1 libxrender1 libgl1-mesa-glx libfreetype6-dev \
  && apt clean

RUN mkdir -p /tmp/distr && \
  cd /tmp/distr && \
  wget https://download.imagemagick.org/ImageMagick/download/releases/ImageMagick-$IM_VER.tar.xz && \
  tar xvf ImageMagick-$IM_VER.tar.xz && \
  cd ImageMagick-$IM_VER && \
  ./configure --enable-shared=yes --disable-static --without-perl && \
  make && \
  make install && \
  ldconfig /usr/local/lib && \
  cd /tmp && \
  rm -rf distr

RUN if [ -f $IM_POLICY ] ; then sed -i 's/<policy domain="coder" rights="none" pattern="PDF" \/>/<policy domain="coder" rights="read|write" pattern="PDF" \/>/g' $IM_POLICY ; else echo did not see file $IM_POLICY ; fi

The Dockerfile is intended for testing purposes:

Bash
docker build --tag ocr . && docker run ocr # will run pytest

Github Actions

Note that both tesseract and imagemagick libraries are also made preconditions in .github/workflows/main.yaml:

.github/workflows/main.yaml
steps:
  # see https://github.com/madmaze/pytesseract/blob/master/.github/workflows/ci.yaml
  - name: Install tesseract
    run: sudo apt-get -y update && sudo apt-get install -y tesseract-ocr tesseract-ocr-fra
  - name: Print tesseract version
    run: echo $(tesseract --version)

  # see https://github.com/jsvine/pdfplumber/blob/stable/.github/workflows/tests.yml
  - name: Install ghostscript & imagemagick
    run: sudo apt update && sudo apt install ghostscript libmagickwand-dev
  - name: Remove policy.xml
    run: sudo rm /etc/ImageMagick-6/policy.xml # this needs to be removed or the test won't run