Skip to content

start-ocr Docs

Caveat

ImageMagick setups vary:

  1. homebrew-based installation
  2. Dockerfile compiles from source
  3. Github action main.yml

This makes some tests janky so have to be a bit creative, e.g.:

Sample docstring
>>> contours = get_contours(im, (10,10))
>>> len(contours) in [222,223] # So one installation outputs 222 and the other 223
True

Ideally, this would consistent. Will have to circle back on this another time.

Sample PDF

A simple file is included in the /tests folder to demonstrate certain functions:

Screenshot of the test pdf found in tests/data/test.pdf

fetch.get_page_and_img()

Each page of a PDF file, can be opened and cropped via pdfplumber. To parse, it's necessary to convert the pdf to an opencv compatible-image format (i.e. np.ndarray). This function converts a Path object into a pair of objects:

  1. the first part is a pdfplumber.Page
  2. the second part is an openCV image, i.e. np.ndarray

Examples:

Python Console Session
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0) # 0 marks the first page
>>> page.page_number # the first page
1
>>> isinstance(page, Page)
True
>>> isinstance(im, np.ndarray)
True
>>> page.pdf.close()

Parameters:

Name Type Description Default
pdfpath str | Path

Path to the PDF file.

required
index int

Zero-based index that determines the page number.

required

Returns:

Type Description
tuple[Page, ndarray]

tuple[Page, np.ndarray]: Page identified by index with image of the page (in np format) that can be manipulated.

Source code in src/start_ocr/fetch.py
Python
def get_page_and_img(pdfpath: str | Path, index: int) -> tuple[Page, np.ndarray]:
    """Each page of a PDF file, can be opened and cropped via `pdfplumber`.
    To parse, it's necessary to convert the pdf to an `opencv` compatible-image format
    (i.e. `np.ndarray`). This function converts a `Path` object into a pair of objects:

    1. the first part is a `pdfplumber.Page`
    2. the second part is an openCV image, i.e. `np.ndarray`

    Examples:
        >>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0) # 0 marks the first page
        >>> page.page_number # the first page
        1
        >>> isinstance(page, Page)
        True
        >>> isinstance(im, np.ndarray)
        True
        >>> page.pdf.close()

    Args:
        pdfpath (str | Path): Path to the PDF file.
        index (int): Zero-based index that determines the page number.

    Returns:
        tuple[Page, np.ndarray]: Page identified by `index`  with image of the
            page  (in np format) that can be manipulated.
    """  # noqa: E501
    with pdfplumber.open(pdfpath) as pdf:
        page = pdf.pages[index]
        img = get_img_from_page(page)
        return page, img

slice.get_contours()

Mental Model

flowchart LR
    pdf --> im[image]
    im --> c[contours]
    c --> c1[contour 1: the header]
    c --> c2[contour 2: the line below the header]
    c --> c3[contour 3: the footer]

Conversion

Converting the pdf to an image format enables get_contours(). Contours can be thought of as tiny fragments within the document that delineate where certain objects in the document are located.

Show Contours

To demonstrate get_contours, I created a helper show_contours which just prints out where the contours are found given a rectangle size that we want to use for the image.

100 x 100 yields 7 contours

100 x 100
>>> from start_ocr import get_page_and_img, show_contours
>>> page, img = get_page_and_img(pdfpath=p, index=0)
>>> rectangle_size_lg = (100,100)
>>> contours = show_contours(img, rectangle_size_lg) # runs get_contours()
dilated contours
Screenshot of applying dilation on the test pdf found in tests/data/test.pdf Screenshot of applying get_contours() on the test pdf found in tests/data/test.pdf

10 x 10 yields 285 contours

10 x 10
>>> from start_ocr import get_page_and_img, show_contours
>>> page, img = get_page_and_img(pdfpath=p, index=0)
>>> rectangle_size_sm = (10,10)
>>> contours = show_contours(img, rectangle_size_sm) # runs get_contours()
dilated 285 contours
Screenshot of applying dilation on the test pdf found in tests/data/test.pdf Screenshot of applying get_contours() on the test pdf found in tests/data/test.pdf

get_contours()

Explaining dilation and contours.

The function follows the strategy outlined in Python Tutorials for Digital Humanities. A good explanation on how dilation works is found in this Stack Overflow answer by @nathancy.

Using tiny rectangle_size of the format (width, height), create a dilated version of the image im. The contours found are outputed by this function.

Examples:

Python Console Session
>>> from pathlib import Path
>>> from start_ocr.fetch import get_page_and_img
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
>>> contours = get_contours(im, (50,50))
>>> len(contours)
15
>>> contours = get_contours(im, (10,10))
>>> len(contours) in [222,223]
True

Parameters:

Name Type Description Default
im ndarray

The opencv formatted image

required
rectangle_size tuple[int, int]

The width and height of the contours to make

required
test_dilation bool

If test_dilation is True, a file will be created in the path represented in test_dilated_image to illustrate what the "diluted" image looks like.. Defaults to False.

False
test_dilated_image str | None

description. Defaults to "temp/dilated.png".

'temp/dilated.png'

Returns:

Name Type Description
list list

The contours found based on the specified structuring element

Source code in src/start_ocr/slice.py
Python
def get_contours(
    im: np.ndarray,
    rectangle_size: tuple[int, int],
    test_dilation: bool = False,
    test_dilated_image: str | None = "temp/dilated.png",
) -> list:
    """Using tiny `rectangle_size` of the format `(width, height)`, create a dilated version
    of the image `im`. The contours found are outputed by this function.

    Examples:
        >>> from pathlib import Path
        >>> from start_ocr.fetch import get_page_and_img
        >>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
        >>> contours = get_contours(im, (50,50))
        >>> len(contours)
        15
        >>> contours = get_contours(im, (10,10))
        >>> len(contours) in [222,223]
        True

    Args:
        im (np.ndarray): The opencv formatted image
        rectangle_size (tuple[int, int]): The width and height of the contours to make
        test_dilation (bool, optional): If `test_dilation` is `True`, a file will be created in the path represented in `test_dilated_image` to illustrate what the "diluted" image looks like.. Defaults to False.
        test_dilated_image (str | None, optional): _description_. Defaults to "temp/dilated.png".

    Returns:
        list: The contours found based on the specified structuring element

    """  # noqa: E501
    gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (7, 7), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, rectangle_size)
    dilate = cv2.dilate(thresh, kernel, iterations=1)
    if test_dilation and test_dilated_image:
        cv2.imwrite(test_dilated_image, dilate)
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    return sorted(cnts, key=lambda x: cv2.boundingRect(x)[1])

Filtering contours

Each contour can be filtered further to arrive at rectangles that meet the filter criteria:

flowchart TB
    c1[contour 1] --> f[filter]
    c2[contour 2] --> f
    c3[contour 3] --> f
    f --> c2x[filtered contour 1]

For instance, we can try looking for a long horizontal line matching some criteria:

  1. If we imagine the page to be divided into 3 equal vertical slices, the line must start at the first slice and end in the third slice.
  2. If we imagine the page width to be of X width, X/2 is simply half this width.

Let's say we want to look for a line that is greater than half the page width (2), positioned with edges along (1):

Filtering mechanism in practice
imgs = []
page, img = get_page_and_img(pdfpath=p, index=0)
contours = get_contours(img, (10, 10), test_dilation=True)
_, im_w, _ = img.shape
for cnt in contours:
    x, y, w, h = cv2.boundingRect(cnt)  # unpack each contour
    filtering_criteria = [
        w > im_w / 2,  # width greater than half
        x < im_w / 3,  # edge of line on first third vertical slice
        (x + w) > im_w * (2 / 3),  # edge of line on last third vertical slice
    ]
    if all(filtering_criteria):
        obj = CoordinatedImage(img, x, y, w, h)
        obj.greenbox()
        imgs.append(obj)
cv2.imwrite("temp/boxes.png", img)

The CoordinatedImage is a data structure to compute related values of x, y, w, and h.