start-ocr Docs

Caveat

ImageMagick setups vary:

homebrew-based installation
Dockerfile compiles from source
Github action main.yml

This makes some tests janky so have to be a bit creative, e.g.:

Sample docstring

>>> contours = get_contours(im, (10,10))
>>> len(contours) in [222,223] # So one installation outputs 222 and the other 223
True

Ideally, this would consistent. Will have to circle back on this another time.

Sample PDF

A simple file is included in the /tests folder to demonstrate certain functions:

Screenshot of the test pdf found in tests/data/test.pdf

`fetch.get_page_and_img()`

Each page of a PDF file, can be opened and cropped via pdfplumber. To parse, it's necessary to convert the pdf to an opencv compatible-image format (i.e. np.ndarray). This function converts a Path object into a pair of objects:

the first part is a pdfplumber.Page
the second part is an openCV image, i.e. np.ndarray

Examples:

Python Console Session

>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0) # 0 marks the first page
>>> page.page_number # the first page
1
>>> isinstance(page, Page)
True
>>> isinstance(im, np.ndarray)
True
>>> page.pdf.close()

Parameters:

Name	Type	Description	Default
`pdfpath`	`str \| Path`	Path to the PDF file.	required
`index`	`int`	Zero-based index that determines the page number.	required

Returns:

Type	Description
`tuple[Page, ndarray]`	tuple[Page, np.ndarray]: Page identified by `index` with image of the page (in np format) that can be manipulated.

Source code in src/start_ocr/fetch.py

Python
def get_page_and_img(pdfpath: str | Path, index: int) -> tuple[Page, np.ndarray]:
    """Each page of a PDF file, can be opened and cropped via `pdfplumber`.
    To parse, it's necessary to convert the pdf to an `opencv` compatible-image format
    (i.e. `np.ndarray`). This function converts a `Path` object into a pair of objects:

    1. the first part is a `pdfplumber.Page`
    2. the second part is an openCV image, i.e. `np.ndarray`

    Examples:
        >>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0) # 0 marks the first page
        >>> page.page_number # the first page
        1
        >>> isinstance(page, Page)
        True
        >>> isinstance(im, np.ndarray)
        True
        >>> page.pdf.close()

    Args:
        pdfpath (str | Path): Path to the PDF file.
        index (int): Zero-based index that determines the page number.

    Returns:
        tuple[Page, np.ndarray]: Page identified by `index`  with image of the
            page  (in np format) that can be manipulated.
    """  # noqa: E501
    with pdfplumber.open(pdfpath) as pdf:
        page = pdf.pages[index]
        img = get_img_from_page(page)
        return page, img

`slice.get_contours()`

Mental Model

flowchart LR
    pdf --> im[image]
    im --> c[contours]
    c --> c1[contour 1: the header]
    c --> c2[contour 2: the line below the header]
    c --> c3[contour 3: the footer]

Conversion

Converting the pdf to an image format enables get_contours(). Contours can be thought of as tiny fragments within the document that delineate where certain objects in the document are located.

Show Contours

To demonstrate get_contours, I created a helper show_contours which just prints out where the contours are found given a rectangle size that we want to use for the image.

100 x 100 yields 7 contours

100 x 100

>>> from start_ocr import get_page_and_img, show_contours
>>> page, img = get_page_and_img(pdfpath=p, index=0)
>>> rectangle_size_lg = (100,100)
>>> contours = show_contours(img, rectangle_size_lg) # runs get_contours()

`dilated`	`contours`

10 x 10 yields 285 contours

10 x 10

>>> from start_ocr import get_page_and_img, show_contours
>>> page, img = get_page_and_img(pdfpath=p, index=0)
>>> rectangle_size_sm = (10,10)
>>> contours = show_contours(img, rectangle_size_sm) # runs get_contours()

`dilated`	285 `contours`

`get_contours()`

Explaining dilation and contours.

The function follows the strategy outlined in Python Tutorials for Digital Humanities. A good explanation on how dilation works is found in this Stack Overflow answer by @nathancy.

Using tiny rectangle_size of the format (width, height), create a dilated version of the image im. The contours found are outputed by this function.

Examples:

Python Console Session

>>> from pathlib import Path
>>> from start_ocr.fetch import get_page_and_img
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
>>> contours = get_contours(im, (50,50))
>>> len(contours)
15
>>> contours = get_contours(im, (10,10))
>>> len(contours) in [222,223]
True

Parameters:

Name	Type	Description	Default
`im`	`ndarray`	The opencv formatted image	required
`rectangle_size`	`tuple[int, int]`	The width and height of the contours to make	required
`test_dilation`	`bool`	If `test_dilation` is `True`, a file will be created in the path represented in `test_dilated_image` to illustrate what the "diluted" image looks like.. Defaults to False.	`False`
`test_dilated_image`	`str \| None`	description. Defaults to "temp/dilated.png".	`'temp/dilated.png'`

Returns:

Name	Type	Description
`list`	`list`	The contours found based on the specified structuring element

Source code in src/start_ocr/slice.py

Python
def get_contours(
    im: np.ndarray,
    rectangle_size: tuple[int, int],
    test_dilation: bool = False,
    test_dilated_image: str | None = "temp/dilated.png",
) -> list:
    """Using tiny `rectangle_size` of the format `(width, height)`, create a dilated version
    of the image `im`. The contours found are outputed by this function.

    Examples:
        >>> from pathlib import Path
        >>> from start_ocr.fetch import get_page_and_img
        >>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
        >>> contours = get_contours(im, (50,50))
        >>> len(contours)
        15
        >>> contours = get_contours(im, (10,10))
        >>> len(contours) in [222,223]
        True

    Args:
        im (np.ndarray): The opencv formatted image
        rectangle_size (tuple[int, int]): The width and height of the contours to make
        test_dilation (bool, optional): If `test_dilation` is `True`, a file will be created in the path represented in `test_dilated_image` to illustrate what the "diluted" image looks like.. Defaults to False.
        test_dilated_image (str | None, optional): _description_. Defaults to "temp/dilated.png".

    Returns:
        list: The contours found based on the specified structuring element

    """  # noqa: E501
    gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (7, 7), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, rectangle_size)
    dilate = cv2.dilate(thresh, kernel, iterations=1)
    if test_dilation and test_dilated_image:
        cv2.imwrite(test_dilated_image, dilate)
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    return sorted(cnts, key=lambda x: cv2.boundingRect(x)[1])

Filtering contours

Each contour can be filtered further to arrive at rectangles that meet the filter criteria:

flowchart TB
    c1[contour 1] --> f[filter]
    c2[contour 2] --> f
    c3[contour 3] --> f
    f --> c2x[filtered contour 1]

For instance, we can try looking for a long horizontal line matching some criteria:

If we imagine the page to be divided into 3 equal vertical slices, the line must start at the first slice and end in the third slice.
If we imagine the page width to be of X width, X/2 is simply half this width.

Let's say we want to look for a line that is greater than half the page width (2), positioned with edges along (1):

Filtering mechanism in practice
imgs = []
page, img = get_page_and_img(pdfpath=p, index=0)
contours = get_contours(img, (10, 10), test_dilation=True)
_, im_w, _ = img.shape
for cnt in contours:
    x, y, w, h = cv2.boundingRect(cnt)  # unpack each contour
    filtering_criteria = [
        w > im_w / 2,  # width greater than half
        x < im_w / 3,  # edge of line on first third vertical slice
        (x + w) > im_w * (2 / 3),  # edge of line on last third vertical slice
    ]
    if all(filtering_criteria):
        obj = CoordinatedImage(img, x, y, w, h)
        obj.greenbox()
        imgs.append(obj)
cv2.imwrite("temp/boxes.png", img)

The CoordinatedImage is a data structure to compute related values of x, y, w, and h.