Page Components

Page Y-Axis Start

The header represents non-title page content above the main content.

The terminating header line is a non-visible line that separates the decision's header from its main content. We'll use a typographic bottom of the header to signify this line.

Examples:

Python Console Session

>>> from pathlib import Path
>>> from start_ocr import get_page_and_img
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 1)
>>> int(get_header_line(im, page)) in [76, 77]
True
>>> page.pdf.close()

Parameters:

Name	Type	Description	Default
`im`	`ndarray`	The full page image	required
`page`	`Page`	The pdfplumber page	required

Returns:

Type	Description
`int \| float \| None`	float \| None: Y-axis point (pdfplumber point) at bottom of header

Source code in src/start_ocr/components.py

Python
def get_header_line(im: np.ndarray, page: Page) -> int | float | None:
    """The header represents non-title page content above the main content.

    The terminating header line is a non-visible line that separates the
    decision's header from its main content. We'll use a typographic bottom
    of the header to signify this line.

    Examples:
        >>> from pathlib import Path
        >>> from start_ocr import get_page_and_img
        >>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 1)
        >>> int(get_header_line(im, page)) in [76, 77]
        True
        >>> page.pdf.close()

    Args:
        im (numpy.ndarray): The full page image
        page (Page): The pdfplumber page

    Returns:
        float | None: Y-axis point (pdfplumber point) at bottom of header
    """  # noqa: E501
    im_h, im_w, _ = im.shape
    if hd := get_header_upper_right(im):
        _, y, _, h = hd
        header_end = (y + h) / im_h
        terminal = header_end * page.height
        return terminal
    return None

Upper Right

The header represents non-title page content above the main content.

It usually consists of three items:

Item	Label	Test PDF
1	Indicator text	`Indicator`
2	Page number	`1`
3	Some other detail	`xyzabc123`

This detects Item (3) which implies that it is the in upper right quarter of the document:

Python

x > im_w / 2  # ensures that it is on the right side of the page
y <= im_h * 0.2  # ensures that it is on the top quarter of the page

Item (3) is the only one above that is likely to have a second vertical line, hence choosing this as the the typographic bottom for the header makes sense.

Examples:

Python Console Session

>>> from start_ocr import get_page_and_img
>>> from pathlib import Path
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 1)
>>> isinstance(get_header_upper_right(im), tuple)
True
>>> page.pdf.close()

Parameters:

Name	Type	Description	Default
`im`	`ndarray`	The full page image	required

Returns:

Type	Description
`tuple[int, int, int, int] \| None`	tuple[int, int, int, int] \| None: The coordinates of the docket, if found.

Source code in src/start_ocr/components.py

Python
def get_header_upper_right(
    im: np.ndarray,
) -> tuple[int, int, int, int] | None:
    """The header represents non-title page content above the main content.

    It usually consists of three items:

    Item | Label | Test PDF
    --:|:--|:--
    1 | Indicator text | `Indicator`
    2 | Page number | `1`
    3 | Some other detail | `xyzabc123`

    This detects Item (3) which implies that it is the in upper right quarter
    of the document:

    ```py
    x > im_w / 2  # ensures that it is on the right side of the page
    y <= im_h * 0.2  # ensures that it is on the top quarter of the page
    ```

    Item (3) is the only one above that is likely to have a second vertical line,
    hence choosing this as the the typographic bottom for the header makes sense.

    Examples:
        >>> from start_ocr import get_page_and_img
        >>> from pathlib import Path
        >>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 1)
        >>> isinstance(get_header_upper_right(im), tuple)
        True
        >>> page.pdf.close()

    Args:
        im (numpy.ndarray): The full page image

    Returns:
        tuple[int, int, int, int] | None: The coordinates of the docket, if found.
    """  # noqa: E501
    im_h, im_w, _ = im.shape
    for cnt in get_contours(im, (50, 50)):
        x, y, w, h = cv2.boundingRect(cnt)
        if x > im_w / 2 and y <= im_h * 0.25 and w > 200:
            return x, y, w, h
    return None

Page Number

Aside from the first page, which should always be 1, this function gets the first matching digit in the header's text. If no such digit is round, return 0.

Examples:

Python Console Session

>>> import pdfplumber
>>> from pathlib import Path
>>> from start_ocr import get_img_from_page
>>> x = Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf"
>>> pdf = pdfplumber.open(x)
>>> page = pdf.pages[1] # page 2
>>> im = get_img_from_page(page)
>>> header_line = get_header_line(im, page)
>>> get_page_num(page, header_line)
2
>>> pdf.close()

Parameters:

Name	Type	Description	Default
`page`	`Page`	The pdfplumber page	required
`header_line`	`int \| float`	The value retrieved from `get_header_line()`	required

Returns:

Type	Description
`int`	int \| None: The page number, if found

Source code in src/start_ocr/components.py

Python
def get_page_num(page: Page, header_line: int | float) -> int:
    """Aside from the first page, which should always be `1`,
    this function gets the first matching digit in the header's text.
    If no such digit is round, return 0.

    Examples:
        >>> import pdfplumber
        >>> from pathlib import Path
        >>> from start_ocr import get_img_from_page
        >>> x = Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf"
        >>> pdf = pdfplumber.open(x)
        >>> page = pdf.pages[1] # page 2
        >>> im = get_img_from_page(page)
        >>> header_line = get_header_line(im, page)
        >>> get_page_num(page, header_line)
        2
        >>> pdf.close()

    Args:
        page (Page): The pdfplumber page
        header_line (int | float): The value retrieved from `get_header_line()`

    Returns:
        int | None: The page number, if found
    """
    if page.page_number == 1:
        return 1  # The first page should always be page 1

    box = (0, 0, page.width, header_line)
    header = page.crop(box, relative=False, strict=True)
    texts = header.extract_text(layout=True, keep_blank_chars=True).split()
    for text in texts:
        if text.isdigit() and len(text) <= 3:
            return int(text)  # Subsequent pages shall be based on the header

    return 0  # 0 implies

Lines

Bodyline

Bases: NamedTuple

Each page may be divided into lines which, for our purposes, will refer to an arbitrary segmentation of text based on regex parameters.

Field	Type	Description
`num`	int	Order in the page
`line`	str	The text found based on segmentation

Source code in src/start_ocr/components.py

Python
class Bodyline(NamedTuple):
    """Each page may be divided into lines which, for our purposes,
    will refer to an arbitrary segmentation of text based on regex parameters.

    Field | Type | Description
    --:|:--:|:--
    `num` | int | Order in the page
    `line` | str | The text found based on segmentation
    """

    page_num: int
    order_num: int
    line: str

    @classmethod
    def split(cls, prelim_lines: list[str], page_num: int) -> list[Self]:
        """Get paragraphs using regex `\\s{10,}(?=[A-Z])`
        implying many spaces before a capital letter then
        remove new lines contained in non-paragraph lines.

        Args:
            prelim_lines (list[str]): Previously split text

        Returns:
            list[Self]: Bodylines of segmented text
        """
        lines = []
        for order_num, par in enumerate(prelim_lines, start=1):
            obj = cls(
                page_num=page_num,
                order_num=order_num,
                line=line_break.sub(" ", par).strip(),
            )
            lines.append(obj)
        lines.sort(key=lambda obj: obj.order_num)
        return lines

Functions

`split(prelim_lines, page_num)` `classmethod`

Get paragraphs using regex \s{10,}(?=[A-Z]) implying many spaces before a capital letter then remove new lines contained in non-paragraph lines.

Parameters:

Name	Type	Description	Default
`prelim_lines`	`list[str]`	Previously split text	required

Returns:

Type	Description
`list[Self]`	list[Self]: Bodylines of segmented text

Source code in src/start_ocr/components.py

Python
@classmethod
def split(cls, prelim_lines: list[str], page_num: int) -> list[Self]:
    """Get paragraphs using regex `\\s{10,}(?=[A-Z])`
    implying many spaces before a capital letter then
    remove new lines contained in non-paragraph lines.

    Args:
        prelim_lines (list[str]): Previously split text

    Returns:
        list[Self]: Bodylines of segmented text
    """
    lines = []
    for order_num, par in enumerate(prelim_lines, start=1):
        obj = cls(
            page_num=page_num,
            order_num=order_num,
            line=line_break.sub(" ", par).strip(),
        )
        lines.append(obj)
    lines.sort(key=lambda obj: obj.order_num)
    return lines

Footnote

Bases: NamedTuple

Each page may contain an annex which consists of footnotes. Note that this is based on a imperfect use of regex to detect the footnote number fn_id and its corresponding text note.

Field	Type	Description
`fn_id`	int	Footnote number
`note`	str	The text found based on segmentation of footnotes

Source code in src/start_ocr/components.py

Python
class Footnote(NamedTuple):
    """Each page may contain an annex which consists of footnotes. Note
    that this is based on a imperfect use of regex to detect the footnote
    number `fn_id` and its corresponding text `note`.

    Field | Type | Description
    --:|:--:|:--
    `fn_id` | int | Footnote number
    `note` | str | The text found based on segmentation of footnotes
    """

    page_num: int
    fn_id: int
    note: str

    @classmethod
    def extract_notes(cls, text: str, page_num: int) -> list[Self]:
        """Get footnote digits using regex `\\n\\s+(?P<fn>\\d+)(?=\\s+[A-Z])`
        then for each matching span, the start span becomes the anchor
        for the balance of the text for each remaining foornote in the while
        loop. The while loop extraction must use `.pop()` where the last
        item is removed first.

        Args:
            text (str): Text that should be convertible to footnotes based on regex

        Returns:
            list[Self]: Footnotes separated by digits.
        """
        notes = []
        while True:
            matches = list(footnote_nums.finditer(text))
            if not matches:
                break
            note = matches.pop()  # start from the last
            footnote_num = int(note.group("fn"))
            digit_start, digit_end = note.span()
            footnote_body = text[digit_end:].strip()
            obj = cls(
                page_num=page_num,
                fn_id=footnote_num,
                note=footnote_body,
            )
            notes.append(obj)
            text = text[:digit_start]
        notes.sort(key=lambda obj: obj.fn_id)
        return notes

Functions

`extract_notes(text, page_num)` `classmethod`

Get footnote digits using regex \n\s+(?P<fn>\d+)(?=\s+[A-Z]) then for each matching span, the start span becomes the anchor for the balance of the text for each remaining foornote in the while loop. The while loop extraction must use .pop() where the last item is removed first.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text that should be convertible to footnotes based on regex	required

Returns:

Type	Description
`list[Self]`	list[Self]: Footnotes separated by digits.

Source code in src/start_ocr/components.py

Python
@classmethod
def extract_notes(cls, text: str, page_num: int) -> list[Self]:
    """Get footnote digits using regex `\\n\\s+(?P<fn>\\d+)(?=\\s+[A-Z])`
    then for each matching span, the start span becomes the anchor
    for the balance of the text for each remaining foornote in the while
    loop. The while loop extraction must use `.pop()` where the last
    item is removed first.

    Args:
        text (str): Text that should be convertible to footnotes based on regex

    Returns:
        list[Self]: Footnotes separated by digits.
    """
    notes = []
    while True:
        matches = list(footnote_nums.finditer(text))
        if not matches:
            break
        note = matches.pop()  # start from the last
        footnote_num = int(note.group("fn"))
        digit_start, digit_end = note.span()
        footnote_body = text[digit_end:].strip()
        obj = cls(
            page_num=page_num,
            fn_id=footnote_num,
            note=footnote_body,
        )
        notes.append(obj)
        text = text[:digit_start]
    notes.sort(key=lambda obj: obj.fn_id)
    return notes

Annex Existence as Page Y-Axis End/s

Given an im, detect footnote line of annex and return relevant points in the y-axis as a tuple.

Scenario	Description	y0	y1
Footnote line exists	Page contains footnotes	int or float	int or float signifying end of page
Footnote line absent	Page does not contain footnotes	int or float signifying end of page	`None`

Examples:

Python Console Session

>>> from start_ocr import get_page_and_img
>>> from pathlib import Path
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
>>> res = get_page_end(im, page)
>>> isinstance(res, tuple)
True
>>> int(res[0])
822
>>> int(res[1])
879

Parameters:

Name	Type	Description	Default
`im`	`ndarray`	the openCV image that may contain a footnote line	required
`page`	`Page`	the pdfplumber.page.Page based on `im`	required

Returns:

Type	Description
`tuple[float, float \| None]`	tuple[float, float \| None]: Annex line's y-axis (if it exists) and the page's end content line.

Source code in src/start_ocr/components.py

Python
def get_page_end(im: np.ndarray, page: Page) -> tuple[float, float | None]:
    """Given an `im`, detect footnote line of annex and return relevant points in the y-axis as a tuple.

    Scenario | Description | y0 | y1
    :--:|:-- |:--:|:--:
    Footnote line exists | Page contains footnotes | int or float | int or float signifying end of page
    Footnote line absent | Page does not contain footnotes | int or float signifying end of page | `None`

    Examples:
        >>> from start_ocr import get_page_and_img
        >>> from pathlib import Path
        >>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
        >>> res = get_page_end(im, page)
        >>> isinstance(res, tuple)
        True
        >>> int(res[0])
        822
        >>> int(res[1])
        879

    Args:
        im (numpy.ndarray): the openCV image that may contain a footnote line
        page (Page): the pdfplumber.page.Page based on `im`

    Returns:
        tuple[float, float | None]: Annex line's y-axis (if it exists) and the page's end content line.
    """  # noqa: E501
    y1 = PERCENT_OF_MAX_PAGE * page.height
    im_h, _, _ = im.shape
    if lines := footnote_lines(im):
        fn_line_end = lines[0].y / im_h
        y0 = fn_line_end * page.height
        return y0, y1
    return y1, None

Page Width Lines

Filter long horizontal lines:

Edges of lines must be:
- on the left of the page; and
- on the right of the page
Each line must be at least 1/2 the page width

Examples:

Python Console Session

>>> from start_ocr import get_page_and_img
>>> from pathlib import Path
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
>>> res = page_width_lines(im)
>>> len(res) # only one image matches the filter
3

Source code in src/start_ocr/components.py

Python
def page_width_lines(im: np.ndarray) -> list[CoordinatedImage]:
    """Filter long horizontal lines:

    1. Edges of lines must be:
        - on the left of the page; and
        - on the right of the page
    2. Each line must be at least 1/2 the page width

    Examples:
        >>> from start_ocr import get_page_and_img
        >>> from pathlib import Path
        >>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
        >>> res = page_width_lines(im)
        >>> len(res) # only one image matches the filter
        3
    """  # noqa: E501
    _, im_w, _ = im.shape
    results = []
    contours = get_contours(
        im=im,
        rectangle_size=(100, 100),
        test_dilation=True,
        test_dilated_image="temp/dilated.png",
    )
    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        contoured = CoordinatedImage(im, x, y, w, h)
        contoured.redbox()
        filtering_criteria = [
            w > im_w / 2,  # width greater than half
            x < im_w / 3,  # edge of line on first third
            (x + w) > im_w * (2 / 3),  # edge of line on last third
        ]
        if all(filtering_criteria):
            obj = CoordinatedImage(im, x, y, w, h)
            obj.greenbox()
            results.append(obj)
    cv2.imwrite("temp/boxes.png", im)
    return results

Page Components

Header

Page Y-Axis Start

Upper Right

Page Number

Lines

Bodyline

Functions

`split(prelim_lines, page_num)` `classmethod`

Footnote

Functions

`extract_notes(text, page_num)` `classmethod`

Footer

Annex Existence as Page Y-Axis End/s

Page Width Lines

Page Components

Header

Page Y-Axis Start

Upper Right

Page Number

Lines

Bodyline

Functions

split(prelim_lines, page_num) classmethod

Footnote

Functions

extract_notes(text, page_num) classmethod

Footer

Annex Existence as Page Y-Axis End/s

Page Width Lines

`split(prelim_lines, page_num)` `classmethod`

`extract_notes(text, page_num)` `classmethod`