Page Components
Header
Page Y-Axis Start
The header represents non-title page content above the main content.
The terminating header line is a non-visible line that separates the decision's header from its main content. We'll use a typographic bottom of the header to signify this line.
Examples:
>>> from pathlib import Path
>>> from start_ocr import get_page_and_img
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 1)
>>> int(get_header_line(im, page)) in [76, 77]
True
>>> page.pdf.close()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
im |
ndarray
|
The full page image |
required |
page |
Page
|
The pdfplumber page |
required |
Returns:
Type | Description |
---|---|
int | float | None
|
float | None: Y-axis point (pdfplumber point) at bottom of header |
Source code in src/start_ocr/components.py
Upper Right
The header represents non-title page content above the main content.
It usually consists of three items:
Item | Label | Test PDF |
---|---|---|
1 | Indicator text | Indicator |
2 | Page number | 1 |
3 | Some other detail | xyzabc123 |
This detects Item (3) which implies that it is the in upper right quarter of the document:
x > im_w / 2 # ensures that it is on the right side of the page
y <= im_h * 0.2 # ensures that it is on the top quarter of the page
Item (3) is the only one above that is likely to have a second vertical line, hence choosing this as the the typographic bottom for the header makes sense.
Examples:
>>> from start_ocr import get_page_and_img
>>> from pathlib import Path
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 1)
>>> isinstance(get_header_upper_right(im), tuple)
True
>>> page.pdf.close()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
im |
ndarray
|
The full page image |
required |
Returns:
Type | Description |
---|---|
tuple[int, int, int, int] | None
|
tuple[int, int, int, int] | None: The coordinates of the docket, if found. |
Source code in src/start_ocr/components.py
Page Number
Aside from the first page, which should always be 1
,
this function gets the first matching digit in the header's text.
If no such digit is round, return 0.
Examples:
>>> import pdfplumber
>>> from pathlib import Path
>>> from start_ocr import get_img_from_page
>>> x = Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf"
>>> pdf = pdfplumber.open(x)
>>> page = pdf.pages[1] # page 2
>>> im = get_img_from_page(page)
>>> header_line = get_header_line(im, page)
>>> get_page_num(page, header_line)
2
>>> pdf.close()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
Page
|
The pdfplumber page |
required |
header_line |
int | float
|
The value retrieved from |
required |
Returns:
Type | Description |
---|---|
int
|
int | None: The page number, if found |
Source code in src/start_ocr/components.py
Lines
Bodyline
Bases: NamedTuple
Each page may be divided into lines which, for our purposes, will refer to an arbitrary segmentation of text based on regex parameters.
Field | Type | Description |
---|---|---|
num |
int | Order in the page |
line |
str | The text found based on segmentation |
Source code in src/start_ocr/components.py
Functions
split(prelim_lines, page_num)
classmethod
Get paragraphs using regex \s{10,}(?=[A-Z])
implying many spaces before a capital letter then
remove new lines contained in non-paragraph lines.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prelim_lines |
list[str]
|
Previously split text |
required |
Returns:
Type | Description |
---|---|
list[Self]
|
list[Self]: Bodylines of segmented text |
Source code in src/start_ocr/components.py
Footnote
Bases: NamedTuple
Each page may contain an annex which consists of footnotes. Note
that this is based on a imperfect use of regex to detect the footnote
number fn_id
and its corresponding text note
.
Field | Type | Description |
---|---|---|
fn_id |
int | Footnote number |
note |
str | The text found based on segmentation of footnotes |
Source code in src/start_ocr/components.py
Functions
extract_notes(text, page_num)
classmethod
Get footnote digits using regex \n\s+(?P<fn>\d+)(?=\s+[A-Z])
then for each matching span, the start span becomes the anchor
for the balance of the text for each remaining foornote in the while
loop. The while loop extraction must use .pop()
where the last
item is removed first.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Text that should be convertible to footnotes based on regex |
required |
Returns:
Type | Description |
---|---|
list[Self]
|
list[Self]: Footnotes separated by digits. |
Source code in src/start_ocr/components.py
Footer
Annex Existence as Page Y-Axis End/s
Given an im
, detect footnote line of annex and return relevant points in the y-axis as a tuple.
Scenario | Description | y0 | y1 |
---|---|---|---|
Footnote line exists | Page contains footnotes | int or float | int or float signifying end of page |
Footnote line absent | Page does not contain footnotes | int or float signifying end of page | None |
Examples:
>>> from start_ocr import get_page_and_img
>>> from pathlib import Path
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
>>> res = get_page_end(im, page)
>>> isinstance(res, tuple)
True
>>> int(res[0])
822
>>> int(res[1])
879
Parameters:
Name | Type | Description | Default |
---|---|---|---|
im |
ndarray
|
the openCV image that may contain a footnote line |
required |
page |
Page
|
the pdfplumber.page.Page based on |
required |
Returns:
Type | Description |
---|---|
tuple[float, float | None]
|
tuple[float, float | None]: Annex line's y-axis (if it exists) and the page's end content line. |
Source code in src/start_ocr/components.py
Page Width Lines
Filter long horizontal lines:
- Edges of lines must be:
- on the left of the page; and
- on the right of the page
- Each line must be at least 1/2 the page width
Examples:
>>> from start_ocr import get_page_and_img
>>> from pathlib import Path
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
>>> res = page_width_lines(im)
>>> len(res) # only one image matches the filter
3