Positions
PositionMeta
Bases: NamedTuple
Metadata required to determine the true start and end pages of a given pdf Path. Although in a collection of pages there is a logical start and end page, i.e. page 1 and the final page of a document, in Court documents this sometimes does not correspond to the actual start and end of the content.
Field | Type | Description |
---|---|---|
start_index |
int | The zero-based integer x , i.e. get specific pdfplumber.pages[x] |
start_page_num |
int | The 1-based integer to describe human-readable page number signifying the true content start |
start_indicator |
PositionDecisionCategoryWriter or PositionNotice | Marking the start of the content proper |
end_page_num |
int | The 1-based integer to describe human-readable page number signifying the true content end |
end_page_pos |
float, int | y-axis position in the end_page_num |
Source code in corpus_unpdf/_positions.py
Terminal Start: Page, Position
The actual start of content depends on either the detection of a
Notice
or a Category
This requires searching the page from start to finish, via
start_ocr.get_pages_and_imgs()
Examples:
>>> x = Path().cwd() / "tests" / "data" / "notice.pdf"
>>> res = get_start_page_pos(x)
>>> type(res[0])
<class 'int'>
>>> res[0]
0
>>> type(res[1])
<class 'corpus_unpdf._markers.PositionNotice'>
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Path
|
Path to the PDF file. |
required |
Returns:
Type | Description |
---|---|
tuple[int, PositionNotice | PositionDecisionCategoryWriter | None] | None
|
tuple[int, PositionNotice | PositionDecisionCategoryWriter | None] | None: The zero-based index of the page (i.e. 0 = page 1), the marker found that signifies start of the content |
Source code in corpus_unpdf/_positions.py
Terminal End: Page Number, Position
The actual end of content depends on either two pieces of text:
the Ordered
clause or By Authority of the Court
This requires searching the page in reverse, via
get_reverse_pages_and_imgs()
since the above pieces of text
indicate the end of the content.
Examples:
>>> from pdfplumber.page import Page
>>> from pathlib import Path
>>> import pdfplumber
>>> x = Path().cwd() / "tests" / "data" / "notice.pdf"
>>> get_end_page_pos(x) # page 5, y-axis 80.88
(5, 80.88)
Also see snippets for debugging:
debug with print(f"{x=}, {y=}, {w=}, {h=}, {y_pos=} {candidate=}")
cv2.rectangle(im, (x,y), (x+w, y+h), (36, 255, 12), 3) # for each mark
cv2.imwrite("temp/sample_boxes.png", im); see cv2.rectangle # end of forloop
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Path
|
Path to the PDF file. |
required |
Returns:
Type | Description |
---|---|
tuple[int, int] | None
|
tuple[int, int] | None: The page number from pdfplumber.pages, the Y position of that page |