Content
Metadata of a single content page.
| Field | Description |
|---|---|
page_num |
Page number |
body |
Main content above an annex, if existing |
segments |
Segments of the body's text in the given page_num |
annex |
Portion of page containing the footnotes; some pages are annex-free |
footnotes |
Each footnote item in the annex's text in the given page_num |
Source code in src/start_ocr/content.py
Functions
set(page, start_y=None, end_y=None)
classmethod
A header_line (related to start_y) and page_line (related to end_y) are utilized as local variables in this function.
The header_line is the imaginary line at the top of the page. If the start_y is supplied, it means that the header_line no longer needs to be calculated.
The page_line is the imaginary line at the bottom of the page. If the end_y is supplied, it means that the calculated page_line ought to be replaced.
The presence of a header_line and a page_endline determine what to extract as content from a given page.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page |
Page
|
The pdfplumber page to evaluate |
required |
start_y |
float | int | None
|
If present, refers to The y-axis point of the starter page. Defaults to None. |
None
|
end_y |
float | int | None
|
If present, refers to The y-axis point of the ender page. Defaults to None. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Self |
Self
|
Page with individual components mapped out. |