Content
Metadata of a single content page.
Field | Description |
---|---|
page_num |
Page number |
body |
Main content above an annex, if existing |
segments |
Segments of the body 's text in the given page_num |
annex |
Portion of page containing the footnotes; some pages are annex-free |
footnotes |
Each footnote item in the annex 's text in the given page_num |
Source code in src/start_ocr/content.py
Functions
set(page, start_y=None, end_y=None)
classmethod
A header_line
(related to start_y
) and page_line
(related to end_y
) are utilized as local variables in this function.
The header_line
is the imaginary line at the top of the page. If the start_y
is supplied, it means that the header_line
no longer needs to be calculated.
The page_line
is the imaginary line at the bottom of the page. If the end_y
is supplied, it means that the calculated page_line
ought to be replaced.
The presence of a header_line
and a page_endline
determine what to extract as content from a given page
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
Page
|
The pdfplumber page to evaluate |
required |
start_y |
float | int | None
|
If present, refers to The y-axis point of the starter page. Defaults to None. |
None
|
end_y |
float | int | None
|
If present, refers to The y-axis point of the ender page. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Self |
Self
|
Page with individual components mapped out. |