Markers
Metadata
Each document will have:
Page type | Note |
---|---|
start | there will be a deliberate start y-axis position affected by markers. |
content | see start-ocr "primitives" Bodyline for content segments, Footnote for discovered footnote partials. |
end | there will be a deliberate end y-axis position. |
Y-axis cutting
The y-axis is relevant for start and end... since the header and the footer are cut out be to arrive at the meat of each page. And each page can then be dissected into segments and footnotes.
Warning
Not all markers are found in the preliminary page. Hence, need to find anchoring start of content.
Court Composition
Composition Choices
Bases: Enum
How Philippine Supreme Court sits. At present, this includes four options: en banc + 3 divisions. TODO: Might need to add cases for special divisions.
Source code in corpus_unpdf/_markers.py
Extract Composition
Bases: NamedTuple
Should be present as top centered element in the first page of a Decision PDF file.
Field | Type | Description |
---|---|---|
element |
CourtCompositionChoices | Presently four choices |
coordinates |
tuple[int, int, int, int] | The opencv rectangle found in the page where the composition is found |
composition_pct_height |
float | The y + height h of the coordinates over the im_h image height; used so the pdfplumber can utilize its cropping mechanism. |
Source code in corpus_unpdf/_markers.py
Decision Category & Writer
Category Choices
Bases: Enum
The classification of a decision issued by the Supreme Court, i.e. a decision or a resolution.
Source code in corpus_unpdf/_markers.py
Extract Category
Bases: NamedTuple
Should be present as top centered element in the first page of a Decision PDF file.
Field | Type | Description |
---|---|---|
element |
DecisionCategoryChoices | Presently four choices |
coordinates |
tuple[int, int, int, int] | The opencv rectangle found in the page where the composition element is found |
writer |
str | The string found indicating the name of the writer |
category_pct_height |
float | The y + height h of the coordinates over the im_h image height; used so the pdfplumber can utilize its cropping mechanism. |
writer_pct_height |
float | The writer's coordinates are found below the category coordinates. This can then be used to signify the anchoring start of the document. |
Source code in corpus_unpdf/_markers.py
Notice
Bases: NamedTuple
When present, signifies issuance by authority of the Court.
Field | Type | Description |
---|---|---|
element |
NoticeChoices | Only a single choice (for now) |
coordinates |
tuple[int, int, int, int] | The opencv rectangle found in the page where the notice is found |
position_pct_height |
float | The y + height h of the coordinates over the im_h image height; used so the pdfplumber can utilize its cropping mechanism. |