corpus-unpdf Docs
Apply image processing / OCR to main and separate opinions in the PH Supreme Court website.
MainOpinionPages
Bases: Collection
, FrontpageMeta
The main opinion of a Decision or Resolution, specifically its front and last pages, is formatted differenly from separate opinions. The following metadata are required to be parsed:
- Composition, i.e. whether En Banc or by divison.
- Category, i.e. whether a Decision or a Resolution.
- Writer, i.e. who penned the main opinion.
- Notice, i.e. whether it is of a particular category of decisions.
Given a PDF file, can use MainOpinionPages.set(<path-to-pdf)
to extract content pages and the above metadata of the file. The fields of this data structure inherits from start_ocr
's Collection
with a custom FrontpageMeta
.
Source code in corpus_unpdf/main.py
Python | |
---|---|
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
|
Functions
set(path)
classmethod
From a *.pdf file found in path
, extract relevant metadata to generate a decision having content pages. Each of which will contain a body and, likely, an annex for footnotes.
Examples:
>>> x = Path().cwd() / "tests" / "data" / "decision.pdf"
>>> decision = MainOpinionPages.set(x)
>>> decision.category
<DecisionCategoryChoices.RESO: 'Resolution'>
>>> decision.composition
<CourtCompositionChoices.DIV2: 'Second Division'>
>>> decision.writer
'CARPIO. J.:'
>>> len(decision.pages) # total page count
5
>>> from start_ocr import Bodyline, Footnote, Content
>>> isinstance(decision.pages[0], Content) # first page
True
>>> isinstance(decision.segments[0], Bodyline)
True
>>> isinstance(decision.footnotes[0], Footnote)
True
>>> len(decision.footnotes) # TODO: limited number detected; should be 15
10
Source code in corpus_unpdf/main.py
SeparateOpinionPages
Bases: Collection
Handles content and metadata of separate opinions, i.e. the concurring, dissenting opinions to a main opinion of a Decision or Resolution.
Given a PDF file, can use SeparateOpinionPages.set(<path-to-pdf)
to extract content pages (only) of the file. The fields of this data structure inherits from start_ocr
's Collection
.
Source code in corpus_unpdf/main.py
Functions
set(path)
classmethod
Limited extraction: only interested in content unlike decisions where metadata is relevant. Also assumes first page will always be the logical start.
Examples:
>>> x = Path().cwd() / "tests" / "data" / "opinion.pdf"
>>> opinion = SeparateOpinionPages.set(x)
>>> len(opinion.pages) # total page count
10
>>> from start_ocr import Bodyline, Footnote, Content
>>> isinstance(opinion.pages[0], Content) # first page
True
>>> isinstance(opinion.segments[0], Bodyline)
True
>>> isinstance(opinion.footnotes[0], Footnote)
True