Skip to content

Markers

Metadata

Each document will have:

Page type Note
start there will be a deliberate start y-axis position affected by markers.
content see start-ocr "primitives" Bodyline for content segments, Footnote for discovered footnote partials.
end there will be a deliberate end y-axis position.

Y-axis cutting

The y-axis is relevant for start and end... since the header and the footer are cut out be to arrive at the meat of each page. And each page can then be dissected into segments and footnotes.

Warning

Not all markers are found in the preliminary page. Hence, need to find anchoring start of content.

Court Composition

Composition Choices

Bases: Enum

How Philippine Supreme Court sits. At present, this includes four options: en banc + 3 divisions. TODO: Might need to add cases for special divisions.

Source code in corpus_unpdf/_markers.py
Python
class CourtCompositionChoices(Enum):
    """How Philippine Supreme Court sits. At present, this includes four options: en banc + 3 divisions. TODO: Might need to add cases for _special_ divisions."""  # noqa: E501

    ENBANC = "En Banc"
    DIV1 = "First Division"
    DIV2 = "Second Division"
    DIV3 = "Third Division"

Extract Composition

Bases: NamedTuple

Should be present as top centered element in the first page of a Decision PDF file.

Field Type Description
element CourtCompositionChoices Presently four choices
coordinates tuple[int, int, int, int] The opencv rectangle found in the page where the composition is found
composition_pct_height float The y + height h of the coordinates over the im_h image height; used so the pdfplumber can utilize its cropping mechanism.
Source code in corpus_unpdf/_markers.py
Python
class PositionCourtComposition(NamedTuple):
    """Should be present as top centered element in the first page of a Decision PDF file.

    Field | Type | Description
    --:|:--:|:--
    `element` | [CourtCompositionChoices][composition-choices] | Presently four choices
    `coordinates` | tuple[int, int, int, int] | The opencv rectangle found in the page where the composition is found
    `composition_pct_height` | float | The `y` + height `h` of the `coordinates` over the `im_h` image height; used so the pdfplumber can utilize its cropping mechanism.
    """  # noqa: E501

    element: CourtCompositionChoices
    coordinates: tuple[int, int, int, int]
    composition_pct_height: float

    @classmethod
    def extract(cls, im: np.ndarray) -> Self | None:
        im_h, _, _ = im.shape
        for member in CourtCompositionChoices:
            if xywh := get_likelihood_centered_coordinates(im, member.value):
                y, h = xywh[1], xywh[3]
                return cls(
                    element=member,
                    coordinates=xywh,
                    composition_pct_height=(y + h) / im_h,
                )
        return None

    @classmethod
    def from_pdf(cls, pdf: PDF) -> Self:
        page_one_im = get_img_from_page(pdf.pages[0])
        court_composition = cls.extract(page_one_im)
        if not court_composition:
            raise Exception("Could not detect court compositon in page 1.")
        return court_composition

Decision Category & Writer

Category Choices

Bases: Enum

The classification of a decision issued by the Supreme Court, i.e. a decision or a resolution.

Source code in corpus_unpdf/_markers.py
Python
class DecisionCategoryChoices(Enum):
    """The classification of a decision issued by the Supreme Court, i.e.
    a decision or a resolution."""

    CASO = "Decision"
    RESO = "Resolution"

Extract Category

Bases: NamedTuple

Should be present as top centered element in the first page of a Decision PDF file.

Field Type Description
element DecisionCategoryChoices Presently four choices
coordinates tuple[int, int, int, int] The opencv rectangle found in the page where the composition element is found
writer str The string found indicating the name of the writer
category_pct_height float The y + height h of the coordinates over the im_h image height; used so the pdfplumber can utilize its cropping mechanism.
writer_pct_height float The writer's coordinates are found below the category coordinates. This can then be used to signify the anchoring start of the document.
Source code in corpus_unpdf/_markers.py
Python
class PositionDecisionCategoryWriter(NamedTuple):
    """Should be present as top centered element in the first page of a Decision PDF file.

    Field | Type | Description
    --:|:--:|:--
    `element` | [DecisionCategoryChoices][category-choices] | Presently four choices
    `coordinates` | tuple[int, int, int, int] | The opencv rectangle found in the page where the `composition` element is found
    `writer` | str | The string found indicating the name of the writer
    `category_pct_height` | float | The `y` + height `h` of the `coordinates` over the `im_h` image height; used so the pdfplumber can utilize its cropping mechanism.
    `writer_pct_height` | float | The writer's coordinates are found below the category coordinates. This can then be used to signify the anchoring start of the document.
    """  # noqa: E501

    element: DecisionCategoryChoices
    coordinates: tuple[int, int, int, int]
    writer: str
    category_pct_height: float
    writer_pct_height: float

    @classmethod
    def extract(cls, im: np.ndarray) -> Self | None:
        im_h, _, _ = im.shape
        for member in DecisionCategoryChoices:
            if xywh := get_likelihood_centered_coordinates(im, member.value):
                _, y, _, h = xywh
                y0, y1 = y + h, y + 270
                writer_box = im[y0:y1]
                writer = pytesseract.image_to_string(writer_box).strip()
                return cls(
                    element=member,
                    coordinates=xywh,
                    writer=writer,
                    category_pct_height=y / im_h,
                    writer_pct_height=y1 / im_h,
                )
        return None

Notice

Bases: NamedTuple

When present, signifies issuance by authority of the Court.

Field Type Description
element NoticeChoices Only a single choice (for now)
coordinates tuple[int, int, int, int] The opencv rectangle found in the page where the notice is found
position_pct_height float The y + height h of the coordinates over the im_h image height; used so the pdfplumber can utilize its cropping mechanism.
Source code in corpus_unpdf/_markers.py
Python
class PositionNotice(NamedTuple):
    """When present, signifies issuance by authority of the Court.

    Field | Type | Description
    --:|:--:|:--
    `element` | NoticeChoices | Only a single choice (for now)
    `coordinates` | tuple[int, int, int, int] | The opencv rectangle found in the page where the notice is found
    `position_pct_height` | float | The `y` + height `h` of the `coordinates` over the `im_h` image height; used so the pdfplumber can utilize its cropping mechanism.
    """  # noqa: E501

    element: NoticeChoices
    coordinates: tuple[int, int, int, int]
    position_pct_height: float

    @classmethod
    def extract(cls, im: np.ndarray) -> Self | None:
        im_h, _, _ = im.shape
        for member in NoticeChoices:
            if xywh := get_likelihood_centered_coordinates(im, member.value):
                y, h = xywh[1], xywh[3]
                return cls(
                    element=member,
                    coordinates=xywh,
                    position_pct_height=(y + h) / im_h,
                )
        return None