Skip to content

Content

Metadata of a single content page.

Field Description
page_num Page number
body Main content above an annex, if existing
segments Segments of the body's text in the given page_num
annex Portion of page containing the footnotes; some pages are annex-free
footnotes Each footnote item in the annex's text in the given page_num
Source code in src/start_ocr/content.py
Python
@dataclass
class Content:
    """Metadata of a single content page.

    Field | Description
    --:|:--
    `page_num` | Page number
    `body` | Main content above an annex, if existing
    `segments` | Segments of the `body`'s text in the given `page_num`
    `annex` | Portion of page containing the footnotes; some pages are annex-free
    `footnotes` | Each footnote item in the `annex`'s text in the given `page_num`
    """  # noqa: E501

    page_num: int
    body: CroppedPage
    body_text: str
    annex: CroppedPage | None = None
    annex_text: str | None = None
    segments: list[Bodyline] = field(default_factory=list)
    footnotes: list[Footnote] = field(default_factory=list)

    def __post_init__(self):
        alpha = paragraph_break.split(self.body_text)
        beta = self.body_text.split("\n\n")
        candidates = alpha or beta
        self.segments = Bodyline.split(candidates, self.page_num)
        if self.annex and self.annex_text:
            self.footnotes = Footnote.extract_notes(self.annex_text, self.page_num)

    def __repr__(self) -> str:
        return f"<Content Page: {self.page_num}>"

    @classmethod
    def set(
        cls,
        page: Page,
        start_y: float | int | None = None,
        end_y: float | int | None = None,
    ) -> Self:
        """
        A `header_line` (related to `start_y`) and `page_line` (related to `end_y`) are utilized as local variables in this function.

        The `header_line` is the imaginary line at the top of the page. If the `start_y` is supplied, it means that the `header_line` no longer needs to be calculated.

        The `page_line` is the imaginary line at the bottom of the page. If the `end_y` is supplied, it means that the calculated `page_line` ought to be replaced.

        The presence of a `header_line` and a `page_endline` determine what to extract as content from a given `page`.

        Args:
            page (Page): The pdfplumber page to evaluate
            start_y (float | int | None, optional): If present, refers to The y-axis point of the starter page. Defaults to None.
            end_y (float | int | None, optional): If present, refers to The y-axis point of the ender page. Defaults to None.

        Returns:
            Self: Page with individual components mapped out.
        """  # noqa: E501
        im = get_img_from_page(page)

        header_line = start_y or get_header_line(im, page)
        if not header_line:
            raise Exception(f"No header line in {page.page_number=}")

        end_of_content, e = get_page_end(im, page)
        page_line = end_y or end_of_content

        body = PageCut.set(page=page, y0=header_line, y1=page_line)
        body_text = paged_text(body) or imaged_text(body)
        annex = None
        annex_text = None

        if e:
            annex = PageCut.set(page=page, y0=end_of_content, y1=e)
            annex_text = paged_text(annex) or imaged_text(annex)

        return cls(
            page_num=get_page_num(page, header_line),
            body=body,
            body_text=body_text,
            annex=annex,
            annex_text=annex_text,
        )

Functions

set(page, start_y=None, end_y=None) classmethod

A header_line (related to start_y) and page_line (related to end_y) are utilized as local variables in this function.

The header_line is the imaginary line at the top of the page. If the start_y is supplied, it means that the header_line no longer needs to be calculated.

The page_line is the imaginary line at the bottom of the page. If the end_y is supplied, it means that the calculated page_line ought to be replaced.

The presence of a header_line and a page_endline determine what to extract as content from a given page.

Parameters:

Name Type Description Default
page Page

The pdfplumber page to evaluate

required
start_y float | int | None

If present, refers to The y-axis point of the starter page. Defaults to None.

None
end_y float | int | None

If present, refers to The y-axis point of the ender page. Defaults to None.

None

Returns:

Name Type Description
Self Self

Page with individual components mapped out.

Source code in src/start_ocr/content.py
Python
@classmethod
def set(
    cls,
    page: Page,
    start_y: float | int | None = None,
    end_y: float | int | None = None,
) -> Self:
    """
    A `header_line` (related to `start_y`) and `page_line` (related to `end_y`) are utilized as local variables in this function.

    The `header_line` is the imaginary line at the top of the page. If the `start_y` is supplied, it means that the `header_line` no longer needs to be calculated.

    The `page_line` is the imaginary line at the bottom of the page. If the `end_y` is supplied, it means that the calculated `page_line` ought to be replaced.

    The presence of a `header_line` and a `page_endline` determine what to extract as content from a given `page`.

    Args:
        page (Page): The pdfplumber page to evaluate
        start_y (float | int | None, optional): If present, refers to The y-axis point of the starter page. Defaults to None.
        end_y (float | int | None, optional): If present, refers to The y-axis point of the ender page. Defaults to None.

    Returns:
        Self: Page with individual components mapped out.
    """  # noqa: E501
    im = get_img_from_page(page)

    header_line = start_y or get_header_line(im, page)
    if not header_line:
        raise Exception(f"No header line in {page.page_number=}")

    end_of_content, e = get_page_end(im, page)
    page_line = end_y or end_of_content

    body = PageCut.set(page=page, y0=header_line, y1=page_line)
    body_text = paged_text(body) or imaged_text(body)
    annex = None
    annex_text = None

    if e:
        annex = PageCut.set(page=page, y0=end_of_content, y1=e)
        annex_text = paged_text(annex) or imaged_text(annex)

    return cls(
        page_num=get_page_num(page, header_line),
        body=body,
        body_text=body_text,
        annex=annex,
        annex_text=annex_text,
    )