Slicing
PageCut
Bases: NamedTuple
Fields:
field | type | description |
---|---|---|
page | pdfplumber.page.Page | The page to cut |
x0 | float or int | The x axis where the slice will start |
x1 | float or int | The x axis where the slice will terminate |
y0 | float or int | The y axis where the slice will start |
y1 | float or int | The y axis where the slice will terminate |
When the above fields are populated, the @slice
property describes
the area of the page that will be used to extract text from.
Source code in src/start_ocr/slice.py
Attributes
slice: CroppedPage
property
Unlike slicing from an image based on a np.ndarray
, a page cut
implies a page derived from pdfplumber
. The former is based on pixels;
the latter on points.
Examples:
>>> from pathlib import Path
>>> from start_ocr.fetch import get_page_and_img
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0) # page 1
>>> page.height
936
>>> cutpage = PageCut(page=page, x0=100, x1=200, y0=100, y1=200).slice
>>> cutpage.height
100
>>> page.pdf.close()
Returns:
Name | Type | Description |
---|---|---|
CroppedPage |
CroppedPage
|
The page crop where to extract text from. |
Functions
set(page, y0, y1)
classmethod
Using a uniform margin on the x-axis, supply the page
to generate page width and thus force preset margins. The y0
and y1
fields determine how to slice the page.
Examples:
>>> import pdfplumber
>>> from pathlib import Path
>>> from start_ocr.fetch import get_img_from_page
>>> pdf = pdfplumber.open(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf")
>>> page = pdf.pages[1] # page 2
>>> im = get_img_from_page(page)
>>> crop = PageCut.set(page, y0=0, y1=page.height * 0.1)
>>> crop.extract_text()
'ALorem IpsumDocument 2 June1,2023'
>>> pdf.close()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page |
Page
|
pdfplumber Page object |
required |
y0 |
float | int
|
Top y-axis |
required |
y1 |
float | int
|
Bottom y-axis |
required |
Returns:
Name | Type | Description |
---|---|---|
CroppedPage |
CroppedPage
|
The page crop where to extract text from. |
Source code in src/start_ocr/slice.py
get_likelihood_centered_coordinates()
With a image im
, get all contours found in the center
of the image and then for each of these matches, if they
are text resembling text_to_match
, extract the coordinates of
such contours.
Examples:
>>> from pathlib import Path
>>> from start_ocr.fetch import get_page_and_img
>>> page, im = get_page_and_img(Path().cwd() / "tests" / "data" / "lorem_ipsum.pdf", 0)
>>> get_likelihood_centered_coordinates(im, 'Decision') # None found
>>> res = get_likelihood_centered_coordinates(im, 'Memo')
>>> isinstance(res, tuple)
True
>>> page.pdf.close()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
im |
ndarray
|
The base image to look for text |
required |
text_to_match |
str
|
The words that should match |
required |
Returns:
Type | Description |
---|---|
tuple[int, int, int, int] | None
|
tuple[int, int, int, int] | None: (x, y, w, h) pixels representing
|
Source code in src/start_ocr/slice.py
is_match_text()
Test whether textual image in sliced_im
resembles text_to_match
by
a likelihood
percentage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sliced_im |
ndarray
|
Slice of a larger image containing text |
required |
text_to_match |
str
|
How to match the text slice in |
required |
likelihood |
float
|
Allowed percentage expressed in decimals |
0.7
|
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
Whether or not the |