Skip to content

Measurements

Since we'll be using distinct libraries with different formats, pay attention to the kind of measurements involved.

Unit

Library Unit Description Maximum
pdfplumber point PDF unit page.height * page.width is the size of the page
opencv pixel Graphical unit im.shape gets a tuple of image dimensions

Warning

Convert image's pixels as page points, by first getting image ratio; then apply ratio (percentage) to the page's max width / height.

Python
>>> from corpus_unpdf.src.common import get_contours # shortcut custom function
>>> im_h, im_w, im_d = im.shape # im_h is maximum image height
>>> test = next(cv2.boundingRect(c) for c in get_contours(im, (50, 10)))
>>> x, y, w, h = test # see Slicing below
>>> ratio = y / im_h # `y` coordinate over `im_h` gives a pixel-based ratio
>>> page_point = ratio * page.height # equivalent point in PDF page

See related discussion.

Boxes

Slicing opencv

Rectangles for opencv

Reference Expectation Format Unit
cv2.boundingRect() Results in a tuple of four points (x,y,w,h) pixels
Fields Meaning
x point in x-axis
y point in y-axis
w width
h height

Slicing pdfplumber

Rectangles for pdfplumber

Reference Expectation Format Unit
pdfplumber._typing.T_bbox A tuple of four points (x0, y0, x1, y1) points
Fields Meaning
x0 left-most point in x-axis
x1 right-most point in x-axis
y0 top-most point in y-axis
y1 bottom-most point in y-axis