Measurements

Since we'll be using distinct libraries with different formats, pay attention to the kind of measurements involved.

Unit

Library	Unit	Description	Maximum
pdfplumber	point	PDF unit	`page.height * page.width` is the size of the page
opencv	pixel	Graphical unit	`im.shape` gets a tuple of image dimensions

Warning

Convert image's pixels as page points, by first getting image ratio; then apply ratio (percentage) to the page's max width / height.

Python

>>> from corpus_unpdf.src.common import get_contours # shortcut custom function
>>> im_h, im_w, im_d = im.shape # im_h is maximum image height
>>> test = next(cv2.boundingRect(c) for c in get_contours(im, (50, 10)))
>>> x, y, w, h = test # see Slicing below
>>> ratio = y / im_h # `y` coordinate over `im_h` gives a pixel-based ratio
>>> page_point = ratio * page.height # equivalent point in PDF page

See related discussion.

Boxes

Slicing opencv

Rectangles for opencv

Reference	Expectation	Format	Unit
cv2.boundingRect()	Results in a tuple of four points	(`x`,`y`,`w`,`h`)	pixels

Fields	Meaning
`x`	point in x-axis
`y`	point in y-axis
`w`	width
`h`	height

Slicing pdfplumber

Rectangles for pdfplumber

Reference	Expectation	Format	Unit
pdfplumber._typing.T_bbox	A tuple of four points	(`x0`, `y0`, `x1`, `y1`)	points

Fields	Meaning
`x0`	left-most point in x-axis
`x1`	right-most point in x-axis
`y0`	top-most point in y-axis
`y1`	bottom-most point in y-axis