pagemeta docs
InspectedURL
InspectedURL
is a dataclass whose function is to extract relevant metadata via SiteHeaders
and PageMeta
.
Source code in pagemeta/main.py
SiteHeaders
SiteHeaders
is a dataclass whose function is to use httpx.Response and extract relevant metadata (last-modified, content-type, etc.).
Source code in pagemeta/headers.py
PageMeta
PageMeta
is a dataclass whose function is to httpx.get a given URL's metadata (title, description, open graph image) with BeautifulSoup.
Extract generic website metadata based on a url fetched on a certain date.
All of the fields, except the date, default to None
.
Field | Type | Description |
---|---|---|
title | str | First matching title parsed from <meta> CSS selectors (and the <title> tag) |
description | str | First matching description Parsed from <meta> CSS selectors |
author | str | Either the author or the creator, if the author is absent |
image | str | An open graph (OG) image url detected |
category | str | A type detected from OG ("og:type") values |
Source code in pagemeta/meta.py
Functions
select(soup, selectors)
classmethod
The order of CSS selectors. The first one matched, retrieves the content, if found.
See present list of selectors used to extract content:
TITLE = (
'meta[name="twitter:title"]',
'meta[property="og:title"]',
"title",
)
DESC = (
'meta[name="twitter:description"]',
'meta[property="og:description"]',
'meta[name="description"]',
)
IMG = (
'meta[name="twitter:image"]',
'meta[property="og:image"]',
)
AUTHOR = (
'meta[name="author"]',
'meta[name="twitter:creator"]',
)
TYPE = ('meta[property="og:type"]',)
Note the special rule on title
as a selector.
Examples:
>>> from pathlib import Path
>>> html = Path(__file__).parent.parent / "tests" / "data" / "test.html"
>>> soup = BeautifulSoup(html.read_text(), "html.parser")
>>> PageMeta.select(soup, TITLE)
'Hello World From Twitter Title!'
>>> PageMeta.select(soup, DESC)
'this is a description from twitter:desc'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
soup |
BeautifulSoup
|
Converted html content into a soup object |
required |
selectors |
Iterable[str]
|
CSS selectors as a tuple |
required |
Returns:
Type | Description |
---|---|
str | None
|
str | None: If found, return the text value. |