corpus-pax Docs
Using a sqlpyd-fashioned database, create tables for generic users, organizations, and articles, sourcing the data from Github repositories.
flowchart TB
subgraph dev env
pax[corpus-pax]
pax--run setup_pax--->db[(sqlite.db)]
end
subgraph /corpus-entities
1(members)--github api---pax
2(orgs)--github api---pax
end
subgraph /lawsql-articles
3(articles)--github api---pax
end
pax--cloudflare api-->cf(cloudflare images)
Each corpus entity in the /corpus-entities
repository will contain 2 files: a details.yaml
and an avatar.jpeg
organized according to the following structure:
YAML/<gh-repo> # github repository
/members
/<id-of-individual-1>
- details.yaml
- avatar.jpeg
/orgs
/<id-of-org-1>
- details.yaml
- avatar.jpeg
The details.yaml
file should contain the key value pairs for the org represented by the id.
Rationale
Why Github
The names and profiles of individuals and organizations are stored in Github. These are pulled into the application via an API call requiring the use of a personal access token.
Why Cloudflare Images
Individuals and organizations have images stored in Github. To persist and optimize images for the web, I use Cloudflare Images to take advantage of modern image formats and customizable variants.
Why sqlite
The initial data is simple. This database however will be the foundation for a more complicated schema. Sqlite seems a better fit for experimentation and future app use (Android and iOS rely on sqlite).
Run
Python>>> from corpus_pax import setup_pax
>>> setup_pax("x.db")
Since it's hard to correct the m2m tables, setup_pax()
drops all the tables first, before adding content.
setup_pax()
is a collection of 3 functions:
Add individuals
Add/replace records of individuals from an API call.
Source code in corpus_pax/__main__.py
Python |
---|
| def add_individuals_from_api(c: Connection, replace_img: bool = False):
"""Add/replace records of individuals from an API call."""
for entity_individual in Individual.list_members_repo():
Individual.make_or_replace(c, entity_individual["url"], replace_img)
|
Add organizations
Add/replace records of organizations from an API call.
Source code in corpus_pax/__main__.py
Python |
---|
| def add_organizations_from_api(c: Connection, replace_img: bool = False):
"""Add/replace records of organizations from an API call."""
for entity_org in Org.list_orgs_repo():
Org.make_or_replace(c, entity_org["url"], replace_img)
|
Add articles
Add/replace records of articles from an API call.
Source code in corpus_pax/__main__.py
Python |
---|
| def add_articles_from_api(c: Connection):
"""Add/replace records of articles from an API call."""
for extracted_data in Article.extract_articles():
Article.make_or_replace(c, extracted_data)
|
Prerequisites
Repositories
Different repositories involved:
repository |
status |
type |
purpose |
lawsql-articles |
private |
data source |
used by corpus-pax; yaml-formatted member and org files |
corpus-entities |
private |
data source |
used by corpus-pax; markdown-styled articles with frontmatter |
corpus-pax |
public |
sqlite i/o |
functions to create pax-related tables |
Since data concerning members will be pulled from such repositories, make sure the individual / org fields in resources match the data pulled from corpus-entities
.
Each avatar image should be named avatar.jpeg
so that these can be uploaded to Cloudflare.
.env
Create an .env file to create/populate the database. See env.example
highlighting the following variables:
Text OnlyCF_ACCT_ID=op://dev/cloudflare/acct_id
CF_IMG_TOKEN=op://dev/cloudflare/images/token
CF_IMG_HASH=op://dev/cloudflare/images/hash
GH_TOKEN=op://dev/gh/pat-public/token
Note the workflow (main.yml) where the secrets are included for Github actions. Ensure these are set in the repository's <url-to-repo>/settings/secrets/actions
, making the proper replacements when the tokens for Cloudflare and Github expire.
Articles
Bases: TableConfig
Source code in corpus_pax/articles.py
Python |
---|
| class Article(TableConfig):
__prefix__ = "pax"
__tablename__ = "articles"
url: HttpUrl = Field(col=str)
id: str = Field(col=str)
title: str = Field(col=str, fts=True)
description: str = Field(col=str, fts=True)
date: datetime.date = Field(..., col=datetime.date, index=True)
created: float = Field(col=float)
modified: float = Field(col=float)
content: str = Field(col=str, fts=True)
tags: list[str] = Field(
default_factory=list,
title="Subject Matter",
description="Itemized strings, referring to the topic tag involved.",
exclude=True,
)
authors: list[EmailStr] = Field(default_factory=list, exclude=True)
@classmethod
def extract_articles(cls):
"""Based on entries from a Github folder, ignore files
not formatted in .md and extract the Pydantic-model;
the model is based on the frontmatter metadata of each
markdown article.
"""
articles = []
for entry in fetch_articles():
if filename := entry.get("name"):
if filename.endswith(".md"):
if url := entry.get("url"):
id = filename.removesuffix(".md")
modified = fetch_article_date_modified(filename)
details = cls.extract_markdown_postmatter(url)
article = cls(id=id, modified=modified, **details)
articles.append(article)
return articles
@classmethod
def extract_markdown_postmatter(cls, url: str) -> dict:
"""Convert the markdown/frontmatter file fetched via url to a dict."""
mdfile = gh.get(url)
post = frontmatter.loads(mdfile.content)
d = parser.parse(post["date"]).astimezone(ZoneInfo("Asia/Manila"))
return {
"url": url,
"created": d.timestamp(),
"date": d.date(),
"title": post["title"],
"description": post["summary"],
"content": post.content,
"authors": post["authors"],
"tags": post["tags"],
}
@classmethod
def make_or_replace(cls, c: Connection, extract: Any):
tbl = c.table(cls)
row = tbl.insert(extract.dict(), replace=True, pk="id") # type: ignore
if row.last_pk:
for author_email in extract.authors:
tbl.update(row.last_pk).m2m(
other_table=Individual.__tablename__,
lookup={"email": author_email},
pk="id",
)
for tag in extract.tags:
tbl.update(row.last_pk).m2m(
other_table=Tag.__tablename__,
lookup=Tag(**{"tag": tag}).dict(),
)
|
Functions
extract_articles()
classmethod
Based on entries from a Github folder, ignore files
not formatted in .md and extract the Pydantic-model;
the model is based on the frontmatter metadata of each
markdown article.
Source code in corpus_pax/articles.py
Python |
---|
| @classmethod
def extract_articles(cls):
"""Based on entries from a Github folder, ignore files
not formatted in .md and extract the Pydantic-model;
the model is based on the frontmatter metadata of each
markdown article.
"""
articles = []
for entry in fetch_articles():
if filename := entry.get("name"):
if filename.endswith(".md"):
if url := entry.get("url"):
id = filename.removesuffix(".md")
modified = fetch_article_date_modified(filename)
details = cls.extract_markdown_postmatter(url)
article = cls(id=id, modified=modified, **details)
articles.append(article)
return articles
|
extract_markdown_postmatter(url)
classmethod
Convert the markdown/frontmatter file fetched via url to a dict.
Source code in corpus_pax/articles.py
Python |
---|
| @classmethod
def extract_markdown_postmatter(cls, url: str) -> dict:
"""Convert the markdown/frontmatter file fetched via url to a dict."""
mdfile = gh.get(url)
post = frontmatter.loads(mdfile.content)
d = parser.parse(post["date"]).astimezone(ZoneInfo("Asia/Manila"))
return {
"url": url,
"created": d.timestamp(),
"date": d.date(),
"title": post["title"],
"description": post["summary"],
"content": post.content,
"authors": post["authors"],
"tags": post["tags"],
}
|
Entities
RegisteredMember
Bases: BaseModel
Common validator for corpus entities: Individuals and Orgs.
Note that the col
attribute is for use in sqlpyd
.
Source code in corpus_pax/resources.py
Python |
---|
| class RegisteredMember(BaseModel):
"""Common validator for corpus entities: Individuals and Orgs.
Note that the `col` attribute is for use in `sqlpyd`."""
id: str = Field(col=str)
created: float = Field(col=float)
modified: float = Field(col=float)
search_rank: RankStatus | None = Field(
RankStatus.Ordinary,
title="Search Rank",
description="Can use as a means to determine rank in SERP",
col=int,
)
email: EmailStr = Field(col=str)
img_id: str | None = Field(
None,
title="Cloudflare Image ID",
description=(
"Based on email, upload a unique avatar that can be called via"
" Cloudflare Images."
),
col=str,
)
display_url: HttpUrl | None = Field(
title="Associated URL",
description=(
"When visiting the profile of the member, what URL is associated"
" with the latter?"
),
col=str,
)
display_name: str = Field(
...,
title="Display Name",
description="Preferred way of being designated in the platform.",
min_length=5,
col=str,
fts=True,
)
caption: str | None = Field(
None,
description=(
"For individuals, the way by which a person is to be known, e.g."
" Lawyer and Programmer; if an organization, it's motto or quote,"
" i.e. 'just do it'."
),
col=str,
)
description: str | None = Field(
None,
title="Member Description",
description="Useful for both SEO and for contextualizing the profile object.",
min_length=10,
col=str,
fts=True,
)
twitter: str | None = Field(None, title="Twitter username", col=str)
github: str | None = Field(None, title="Github username", col=str)
linkedin: str | None = Field(None, title="LinkedIn username", col=str)
facebook: str | None = Field(None, title="Facebook page", col=str)
areas: list[str] | None = Field(
default_factory=list,
title="Practice Areas",
description=(
"Itemized strings, referring to specialization of both natural and"
" artificial persons, that will be mapped to a unique table"
),
exclude=True,
)
categories: list[str] | None = Field(
default_factory=list,
title="Entity Categories",
description=(
"Itemized strings, referring to type of entity of both natural"
" (e.g. lawyer) and artificial (e.g. law firm) persons, that will"
" be mapped to a unique table"
),
exclude=True,
)
members: list[dict[str, int | str | EmailStr]] | None = Field(
default_factory=list, exclude=True
)
class Config:
use_enum_values = True
@classmethod
def extract_details(cls, url: str) -> dict:
"""Convert the yaml file in the repository to a dict."""
if details_resp := gh.get(f"{url}/{DETAILS_FILE}"):
return yaml.safe_load(details_resp.content)
raise Exception(f"Could not get details from {url=}")
@classmethod
def from_url(cls, url: str, set_img: bool = False):
"""Each member url can be converted to a fully validated object
via a valid Github `url`; if `set_img` is set to true,
an `img_id` is created on Cloudflare."""
obj = MemberURL.setter(url, set_img)
return cls(
**cls.extract_details(obj.target_url),
id=obj.id,
img_id=obj.img_id,
created=datetime.datetime.now().timestamp(),
modified=datetime.datetime.now().timestamp(),
)
|
Functions
Convert the yaml file in the repository to a dict.
Source code in corpus_pax/resources.py
Python |
---|
| @classmethod
def extract_details(cls, url: str) -> dict:
"""Convert the yaml file in the repository to a dict."""
if details_resp := gh.get(f"{url}/{DETAILS_FILE}"):
return yaml.safe_load(details_resp.content)
raise Exception(f"Could not get details from {url=}")
|
from_url(url, set_img=False)
classmethod
Each member url can be converted to a fully validated object
via a valid Github url
; if set_img
is set to true,
an img_id
is created on Cloudflare.
Source code in corpus_pax/resources.py
Python |
---|
| @classmethod
def from_url(cls, url: str, set_img: bool = False):
"""Each member url can be converted to a fully validated object
via a valid Github `url`; if `set_img` is set to true,
an `img_id` is created on Cloudflare."""
obj = MemberURL.setter(url, set_img)
return cls(
**cls.extract_details(obj.target_url),
id=obj.id,
img_id=obj.img_id,
created=datetime.datetime.now().timestamp(),
modified=datetime.datetime.now().timestamp(),
)
|
Individual
Bases: RegisteredMember
, IndividualBio
, TableConfig
Source code in corpus_pax/entities.py
Python |
---|
| class Individual(RegisteredMember, IndividualBio, TableConfig):
__prefix__ = "pax"
__tablename__ = "individuals"
@validator("id", pre=True)
def lower_cased_id(cls, v):
return v.lower()
class Config:
use_enum_values = True
@classmethod
def list_members_repo(cls):
return fetch_entities("members")
@classmethod
def make_or_replace(
cls,
c: Connection,
url: str,
replace_img: bool = False,
):
indiv_data = cls.from_url(url, replace_img)
tbl = c.table(cls)
row = tbl.insert(indiv_data.dict(), replace=True, pk="id") # type: ignore # noqa: E501
if pk := row.last_pk:
if indiv_data.areas:
PracticeArea.associate(tbl, pk, indiv_data.areas)
if indiv_data.categories:
PersonCategory.associate(tbl, pk, indiv_data.categories)
|
Org
Bases: RegisteredMember
, TableConfig
Source code in corpus_pax/entities.py
Python |
---|
| class Org(RegisteredMember, TableConfig):
__prefix__ = "pax"
__tablename__ = "orgs"
official_name: str = Field(None, max_length=100, col=str, fts=True)
@classmethod
def list_orgs_repo(cls):
return fetch_entities("orgs")
def set_membership_rows(self, c: Connection) -> Table | None:
member_list = []
if self.members:
for member in self.members:
email = member.pop("account_email", None)
if email and (acct := EmailStr(email)):
obj = OrgMember(
org_id=self.id,
individual_id=None,
rank=member.get("rank", 10),
role=member.get("role", "Unspecified"),
account_email=acct,
)
member_list.append(obj)
if member_list:
return c.add_cleaned_records(OrgMember, member_list)
return None
@classmethod
def make_or_replace(
cls,
c: Connection,
url: str,
replace_img: bool = False,
):
org_data = cls.from_url(url, replace_img)
tbl = c.table(cls)
row = tbl.insert(org_data.dict(), replace=True, pk="id") # type: ignore # noqa: E501
if pk := row.last_pk:
if org_data.areas:
PracticeArea.associate(tbl, pk, org_data.areas)
if org_data.categories:
PersonCategory.associate(tbl, pk, org_data.categories)
org_data.set_membership_rows(c)
|