Skip to content

corpus-pax Docs

Using a sqlpyd-fashioned database, create tables for generic users, organizations, and articles, sourcing the data from Github repositories.

flowchart TB
subgraph dev env
  pax[corpus-pax]
  pax--run setup_pax--->db[(sqlite.db)]
end
subgraph /corpus-entities
  1(members)--github api---pax
  2(orgs)--github api---pax
end
subgraph /lawsql-articles
  3(articles)--github api---pax
end
pax--cloudflare api-->cf(cloudflare images)

Each corpus entity in the /corpus-entities repository will contain 2 files: a details.yaml and an avatar.jpeg organized according to the following structure:

YAML
/<gh-repo> # github repository
  /members
    /<id-of-individual-1>
      - details.yaml
      - avatar.jpeg
  /orgs
    /<id-of-org-1>
      - details.yaml
      - avatar.jpeg

The details.yaml file should contain the key value pairs for the org represented by the id.

Rationale

Why Github

The names and profiles of individuals and organizations are stored in Github. These are pulled into the application via an API call requiring the use of a personal access token.

Why Cloudflare Images

Individuals and organizations have images stored in Github. To persist and optimize images for the web, I use Cloudflare Images to take advantage of modern image formats and customizable variants.

Why sqlite

The initial data is simple. This database however will be the foundation for a more complicated schema. Sqlite seems a better fit for experimentation and future app use (Android and iOS rely on sqlite).

Run

Python
>>> from corpus_pax import setup_pax
>>> setup_pax("x.db")

Since it's hard to correct the m2m tables, setup_pax() drops all the tables first, before adding content.

setup_pax() is a collection of 3 functions:

Add individuals

Add/replace records of individuals from an API call.

Source code in corpus_pax/__main__.py
Python
def add_individuals_from_api(c: Connection, replace_img: bool = False):
    """Add/replace records of individuals from an API call."""
    for entity_individual in Individual.list_members_repo():
        Individual.make_or_replace(c, entity_individual["url"], replace_img)

Add organizations

Add/replace records of organizations from an API call.

Source code in corpus_pax/__main__.py
Python
def add_organizations_from_api(c: Connection, replace_img: bool = False):
    """Add/replace records of organizations from an API call."""
    for entity_org in Org.list_orgs_repo():
        Org.make_or_replace(c, entity_org["url"], replace_img)

Add articles

Add/replace records of articles from an API call.

Source code in corpus_pax/__main__.py
Python
def add_articles_from_api(c: Connection):
    """Add/replace records of articles from an API call."""
    for extracted_data in Article.extract_articles():
        Article.make_or_replace(c, extracted_data)

Prerequisites

Repositories

Different repositories involved:

repository status type purpose
lawsql-articles private data source used by corpus-pax; yaml-formatted member and org files
corpus-entities private data source used by corpus-pax; markdown-styled articles with frontmatter
corpus-pax public sqlite i/o functions to create pax-related tables

Since data concerning members will be pulled from such repositories, make sure the individual / org fields in resources match the data pulled from corpus-entities.

Each avatar image should be named avatar.jpeg so that these can be uploaded to Cloudflare.

.env

Create an .env file to create/populate the database. See env.example highlighting the following variables:

Text Only
CF_ACCT_ID=op://dev/cloudflare/acct_id
CF_IMG_TOKEN=op://dev/cloudflare/images/token
CF_IMG_HASH=op://dev/cloudflare/images/hash
GH_TOKEN=op://dev/gh/pat-public/token

Note the workflow (main.yml) where the secrets are included for Github actions. Ensure these are set in the repository's <url-to-repo>/settings/secrets/actions, making the proper replacements when the tokens for Cloudflare and Github expire.


Articles

Bases: TableConfig

Source code in corpus_pax/articles.py
Python
class Article(TableConfig):
    __prefix__ = "pax"
    __tablename__ = "articles"

    url: HttpUrl = Field(col=str)
    id: str = Field(col=str)
    title: str = Field(col=str, fts=True)
    description: str = Field(col=str, fts=True)
    date: datetime.date = Field(..., col=datetime.date, index=True)
    created: float = Field(col=float)
    modified: float = Field(col=float)
    content: str = Field(col=str, fts=True)
    tags: list[str] = Field(
        default_factory=list,
        title="Subject Matter",
        description="Itemized strings, referring to the topic tag involved.",
        exclude=True,
    )
    authors: list[EmailStr] = Field(default_factory=list, exclude=True)

    @classmethod
    def extract_articles(cls):
        """Based on entries from a Github folder, ignore files
        not formatted in .md and extract the Pydantic-model;
        the model is based on the frontmatter metadata of each
        markdown article.
        """
        articles = []
        for entry in fetch_articles():
            if filename := entry.get("name"):
                if filename.endswith(".md"):
                    if url := entry.get("url"):
                        id = filename.removesuffix(".md")
                        modified = fetch_article_date_modified(filename)
                        details = cls.extract_markdown_postmatter(url)
                        article = cls(id=id, modified=modified, **details)
                        articles.append(article)
        return articles

    @classmethod
    def extract_markdown_postmatter(cls, url: str) -> dict:
        """Convert the markdown/frontmatter file fetched via url to a dict."""
        mdfile = gh.get(url)
        post = frontmatter.loads(mdfile.content)
        d = parser.parse(post["date"]).astimezone(ZoneInfo("Asia/Manila"))
        return {
            "url": url,
            "created": d.timestamp(),
            "date": d.date(),
            "title": post["title"],
            "description": post["summary"],
            "content": post.content,
            "authors": post["authors"],
            "tags": post["tags"],
        }

    @classmethod
    def make_or_replace(cls, c: Connection, extract: Any):
        tbl = c.table(cls)
        row = tbl.insert(extract.dict(), replace=True, pk="id")  # type: ignore
        if row.last_pk:
            for author_email in extract.authors:
                tbl.update(row.last_pk).m2m(
                    other_table=Individual.__tablename__,
                    lookup={"email": author_email},
                    pk="id",
                )
            for tag in extract.tags:
                tbl.update(row.last_pk).m2m(
                    other_table=Tag.__tablename__,
                    lookup=Tag(**{"tag": tag}).dict(),
                )

Functions

extract_articles() classmethod

Based on entries from a Github folder, ignore files not formatted in .md and extract the Pydantic-model; the model is based on the frontmatter metadata of each markdown article.

Source code in corpus_pax/articles.py
Python
@classmethod
def extract_articles(cls):
    """Based on entries from a Github folder, ignore files
    not formatted in .md and extract the Pydantic-model;
    the model is based on the frontmatter metadata of each
    markdown article.
    """
    articles = []
    for entry in fetch_articles():
        if filename := entry.get("name"):
            if filename.endswith(".md"):
                if url := entry.get("url"):
                    id = filename.removesuffix(".md")
                    modified = fetch_article_date_modified(filename)
                    details = cls.extract_markdown_postmatter(url)
                    article = cls(id=id, modified=modified, **details)
                    articles.append(article)
    return articles

extract_markdown_postmatter(url) classmethod

Convert the markdown/frontmatter file fetched via url to a dict.

Source code in corpus_pax/articles.py
Python
@classmethod
def extract_markdown_postmatter(cls, url: str) -> dict:
    """Convert the markdown/frontmatter file fetched via url to a dict."""
    mdfile = gh.get(url)
    post = frontmatter.loads(mdfile.content)
    d = parser.parse(post["date"]).astimezone(ZoneInfo("Asia/Manila"))
    return {
        "url": url,
        "created": d.timestamp(),
        "date": d.date(),
        "title": post["title"],
        "description": post["summary"],
        "content": post.content,
        "authors": post["authors"],
        "tags": post["tags"],
    }

Entities

RegisteredMember

Bases: BaseModel

Common validator for corpus entities: Individuals and Orgs. Note that the col attribute is for use in sqlpyd.

Source code in corpus_pax/resources.py
Python
class RegisteredMember(BaseModel):
    """Common validator for corpus entities: Individuals and Orgs.
    Note that the `col` attribute is for use in `sqlpyd`."""

    id: str = Field(col=str)
    created: float = Field(col=float)
    modified: float = Field(col=float)
    search_rank: RankStatus | None = Field(
        RankStatus.Ordinary,
        title="Search Rank",
        description="Can use as a means to determine rank in SERP",
        col=int,
    )
    email: EmailStr = Field(col=str)
    img_id: str | None = Field(
        None,
        title="Cloudflare Image ID",
        description=(
            "Based on email, upload a unique avatar that can be called via"
            " Cloudflare Images."
        ),
        col=str,
    )
    display_url: HttpUrl | None = Field(
        title="Associated URL",
        description=(
            "When visiting the profile of the member, what URL is associated"
            " with the latter?"
        ),
        col=str,
    )
    display_name: str = Field(
        ...,
        title="Display Name",
        description="Preferred way of being designated in the platform.",
        min_length=5,
        col=str,
        fts=True,
    )
    caption: str | None = Field(
        None,
        description=(
            "For individuals, the way by which a person is to be known, e.g."
            " Lawyer and Programmer; if an organization, it's motto or quote,"
            " i.e. 'just do it'."
        ),
        col=str,
    )
    description: str | None = Field(
        None,
        title="Member Description",
        description="Useful for both SEO and for contextualizing the profile object.",
        min_length=10,
        col=str,
        fts=True,
    )
    twitter: str | None = Field(None, title="Twitter username", col=str)
    github: str | None = Field(None, title="Github username", col=str)
    linkedin: str | None = Field(None, title="LinkedIn username", col=str)
    facebook: str | None = Field(None, title="Facebook page", col=str)
    areas: list[str] | None = Field(
        default_factory=list,
        title="Practice Areas",
        description=(
            "Itemized strings, referring to specialization of both natural and"
            " artificial persons, that will be mapped to a unique table"
        ),
        exclude=True,
    )
    categories: list[str] | None = Field(
        default_factory=list,
        title="Entity Categories",
        description=(
            "Itemized strings, referring to type of entity of both natural"
            " (e.g. lawyer) and artificial (e.g. law firm) persons, that will"
            " be mapped to a unique table"
        ),
        exclude=True,
    )
    members: list[dict[str, int | str | EmailStr]] | None = Field(
        default_factory=list, exclude=True
    )

    class Config:
        use_enum_values = True

    @classmethod
    def extract_details(cls, url: str) -> dict:
        """Convert the yaml file in the repository to a dict."""
        if details_resp := gh.get(f"{url}/{DETAILS_FILE}"):
            return yaml.safe_load(details_resp.content)
        raise Exception(f"Could not get details from {url=}")

    @classmethod
    def from_url(cls, url: str, set_img: bool = False):
        """Each member url can be converted to a fully validated object
        via a valid Github `url`; if `set_img` is set to true,
        an `img_id` is created on Cloudflare."""
        obj = MemberURL.setter(url, set_img)
        return cls(
            **cls.extract_details(obj.target_url),
            id=obj.id,
            img_id=obj.img_id,
            created=datetime.datetime.now().timestamp(),
            modified=datetime.datetime.now().timestamp(),
        )

Functions

extract_details(url) classmethod

Convert the yaml file in the repository to a dict.

Source code in corpus_pax/resources.py
Python
@classmethod
def extract_details(cls, url: str) -> dict:
    """Convert the yaml file in the repository to a dict."""
    if details_resp := gh.get(f"{url}/{DETAILS_FILE}"):
        return yaml.safe_load(details_resp.content)
    raise Exception(f"Could not get details from {url=}")

from_url(url, set_img=False) classmethod

Each member url can be converted to a fully validated object via a valid Github url; if set_img is set to true, an img_id is created on Cloudflare.

Source code in corpus_pax/resources.py
Python
@classmethod
def from_url(cls, url: str, set_img: bool = False):
    """Each member url can be converted to a fully validated object
    via a valid Github `url`; if `set_img` is set to true,
    an `img_id` is created on Cloudflare."""
    obj = MemberURL.setter(url, set_img)
    return cls(
        **cls.extract_details(obj.target_url),
        id=obj.id,
        img_id=obj.img_id,
        created=datetime.datetime.now().timestamp(),
        modified=datetime.datetime.now().timestamp(),
    )

Individual

Bases: RegisteredMember, IndividualBio, TableConfig

Source code in corpus_pax/entities.py
Python
class Individual(RegisteredMember, IndividualBio, TableConfig):
    __prefix__ = "pax"
    __tablename__ = "individuals"

    @validator("id", pre=True)
    def lower_cased_id(cls, v):
        return v.lower()

    class Config:
        use_enum_values = True

    @classmethod
    def list_members_repo(cls):
        return fetch_entities("members")

    @classmethod
    def make_or_replace(
        cls,
        c: Connection,
        url: str,
        replace_img: bool = False,
    ):
        indiv_data = cls.from_url(url, replace_img)
        tbl = c.table(cls)
        row = tbl.insert(indiv_data.dict(), replace=True, pk="id")  # type: ignore # noqa: E501
        if pk := row.last_pk:
            if indiv_data.areas:
                PracticeArea.associate(tbl, pk, indiv_data.areas)
            if indiv_data.categories:
                PersonCategory.associate(tbl, pk, indiv_data.categories)

Org

Bases: RegisteredMember, TableConfig

Source code in corpus_pax/entities.py
Python
class Org(RegisteredMember, TableConfig):
    __prefix__ = "pax"
    __tablename__ = "orgs"
    official_name: str = Field(None, max_length=100, col=str, fts=True)

    @classmethod
    def list_orgs_repo(cls):
        return fetch_entities("orgs")

    def set_membership_rows(self, c: Connection) -> Table | None:
        member_list = []
        if self.members:
            for member in self.members:
                email = member.pop("account_email", None)
                if email and (acct := EmailStr(email)):
                    obj = OrgMember(
                        org_id=self.id,
                        individual_id=None,
                        rank=member.get("rank", 10),
                        role=member.get("role", "Unspecified"),
                        account_email=acct,
                    )
                    member_list.append(obj)
        if member_list:
            return c.add_cleaned_records(OrgMember, member_list)
        return None

    @classmethod
    def make_or_replace(
        cls,
        c: Connection,
        url: str,
        replace_img: bool = False,
    ):
        org_data = cls.from_url(url, replace_img)
        tbl = c.table(cls)
        row = tbl.insert(org_data.dict(), replace=True, pk="id")  # type: ignore # noqa: E501
        if pk := row.last_pk:
            if org_data.areas:
                PracticeArea.associate(tbl, pk, org_data.areas)
            if org_data.categories:
                PersonCategory.associate(tbl, pk, org_data.categories)
        org_data.set_membership_rows(c)