Skip to main content

Storage and database adapters available in project Thoth

Project description

This library provides a library called thoth-storages used in project Thoth. The library exposes core queries and methods for PostgreSQL database as well as adapters for manipulating with Ceph via its S3 compatible API.

Installation and Usage

The library can be installed via pip or Pipenv from PyPI:

pipenv install thoth-storages

The library does not provide any CLI, it is rather a low level library supporting other parts of Thoth.

You can run prepared test-suite via the following command:

pipenv install --dev
pipenv run python3 setup.py test

# To generate docs:
pipenv run python3 setup.py build_sphinx

Running PostgreSQL locally

You can use docker-compose.yaml present in this repository to run a local PostgreSQL instance, (make sure you installed podman-compose):

$ podman-compose up

After running the command above, you should be able to access a local PostgreSQL instance at localhost:5432. This is also the default configuration for PostgreSQL’s adapter - you don’t need to provide GRAPH_SERVICE_HOST explicitly. The default configuration uses database named postgres which can be accessed using postgres user and postgres password (SSL is disabled).

The provided docker-compose.yaml has also PGweb enabled for to have an UI for the database content. To access it visit http://localhost:8081/.

The provided docker-compose.yaml does not use any volume. After you containers restart, the content will not be available anymore.

If you would like to experiment with PostgreSQL programmatically, you can use the following code snippet as a starting point:

from thoth.storages import GraphDatabase

graph = GraphDatabase()
graph.connect()
# To clear database:
# graph.drop_all()
# To initialize schema in the graph database:
# graph.initialize_schema()

Generating migrations and schema adjustment in deployment

If you make any changes to data model of the main PostgreSQL database, you need to generate migrations. These migrations state how to adjust already existing database with data in deployments. For this purpose, Alembic migrations are used. Alembic can (partially) automatically detect what has changed and how to adjust already existing database in a deployment.

Alembic uses incremental version control, where each migration is versioned and states how to migrate from previous state of database to the desired next state - these versions are present in alembic/versions directory and are automatically generated with procedure described bellow.

If you make any changes, follow the following steps which will generate version for you:

  • Make sure your local PostgreSQL instance is running (follow Running PostgreSQL locally instructions above):

    $ podman-compose up
  • Run Alembic CLI to generate versions for you:

    # Make sure you have your environment setup:
    # pipenv install --dev
    # Make sure you are running the most recent version of schema:
    $ PYTHONPATH=. pipenv run alembic upgrade head
    # Actually generate a new version:
    $ PYTHONPATH=. pipenv run alembic revision --autogenerate -m "Added row to calculate sum of sums which will be divided by 42"
  • Review migrations generated by Alembic. Note NOT all changes are automatically detected by Alembic.

  • Make sure generated migrations are part of your pull request so changes are

    propagated to deployments:

    $ git add thoth/storages/data/alembic/versions/
  • In a deployment, use Management API and its /graph/initialize endpoint to

    propagate database schema changes in deployment (Management API has to have recent schema changes present which are populated with new thoth-storages releases).

  • If running locally and you would like to propagate changes, run the following Alembic command to update migrations to the latest version:

    $ PYTHONPATH=. pipenv run alembic upgrade head

    If you would like to update schema programmatically run the following Python code:

    from thoth.storages import GraphDatabase
    
    graph = GraphDatabase()
    graph.connect()
    graph.initilize_schema()

Generate schema images

You can use shipped CLI thoth-storages to automatically generate schema images out of the current models:

# First, make sure you have dev packages installed:
pipenv install --dev
PYTHONPATH=. pipenv run python3 ./thoth-storages generate-schema

The command above will produce 2 images named schema.png and schema_cache.png. The first PNG file shows schema for the main PostgreSQL instance and the latter one, as the name suggests, shows how cache schema looks like.

If the command above fails with the following exception:

FileNotFoundError: [Errno 2] "dot" not found in path.

make sure you have graphviz package installed:

dnf install -y graphviz

Creating own performance indicators

You can create your own performance indicators. To create own performance indicator, create a script which tests desired functionality of a library. An example can be matrix multiplication script present in performance repository. This script can be supplied to Dependency Monkey to validate certain combination of libraries in desired runtime and buildtime environment or directly on Amun API which will run the given script using desired software and hardware configuration. Please follow instructions on how to create a performance script shown in the README of performance repo.

To create relevant models, adjust thoth/storages/graph/models_performance.py file and add your model. Describe parameters (reported in @parameters section of performance indicator result) and result (reported in @result). The name of class should match name which is reported by performance indicator run.

class PiMatmul(Base, BaseExtension, PerformanceIndicatorBase):
    """A class for representing a matrix multiplication micro-performance test."""

    # Device used during performance indicator run - CPU/GPU/TPU/...
    device = Column(String(128), nullable=False)
    matrix_size = Column(Integer, nullable=False)
    dtype = Column(String(128), nullable=False)
    reps = Column(Integer, nullable=False)
    elapsed = Column(Float, nullable=False)
    rate = Column(Float, nullable=False)

All the models use SQLAchemy. See docs for more info.

Online debugging of queries

You can print to logger all the queries that are performed to a PostgreSQL instance. To do so, set the following environment variable:

export THOTH_STORAGES_DEBUG_QUERIES=1

Memory usage statisticts

You can print information about PostgreSQL adapter together with statistics on the graph cache and memory cache usage to logger (it has to have at least level INFO set). To do so, set the following environment variable:

export THOTH_STORAGES_LOG_STATS=1

These statistics will be printed once the database adapter is destructed.

Automatic backups of Thoth deployment

In each deployment, an automatic knowledge graph backup cronjob is run, usually once a day. Results of automatic backups are stored on Ceph - you can find them in s3://<bucket-name>/<prefix>/<deployment-name>/graph-backup/pg_dump-<timestamp>.sql. Refer to deployment configuration for expansion of parameters in the path.

To create a database instance out of this backup file, run a local a fresh PostgreSQL instance and fill it from the backup file:

$ cd thoth-station/storages
$ aws s3 --endpoint <ceph-s3-endpoint> cp s3://<bucket-name>/<prefix>/<deployment-name>/graph-backup/pg_dump-<timestamp>.sql pg_dump-<timestamp>.sql
$ podman-compose up
$ psql -h localhost -p 5432 --username=postgres < pg_dump-<timestamp>.sql
password: <type password "postgres" here>
<logs will show up>

Manual backups of Thoth deployment

You can use pg_dump and psql utilities to create dumps and restore the database content from dumps. This tool is pre-installed in the container image which is running PostgreSQL so the only thing you need to do is execute pg_dump in Thoth’s deployment in a PostgreSQL container to create a dump, use oc cp to retrieve dump (or directly use oc exec and create the dump from the cluster) and subsequently psql to restore the database content. The prerequisite for this is to have access to the running container (edit rights).

# Execute the following commands from the root of this Git repo:
# List PostgreSQL pods running:
$ oc get pod -l name=postgresql
NAME                 READY     STATUS    RESTARTS   AGE
postgresql-1-glwnr   1/1       Running   0          3d
# Open remote shell to the running container in the PostgreSQL pod:
$ oc rsh -t postgresql-1-glwnr bash
# Perform dump of the database:
(cluster-postgres) $ pg_dump > pg_dump-$(date +"%s").sql
(cluster-postgres) $ ls pg_dump-*.sql   # Remember the current dump name
(cluster-postgres) pg_dump-1569491024.sql
(cluster-postgres) $ exit
# Copy the dump to the current dir:
$ oc cp thoth-test-core/postgresql-1-glwnr:/opt/app-root/src/pg_dump-1569491024.sql  .
# Start local PostgreSQL instance:
$ podman-compose up --detach
<logs will show up>
$ psql -h localhost -p 5432 --username=postgres < pg_dump-1569491024.sql
password: <type password "postgres" here>
<logs will show up>

You can ignore error messages related to an owner error like this:

STATEMENT:  ALTER TABLE public.python_software_stack OWNER TO thoth;
ERROR:  role "thoth" does not exist

The PostgreSQL container uses user “postgres” by default which is different from the one run in the cluster (“thoth”). The role assignment will simply not be created but data will be available.

Syncing results of jobs run in the cluster

Each job in the cluster reports a JSON which states necessary information about the job run (metadata) and actual job results. These results of jobs are stored on object storage Ceph via S3 compatible API and later on synced via graph syncs to the knowledge graph. The component responsible for graph syncs is graph-sync-job which is written generic enough to sync any data and report metrics about synced data so you don’t need to provide such logic on each new workload registered in the system. To sync your own results of job results (workload) done in the cluster, implement related syncing logic in the sync.py and register handler in the _HANDLERS_MAPPING in the same file. The mapping maps prefix of the document id to the handler (function) which is responsible for syncing data into the knowledge base (please mind signatures of existing syncing funcions to automatically integrate with sync_documents function which is called from graph-sync-job).

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thoth-storages-0.20.0.tar.gz (69.1 kB view details)

Uploaded Source

Built Distribution

thoth_storages-0.20.0-py3-none-any.whl (110.9 kB view details)

Uploaded Python 3

File details

Details for the file thoth-storages-0.20.0.tar.gz.

File metadata

  • Download URL: thoth-storages-0.20.0.tar.gz
  • Upload date:
  • Size: 69.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/36.5.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.3

File hashes

Hashes for thoth-storages-0.20.0.tar.gz
Algorithm Hash digest
SHA256 216c8766627e3341b8e8ba2b76b6a393c3ee5c8eccb6c4edd02197a803c1d9fb
MD5 304d5d1150a1fec21b2f780231c2938f
BLAKE2b-256 f950dc7c47fb46334096fd4cc1290cba7a6d43802776c3b94446624d7a4fd4c6

See more details on using hashes here.

File details

Details for the file thoth_storages-0.20.0-py3-none-any.whl.

File metadata

  • Download URL: thoth_storages-0.20.0-py3-none-any.whl
  • Upload date:
  • Size: 110.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/36.5.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.3

File hashes

Hashes for thoth_storages-0.20.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9bcd68b3b5592a150b1beb10eaa944c184f3f1f2bffa744a6033c3d09240d978
MD5 35384ea5f1d47149876950485566495e
BLAKE2b-256 214501391815fb5529e8a26a56778bdfae447959f2312a4a0f95d768f470fc3a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page