Skip to main content

Command-line program that scans the NMDC MongoDB database for referential integrity violations

Project description

refscan

refscan is a command-line tool people can use to scan the NMDC MongoDB database for referential integrity violations.

%% This is the source code of a Mermaid diagram, which GitHub will render as a diagram.
%% Note: PyPI does not render Mermaid diagrams, and instead displays their source code.
%%       Reference: https://github.com/pypi/warehouse/issues/13083
graph LR
    schema[LinkML<br>schema]
    database[(MongoDB<br>database)]
    script[["refscan.py"]]
    violations["List of<br>violations"]
    references["List of<br>references"]:::dashed_border
    schema --> script
    database --> script
    script -.-> references
    script --> violations
    
    classDef dashed_border stroke-dasharray: 5 5

Table of contents

How it works

refscan does its job in two stages:

  1. It uses the LinkML schema to determine "what to scan;" i.e. all of the document-to-document references that can exist in a database that conforms to the schema.

    Example: The schema might say that, for each document in the biosample_set collection that has a field named associated_studies, that field must contain a list of ids of documents in the study_set collection.

  2. It scans the MongoDB database to check the integrity of all of the references that do exist.

    Example: For each document in the biosample_set collection that has a field named associated_studies, for each value in that field, confirm there is a document having that id in the study_set collection.

Limitations

refscan was designed under the assumption that every document in every collection described by the schema has a field named type, whose value is the class_uri of the schema class the document represents an instance of. refscan uses that class_uri value (in that type field) to determine the name of that schema class, which it, in turn, uses to determine which fields of that document can contain references.

Usage

Install

Assuming you have pipx installed, you can install the tool by running the following command:

pipx install refscan

pipx is a tool people can use to download and install Python scripts that are hosted on PyPI. You can install pipx by running $ python -m pip install pipx.

Run

You can display the tool's --help snippet by running:

refscan --help

At the time of this writing, the tool's --help snippet is:

 Usage: refscan [OPTIONS]

 Scans the NMDC MongoDB database for referential integrity violations.

╭─ Options ──────────────────────────────────────────────────────────────────────────────╮
│ *  --schema                               FILE  Filesystem path at which the YAML file │
│                                                 representing the schema is located.    │
│                                                 [default: None]                        │
│                                                 [required]                             │
│    --database-name                        TEXT  Name of the database.                  │
│                                                 [default: nmdc]                        │
│    --mongo-uri                            TEXT  Connection string for accessing the    │
│                                                 MongoDB server. If you have Docker     │
│                                                 installed, you can spin up a temporary │
│                                                 MongoDB server at the default URI by   │
│                                                 running: $ docker run --rm --detach -p │
│                                                 27017:27017 mongo                      │
│                                                 [env var: MONGO_URI]                   │
│                                                 [default: mongodb://localhost:27017]   │
│    --verbose                                    Show verbose output.                   │
│    --skip-source-collection,--skip        TEXT  Name of collection you do not want to  │
│                                                 search for referring documents. Option │
│                                                 can be used multiple times.            │
│                                                 [default: None]                        │
│    --reference-report                     FILE  Filesystem path at which you want the  │
│                                                 program to generate its reference      │
│                                                 report.                                │
│                                                 [default: references.tsv]              │
│    --violation-report                     FILE  Filesystem path at which you want the  │
│                                                 program to generate its violation      │
│                                                 report.                                │
│                                                 [default: violations.tsv]              │
│    --version                                    Show version number and exit.          │
│    --help                                       Show this message and exit.            │
╰────────────────────────────────────────────────────────────────────────────────────────╯

Note: The above snippet was captured from a terminal window whose width was 90 characters.

The MongoDB connection string (--mongo-uri)

As documented in the --help snippet above, you can provide the MongoDB connection string to the tool via either (a) the --mongo-uri option; or (b) an environment variable named MONGO_URI. The latter can come in handy when the MongoDB connection string contains information you don't want to appear in your shell history, such as a password.

Here's how you could create that environment variable:

export MONGO_URI='mongodb://username:password@localhost:27017'

The schema (--schema)

As documented in the --help snippet above, you can provide the path to a YAML-formatted LinkML schema file to the tool via the --schema option.

Show/hide tips for getting a schema file

If you have curl installed, you can download a YAML file from GitHub by running the following command (after replacing the {...} placeholders and customizing the path):

# Download the raw content of https://github.com/{user_or_org}/{repo}/blob/{branch}/path/to/schema.yaml
curl -o schema.yaml https://raw.githubusercontent.com/{user_or_org}/{repo}/{branch}/path/to/schema.yaml

For example:

# Download the raw content of https://github.com/microbiomedata/berkeley-schema-fy24/blob/main/nmdc_schema/nmdc_materialized_patterns.yaml
curl -o schema.yaml https://raw.githubusercontent.com/microbiomedata/berkeley-schema-fy24/main/nmdc_schema/nmdc_materialized_patterns.yaml

Output

While refscan is running, it will display console output indicating what it's currently doing.

Screenshot of refscan console output

Once the scan is complete, the reference report (TSV file) and violation report (TSV file) will be available in the current directory (or in custom directories, if any were specified via CLI options).

Update

You can update the tool to the latest version available on PyPI by running:

pipx upgrade refscan

Uninstall

You can uninstall the tool from your computer by running:

pipx uninstall refscan

Development

We use Poetry to both (a) manage dependencies and (b) build distributable packages that can be published to PyPI.

  • pyproject.toml: Configuration file for Poetry and other tools (was initialized via $ poetry init)
  • poetry.lock: List of dependencies, both direct and indirect/transitive

Clone repository

git clone https://github.com/microbiomedata/refscan.git
cd refscan

Create virtual environment

Create a Poetry virtual environment and attach to its shell:

poetry shell

You can see information about the Poetry virtual environment by running: $ poetry env info

You can detach from the Poetry virtual environment's shell by running: $ exit

From now on, I'll refer to the Poetry virtual environment's shell as the "Poetry shell."

Install dependencies

At the Poetry shell, install the project's dependencies:

poetry install

Make changes

Edit the tool's source code and documentation however you want.

While editing the tool's source code, you can run the tool as you normally would in order to test things out.

refscan --help

Run tests

We use pytest as the testing framework for refscan.

You can run the tests by running the following command from the root directory of the repository:

poetry run pytest

Tests are defined in the tests directory.

Format code

We use black as the code formatter for refscan.

We do not use it with its default options. Instead, we include an option that allows lines to be 120 characters instead of the default 88 characters.

You can format all the Python code in the repository by running this command from the root directory of the repository:

poetry run black --line-length 120 .

Check format

You can check the format of the Python code by including the --check option, like this:

poetry run black --line-length 120 --check .

Building and publishing

Build for production

Whenever someone publishes a GitHub Release in this repository, a GitHub Actions workflow will automatically build a package and publish it to PyPI. That package will have a version identifier that matches the name of the Git tag associated with the Release.

Test the build process locally

In case you want to test the build process locally, you can do so by running:

poetry build

That will create both a source distribution file (whose name ends with .tar.gz) and a wheel file (whose name ends with .whl) in the dist directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refscan-0.1.8.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

refscan-0.1.8-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file refscan-0.1.8.tar.gz.

File metadata

  • Download URL: refscan-0.1.8.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for refscan-0.1.8.tar.gz
Algorithm Hash digest
SHA256 72e0a7aaa9449327d5912818b05d7290d64327ac247d7b86c705096c7c11c27c
MD5 0984aba51f3ee39412ad56f301d53f9b
BLAKE2b-256 65f908f701c739e628383bb0815e9dbe69e8ed9c9aef50f0d086098be1f4d52a

See more details on using hashes here.

File details

Details for the file refscan-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: refscan-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for refscan-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 83ab1003ed165be09f16d3387454cc2e4c90b4a26b6a5eac3738046442cf159c
MD5 c4517d9ae438917f2af950ae1c0a4106
BLAKE2b-256 62c89848eadd55d282b96ca93f6b7a1a3555b92138fba1ab1a186db13ad081ea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page