Skip to main content

Looks for new Illumina sequencing runs and tars them up into GCP storage

Project description

A tool that archives new Illumina sequencing runs to Google Cloud Storage

Use case

You have one or more Illumina sequencers that are writing to a mounted filesystem such as NFS. You need a way to detect when there is a new, completed sequencing run directory and then relocate it to redundant storage. Downstream tools need to be able to know when a tarred run directory is available to post-processing (i.e. demultiplexing, QC, read alignment, etc.).

How it works

Sequencing Runs Monitor solves the aforementioned challenges through the use of Google Cloud Platform services and by tracking workflow state. Sequencing runs are tarred with gzip compression and then uploaded to Google Cloud Storage. Workflow state is tracked both locally via SQLite and in the NoSQL database Google Firestore for redundancy and to allow downstream clients to query sequencing run records.

Note: while you don’t need to use a Google compute instance to run the monitor script, the documentation here assumes that you are since it is the recommended way. That’s due to the fact that the monitor must interact with certain GCP services, and hence must be running with proper Google credentials (i.e. a service account).

The monitor script is named launch_monitor.py; when running it you must provide it with the path to a JSON configuration file, described in detail further below. You should set up your compute instance to run this script as a daemon service.

The workflow is fitted into two tasks: the tar task and the upload task. When the monitor detects a new sequencing run, it executes the workflow in a child process. The workflow is smart enough to detect which task to begin with, thanks to the local SQLite database. This database has a record for each sequencing run and tracks which workflow tasks have been completed, and whether the workflow is running.

Tar task

Creates a tarball with gzip compression. The process ID is stored in the local run record in the SQLite database.

Upload task

Uploads the tarfile to a Google bucket. This task fetches the run record from the local database to get the path to the local tarfile.

The configuration file

This is a small JSON file that lets the monitor know things such as which GCP bucket and Firestore collection to use, for example. The possible keys are:

  • watchdir: (Required) The directory to monitor for new sequencing runs.

  • completed_runs_dir: The directory to move a run directory to after it has completed the workflow. Defaults to a folder by the name ‘SRM_COMPLETED` that resides within the same directory as the one being watched. Note that at present, there isn’t a means to clean out the completed runs directory, but that will come in a future release.

  • sqlite_db: The name of the local SQLite database to use for tracking workflow state. Defaults to sruns.db if not specified.

  • firestore_collection: (Required) The name of the Google Firestore collection to use for persistent workflow state that downstream tools can query. If it doesn’t exist yet, it will be created.

  • gcp_bucket_name: (Required) The name of the Google Cloud Storage bucket to which tarred run directories will be uploaded.

  • gcp_bucket_basedir: The directory in gcp_bucket_name in which to store all uploaded files. Defaults to the root directory.

  • cycle_pause_sec: The number of seconds to wait in-between scans of watchdir. Defaults to 60.

  • task_runtime_limit_sec: The number of seconds a child process is allowed to run before being killed. This is meant to serve as a safety mechanism to prevent errant child processes from consuming resources in the event that this does happen due to unforeseen circumstances. An email notification will be sent out in this case to alert about the errant process and the sequencing run it was associated with. The number of seconds you set for this depends on several factors, such as run size and network speed. It is suggested to use two days (172800 seconds) at least to be conservative.

Workflow state

The state of the workflow for a given run directory is tracked both locally in a SQLite database as well as Google Firestore - a NoSQL database. Local state is tracked for the purpose of being able to restart workflows if a child process ever crashes, or if the node goes down. Firestore is used to enable downstream applications to query the collection (whose name is specified in your configuration file) to do their own post-processing as desired. For example, an external tool could query the collection and ask if a given run is completed yet. Completed in this sense means that the run was tarred and uploded to a Google bucket. Then, the tool could tell where the tarfile blob is located.

SQLite

There is a record for every sequencing run, which is stored in the tasks table - the only table. The possible fields are:

  • name: The name of the sequencing run.

  • pid: The process ID of the workflow that is running or that already ran.

  • tarfile: The path to the local tarfil that was generated by the tar task.

  • gcp_tarfile: The blob object path in the Google bucket, stored as $bucket_name/$blob_name.

Firestore

There is a record in the collection for each sequencing run. The possible fields are:

  • name: The name of the sequencing run. This mirrors the value of the same attribute in the analagous SQLite database record.

  • storage: Bucket storage object path for the tarred run directory in the form $bucket_name/path/to/run.tar.gz

  • workflow_status: The overall status of the worklfow. Possible values are:

    • new

    • starting

    • tarring

    • tarring_complete

    • uploading

    • uploading_complete

    • complete

    • not_running

Installation and setup

This works in later versions of Python 3 only:

pip3 install sruns-monitor

It is recommended to start your compute instance (that will run the monitor) using a service account with the following roles:

  • roles/storage.objectAdmin

  • roles/datastore.owner

Running Test Cases

Each module has associated test cases. There are both unit tests and functional tests.

Unit Tests

Run the following command from within the tests package directory:

python3 -m unittest

There are two unit test modules:

  • test_sqlite_utils.py: Tests methods in the sqlite_utils.Db class. These tests make sure that the methods that interface with the local SQLite database function as expected.

  • test_utils.py: Tests general utility functions in utils.py, such as tarring a run directory, uploading an object to Google Storage, and checking child process state.

Functional Tests

Running the functional tests are especially helpful in letting you know that your environment is set up correctly and that the monitor can access your Firestore database and Google bucket.

The test module is named monitor_integration_tests.py. It is testing logic in the class sruns_monitor.Monitor. Because this class requires a configuration file in JSON format during instantiation, you must create such a file in order to run these tests. The file must be named as conf.json and must reside within the calling directory. The following config parameters should not be specified, however:

  • watchdir

  • completed_runs_dir

  • sqlite_db

That is because within the tests package directory, it includes its own watch directory with mock run directories. The parameters you should provide in the conf.json file for testing are:

  • firestore_collection

  • gcp_bucket_name

  • gcp_bucket_basedir

Then, you run the tests like so:

monitor_integration_tests.py

Note that you should be using a Google service account as described above.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sruns-monitor-0.2.1.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

sruns_monitor-0.2.1-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file sruns-monitor-0.2.1.tar.gz.

File metadata

  • Download URL: sruns-monitor-0.2.1.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for sruns-monitor-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b95934a842e07b66d8d6f3006b5a867ae17b90270b3f351d40eda186a7b27655
MD5 240e3639fa359449bf90ccbacf5a0443
BLAKE2b-256 8379375bdb90228809ef85e2e8fd7718ace5bfc1e7aac1079a3a25bc5f6908f8

See more details on using hashes here.

File details

Details for the file sruns_monitor-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: sruns_monitor-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for sruns_monitor-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 14f65b4c77fef6313a8481169c1e2acad3fe560c64527f2dde0168848be498f3
MD5 1aede1642e9dd696bbca2ec4348676ef
BLAKE2b-256 1e01c442b90b33a973b9ae5d00d7014acfaba43e0cc95ba483f2a8ab91ab6496

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page