Skip to main content

Synthetic data generator for snail mutation survey

Project description

Snailz

snail logo

These data generators model genomic analysis of snails in the Pacific Northwest that are growing to unusual size as a result of exposure to pollution.

  • One or more surveys are conducted at one or more sites.
  • Each survey collects genomes and sizes of snails.
  • A grid at each site is marked out to show the presence or absence of pollution.
  • Laboratory staff perform assays of the snails' genetic material.
  • Each assay plate has a design showing the material applied and readings showing the measured response.
  • Plates may be invalidated after the fact if a staff member believes it is contaminated.
survey sites

Usage

  1. Create a fresh Python environment: mamba create -y -n snailz python=3.12
  2. Activate that environment: mamba activate snailz
  3. Build development version of package: pip install -e .
  4. View available commands: snailz --help
  5. Copy default parameter files: snailz params --outdir .
  6. See how to regenerate datasets: python -c 'import snailz; help(snailz)'

This project also includes a Makefile that will re-execute commands as needed. To see available commands, run:

make DATA=data PARAMS=snailz/params commands

(To keep the Makefile simple, DATA and PARAMS must be defined even for commands that don't need them.) To regenerate all of the datasets, run:

make DATA=data PARAMS=snailz/params datasets
command action
dev rebuild development version of package
lint check code using ruff
clean remove datafiles
commands show available commands
datasets make all datasets
survey generate survey map
mangled create inconsistent plate files
db generate database
plates generate plate files
assays generate assay files
samples sample snails from survey sites
genomes synthesize genomes
grids synthesize pollution grids

Database

The final database data/lab.db is structured as shown below. Note that the data from the file assays.json is split between several tables. Note also that the SQLite database file is not included in this repository because its binary representation changes each time it is regenerated (even though the values it contains stay the same). The map of survey locations in data/survey.png is not included in the repository for the same reason, but a duplicate is manually saved in img/survey.png.

database schema
  • site: survey site
    • site_id: primary key (text)
    • lon: longitude of site reference marker (float deg)
    • lat: latitude of site reference marker (float deg)
  • survey
    • survey_id: primary key (text)
    • site_id: foreign key of site where survey was conducted (text)
    • date: date that survey was conducted (date, YYYY-MM-DD)
  • sample: sample taken from survey
    • sample_id: primary key (int, 1-1 with experiment.sample_id)
    • survey_id: foreign key of survey (int)
    • lon: longitude of sample site (float deg)
    • lat: latitude of sample site (float deg)
    • sequence: genome sequence of sample (text)
    • size: snail size (float)
  • experiment: experiment done on sample
    • sample_id: primary key (int, 1-1 with sample.sample_id)
    • kind: kind of experiment (text, either 'ELISA' or 'JESS')
    • start: start date (date, YYYY-MM-DD)
    • end: end date (date, YYYY-MM-DD, null if experiment is ongoing)
  • staff
    • staff_id: primary key (int)
    • personal: personal name (text)
    • family: family name (text)
  • performed: join table showing which staff members performed which experiments
    • staff_id: foreign key of staff member
    • sample_id: foreign key of sample/experiment
  • plate: information about single assay plate
    • plate_id: primary key (int)
    • sample_id: foreign key of sample/experiment (int)
    • date: date that plate was run (date, YYYY-MM-DD)
    • filename: filename of design/results file (text)
  • invalidated: invalidated plates
    • plate_id: foreign key of plate (int)
    • staff_id: foreign key of staff member who did invalidation (int)
    • date: when plate was invalidated

Data Files

./data contains a generated dataset for reference. As noted above, it does not contain the SQLite database file lab.db; run snailz db to regenerate it. (See help(snailz) for an example invocation.)

  • Genomes: genomes.json
    • length: number of base pairs (int > 0)
    • reference: the unmutated reference genome (text)
    • individuals: sequences for individuals (list of text)
    • locations: locations of mutations (list of int)
    • susceptible_loc: location of mutation of interest (int >= 0)
    • susceptible_base: mutated base responsible for size change (char)
  • Grids: grids/*.csv (one file per site)
    • 1/0: presence/absence of contamination at sample location
  • Samples: grids/samples.csv
    • sample_id: unique ID for genetic sample (text)
    • survey_id: which survey it was taken in (text)
    • lon: longitude of sample site (float)
    • lat: latitude of sample site (float)
    • sequence: sampled gene sequence (text)
    • size: snail weight (float, grams)
  • Assays: assays.json
    • staff:
      • staff_id: unique staff member identifier (int > 0)
      • personal: personal name (text)
      • family: family name (text)
    • experiment: experiment details
      • sample_id: sample that experiment used (int > 0)
      • kind: "ELISA" or "JESS" (text)
      • start: start date (date, YYYY-MM-DD)
      • end: end date (date, YYYY-MM-DD or None if experiment incomplete)
    • performed: join table showing who performed which experiments
      • staff_id: foreign key to staff
      • sample_id: foreign key to experiment
    • plate: details of assay plates used in experiments
      • plate_id: unique plate identifier (int > 0)
      • sample_id: foreign key to sample (text)
      • date: date plate was run (date, YYYY-MM-DD)
      • filename: name of design and results files (text)
    • invalidated: which plates have been invalidated
      • plate_id: foreign key to plate (text)
      • staff_id: foreign key to staff member responsible (text)
      • date: invalidation date (date, YYYY-MM-DD)
  • Plates are represented by matching files in the designs and readings directories
    • designs/*.csv: assay plate designs
      • header: machine type, file type ("design" or "readings"), staff ID
      • blank line
      • table with column and row titles showing material in each well
    • readings/*.csv: assay plate readings
      • header: machine type, file type ("design" or "readings"), staff ID
      • blank line
      • table with column and row titles showing reading from each well
  • To simulate the messiness of real experimental data, the tidy assay plate files in readings/*.csv are copied to mangled/*.csv with random changes:
    • Some files have a staff member's name added in the first row.
    • Some have an extra header row containing the experiment date.
    • Some have a footer with the staff member's ID.
    • In some, the values are offset one column to the right.

Workflow

The workflow used to generate the database and data files is shown below:

  • snailz or snailz --help: show available commands
  • snailz all: make all datasets
  • snailz map: generate SVG map of sample locations (in progress)
  • snailz mangle: create mangled plate reading files
  • snailz db: generate database
  • snailz plates: generate plate files
  • snailz assays: generate assay files
  • snailz samples: sample snails from survey sites
  • snailz genomes: synthesize genomes
  • snailz grids: synthesize pollution grids
  • snailz clean: remove all datasets
data generation workflow

Parameters

./snailz/params contains the parameter files used to control generation of the reference dataset. These are included in the package and can be copied into the current directory using snailz params --outdir . (replace . with another directory name as desired). snailz params also copies a Makefile that can re-run commands with appropriate parameters; see the table of commands given earlier for options.

  • Sites: sites.csv
    • site_id: unique label for site (text)
    • lon: longitude of site reference marker (deg)
    • lat: latitude of site reference marker (deg)
  • Surveys: surveys.csv
    • survey_id: unique label for survey (text)
    • site_id: ID of site where survey was conducted (text)
    • date: date that survey was conducted (date, YYYY-MM-DD)
    • spacing: spacing of measurement point (float, meters)
  • Genomes: genomes.json
    • length: number of base pairs in sequences (int > 0)
    • num_genomes: how many individuals to generate (int > 0)
    • num_snp: number of single nucleotide polymorphisms (int > 0)
    • prob_other: probability of non-significant mutations (float in 0..1)
    • seed: RNG seed (int > 0)
    • snp_probs: probability of selecting various bases (list of 4 float summing to 1.0)
  • Grids: grids.json
    • depth: range of random values per cell (int > 0)
    • height: number of cells on Y axis (int > 0)
    • seed: RNG seed (int > 0)
    • width: number of cells on X axis (int > 0)
  • Assays: assays.json
    • assay_duration: range of days for each assay (ordered pair of int >= 0)
    • assay_plates: range of plates per assay (ordered pair of int >= 1)
    • assay_staff: range of staff in each assay (ordered pair of int > 0)
    • assay_types: types of assays (list of text)
    • control_val: nominal reading value for control wells (float > 0)
    • controls: labels to used for control wells (list of text)
    • enddate: end of all experiments
    • filename_length: length of stem of design/readings filenames (int > 0)
    • invalid: probability of plate being invalidted (float in 0..1)
    • locale: locale to use when generating staff names (text)
    • seed: RNG seed (int > 0)
    • staff: number of staff (int > 0)
    • startdate: start of all experiments
    • stdev: standard deviation on readings (float > 0)
    • treated_val: nominal reading value for treated well (float > 0)
    • treatment: label to use for treated wells (text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snailz-0.1.3.tar.gz (471.8 kB view details)

Uploaded Source

Built Distribution

snailz-0.1.3-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file snailz-0.1.3.tar.gz.

File metadata

  • Download URL: snailz-0.1.3.tar.gz
  • Upload date:
  • Size: 471.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for snailz-0.1.3.tar.gz
Algorithm Hash digest
SHA256 93225450be2a80c99fda5ecdcdf49a87eaa29ff8d5b8cee07905592141990b7d
MD5 bed843ecf3db99537d6e59f470d5a8ee
BLAKE2b-256 dba8d57ed5324e3676dae75e19128dac33ebc7e65626d678f5efe955d699e424

See more details on using hashes here.

File details

Details for the file snailz-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: snailz-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for snailz-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0289d08e2122d4d962c5860c632ac04068c8a53e1c9ca2320334a1cda4c2179f
MD5 d0d2b8d82db72c887c5890b1f003ffbc
BLAKE2b-256 d7af79a5d5e51332dfd4acff601874ffad75a82141c390fbe3943e7c317723e3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page