Skip to main content

A Python reader (and eventually writer) for ms files

Project description

msssPy

An ms/msms file reader for Python.

Pronounced “Mississippi”

This reader is enhanced over basic ms file readers, in that it keeps a cache indices for each file it reads, which significantly speeds of random access to individual samples within a multiple-replicate ms file.

This can be especially useful for machine learnings tasks, during which ms files need to be randomly accessed multiple times. Files already seen (by the same process) are read much more quickly than the first time they are accessed within that process. A future version will also add cache persistence.

Additionally, mssspy adds the ability to plug in different "reader" implementations that use different parsing algorithms. Currently two built-in readers are included, the "slow" reader which is more fault-tolerant and provides better error reporting, and a "faster" reader which assumes correctly formatted ms files, while sacrificing more careful validation.

Basic Usage

To read an ms file, the main high-level interface is the MSFile class. Simply open a file like:

>>> import mssspy
>>> msf = MSFile('path/to/simulations.ms')

You can then access the individual replicates in the file, or "samples" using index notation:

>>> msf[0]
Sample(haplotypes=array([[0, 1, 1, 0, 0],
       [1, 0, 0, 1, 1]], dtype=uint8), positions=array([0.283, 0.55 , 0.589, 0.715, 0.988]))

This is the case even if there is only one sample in the file, msf[0].

If you intend to read multiple samples from the same file while it's open, it is also more efficient to use MSFile in a with statement, e.g.:

>>> with MSFile('path/to/simulations.ms') as msf:
...     all_samples = list(msf)

Note: The is currently not a way to get the length of the file in samples. E.g. len(msf) does not work. This is because it would require scanning through the entire file to count the number of samples, which would be inefficient. However, this capability will be added in a future release.

In the meantime, you can still iterate over the MSFile which will try each possible index starting from 0 until an IndexError is raised. In other words, that's why list(msf) works.

And that's basically it!

Advanced Usage

TODO

TODO List for Future Releases

  • Add "fast" reader written in C(ython) and compare its performance to the existing "faster" reader.

  • More thorough parsing (e.g. support for time: and tree data parsing).

  • Support for writing.

  • More thorough documentation including API documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

mssspy-0.1.0b2-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file mssspy-0.1.0b2-py3-none-any.whl.

File metadata

  • Download URL: mssspy-0.1.0b2-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.2 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.7.5

File hashes

Hashes for mssspy-0.1.0b2-py3-none-any.whl
Algorithm Hash digest
SHA256 e356f2aada86f0cd18f02631dbd621e35819be3dba13916b46aaeeea3bf5433c
MD5 d203c77aba03f77e19bec6f2d67b6e5e
BLAKE2b-256 4236675a4a7a29c02e4fb1873a1e3a5db46592dc1c6a85b488b0d3f03daa14b9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page