Skip to main content

fast dbscan clustering on peptide strings

Project description

# fast_dbscan
A lightweight, fast dbscan implementation for use on peptide strings. It uses
pure C for the distance calculations and clustering. This code is then wrapped
in python.

*Note*: as implemented, the software assumes all sequences have the same length.

### Installation

#### pip
```
pip3 install fast_dbscan
```

#### Development version
```
git clone https://github.com/harmslab/fast_dbscan
cd fast_dbscan
sudo python3 setup.py install
```

### Usage

#### Stand-alone
This will install a convenience program called `fast_dbscan` in the path. This
can be invoked on the command line:

```
fast_dbscan filename epsilon [dl]
```

where `filename` is a file that contains sequences of identical length, with one
per line, `epsilon` is the neighborhood distance cutoff (see below), and the
optional argument `dl` says to use the Damerau-Levenshtein distance function
rather than the simple distance function.

#### As library
```
import fast_dbscan

d = fast_dbscan.DBScanWrapper(distance_function='dl')
d.read_file(file_with_sequences)
d.run(epsilon=1,min_neighbors=12)

# Dictionary keying cluster id to sequences
clusters = d.results
```

### Distance functions

+ `simple`: add up entries in a distance matrix based on the identies of letters
at each column in the alignment. Currently, the software uses hamming
distance. This could be easily modified to use other matricies, provided
distances can be calculated as integers. The matrix is populated in
`DBScanWrapper.__init__`.
+ `dl`: Damerau-Levenshtein distance, allowing deletion, insertion, substitution,
and transposition.

### Other parameters
+ `epsilon`: the maximum distance between two samples for them to be considered
within the same neighborhood.
+ `min_neighbors`: the minimum number of sequence neighbors required to define
a cluster

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_dbscan-0.0.1.tar.gz (6.2 kB view details)

Uploaded Source

File details

Details for the file fast_dbscan-0.0.1.tar.gz.

File metadata

  • Download URL: fast_dbscan-0.0.1.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for fast_dbscan-0.0.1.tar.gz
Algorithm Hash digest
SHA256 f583be3b98827f8ad2731a5b6e3b96aa26b7a6b6162da97e32768be30d578a6f
MD5 d708442065e8203c96f61dc24feaecb8
BLAKE2b-256 dfd007e882a76977bb2508eb785c0e8789cc3b16705a2b7292a47e2bbd653509

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page