fast dbscan clustering on peptide strings
Project description
# fast_dbscan
A lightweight, fast dbscan implementation for use on peptide strings. It uses
pure C for the distance calculations and clustering. This code is then wrapped
in python.
*Note*: as implemented, the software assumes all sequences have the same length.
### Installation
#### pip
```
pip3 install fast_dbscan
```
#### Development version
```
git clone https://github.com/harmslab/fast_dbscan
cd fast_dbscan
sudo python3 setup.py install
```
### Usage
#### Stand-alone
This will install a convenience program called `fast_dbscan` in the path. This
can be invoked on the command line:
```
fast_dbscan filename epsilon [dl]
```
where `filename` is a file that contains sequences of identical length, with one
per line, `epsilon` is the neighborhood distance cutoff (see below), and the
optional argument `dl` says to use the Damerau-Levenshtein distance function
rather than the simple distance function.
#### As library
```
import fast_dbscan
d = fast_dbscan.DBScanWrapper(distance_function='dl')
d.read_file(file_with_sequences)
d.run(epsilon=1,min_neighbors=12)
# Dictionary keying cluster id to sequences
clusters = d.results
```
### Distance functions
+ `simple`: add up entries in a distance matrix based on the identies of letters
at each column in the alignment. Currently, the software uses hamming
distance. This could be easily modified to use other matricies, provided
distances can be calculated as integers. The matrix is populated in
`DBScanWrapper.__init__`.
+ `dl`: Damerau-Levenshtein distance, allowing deletion, insertion, substitution,
and transposition.
### Other parameters
+ `epsilon`: the maximum distance between two samples for them to be considered
within the same neighborhood.
+ `min_neighbors`: the minimum number of sequence neighbors required to define
a cluster
A lightweight, fast dbscan implementation for use on peptide strings. It uses
pure C for the distance calculations and clustering. This code is then wrapped
in python.
*Note*: as implemented, the software assumes all sequences have the same length.
### Installation
#### pip
```
pip3 install fast_dbscan
```
#### Development version
```
git clone https://github.com/harmslab/fast_dbscan
cd fast_dbscan
sudo python3 setup.py install
```
### Usage
#### Stand-alone
This will install a convenience program called `fast_dbscan` in the path. This
can be invoked on the command line:
```
fast_dbscan filename epsilon [dl]
```
where `filename` is a file that contains sequences of identical length, with one
per line, `epsilon` is the neighborhood distance cutoff (see below), and the
optional argument `dl` says to use the Damerau-Levenshtein distance function
rather than the simple distance function.
#### As library
```
import fast_dbscan
d = fast_dbscan.DBScanWrapper(distance_function='dl')
d.read_file(file_with_sequences)
d.run(epsilon=1,min_neighbors=12)
# Dictionary keying cluster id to sequences
clusters = d.results
```
### Distance functions
+ `simple`: add up entries in a distance matrix based on the identies of letters
at each column in the alignment. Currently, the software uses hamming
distance. This could be easily modified to use other matricies, provided
distances can be calculated as integers. The matrix is populated in
`DBScanWrapper.__init__`.
+ `dl`: Damerau-Levenshtein distance, allowing deletion, insertion, substitution,
and transposition.
### Other parameters
+ `epsilon`: the maximum distance between two samples for them to be considered
within the same neighborhood.
+ `min_neighbors`: the minimum number of sequence neighbors required to define
a cluster
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
fast_dbscan-0.0.1.tar.gz
(6.2 kB
view details)
File details
Details for the file fast_dbscan-0.0.1.tar.gz
.
File metadata
- Download URL: fast_dbscan-0.0.1.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f583be3b98827f8ad2731a5b6e3b96aa26b7a6b6162da97e32768be30d578a6f |
|
MD5 | d708442065e8203c96f61dc24feaecb8 |
|
BLAKE2b-256 | dfd007e882a76977bb2508eb785c0e8789cc3b16705a2b7292a47e2bbd653509 |