Skip to main content

CREPE pitch tracker

Project description

CREPE Pitch Tracker Build Status

CREPE is a monophonic pitch tracker based on a deep convolutional neural network operating directly on the time-domain waveform input. CREPE is state-of-the-art (as of 2018), outperfoming popular pitch trackers such as pYIN and SWIPE:

Further details are provided in the following paper:

CREPE: A Convolutional Representation for Pitch Estimation
Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello.
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.

We kindly request that academic publications making use of CREPE cite the aforementioned paper.

Installing CREPE

CREPE is hosted on PyPI. To install, run the following command in your Python environment:

$ pip install --upgrade tensorflow  # if you don't already have tensorflow >= 1.6.0
$ pip install crepe

To install the latest version from source clone the repository and from the top-level crepe folder call:

$ python setup.py install

Using CREPE

Using CREPE from the command line

This package includes a command line utility crepe and a pre-trained version of the CREPE model for easy use. To estimate the pitch of audio_file.wav, run:

$ crepe audio_file.wav

or

$ python -m crepe audio_file.wav

The resulting audio_file.f0.csv contains 3 columns: the first with timestamps (a 10 ms hop size is used), the second contains the predicted fundamental frequency in Hz, and the third contains the voicing confidence, i.e. the confidence in the presence of a pitch:

time,frequency,confidence
0.00,185.616,0.907112
0.01,186.764,0.844488
0.02,188.356,0.798015
0.03,190.610,0.746729
0.04,192.952,0.771268
0.05,195.191,0.859440
0.06,196.541,0.864447
0.07,197.809,0.827441
0.08,199.678,0.775208
...

Timestamps

Following the convention adopted by popular audio processing libraries such as Essentia and Librosa, from v0.0.5 onwards CREPE will pad the input signal such that the first frame is zero-centered (the center of the frame corresponds to time 0) and generally all frames are centered around their corresponding timestamp, i.e. frame D[:, t] is centered at audio[t * hop_length]. This behavior can be changed by specifying the optional --no-centering flag, in which case the first frame will start at time zero and generally frame D[:, t] will begin at audio[t * hop_length]. Sticking to the default behavior (centered frames) is strongly recommended to avoid misalignment with features and annotations produced by other common audio processing tools.

Temporal smoothing

By default CREPE does not apply temporal smoothing to the pitch curve, but Viterbi smoothing is supported via the optional --viterbi command line argument.

Saving the activation matrix

The script can also optionally save the output activation matrix of the model to an npy file (--save-activation), where the matrix dimensions are (n_frames, 360) using a hop size of 10 ms (there are 360 pitch bins covering 20 cents each).

The script can also output a plot of the activation matrix (--save-plot), saved to audio_file.activation.png including an optional visual representation of the model's voicing detection (--plot-voicing). Here's an example plot of the activation matrix (without the voicing overlay) for an excerpt of male singing voice:

salience

Batch processing

For batch processing of files, you can provide a folder path instead of a file path:

$ python crepe.py audio_folder

The script will process all WAV files found inside the folder.

Additional usage information

For more information on the usage, please refer to the help message:

$ python crepe.py --help

Using CREPE inside Python

CREPE can be imported as module to be used directly in Python. Here's a minimal example:

import crepe
from scipy.io import wavfile

sr, audio = wavfile.read('/path/to/audiofile.wav')
time, frequency, confidence, activation = crepe.predict(audio, sr, viterbi=True)

Argmax-local Weighted Averaging

This release of CREPE uses the following weighted averaging formula, which is slightly different from the paper. This only focuses on the neighborhood around the maximum activation, which is shown to further improve the pitch accuracy:

Please Note

  • The current version only supports WAV files as input.
  • The model is trained on 16 kHz audio, so if the input audio has a different sample rate, it will be first resampled to 16 kHz using resampy.
  • Due to the subtle numerical differences between frameworks, Keras should be configured to use the TensorFlow backend for the best performance. The model was trained using Keras 2.1.5 and TensorFlow 1.6.0, and the newer versions of TensorFlow seems to work as well.
  • Prediction is significantly faster if Keras (and the corresponding backend) is configured to run on GPU.
  • The provided model is trained using the following datasets, composed of vocal and instrumental audio, and is therefore expected to work best on this type of audio signals.
    • MIR-1K [1]
    • Bach10 [2]
    • RWC-Synth [3]
    • MedleyDB [4]
    • MDB-STEM-Synth [5]
    • NSynth [6]

References

[1] C.-L. Hsu et al. "On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset", IEEE Transactions on Audio, Speech, and Language Processing. 2009.

[2] Z. Duan et al. "Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions", IEEE Transactions on Audio, Speech, and Language Processing. 2010.

[3] M. Mauch et al. "pYIN: A fundamental Frequency Estimator Using Probabilistic Threshold Distributions", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2014.

[4] R. M. Bittner et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research", Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference. 2014.

[5] J. Salamon et al. "An Analysis/Synthesis Framework for Automatic F0 Annotation of Multitrack Datasets", Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference. 2017.

[6] J. Engel et al. "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders", arXiv preprint: 1704.01279. 2017.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crepe-0.0.5.tar.gz (57.7 MB view details)

Uploaded Source

File details

Details for the file crepe-0.0.5.tar.gz.

File metadata

  • Download URL: crepe-0.0.5.tar.gz
  • Upload date:
  • Size: 57.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for crepe-0.0.5.tar.gz
Algorithm Hash digest
SHA256 960d5bf30b2c7f4e0406577f14538092522c5f633486c0d02920fe6756ef2015
MD5 8c4d95a4946bafd017e58e9bab320679
BLAKE2b-256 67f26f95ee54b7632bd334bc9c4dd78ae693f62dc4be9fdd58c51e844cac6679

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page