A utility for validation of peptide identification results in proteomics using amino acid counting.

These details have not been verified by PyPI

Project description

AA_stat

An open source software for amino acid residue modification analyses in proteomics

Overview

AA_stat is a tool for shotgun proteomics that uncovers unexpected peptide modifications, as well as possible artefacts of data acquisition or processing, in the results of proteome analysis. AA_stat calculates and visualizes amino acid occurrence frequencies for the identified peptides. AA_stat processes the results of open search and composes a list of identified mass shifts that can be attributed to modifications.

The processing involves Gaussian fitting of potential peaks, group-specific FDR filtering, amino acid counting, and appearance frequency normalization for the observed mass shifts.

How to cite

The recent version of AA_stat is described in:

AA_stat: Intelligent profiling of in vivo and in vitro modifications from open search results. Journal of Proteomics, Volume 248, 30 September 2021, 104350. https://doi.org/10.1016/j.jprot.2021.104350

A preprint of this article is available on biorXiv: https://www.biorxiv.org/content/10.1101/2020.09.07.286161v2.full.

An initial version of AA_stat was described earlier in:

Validation of Peptide Identification Results in Proteomics Using Amino Acid Counting. Proteomics, Volume 18, Issue 23, December 2018, 1800117. https://doi.org/10.1002/pmic.201800117

Installation

AA_stat requires Python 3.6 or newer. AA_stat GUI requires Python 3.7 or newer.

Install from PyPI (all platforms):

pip install AA_stat

Alternatively, you can install directly from GitHub:

pip install git+https://github.com/SimpleNumber/aa_stat

Create a desktop shortcut

To create a desktop shortcut for AA_stat_GUI, call another command after pip install:

AA_stat_GUI --create-shortcut

Quickstart

Command line

Just call AA_stat with your open search results. Pass spectrum files, as well, if they are available:

AA_stat --pepxml *.pepXML --mzml *.mzML

terminal animation

To check results, open the file called report.html in your browser. Example reports are shown here.

GUI

Run AA_stat_GUI on the command line or do AA_stat_GUI --create-shortcut once and use the desktop shortcut.

Open search results and spectrum files

AA_stat deals with open search results in pepXML or CSV formats and is compatible with the search results obtained using most existing search engines. By default, it is recommended to use MSFragger search engine, available from Nesvizhskii lab website. For details of its operation, see MSFragger User manual.

Most new features of AA_stat require that you also provide spectrum files. AA_stat supports MGF and mzML formats. Note that you must provide the same files that are used for open search. Note: If you run MSFragger on RAW files, you can convert RAW files to mzML and specify those for AA_stat.

Examples

An example of the open search parameters file can be found in the repository here.

Example of MSFragger usage:

java -Xmx8G MSFragger.jar open_search.params HeLa_run1.mzML HeLa_run2.mzML

Example of using AA_stat:

AA_stat --pepxml *.pepXML --mzml *.mzML

User manual

Command line options

usage: AA_stat [-h] [--params PARAMS] [--dir DIR] [-v {0,1,2,3}] [--mgf MGF [MGF ...] | --mzml MZML [MZML ...]] (--pepxml PEPXML [PEPXML ...] | --csv CSV [CSV ...]) [--fmods FMODS] [--vmods VMODS]
               [--enzyme ENZYME] [-n PROCESSES]

options:
  -h, --help            show this help message and exit
  --params PARAMS       CFG file with parameters. If there is no file, AA_stat uses default one. An example can be found at https://github.com/SimpleNumber/aa_stat
  --dir DIR             Directory to store the results. Default value is current directory.
  -v {0,1,2,3}, --verbosity {0,1,2,3}
                        Output verbosity.
  --mgf MGF [MGF ...]   MGF files to localize modifications.
  --mzml MZML [MZML ...]
                        mzML files to localize modifications.
  --pepxml PEPXML [PEPXML ...]
                        List of input files in pepXML format.
  --csv CSV [CSV ...]   List of input files in CSV format.
  --fmods FMODS         Fixed modifications specified in the search (needed with CSV input). Example: +57.0215 @ C, +229.1630 @ N-term
  --vmods VMODS         Variable modifications specified in the search (needed with CSV input). Example: 15.9959 @ M, 42.0106 @ N-term
  --enzyme ENZYME       Enzyme specificity set in the search (needed with CSV input).
  -n PROCESSES, --processes PROCESSES
                        Maximum number of processes to use.

Configuration file

Configuration parameters can be set in a config file (default values and comments are in default.cfg).

AA_stat supports the following parameters:

Name	Default value	Description
	[data]
decoy prefix		Prefix that is used to indicate decoy sequences in database. If not set, all prefixes in "decoy prefix list" will be tried
decoy prefix list	DECOY_, rev_	A comma-separated list of values. For each input file, the prefix will be selected from this list, unless "decoy prefix" is set explicitly.
FDR	0.02	PSM false discovery rate, that will be used for each mass shift interval.
labels	M D Q C L H S R A W Y P F V N T E G I K	All amino acid residues. It may be helpful if your data contains unusual amino acid residues.
cleavage rule	trypsin	Specify a name from pyteomics.parser.expasy_rules or a valid Python regex.
	[csv input]
delimiter	, (comma)	Delimiter used in CSV input files.
proteins column	protein	Name of column with protein IDs.
proteins delimiter	;	The delimiter of proteins in proteins column.
peptides column	peptide	Name of column with peptide sequences.
measured mass column	precursor_neutral_mass	Name of column with measured peptide masses.
calculated mass column	calc_neutral_pep_mass	Name of column with theorectical peptide masses.
retention time column	retention_time_sec	Name of column with peptide retention times.
next aa column	peptide_next_aa	Name of column with next amino acid in protein sequence.
previous aa column	peptide_prev_aa	Name of column with previous amino acid in protein sequence.
spectrum column	spectrum	Name of column with spectrum IDs.
charge column	assumed_charge	Name of column with assumed charges.
score ascending	yes	"Yes" means smaller scores are better.
	[general]
width of bin in histogram	0.001	Bin width in Da that will be used to make mass shift distribution histogram.
mass shift tolerance	0.005	Tolerance used when comparing mass shifts. This may be smaller than "precursor mass tolerance" in searches, because mass shifts are determined by careful averaging and are more accurate than individual measurements.
open search range	-500, 500	Open search range, in Da.
shifting window	0.03	Mass window, in Da, that will be used for Gaussian fit. At least one mass shift peak should be in this window.
zero peak window	0.05	Mass window, in Da, for initial selection and fit of the zero-shift peak.
threshold for bins	200	The minimal number of PSMs that should be in the interval to start Gaussian fitting.
FDR correction	yes	Use FDR correction when filtering each mass interval.
use specific mass shift window	no	Focusing on specific mass shift (yes/no).
specific mass shift window	15.975, 16.005	Specifying mass range, in Da, which the User wants to focus on.
figure size in inches	5.0, 3.5	Specifying the size of the output figures, in inches (L, H).
zero shift mass tolerance	0.05	Within this accuracy of zero mass shift all input files will be calibrated to 0.0.
zero shift minimum intensity	0.05	Criterion for selection as reference mass shift bin. Relative to the most abundant mass
minimum peptides for mass calibration	100	Minimum amount of unmodified peptides with configured FDR to use them for mass calibration.
mass calibration	gauss_frequency	Which values to use for Gauss fitting. Can be 'gauss_frequency', 'gauss', 'gauss_relative', 'simple' or 'off'
	[clustering]
use clustering	yes	Apply clustering to unmodified peptides when doing Gauss fit calibration. This helps recover partial calibration errors, when measured masses are shifted for some part of the run.
dbscan eps factor	0.2	eps parameter of DBSCAN will be proportional to zero peak window and this factor.
dbscan min_samples	5	Value of DBSCAN min_samples parameter.
cluster span percentage minimum	0.1	Minimum fraction of run duration that a single cluster should cover.
total clustered peptide percentage minimum	0.5	Minimum fraction of all considered peptides that belong to large clusters.
	[fit]
standard deviation threshold for center of peak	15	Threshold value for the standard error of peak center as determined by the Gaussian fit algorithm. This value is expressed in histogram bins (the bin width is configured in [general]).
standard deviation threshold for sigma	0.1	Threshold value for the standard error of sigma, relative to sigma, as determined by the Gaussian fit algorithm.
standard deviation threshold for height	0.15	Threshold value for the standard error of peak height, relative to peak height, as determined by the Gaussian fit algorithm.
shift error	10	Minimal distance between fitted Gaussian peaks (in histogram bins). Closer peaks will be merged.
batch	900	Number of mass shifts to give to each worker process.
	[localization]
ion type	b, y	Ion types to be considered for theoretical peptide spectrum.
fragment ion mass tolerance	0.01
frequency threshold	1.5	Minimum normalized AA frequency to be considered as localization candidate.
minimum matched peaks	4	Minimum peaks to be matched.
always try terminal localization	yes	If enabled, terminal positions are tried for all mass shifts during localization.
try all localizations	no	If enabled, all localizations are possible. Otherwise, localization sites are determined based on occurrence frequencies and Unimod.
	[modifications]
recommend variable modifications	5	Number of modifications recommended for closed search.
recommend multiple modifications on residue	yes	Allows several modifications on one AA residue.
fixed modification intensity threshold	3	Maximum % of peptides containing AA at zero shift to consider a fixed modification.
isotope error abundance threshold	10	Minimum % of isotope error to justify recommendation of isotope error.
minimum localization count	10	Minimum absolute localization count to recommend a variable modification.
configured fixed modifications		for CSV input, specify here the fixed modifications used in your search. Example: `+57.0215 @ C, +229.1629 @ K, +229.1630 @ N-term`.
configured variable modifications		for CSV input, specify here the variable modifications used in your search. Example: `15.9959 @ M, 42.0106 @ N-term`.

Output files

An example of AA_stat output can be found here.

AA_stat produces the following files:

A. Gaussian fit report (gauss_fit.pdf).

B. Summary histogram (summary.png).

C. Charts (PNG and SVG files) of normalized frequencies for each significant mass shift interval. If MGF or mzML files are provided, tables with modifies peptide sequences and localization scores are generated.

D. Summary table (aa_statistics_table.csv) of amino acid frequencies for all mass shifts with Unimod.org links for possible modifications.

E. Summary table (p_values.csv) with p-values for each amino acid frequencies in each mass shift.

F. HTML file (report.html) aggregates and illustrates all results.

G. If MGF or mzML files are provided, a table with localization results is created (localization_statistics.csv).

A. Gaussian fit file shows PSM distributions in intervals that were considered as mass shift peaks. Subplot titles correspond to mass shifts (interval center). Peaks are classified into 3 groups: PASSED - mass shift with good fit, which are considered for subsequent analysis; NO FIT - mass shifts for which the algorithm could not find Gaussian function; FAILED - mass shift with a fit not passing the configured filtering criteria.


Figure 1. Examples of Gaussian fit results in gauss_fit.pdf.

B. Summary histogram shows numbers of filtered PSMs in all mass shift intervals. Number on top of the bin indicates the percentage of all identified PSMs. Each mass shift interval is filtered separately to the user-specified FDR level, using target-decoy approach.


Figure 2. Example of Summary histogram.

C. Charts of normalized frequencies for each significant mass shift. Each chart is named according to the mass shift. Each bar in the chart denotes the normalized occurrence frequency of a specific amino acid residue in the given mass shift interval. The normalized frequency is calculated by:

Counting all amino acids in all non-redundant peptides identified with the given mass shifts;
Dividing the count for the given residue by the total amino acid count for the interval to obtain the occurrence frequency of the residue;
Normalizing the occurrence frequency of the residue by the occurrence frequency of the same residue for the zero mass shift interval.

If the normalized frequency of a residue significantly exceeds 1, that means that this residue is "enriched" in the peptides identified with the corresponding mass shift, suggesting that there is a connection between this residue and the cause of the mass shift. In the simplest case, this residue is modified:


Figure 3. Example of a normalized frequency chart for 15.9943 mass shift (green). Orange -- the percentage of peptides that contain at least one AA residue of a certain kind. Blue -- successful localizations of mass shifts at each AA (only if MS/MS spectra are provided). The counts are not normalized.

D. Summary table (aa_statistics_table.csv) of amino acid frequencies for all mass shifts with peptide counts and Unimod.org links for possible modifications.

E. Summary table (p_values.csv) with p-values for all amino acid frequencies in all mass shifts that indicates the significant deviation amino acid frequency from zero mass shift peak.

F. HTML file (report.html) aggregates and illustrates all results. Examples can be found here.

G. A summary of localization shows the number of peptides in each bin for which a modification was successfully localized in MS/MS. Localization is done by generating theoretical spectra of possible isoforms and scoring them against the experimental spectrum. If there is a clear winner in terms of score, the spectrum is considered localized.

Column name	Description
mass shift	Considered mass shift mass.
# peptides in bin	Number of peptides that passed all filtering procedures.
is isotope	Boolean. True if mass shift is a potential isotope of some other mass shift.
isotope index	If 'is isotope' is True, this column contains the mass of monoisotopic peak.
sum of mass shifts	Shows all possible pairs of mass shifts that produce the given mass shift.
unimod candidates	Localization candidates retrieved from Unimod.org database for given mass shift.
aa_stat candidates	Localization candidates from AA_stat statistics.
candicates for loc	Combination of amino acid candidates from all sources: Unimod, AA_stat results, isotope clusters, sum of modifications, considered for localization of mass shift using MS/MS spectra.
localization	Localization statistics for given mass shift using MS/MS spectra provided in a format "AminoAcid_MassShift:number-of-peptides-with-this-localization". If there is no clear leader for a specific peptide's mass shift localization, it counts as 'non-localized'.

AA_search option

If AA_stat used in "AA_search" mode, MSFragger is run prior to AA_stat. AA_search can also optimize fixed modifications that are used for open search, and repeat open search if needed. This is enabled with -x option (or --optimize-fixed-mods). Full list of supported command-line options:

AA_search [-h] [--params PARAMS] [--MSFragger MSFRAGGER] [--dir DIR] [-v {0,1,2,3}] (--mgf MGF [MGF ...] | --mzml MZML [MZML ...]) [-db FASTA] [--os-params OS_PARAMS] [-x] [-s [SKIP]] [-je JAVA_EXECUTABLE] [-ja JAVA_ARGS]

optional arguments:
  -h, --help            show this help message and exit
  --params PARAMS       CFG file with parameters. If there is no file, AA_stat uses default one. An example can be found at https://github.com/SimpleNumber/aa_stat
  --MSFragger MSFRAGGER
                        Path to MSFragger .jar file. If not specified, MSFRAGGER environment variable is used.
  --dir DIR             Directory to store the results. Default value is current directory.
  -v {0,1,2,3}, --verbosity {0,1,2,3}
                        Output verbosity.
  --mgf MGF [MGF ...]   MGF files to search.
  --mzml MZML [MZML ...]
                        mzML files to search.
  -db FASTA, --fasta FASTA
                        FASTA file with decoys for open search. None: with included MSFragger parameters, the database is expected to contain decoys. Default decoy prefix is "rev_". If it differs, do not forget to specify it in
                        AA_stat params file.
  --os-params OS_PARAMS
                        Custom open search parameters.
  -x, --optimize-fixed-mods
                        Run multiple searches, automatically determine which fixed modifications to apply.
  -s [SKIP], --skip [SKIP]
                        Skip search if pepXML files exist already. If not specified, no steps are skipped. If specified without value, first step may be skipped. Value is number of steps to skip. Only works with "-x".
  -je JAVA_EXECUTABLE, --java-executable JAVA_EXECUTABLE
  -ja JAVA_ARGS, --java-args JAVA_ARGS

Example of using AA_search:

AA_search --MSFragger /path/to/MSFragger/MSFragger-2.4.jar -x -db fasta_file.fasta --mzml mzml_files.mzML --dir ./save_dir

All searches are saved in separate folders: "step 1", "step 2", etc. In these folders, pepXML files and open search parameters file (os.params) are saved, together with AA_stat results.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.6

Sep 5, 2024

2.5.6

Nov 2, 2023

This version

2.5.5

Jun 26, 2023

2.5.4

May 13, 2022

2.5.3

May 12, 2022

2.5.2

Oct 22, 2021

2.5

Jul 15, 2021

2.4

Jul 12, 2021

2.3

Apr 8, 2021

2.2.4

Apr 5, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AA_stat-2.5.5.tar.gz (244.7 kB view details)

Uploaded Jun 26, 2023 Source

Built Distribution

AA_stat-2.5.5-py3-none-any.whl (254.6 kB view details)

Uploaded Jun 26, 2023 Python 3

File details

Details for the file AA_stat-2.5.5.tar.gz.

File metadata

Download URL: AA_stat-2.5.5.tar.gz
Upload date: Jun 26, 2023
Size: 244.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for AA_stat-2.5.5.tar.gz
Algorithm	Hash digest
SHA256	`8e848ce5befa3395b669973211f9dbdc9e18f00a967a9eef31bbca831a78bada`
MD5	`8fac094b710498d077b8c857bb20c731`
BLAKE2b-256	`d2f494407c65449e7cf61906b4c3871dccc92c9119c9a2c642080efdb36a98ac`

See more details on using hashes here.

File details

Details for the file AA_stat-2.5.5-py3-none-any.whl.

File metadata

Download URL: AA_stat-2.5.5-py3-none-any.whl
Upload date: Jun 26, 2023
Size: 254.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for AA_stat-2.5.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`06db7b7c83d8b99b705e49c139ded5208d2cda1ff8dc450de441df45693f3e4b`
MD5	`96d7f6ebbabf6593fc42d03e841839ae`
BLAKE2b-256	`0e0548c5cba543c6be77701d24d15fb0fe293be1cc3fae99e5a4879a8efac94d`