Why do we include only masses of prefix and suffix subpeptides in the ideal spectrum of a peptide (e.g., RED or DCA for REDCA)? Why don't we include masses of other subpeptides (like EDC)?
Internal subpeptides like EDC of REDCA require two bonds (before E and after C) to be broken to generate the fragment ion. In contrast, prefix and suffix subpeptides require only one bond to be broken. As a result, although internal subpeptides correspond to some fragment ions, we ignore them because they are much less common than the fragment ions generated by prefix and suffix subpeptides.
How do we infer the charges of annotated peaks in a spectrum?
For example, how did we annotate the tall peak y12++ as having charge +2 in the figure below (one of the annotations of DinosaurSpectrum)?
As described in the main text, mass spectrometers measure the mass-to-charge ratio rather than the mass of fragment ions. Thus, a peak in a spectrum with a given mass-to-charge ratio
m/z gives rise to various masses depending on its (unknown) charge z. If one of the resulting masses matches a mass in the theoretical spectrum, we may infer that the peak has charge z.
How does the use of rounded (integer) masses of amino acids diminish our ability to accurately sequence peptides?
Why do the heights of some peaks in DinosaurSpectrum correlate so poorly with the corresponding amplitudes in the spectral vector?
DinosaurSpectrum is reproduced below (top), along with its spectral vector (bottom).
The transformation of a spectrum into a spectral vector is a complex process that takes into account many factors in addition to the heights of peaks. For details of this transformation, see
Kim et al., 2008..
If we know the proteome, isn’t it always better to use peptide identification instead of peptide sequencing?
If a spectrum that we analyze originated from a peptide in a proteome, then it makes sense to apply peptide identification via database search and to identify this peptide. However, our knowledge of the proteomes remains incomplete even in the case of a well-studied human proteome. Biologists therefore sometimes use de novo peptide sequencing to discover peptides that do not appear in the currently known (still incomplete) proteome.
Why do we generate the decoy database by assuming that all amino acids have the same frequency (1/20), despite the fact that many proteomes have widely varying amino acid frequencies?
In practice, proteomics researchers do typically generate decoy databases by taking into account amino acid frequencies in the proteome under study. This is often achieved by randomly shuffling the amino acids in real proteins to generate a decoy database.
What is the running time of the dynamic programming algorithm for computing the size of a spectral dictionary?