Decoding Life: Introducing the π‑PrimeNovo Breakthrough for Peptide Sequencing

Muhammad Abdul-Mageed
9 min readFeb 24, 2025

--

In this post, I introduce our new Nature Communications paper “π-PrimeNovo: an accurate and efficient non- autoregressive deep learning model for de novo peptide sequencing” (Link). This exciting work is led by my brilliant student Xiang Zhang and in collaboration with a group of colleagues including Prof. Laks V. S. Lakshmanan of UBC. For a background about proteins, peptides, and peptide sequencing, refer to my earlier article. (Note: This is still a draft article).

Protein Identification

Proteomics is the large-scale study of proteins, which are the primary functional molecules in living organisms. This field focuses on identifying, characterizing, and understanding the structure, function, and interactions of all proteins (the proteome) in a cell, tissue, or organism. By examining how proteins change in different conditions or in response to various stimuli, proteomics helps reveal the complex biological processes that drive health and disease.

Protein identification lies at the heart of proteomics, and shotgun proteomics via mass spectrometry has emerged as the primary technique in this field. In this process, proteins are first broken down into smaller peptides through enzymatic digestion, and then these peptides are analyzed using tandem mass spectrometry (MS/MS). The resulting spectra serve as fingerprints, revealing both the sequences and structures of the peptides.

Traditionally, researchers have relied on database searching to decode these spectra. However, because these methods depend heavily on the availability of comprehensive sequence databases, they face limitations in areas where established databases may not exist.

Tandem mass spectrometry (MS/MS) is an analytical technique that uses two stages of mass analysis to provide detailed structural information about molecules. In the first stage, ions generated from a sample are separated by their mass-to-charge ratios. Selected ions are then fragmented in a collision cell, and the resulting fragments are analyzed in the second stage. This two-step process allows researchers to decipher the sequence and structure of complex molecules — such as peptides in proteomics — by providing a “fingerprint” of their constituent parts.

Two Decades of Innovation, Yet Peptide Recall Still Eludes Us

Over the past 20 years, many tools have emerged to decode protein sequences directly from mass spectrometry data — a process known as de novo peptide sequencing. These methods work by reading the mass differences between broken fragments of peptides to figure out their amino acid makeup and any modifications. Early tools, like PepNovo and PEAKS, used mathematical approaches such as graph theory and dynamic programming. Later, deep learning models like DeepNovo, which combined CNNs and LSTMs, and PointNovo, which improved precision with order-invariant techniques, pushed the field forward. More recently, Transformer-based models such as Casanovo and its upgraded version, Casanovo V2, have reframed sequencing as a translation task, training on massive datasets to boost performance. Innovations like PepNet and GraphNovo have further improved speed and addressed challenges like missing data. Despite all these advancements, current deep learning methods still struggle with peptide recall rates, often correctly identifying only 30–50% of peptides in standard tests.

Despite advancements in the field, current approaches struggle with peptide recall rates, capped at 30–50%.

Pitfalls of Autoregressive Approaches

Currently, most deep learning methods for de novo peptide sequencing are autoregressive. That is, these methods generate one amino acid at a time in a single, fixed order. This means that each prediction relies heavily on the previous one (or previous ones). But in peptide sequencing, every amino acid is connected to its neighbors both before and after it. As a result, if an early prediction is wrong, that mistake can snowball and affect the entire sequence. Moreover, techniques like beam search — which choose the most likely next amino acid sequence continuations — can’t go back and correct earlier errors or adjust the overall mass of the sequence. In short, these autoregressive models are limited because they build the sequence in a one-way, step-by-step manner, making error correction and mass control challenging.

π‑PrimeNovo, a Breakthrough in Peptide Sequencing

An illustration of how PrimeNovo works. The model uses MS/MS spectral data to predict peptide sequences with two main components. First, a non-autoregressive Transformer backbone predicts all amino acids at once using a method called CTC loss. Second, a precise mass control unit fine-tunes the prediction to ensure the overall mass of the peptide is correct.

In our new work, we introduce π‑PrimeNovo — a breakthrough model that moves away from the traditional one-at-a-time sequence generation. Instead of relying on previous predictions, π‑PrimeNovo predicts the entire peptide sequence simultaneously, giving it a full view of the amino acids both before and after each position. This bidirectional approach is a major shift from conventional methods. Additionally, π‑PrimeNovo includes a precise mass control unit that uses precursor mass information to fine-tune the sequence, ensuring greater accuracy. Together, these innovations lead to significantly improved performance in peptide sequencing.

Aligning Complex Spectral Data to Amino Acid Sequences: Connectionist Temporal Classification

In addition to its novel architecture, our approach leverages two advanced techniques to further enhance accuracy and reliability. First, we integrate Connectionist Temporal Classification (CTC) into the model. Traditional peptide sequencing methods struggle with aligning complex spectral data to amino acid sequences because the exact correspondence between input features and the sequence is often unclear. CTC overcomes this by allowing the model to predict the entire peptide sequence without requiring a strict one-to-one alignment. It does so by summing over all possible alignments, effectively handling variations such as repeated or missing tokens. This flexibility enables the model to robustly decode peptide sequences from noisy, complex MS/MS data.

Connectionist Temporal Classification (CTC) is a loss function and decoding framework designed for tasks where the exact alignment between the input data and the target sequence is unknown. In our peptide sequencing method, CTC plays a crucial role by allowing the model to predict the entire peptide sequence from the mass spectrometry data without needing a one-to-one alignment between specific parts of the spectrum and individual amino acids. CTC allows the model to have flexible alignment: In many sequence-to-sequence problems, such as speech recognition, the exact timing of each output element relative to the input isn’t known. CTC addresses this by considering all possible alignments between the input (the mass spectrum) and the output (the peptide sequence). Instead of forcing the model to decide which specific spectrum feature corresponds to a particular amino acid, CTC sums over all potential alignments that could result in the correct sequence. Blank Token and Repetition Handling. CTC introduces a special “blank” token that helps in handling cases where multiple input elements correspond to a single output token or where no output should be produced at a particular step. This mechanism allows the model to output sequences with repeated or skipped tokens and then collapse them into the final sequence. Efficient Learning. By using CTC loss during training, the model learns to output probability distributions over all possible tokens at each position without needing a precise frame-by-frame correspondence. The training process then maximizes the likelihood of the correct sequence over all these alignments. This is particularly useful for peptide sequencing, where the relationship between spectral peaks and amino acids can be complex and non-linear.

CTC details visualized (from our supplementary materials).

Precision Mass Matching

Complementing CTC, our method also incorporates a Precise Mass Control (PMC) unit exploiting a dynamic programming algorithm. While CTC efficiently decodes the sequence, PMC ensures that the predicted peptide’s total mass exactly matches the measured precursor mass — a critical factor in validating peptide identification. By using the precursor mass as a constraint, PMC fine-tunes the sequence predictions, correcting any deviations and ensuring that the final output is chemically accurate. Together, CTC and PMC empower π‑PrimeNovo to deliver fast, accurate, and reliable peptide sequencing, marking a significant step forward from conventional autoregressive approaches.

Knapsack-like dynamic programming decoding algorithm for precise mass control. Imagine you’re trying to pack a bag with items of different weights so that the total weight matches a specific target exactly — this is the classic “knapsack problem.” In peptide sequencing, our goal is similar: we want to select amino acids (each with its own mass) so that the sum of their masses exactly matches the measured mass of the peptide. The “knapsack-like dynamic programming decoding algorithm” does just that. Here are the details: (1) Breaking Down the Problem. The algorithm views each amino acid as an “item” with a specific mass. The challenge is to find the right combination of these items that adds up to the target mass (the precursor mass measured by the instrument). (2) Dynamic Programming Approach. Instead of trying every possible combination (which would be incredibly slow), the algorithm breaks the problem into smaller, manageable parts. It builds up a table of possible mass sums using different combinations of amino acids. This process is efficient because it reuses calculations from earlier steps, much like solving a puzzle piece by piece. (3) Precise Mass Control. As the model predicts the sequence, this decoding algorithm checks and adjusts the choices so that the total mass of the selected amino acids is as close as possible to the target mass. This ensures that the final peptide sequence is not only plausible but also matches the physical reality measured in the lab.

PrimeNovo in Numbers

PrimeNovo delivers remarkable results in peptide sequencing, achieving an average peptide recall of 64% on the standard nine-species benchmark — significantly outperforming the previous best of 54%. Across various MS/MS datasets, it not only consistently outperforms other models, sometimes doubling their accuracy, but it also operates much faster. By predicting the entire sequence simultaneously instead of one amino acid at a time, and using dynamic programming with CUDA acceleration, PrimeNovo runs up to 89 times faster. This speed enables it to analyze large-scale spectrum data in days rather than months, making it especially valuable in metaproteomic studies. Additionally, its capability to accurately identify post-translational modifications (PTMs) underscores its potential as a transformative tool in proteomics research.

Average performance of PrimeNovo alongside four other top-performing models on a widely utilized nine-species benchmark dataset (93,750 tested spectrum samples across 9 species).
One comparison between Casanovo V2 (previous SoTA model) and PrimeNovo (p. 9 from our paper).

Versatility Across Applications

PrimeNovo’s versatility is evident in its robust performance across a diverse range of proteomic applications. The model not only excels on the standard nine-species benchmark, where it achieves an impressive 64% average peptide recall, but it also generalizes effectively to other MS/MS datasets that vary in quality and complexity. For example, in zero-shot evaluations on datasets such as PT, IgG1-Human-HC, and HCC, PrimeNovo consistently outperforms leading autoregressive models, with improvements in peptide recall ranging from 13% to over 40%. This adaptability is further highlighted by its ability to maintain high accuracy even when fine-tuned with additional, domain-specific data, demonstrating a robust capacity for transfer learning across different sample types and experimental conditions.

Beyond conventional proteomic analysis, PrimeNovo proves particularly valuable in metaproteomic research. Its rapid inference speed and precise mass control enable it to identify significantly more species-specific peptides compared to earlier methods, reducing processing times from months to days. Moreover, the model’s advanced handling of post-translational modifications (PTMs) further broadens its applicability, allowing for detailed taxon-resolved peptide annotation and functional insights that are crucial for understanding complex biological systems. These capabilities underscore PrimeNovo’s potential as a transformative tool in both routine and specialized proteomics studies.

PrimeNovo’s capabilities extend to downstream tasks and offer valuable insights for various biological investigations.

Conclusion

π‑PrimeNovo represents a breakthrough inde novo peptide sequencing. By combining a non-autoregressive Transformer framework with a dedicated mass control module and leveraging CTC loss for effective training, this model not only improves prediction accuracy but also drastically reduces inference times. For academics and researchers focused on proteomics, we hope π‑PrimeNovo opens new avenues for analyzing complex proteomic data and addressing biological questions with unprecedented precision.

For those interested in the technical details, the full paper provides an in-depth analysis of the model architecture, training strategies, and comprehensive benchmarking across diverse datasets. This work not only sets a new standard in peptide sequencing but also highlights the transformative potential of AI into biological research. Very exciting!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Muhammad Abdul-Mageed
Muhammad Abdul-Mageed

Written by Muhammad Abdul-Mageed

Canada Research Chair in Natural Language Processing and Machine Learning, The University of British Columbia; Director of UBC Deep Learning & NLP Group

No responses yet

Write a response