Abnormal post-translational modifications (PTMs) of proteins are often a cause or consequence of many neurodegenerative disorders and diseases such as cancer. PTMs are the changes in the protein's chain of amino acids that extend its range of functions. Identification of PTMs is very challenging, given their complexity and diversity – there are more than 300 PTMs that are known to occur physiologically in humans. They also generate complex fragmentation patterns in tandem mass spectrometry (MS/MS), which further complicates identification and subsequent data analysis.
To address these challenges, Shi-Jian Ding and colleagues from the University of Nebraska Medical Center in Omaha, US, have come up with a new strategy for data analysis they are calling “iterative search for identifying PTMs” (ISPTM). Ding (now affiliated with Sanford-Burnham Medical Research Institute in Lake Nona Florida, US) is a proteomics researcher. Proteomics is the large-scale study of proteins and their functions. In particular, Dr. Ding develops methods to address biological and biomedical problems through the identification and quantification of proteins and their post-translational modifications (PTMs) in diverse biological systems.
Other search engines (X!Tandem, Mascot, and SEQUEST) have applied iterative search, but with a double pass strategy. The ISPTM approach differs from those by refining the MS/MS spectra instead of refining the database. For a recent study published in the Journal of Proteome Research,Ding and his colleagues used a multi-blind search with a tool called MODa, which can perform fast and unrestrictive searches for large scale databases of the human proteome. They performed the ISPTM and MODa searches using the computing resources of the University of Nebraska-Lincoln Holland Computing Center (HCC), which is a member of the Open Science Grid (OSG).
Computational requirements are a significant concern for large scale PTM identification of complex proteome data. In the study, ISPTM analysis of the nuclear matrix (NM) datasets took 4,535 cumulative CPU hours, while MODa analysis of the NM datasets took 834 cumulative CPU hours. Depending on the number of cores/CPUs available for parallel computing, an ISPTM search of complex proteome datasets could be completed in a few hours. In this study, the research team used a Linux cluster with 1,151 nodes because they were analyzing up to 207 modifications.
The test results for 13 modifications show that the ISPTM approach is superior compared to the all-in-one search. Resources like the OSG provide the necessary computing power. A user interface for implementing ISPTM is currently under development.
As far as Ding can tell, he and his colleagues are the first to simultaneously and accurately identify the localization of multiple modifications from complex proteome samples. Using the Open Mass Spectrometry Search Algorithm (OMSSA), an open source search engine for analyzing and identifying peptides, they are also able to evaluate the confidence level of each modification.
Thanks to HCC integration with the OSG, the team has refined OMMSA so they can now search as many modifications as needed – but this scale wouldn’t be possible without the resources of the OSG. Global modifications pose a significant computational problem. Much like gene sequencing, mass spectrometry instrumentation generates big data. Effective data analysis is also a significant challenge.
“The Open Science Grid allows us to ask what the global protein PTM changes are,” says Ding. “Previously, we couldn't ask certain questions because of limited computing power. Now, with essentially unlimited computing power, we can perform this kind of research.”
The researchers use Python scripts for refining and filtering tasks, generating the commands for OMSSA searches, and collecting the results. The scripts automate each step, minimizing the need for intervention. As a result, the ISPTM analyses of the synthetic peptides and the NM data finished in less than 48 hours using the computing resources available at HCC and OSG.
This approach shows that identifying peptides with various (either chemical or biological) modifications in a sample can increase the spectral identification rate and the chances of identifying key protein regulators and their possible PTMs. Proteins play important roles in cells and what happens to them, both normally and abnormally. With in-depth studies of proteins, researchers like Ding can find biomarkers for early diagnosis and prognosis of disease – and further their ability to predict whether a patient will respond to a given drug treatment. This work has tremendous implications for discovering new drug targets, overcoming drug resistance in existing drug targets, and developing better therapeutic approaches.