Skip to main content
  • Research article
  • Open access
  • Published:

Improved residue contact prediction using support vector machines and a large feature set

Abstract

Background

Predicting protein residue-residue contacts is an important 2D prediction task. It is useful for ab initio structure prediction and understanding protein folding. In spite of steady progress over the past decade, contact prediction remains still largely unsolved.

Results

Here we develop a new contact map predictor (SVMcon) that uses support vector machines to predict medium- and long-range contacts. SVMcon integrates profiles, secondary structure, relative solvent accessibility, contact potentials, and other useful features. On the same test data set, SVMcon's accuracy is 4% higher than the latest version of the CMAPpro contact map predictor. SVMcon recently participated in the seventh edition of the Critical Assessment of Techniques for Protein Structure Prediction (CASP7) experiment and was evaluated along with seven other contact map predictors. SVMcon was ranked as one of the top predictors, yielding the second best coverage and accuracy for contacts with sequence separation >= 12 on 13 de novo domains.

Conclusion

We describe SVMcon, a new contact map predictor that uses SVMs and a large set of informative features. SVMcon yields good performance on medium- to long-range contact predictions and can be modularly incorporated into a structure prediction pipeline.

Background

Predicting protein inter-residue contacts is an important 2D structure prediction problem [1]. Contact prediction can help improve analogous fold recognition [2, 3] and ab initio 3D structure prediction [4]. Several algorithms for reconstructing 3D structure from contacts have been developed in both the structure prediction and determination (NMR) literature [58]. Contact map prediction is also useful for inferring protein folding rates and pathways [9, 10].

Due to its importance, contact prediction has received considerable attention over the last decade. For instance, contact prediction methods have been evaluated in the fifth, sixth, and seventh editions of the Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment [1115]. A number of different methods for predicting contacts have been developed. These methods can be classified roughly into two non-exclusive categories: (1) statistical correlated mutations approaches [1622]; and (2) machine learning approaches [2334]. The former uses correlated mutations of residues to predict contacts. The latter uses machine learning methods such as neural networks, self organizing map, hidden Markov models, and support vector machines to predict 2D contacts from the primary sequence, as well as other 1D features such as relative solvent accessibility and secondary structure.

In spite of steady progress, contact map prediction remains however a largely unsolved challenge. Here we describe a method that uses support vector machines together with a large set of informative features to improve contact map prediction. On the same data set, SVMcon outperforms the latest version of the CMAPpro contact map predictor [28, 35] and is ranked as one of the top predictors in the blind and independent CASP7 experiment.

Results and Discussion

We first compare SVMcon with the latest version of CMAPpro on the same benchmark dataset. Then we describe the performance of SVMcon along with several other predictors during the CASP7 experiment.

Comparison with CMAPpro on the same Benchmark

SVMcon is trained to predict medium- to long-range contacts (sequence separation >= 6) as in [36], which are not captured by local secondary structure. We train SVMcon on the same dataset used to train CMAPpro [28, 35] and test both programs on the same test dataset. The training dataset contains 485 proteins and the test dataset contains 48 proteins. The sequence identity between the training and testing datasets is below 25%, a common threshold for ab initio prediction [36].

We use sensitivity and specificity to evaluate the performance of SVMcon and CMAPpro. Sensitivity is the percentage of native contacts that are predicted to be contacts. Specificity is the percentage of predicted contacts that are present in the native structure. The contact threshold is set at 8 Å between Ca atoms. The sensitivity and specificity of a predictor depend also on the threshold used to separate 'contact' from 'non-contact' predictions. To compare SVMcon and CMAPpro fairly, we choose to evaluate them at their break-even point, where sensitivity is equal to specificity as in [37]. At the break-even point, the sensitivity and specificity of SVMcon is 27.1%, 4% higher than CMAPpro. Thus on the same benchmark dataset, SVMcon yields a sizable improvement.

We also compare the accuracy of SVMcon with the random uniform baseline algorithm consisting of random independent coin flips to decide whether each residue pair is in contact or not. Since the medium-and long-range contacts account for 2.8% of the total number of residue pairs with linear separation >= 6, the probability for the coin to produce a contact is set to 2.8%. As a result, the sensitivity and specificity of the random baseline algorithm is 2.8% at the break-even point. Thus SVMcon yields a nine-fold improvement over the random baseline.

Since the contact prediction accuracy varies significantly with individual proteins and their structure classes [29], we calculate the contact prediction specificity (or called accuracy) and sensitivity (or called coverage) for each test protein (Table 1). For each one, we select up to L (protein length) predicted contacts ranked by SVM scores because the total number of true contacts is approximately linear to the protein length [24]. The results show that in many cases (e.g. 1QJPA, 1DZOA, 1MAIA, 1LSRA, 1F4PA, 1MSCA, 1IG5A, 1ELRA, 1J75A), the prediction accuracy and coverage are > 30%.

However, for some proteins such as 1SKNP, the prediction accuracy is pretty low. We observe that the contact prediction accuracy is related to the the quality of multiple sequence alignment, the prediction accuracy of secondary structure, and the proportion of beta-sheets. Consistent with previous research [29, 37], the contacts within beta-sheets in beta, alpha+beta, and alpha/beta proteins are predicted with higher accuracy than the contacts between an alpha helix and a beta strand or between alpha helices. We think that the strong restraints between beta-strands such as hydrogen-bonding probably contribute to the improved accuracy.

Table 1 Detailed Contact Prediction Results on 48 Test Proteins for Sequence Separation >= 6, 12, and 24 respectively.

Figures 1 and 2 show the native 3D structure and the predicted contact map of a good example (protein 1DZOA), respectively. In this case, 2L (240) predicted contacts with sequence separation >= 6 are selected. The contact prediction accuracy and coverage are 39.2% and 42.5%, respectively. It is shown that the predicted contact clusters (Figure 2) recall most native beta-sheet pairing patterns of the protein (Figure 1). And it is interesting to see most false positive contacts are also clustered around the true contacts. Thus, these noise may not be very harmful during the process of reconstructing protein structure from the contacts.

Figure 1
figure 1

3D Structure of Protein 1DZOA. Protein 1DZOA is an a+b protein. It consists of two alpha helices and two beta sheets. Beta strands 1 and 2 form a parallel beta sheet. Beta strands 3, 4, 5, 6 form an anti-parallel beta sheet. Most non-local contacts involve the pairing interations between beta strands and the packing interactions between helices and beta sheets. (Figure rendered using Molscript [63]).

Figure 2
figure 2

Predicted and True Contact Maps of 1DZOA. The upper triangle shows the true contacts of protein 1DZOA. The lower triangle shows the predicted contacts of protein 1DZOA. 2L (240) top ranked contacts are selected. The key contacts within anti-parallel strand pairs (3,4), (4,5), and (5,6) are recalled. A few contacts within the parallel strand pair (1,2) are also predicted correctly. However, very long range contacts between alpha helices and beta sheets are not predicted. And there are some false positives. It is interesting to see that most false positives are close to the true contacts. Thus, they may not be very harmful when being used as distance restraints to reconstruct protein 3D structure.

Furthermore, to investigate the relationship between the SVM contact map predictions and the structure classes, we compute the average accuracy and coverage of contact predictions in the six SCOP [38] structure classes (Table 2). The contact prediction accuracy of proteins having beta-sheets (alpha+beta, alpha/beta, beta) is higher than that of alpha helical proteins, which is consistent with previous observations [29]. According to Table 2, the average coverage is about 20% and the accuracy ranges from 21 to 37%. This level of accuracy is probably good enough (or at least helpful) for constructing an ab initio low-resolution structure, since previous experiments show that only L/5 or even less residues contacts are required to reconstruct a low resolution structure for a small protein [5, 8, 3942], taking into account the inherent physical restraints of protein structures. However, the challenge is to develop algorithms to reconstruct a protein structure from a noisy predicted contact map, where contact restraints are much less reliable than the experimental contacts determined by NMR techniques.

Table 2 Contact Prediction Results of Proteins in the Six SCOP Structure Classes.

Comparison with seven other Predictors during CASP7

SVMcon participated in the CASP7 experiment in 2006 and was evaluated along with seven other contact map predictors. The CASP7 evaluation procedure focuses on inter-residue contact predictions with linear sequence separation >= 12 and >= 24 respectively [15]. Up to L/5 of the top predicted contacts were assessed, where L is the length of the target protein. These evaluation metrics are also similar to those used in the past editions of the Critical Assessment of Fully Automated Structure Prediction Methods [4345] and in the EVA contact evaluation server [46]. We use the similar procedure to compute accuracy (specificity) and coverage (sensitivity) for all server predictors.

We compare SVMcon with the other contact map predictors on the 13 out of 15 CASP7 de novo domains whose structures have been released. The contact map predictors participating in CASP7 include SVMcon, BETApro [37], SAM-T06 [47], PROFcon [32], GajdaPairings, Distill [34, 48], Possum [19], and GPCPRED [29]. Their contact predictions were downloaded from the CASP7 website.

Table 3 reports the performance of the eight automated contact map predictors in the CASP7 experiment. The accuracy and coverage of SVMcon at a sequence separation threshold of 12 are 27.7% and 4.7% respectively, corresponding to the second best ranking behind our other predictor BETApro. The accuracy and coverage of SVMcon at a sequence separation threshold of 24 are 13.1% and 2.8% respectively, overall slightly behind SAM-T06 and BETApro. Its coverage at a sequence separation threshold of 24 is higher than Distill, Possum, GPCPRED, and GadjaPairings. Since PROFcon only made predictions for 11 out of 13 domains, its performance can not be directly compared with other methods. Here we include its results for completeness.

Table 3 CASP7 Results of Inter-Residue Contact Predictions of Eight Predictors.

Another caveat is that the evaluation dataset and scheme we used may be slightly different from the official CASP7 evaluation. Thus, here we only try to evaluate the current state of the art of contact predictors instead of ranking them. For the offical contact evaluation scheme and results, readers are advised to check the assessment paper of the CASP7 contact predictions published in the upcoming supplement issue of the journal Proteins.

Overall, these results on the CASP7 dataset show that the accuracy and coverage of protein contact prediction are still low. However, these results are an important step towards reaching the milestone of an accuracy level of about 30%, required for deriving moderately accurate (low resolution) 3D protein structures from scratch [5, 8, 3942] (Also, Dr. Yang Zhang, personal communication at the CASP7 conference). In particular, these predictors tend to predict different correct contacts. Thus, a consensus combination of contact map predictors may yield more accurate contact map predictions, which in turn could significantly improve 3D structure reconstruction. Since the stakes of contact map prediction are high, a community-wide effort for improving contact map prediction should be worthwhile (Dr. Burkhard Rost's lecture slides at Columbia University).

It is also worth pointing out that the CASP7 de novo dataset is too small to reliably estimate the accuracy of the predictors. So one should not over-interpret these results. Indeed, when we use a larger CASP de novo dataset of 24 domains classified by Dr. Dylan Chivian from Dr. David Baker's group to evaluate the predictors (results not shown), the accuracy of SVMcon and BETApro are very close for both sequence separations >= 12 and 24, and both remain among the top predictors.

Conclusion

We have described a new contact map predictor (SVMcon) that uses support vector machines to integrate a large number of useful information including profiles, secondary structure, solvent accessibility, contact potentials, residue types, segment window information [24, 32], and protein-level information [32]. The method yields a 4% improvement over the state-of-the art contact map predictor CMAPpro. In the blind CASP7 experiment, SVMcon is ranked as one of the top contact predictors. The method represents an effort toward a good 2D structure prediction. It can be used to improve ab initio structure prediction [4] and analogous fold recognition [2, 3]. The web server, software, and source code are available at the SVMcon website [49].

Methods

Data Sets

In the comparison with CMAPpro [28, 35], we use the same training and testing datasets. The datasets are redundancy reduced. The pairwise sequence identity of any two sequences is less than 25%. The training and testing datasets contain 485 sequences and 48 sequences respectively.

We use PSI-BLAST to search each sequence against the NCBI non-redundant database and generate a multiple sequence alignment. The number of PSI-BLAST iterations is set to 3. The e-value for selecting a sequence is set to 0.001. The e-value for including a sequence into the construction of a profile is set to 10-10. Multiple sequence alignments are converted into profiles, where each position is associated with a vector denoting the probability of each residue type.

We extract only medium- and long- range residue pairs with sequence separation >= 6 as in [32], which are not captured by local secondary structures. We use a 8 Å distance threshold between Ca atoms to classify residue pairs into two categories: contact (positive, < 8 Å) or non-contact (negative, >= 8 Å). Since the majority of residue pairs are negative examples, to balance the number of positive and negative examples in the training set we randomly sample and retain only 5% of the negative examples while keeping all positive examples. In total, there are 220,994 negative examples and 94,110 positive examples in the training data set. We keep all negative and positive examples in the test data set. The test data set has 10,498 positive examples and 367,299 negative examples.

Input Features

We extract five categories of features for each pair of residues at positions i and j to evaluate their contact likelihood. In addition to the new features (e.g. pairwise information features), the choice of most features combines ideas from our previous contact map predictors, disulfide bond predictors [33, 50], and beta sheet topology predictors [37], and from the PROFcon [32], the best predictor in CASP6.

Local window feature

We extract local features using a 9-residue window centered at each residue in each residue pair. For each position in the window, we use 21 inputs for the probabilities of the 20 amino acids plus gap, computed from multiple sequence alignments, 3 binary inputs for secondary structure (helix: 100, sheet: 010, coil: 001), 2 binary inputs for relative solvent accessibility (exposed: 10, buried: 01) at 25% threshold, one input for the entropy (- ∑ k p k logp k ) as a measure of local conservation. Here p k is the probability of occurrence of the k-th residue (or gap) at the position under consideration. Secondary structure and relative solvent accessibility are predicted using the SSpro and ACCpro programs in the SCRATCH suite [27, 35, 51]. Thus the two local windows produce 2 × 9 × 27 features.

Pairwise information features

For each pair of positions (i, j) in a multiple sequence alignment, we compute the following features. One input corresponds to the mutual information of the profiles of the two positions (∑ kl p kl log (p kl /(p k p l )), where p kl is the empirical probability of residues (or gap) k and l appearing at the two positions i and j simultaneously. Two other pairwise inputs are computed using the cosine ( x y | x | | y | MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdIha4jabgwSixlabdMha5bqaamaaemaabaGaemiEaGhacaGLhWUaayjcSdWaaqWaaeaacqWG5bqEaiaawEa7caGLiWoaaaaaaa@3B32@ ) and correlation ( i ( x i x ¯ ) ( y i y ¯ ) i ( x i x ¯ ) 2 i ( y i y ¯ ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaamaaqababaGaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGHsislcuWG4baEgaqeaiabcMcaPiabcIcaOiabdMha5naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IafmyEaKNbaebacqGGPaqkaSqaaiabdMgaPbqab0GaeyyeIuoaaOqaamaakaaabaWaaabeaeaacqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiqbdIha4zaaraGaeiykaKYaaWbaaSqabeaacqaIYaGmaaGcdaaeqaqaaiabcIcaOiabdMha5naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IafmyEaKNbaebacqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGPbqAaeqaniabggHiLdaaleaacqWGPbqAaeqaniabggHiLdaaleqaaaaaaaa@55C8@ ) measures on the profiles at positions i and j. Thus some information about correlated mutations is used in the inputs. We also use three inputs to represent Levitt's contact potential [52], Jernigan's pairwise potential [53], and Braun's pairwise potential [54] for the residue pairs in the target sequence.

Residue type features

We classify residues into four categories: non-polar (G, A, V, L, I, P, M, F, W), polar (S, T, N, Q, C, Y), acidic (D, E), basic (K, R, H). These four residue types induce 10 different combinations: non-polar/non-nopolar, non-polar/polar, non-polar/acidic, non-polar/basic, polar/polar, polar/acidic, polar/basic, acidic/acidic, acidic/basic, and basic/basic. We use 10 binary inputs to encode the type of a residue pair.

Central segment window feature

Central segment window corresponding to a window centered at position (i + j)/2) has been shown to be useful for predicting whether the residues at position i and j are in contact or not [24, 32]. We use a central segment window of size 5. For each position in the window, we use the same 27 features as the local window features. So the total number of features for the central window is 5 × 27. We also compute the amino acid composition (21 features), secondary structure composition (3 features), relative solvent accessibility composition (2 features) in the central segment window. The sequence separation (|i - j + 1|) for residue pair (i, j) are classified into one of 16 length bins (< 6, 6, 7, 8, 9, 10, 11, 12, 13, 14, < 19, < 24, <= 29, <= 39, <= 49, >= 50) using a binary vector of length 16, as in [32].

Protein information features

We also compute the global amino acid composition (21 features), secondary structure composition (3 features), and relative solvent accessibility composition (2 features) of the target sequence. In addition, we classify sequence lengths into four bins (<= 50, <= 100, <= 150, and > 150) using a binary vector of length 4, as in [32].

The detailed methods of generating features are described in the additional files [see Additional file 1, 2, 3].

Feature Selection

Feature selection is useful to improve the performance of machine learning methods, particularly when there is a large number of features as in this study. However, since there are more than 310,000 training data points, it takes about 12 days to conduct a round of training and testing on a Pentium-IV computer. Thus a thorough feature selection is currently not feasible. So we tried only to remove some features (pairwise profile correlation, pairwise mutual information, residue type, and protein information features) once a time to test how they affect the prediction accuracy. We find that removing these features slightly improve the accuracy by about 0.2%. However, it is not clear if the improvement is due to the random variation or due to the removal of the features. But at least, these features are not essential or being compensated by other similar features. Thus, a more thorough feature selection should be conducted to improve the performance when more computing power is available.

SVM Learning

For an input feature vector associated with a pair of residues, we use Support Vector Machines (SVMs) to predict if the two residues are in contact (positive) or not (negative). SVMs provide a non-linear classifier model by non-linearly mapping the input vectors into a feature space and using linear methods for classification in the feature space [5558]. Thus SVMs, and more generally kernel methods, attempt to combine the advantages of both linear and nonlinear methods by first embedding the data into a feature space equipped with a dot product and then using linear methods in the feature space to perform classification or regression tasks based on the Gram matrix of dot products between data points. A key property of kernel methods is that the embedding does not need to be given in explicit form, the Gram matrix of dot products K (x, y) = φ (xφ (y) between data points is all is needed to proceed with classification or regression. Here x and y are input data points, φ is the mapping from input space to feature space, and K is the kernel or similarity measure between the original data points. Given a set of training data points S = S+ S-, where S+ (resp. S-) represent the positive (resp. negative) examples, using the theory of structural risk minimization [5558], SVMs learn a classification function f (x) in the form of

f ( x ) = x i S + α i K ( x , x i ) x i S α i K ( x , x i ) + b MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzcqGGOaakcqWG4baEcqGGPaqkcqGH9aqpdaaeqbqaaGGaciab=f7aHnaaBaaaleaacqWGPbqAaeqaaOGaem4saSKaeiikaGIaemiEaGNaeiilaWIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkaSqaaiabdIha4naaBaaameaacqWGPbqAaeqaaSGaeyicI4Saem4uam1aaWbaaWqabeaacqGHRaWkaaaaleqaniabggHiLdGccqGHsisldaaeqbqaaiab=f7aHnaaBaaaleaacqWGPbqAaeqaaOGaem4saSKaeiikaGIaemiEaGNaeiilaWIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkaSqaaiabdIha4naaBaaameaacqWGPbqAaeqaaSGaeyicI4Saem4uam1aaWbaaWqabeaacqGHsislaaaaleqaniabggHiLdGccqGHRaWkcqWGIbGyaaa@5E31@

where α i are non-negative weights assigned to the training data point x i during training by minimizing a quadratic objective function and b is the bias. Thus the function f (x) can be viewed as a weighted linear combination of similarities between training data points x i and the target data point x. Only data points with strictly positive weight α in the training dataset affect the final solution. The corresponding data points x i are called the support vectors. For contact map prediction, a new data point x is predicted to be positive or negative by taking the sign of f (x).

We use SVM-light [5961] to implement SVM classification on our data. We experimented with several common kernels including linear kernels, Gaussian radial basis kernels (RBF), polynomial kernels, and sigmoidal kernels. In our experience, on this data the RBF kernel K (x, y) = e γ x y 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGLbqzdaahaaWcbeqaaiabgkHiTGGaciab=n7aNnaafmaabaGaemiEaGNaeyOeI0IaemyEaKhacaGLjWUaayPcSdWaaWbaaWqabeaacqaIYaGmaaaaaaaa@38EF@ (or e x y 2 σ 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGLbqzdaahaaWcbeqaaiabgkHiTmaalaaabaWaauWaaeaacqWG4baEcqGHsislcqWG5bqEaiaawMa7caGLkWoadaahaaadbeqaaiabikdaYaaaaSqaaGGaciab=n8aZnaaCaaameqabaGaeGOmaidaaaaaaaaaaa@3A46@ ) works the best. Using the RBF kernel, f (x) is actually a weighted sum of Gaussians centered on the support vectors. Almost any separating boundary or regression function can be obtained with such a kernel [62], thus it is important to tune the SVM parameters carefully in order to achieve good generalization performance and avoid overfitting.

We only adjust the width parameter γ of the RBF kernel, leaving all other parameters to their default value. γ is the inverse of the variance (σ2) of the RBF and controls the width of the Gaussian functions centered on the support vectors. The bigger is γ, the more peaked are the Gaussians, and the more complex are the resulting decision boundaries [62]. After experimenting with several values of γ, we selected γ = 0.025.

References

  1. Rost B, Liu J, Przybylski D, Nair R, Wrzeszczynski K, Bigelow H, Ofran Y: Prediction of protein structure through evolution. In Handbook of Chemoinformatics – From Data to Knowledge. Edited by: Gasteiger J, Engel T. New York: Wiley; 2003:1789–1811.

    Google Scholar 

  2. Olmea O, Rost B, Valencia A: Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 1999, 295: 1221–1239.

    Article  Google Scholar 

  3. Cheng J, Baldi P: A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics 2006, 22: 1456–1463.

    Article  CAS  PubMed  Google Scholar 

  4. Bonneau R, Ruczinski I, Tsai J, Baker D: Contact order and ab initio protein structure prediction. Protein Sci 2002, 11: 1937–1944.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Aszodi A, Gradwell M, Taylor W: Global fold determination from a small number of distance restraints. J Mol Biol 1995, 251: 308–326.

    Article  CAS  PubMed  Google Scholar 

  6. Vendruscolo M, Kussell E, Domany E: Recovery of protein structure from contact maps. Folding and Design 1997, 2: 295–306.

    Article  CAS  PubMed  Google Scholar 

  7. Skolnick J, Kolinski A, Ortiz A: MONSSTER: a method for folding globular proteins with a small number of distance restraints. J Mol Biol 1997, 265: 217–241.

    Article  CAS  PubMed  Google Scholar 

  8. Zhang Y, Skolnick J: Automated structure prediction of weakly homologous proteins on a genomic scal. P.N.A.S 2004, 101: 7594–7599.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Plaxco K, Simons K, Baker D: Contact order, transition state placement and the refolding rates of single domain proteins. Journal of Molecular Biology 1998, 277: 985–994.

    Article  CAS  PubMed  Google Scholar 

  10. Punta M, Rost B: Protein folding rates estimated from contact predictions. J Mol Biol 2005, 507–512.

    Google Scholar 

  11. Moult J, Hubbard T, Bryant SH, Fidelis K, Pedersen JT: Critical assessment of methods of protein structure prediction (CASP): round II. Proteins Suppl 1997, 1: 2–6.

    Article  Google Scholar 

  12. Moult J, Hubbard T, Bryant SH, Fidelis K, Pedersen JT: Critical assessment of methods of protein structure prediction (CASP): round III. Proteins Suppl 1999, (3):22–29.

  13. Moult J, Fidelis K, Zemla A, Hubbard T: Critical assessment of methods of protein structure prediction (CASP) – round V. Proteins 2003, 53(Suppl 6):334–339.

    Article  CAS  PubMed  Google Scholar 

  14. Moult J, Fidelis K, Tramontano A, Rost B, Hubbard T: Critical assessment of methods of protein structure prediction (CASP) – round VI. Proteins 2005, 61(S7):3–7.

    Article  CAS  PubMed  Google Scholar 

  15. Grana O, Baker D, MacCallum R, Meiler J, Punta M, Rost B, Tress M, Valencia A: CASP6 assessment of contact prediction. Proteins 2005, 61: 214–224.

    Article  CAS  PubMed  Google Scholar 

  16. Goebel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins 1994, 18: 309–317.

    Article  CAS  Google Scholar 

  17. Olmea O, Valencia A: Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des 1997, 2: s25-s32.

    Article  CAS  PubMed  Google Scholar 

  18. Shindyalov I, Kolchanov N, Sander C: Can three-dimensional contacts in protein structure be predicted by analysis of correlated mutation? Protein Eng 1994, 7: 349–358.

    Article  CAS  PubMed  Google Scholar 

  19. Hamilton N, Burrage K, Ragan M, Huber T: Protein contact prediction using patterns of correlation. Proteins 2004, 56: 679–684.

    Article  CAS  PubMed  Google Scholar 

  20. Valencia A, Pazos F: Computational methods for the prediction of protein interactons. Curr Opin Struc Biol 2002, 12: 368–373.

    Article  CAS  Google Scholar 

  21. Halperin I, Wolfson HJ, Nussinov R: Correlated mutations: Advances and limitations. A Study on fusion proteins and on the Cohesin-Dockerin families. Proteins 2006.

    Google Scholar 

  22. Kundrotas PJ, Alexov EG: Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives. BMC Bioinformatics 2006, 7: 503.

    Article  PubMed Central  PubMed  Google Scholar 

  23. Fariselli P, Olmea O, Valencia A, Casadio R: Prediction of contact maps with neural networks and correlated mutations. Protein Engineering 2001, 13: 835–843.

    Article  Google Scholar 

  24. Lund O, Frimand K, Gorodkin J, Bohr H, Bohr J, Hansen J, Brunak S: Protein distance constraints predicted by neural networks and probability density functions. Prot Eng 1997, 10(11):1241–1248.

    Article  CAS  Google Scholar 

  25. Fariselli P, Casadio R: Neural network based predictor of residue contacts in proteins. Protein Engineering 1999, 12: 15–21.

    Article  CAS  PubMed  Google Scholar 

  26. Fariselli P, Olmea O, Valencia A, Casadio R: Progress in predicting inter-residue contacts of proteins with neural networks and correlated mutations. Proteins 2001, (Suppl 5):157–162.

    Google Scholar 

  27. Pollastri G, Baldi P, Fariselli P, Casadio R: Improved prediction of the number of residue contacts in proteins by recurrent neural networks. Bioinformatics 2001, 17: S234-S242. [Proceedings of the ISMB 2001 Conference]. [Proceedings of the ISMB 2001 Conference].

    Article  PubMed  Google Scholar 

  28. Pollastri G, Baldi P: Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics 2002, 18(Suppl 1):S62-S70. [Proceeding of the ISMB 2002 Conference]. [Proceeding of the ISMB 2002 Conference].

    Article  PubMed  Google Scholar 

  29. MacCallum R: Striped Sheets and Protein Contact Prediction. Bioinformatics 2004, 20(Supplement 1):i224-i231. [Proceedings of the ISMB 2004 Conference]. [Proceedings of the ISMB 2004 Conference].

    Article  CAS  PubMed  Google Scholar 

  30. Shao Y, Bystroff C: Predicting inter-residue contacts using templates and pathways. Proteins 2003, 53(Supplement 6):497–502.

    Article  CAS  PubMed  Google Scholar 

  31. Zhao Y, Karypis G: Prediction of Contact Maps Using Support Vector Machines. Proc of the IEEE Symposium on Bioinformatics and BioEngineering 2003, 26–36.

    Google Scholar 

  32. Punta M, Rost B: PROFcon: novel prediction of long-range contacts. Bioinformatics 2005, 21: 2960–2968.

    Article  CAS  PubMed  Google Scholar 

  33. Cheng J, Saigo H, Baldi P: Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching. Proteins: Structure, Function, Bioinformatics 2006, 62(3):617–629.

    Article  CAS  Google Scholar 

  34. Vullo A, Walsh I, Pollastri G: A two-stage approach for improved prediction of residue contact maps. BMC Bioinformatics 2006, 7: 180.

    Article  PubMed Central  PubMed  Google Scholar 

  35. Cheng J, Randall A, Sweredoski M, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Research 2005, (33 web server):w72–76.

    Google Scholar 

  36. Rost B, Eyrich V: EVA: large-scale analysis of secondary structure prediction. Proteins 2001, 45(S5):192–199.

    Article  Google Scholar 

  37. Cheng J, Baldi P: Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms. Bioinformatics 2005, 21(suppl 1):i75-i84.

    Article  CAS  PubMed  Google Scholar 

  38. Murzin A, Brenner S, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 1995, 247: 536–540.

    CAS  PubMed  Google Scholar 

  39. Skolnick J, Kolinski A, Ortiz A: MONSTER: A method for folding globular Proteins with a small number of distance restraints. J Mol Biol 1997, 265: 217–241.

    Article  CAS  PubMed  Google Scholar 

  40. Ortiz A, Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J: Ab initio folding of proteins using restraints derived from evolutionary information. Proteins Suppl 1999, 3: 177–185.

    Article  Google Scholar 

  41. Ortiz A, Kolinski A, Skolnick J: Fold assembly of small proteins using Monte Carlo simulations driven by restraints derived from multiple sequence alignments. J Mol Bio 1998, 227: 419–448.

    Article  Google Scholar 

  42. Zhang Y, Kolinski A, Skolnick J: TOUCHSTONE II: a new approach to ab initio protein structure prediction. Biophysical Journal 2003, 85: 1145–1164.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  43. Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, Karplus K, Kelley L, MacCallum R, Pawowski K, Rost B, Rychlewski L, Sternberg M: CAFASP-1: Critical assessment of fully automated structure prediction methods. Proteins 1999, (Suppl 3):209–217.

    Google Scholar 

  44. Lesk A, Conte LL, Hubbard T: Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins 2001, 45(S5):98–118.

    Article  Google Scholar 

  45. Fischer D, Elofsson A, Rychlewski L, Pazos F, Valencia A, Godzik A, Rost B, Ortiz A, Dunbrack R: CAFASP-2: the second critical assessment of fully automated structure prediction methods. Proteins 2001, 45(S5):171–183.

    Article  Google Scholar 

  46. Grana O, Eyrich V, Pazos F, Rost B, Valencia A: EVAcon: a protein contact prediction evaluaton. Nucleic Acid Res 2005, 33: W347-W351.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  47. Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998, 14(10):846–56.

    Article  CAS  PubMed  Google Scholar 

  48. Bau D, Martin A, Mooney C, Vullo A, Walsh I, Pollastri G: Distill: a suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins. BMC Bioinformatics 2006, 7: 402.

    Article  PubMed Central  PubMed  Google Scholar 

  49. SVMcon[http://www.bioinfotool.org/svmcon.html]

  50. Baldi P, Cheng J, Vullo A: Large-scale prediction of disulphide bond connectivity. In Advances in Neural Information Processing Systems (NIPS04 Conference). Volume 17. Edited by: Saul L LB Y Weiss. Cambridge, MA: MIT press; 2005:97–104.

    Google Scholar 

  51. Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 2002, 47: 228–235.

    Article  CAS  PubMed  Google Scholar 

  52. Huang E, Subbiah S, Tsai J, Levitt M: Using a Hydrophobic Contact Potential to Evaluate Native and Near-Native Folds Generated by Molecular Dynamics Simulations. J Mol Biol 1996, 257: 716–725.

    Article  CAS  PubMed  Google Scholar 

  53. Miyazawa S, Jernigan R: An empirical energy potential with a reference state for protein fold and sequence recognition. Proteins 1999, 36: 357–369.

    Article  CAS  PubMed  Google Scholar 

  54. Zhu H, Braun W: Sequence specificity, statistical potentials, and three-dimensional structure prediction with self-correcting. Protein Sci 1999, 8: 326–342.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  55. Vapnik V: Statistical Learning Theory. New York, NY: Wiley; 1998.

    Google Scholar 

  56. Vapnik V: The Nature of Statistical Learning Theory. Berlin, Germany: Springer-Verlag; 1995.

    Book  Google Scholar 

  57. Drucker H, Burges C, Kaufman L, Smola A, Vapnik V: Support Vector Regression Machines. In Advances in Neural Information Processing Systems. Volume 9. Edited by: Mozer MC TP MI Jordan. Cambridge, MA: MIT Press; 1997:155–161.

    Google Scholar 

  58. Schölkopf B, Smola A: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press; 2002.

    Google Scholar 

  59. Joachims T: Making large-scale SVM Learning Practical. Advances in Kernel Methods – Support Vector Learning. Edited by: Schölkopf B, Burges C, Smola A. MIT Press; 1999.

    Google Scholar 

  60. Joachims T: Learning to Classify Text Using Support Vector Machines. Dessertation. Springer. 2002.

    Book  Google Scholar 

  61. SVM-light[http://svmlight.joachims.org]

  62. Vert J, Tsuda K, Scholkopf B: A Primer on Kernel Methods. In Kernel Methods in Computational Biology. Edited by: Scholkopf B JV K Tsuda. Cambridge, MA: MIT Press; 2004:55–72.

    Google Scholar 

  63. Kraulis P: MOLSCRIPT: A program to produce both detailed and schematic plots of protein structure. Journal of Applied Crystallography 1991, 24: 946–950.

    Article  Google Scholar 

Download references

Acknowledgements

Work supported by NIH grant LM-07443-01 and NSF grants EIA-0321390 and IIS-0513376 to PB. JC is currently supported by a UCF faculty start-up grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianlin Cheng.

Additional information

Authors' contributions

JC designed the features, implemented the algorithm, and carried out the experiment. JC and PB authored the manuscript. Both authors approved the manuscript.

Electronic supplementary material

12859_2006_1485_MOESM1_ESM.pl

Additional file 1: The main Perl script to predict a contact map. It is a text file that can be viewed by any text viewer/editor. (PL )

12859_2006_1485_MOESM2_ESM.pl

Additional file 2: The Perl script to generate input features for support vector machine. It is a text file that can be viewed by any text viewer/editor. (PL )

12859_2006_1485_MOESM3_ESM.pl

Additional file 3: The Perl script to compute pairwise contact potentials. It is a text file that can be viewed by any text viewer/editor. (PL )

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Cheng, J., Baldi, P. Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics 8, 113 (2007). https://doi.org/10.1186/1471-2105-8-113

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-8-113

Keywords