NEC LABS AMERICA, 4 Independence Way, Princeton, NJ, USA

Center for Computational Learning Systems, Columbia University, Interchurch Center, 475 Riverside Dr., New York, USA

Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington, 1705 NE Pacific Street, Seattle, WA, USA

Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY, USA

Abstract

Background

Biologists regularly search DNA or protein databases for sequences that share an evolutionary or functional relationship with a given query sequence. Traditional search methods, such as BLAST and PSI-BLAST, focus on detecting statistically significant pairwise sequence alignments and often miss more subtle sequence similarity. Recent work in the machine learning community has shown that exploiting the global structure of the network defined by these pairwise similarities can help detect more remote relationships than a purely local measure.

Methods

We review RankProp, a ranking algorithm that exploits the global network structure of similarity relationships among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges. The original RankProp algorithm is unsupervised. Here, we describe a semi-supervised version of the algorithm that uses labeled examples. Three possible ways of incorporating label information are considered: (i) as a validation set for model selection, (ii) to learn a new network, by choosing which transfer function to use for a given query, and (iii) to estimate edge weights, which measure the probability of inferring structural similarity.

Results

Benchmarked on a human-curated database of protein structures, the original RankProp algorithm provides significant improvement over local network search algorithms such as PSI-BLAST. Furthermore, we show here that labeled data can be used to learn a network without any need for estimating parameters of the transfer function, and that diffusion on this learned network produces better results than the original RankProp algorithm with a fixed network.

Conclusion

In order to gain maximal information from a network, labeled and unlabeled data should be used to extract both local and global structure.

Background

Pairwise sequence comparison is the "killer app" of bioinformatics. Algorithms like BLAST

Early methods for detecting subtle sequence similarities were designed explicitly with respect to a simple model of molecular evolution. They measure similarities between protein pairs by computing the cost of mutation, insertion and deletion. The Smith-Waterman algorithm

More sophisticated solutions to this problem involve learning from data. In an example of this approach, an HMM or other generative model is constructed from a training set, which is accumulated iteratively from the target database. SAM-T98

RankProp

In this article we review RankProp, an unsupervised protein ranking algorithm. We also discuss some extensions to the algorithm that leverage the use of labeled data to make it a

In semi-supervised learning, one is interested in using a large unlabeled set of data to help build a classification rule. Common scenarios for this situation include text and web page classification. These algorithms work by making the so-called cluster assumption: the classification rule does not change in regions of input space that are densely populated. In other words, the algorithm chooses a classification decision boundary that lies in a region of low density (see Figure

An example of the use of semi-supervised classification

**An example of the use of semi-supervised classification**. (A) If only one example of each class is given (the large cross and circle), then one may choose an incorrect decision boundary between the two classes. (B) Given extra unlabeled data (small crosses), the decision boundary can be placed in a sparse region of the space, which can result in a more accurate classifier.

Semi-supervised learning for a ranking problem

**Semi-supervised learning for a ranking problem**. (A) The query is a single point in input space, and the remaining points comprise the database one wishes to rank. (B) The ranking induced by Euclidean distance. Marking sizes are proportional to the ranking of each point. (C) The ideal ranking. Clearly, to find the optimal ranking we need to find the cluster/manifold structure in the data.

An analogy can also be made between a protein database search, where one submits a query protein and is returned a ranked list of proteins, and a web search, where one enters a query word or phrase and retrieves a ranked list of web pages. RankProp is similar to PageRank

RankProp with labeled data

We hypothesize that extending RankProp to make use of both

Results

Basic approach

The RankProp algorithm requires a protein similarity network as input. The protein similarity network represents the similarity between pairs of proteins by assigning a weight to each edge. The degree of similarity between two sequences is commonly summarized in an E-value, which is the expected number of times that this degree of sequence similarity would occur in a random database of the given size. RankProp bases its edge weights on E-values returned from PSI-BLAST searches, using a radial basis transfer function

_{ij }= _{ij}/

where _{ij }is the E-value between protein _{ij }is the corresponding weight. In this way, edges between similar sequences are assigned large weights. The transfer function introduces a hyper-parameter

We evaluate RankProp output using a 3D structure-based gold standard

We use receiver operating characteristic (ROC) curves to measure performance. The ROC score _{n }score computes this score up to the _{50 }score is close to 0 because most of the sequences are not related to the query.

Our experiments suggest that RankProp's ranking is superior to the ranking induced by the direct links in the original network, i.e. the ranking given by the PSI-BLAST algorithm, as shown in Figure

Comparison of PSI-BLAST and RankProp variants

**Comparison of PSI-BLAST and RankProp variants**. Each figures plots the number of test set queries for which a given method achieves a ROC_{n }score threshold. Figures (A), (B) and (C) use ROC_{50}, ROC_{10 }and ROC_{1 }scores, respectively. Corresponding mean ROC_{n }scores are given in Table 1. The variants of RankProp are described in the text. Previous work [5] used RankProp with

Comparison of PSI-BLAST and RankProp variants. The table lists ROC_{1}, ROC_{10 }and ROC_{50 }scores, averaged across 2899 SCOP domains in the test set, for PSI-BLAST and five variants of the RankProp algorithm.

ROC_{1}

ROC_{10}

ROC_{50}

PSI-BLAST

0.5973

0.6167

0.6406

RankProp

0.5964

0.6658

0.7169

RankProp

0.5924

0.6671

0.7249

RankProp

0.6040

0.6661

0.6999

Probabilistic RankProp

0.6145

0.6660

0.7195

Adaptive RankProp

0.6510

0.7000

0.7420

So far, we have described RankProp as a purely unsupervised approach. However, we would also like to make use of the labeled data available, which can be done by learning some aspect of the network with the available labels. In the following subsections, we consider three ways to achieve this goal.

Model selection of radial basis width

RankProp takes as input a weighted protein similarity network. Clearly, the quality of the rankings produced by RankProp depends critically on the quality of the initial weights. If we can parameterize the weights in some way, then these parameters can be inferred using labeled data.

Our first method for making use of labeled data simply learns the radial basis width parameter _{n }scores of the resulting rankings can then be used to select an optimal value for

The probability of homology network

The RankProp algorithm with the transfer function (1) requires that the user specify the radial basis width parameter

Our second method for making use of labeled data uses this probabilistic formulation. In particular, we suggest an empirical approach for estimating edge probabilities from labeled data. In order to perform superfamily detection by network propagation, the most natural weight to assign to an edge between proteins

Comparison of transfer functions for converting PSI-BLAST E-values to edge weights

**Comparison of transfer functions for converting PSI-BLAST E-values to edge weights**. The figure plots edge weight as a function of E-value for various transfer functions. The original RankProp algorithm assigns edge weights using the function exp(-_{ij}/

Overall, the probabilistic network provides performance comparable or superior to all values of _{1}, ROC_{10 }and ROC_{50 }scores. However, the improvement is less convincing at the ROC_{10 }and ROC_{1 }levels, i.e., when the very highest ranked proteins become increasingly important. It is not surprising that RankProp has similar ROC_{1 }performance to PSI-BLAST, because examples very close to the query using the original similarity metric are usually already highly ranked.

Although probabilistic RankProp does not outperform the original RankProp when used with the best choice of

Adaptive model selection of radial basis width

We observed that the optimal value of

The optimal radial basis width

**The optimal radial basis width σ depends upon the number of close homologs to the query**. In the figure, each point corresponds to one query sequence from the training set. For each query, the y-axis coordinate is the number of target sequences that PSI-BLAST assigns an E-value less than 0.01, and the x-axis coordinate is the difference in ROC

This observation suggests our third strategy for making use of labeled data: _{n }score given a histogram of number of hits with E-value less than _{1}, but in principle we could optimize for any error measure. Then, given the query, we choose the value of _{1 }score by the regressions. We call this method "adaptive RankProp." The results, given in Figure _{1}, ROC_{10 }and ROC_{50 }scores. We note that the supplementary material of

Discussion

In this article we reviewed the RankProp algorithm and suggested some ways of using labeled data to further improve the basic algorithm. Based on our experiments, we advocate making use of all available information — in this case using both labeled and unlabeled data — to achieve the best possible results.

The basic way to use labeled data with the ranking problem is to optimize the parameters of the model of choice, which in this case is the protein similarity network. In the previous sections, we defined three possible parameterizations and then optimized them, in each case yielding good results. However, many other parameterizations are possible. For example, one could build a network based upon several measures of similarity, including pairwise sequence similarity metrics (BLAST, PSI-BLAST), common motif occurrences (MotifProp

An implementation of RankProp is now available on the Santa Cruz Gene Sorter,

Finally, one important difference between RankProp and existing methods such as BLAST and PSI-BLAST is that RankProp does not return an E-value or other confidence measure along with each ranked protein. Defining such a confidence measure is the subject of our current research.

Conclusion

The RankProp algorithm uses global network information to give improved protein rankings by performing diffusion on a graph built with PSI-BLAST similarity scores. PSI-BLAST improves upon BLAST by incorporating unlabeled data into its search algorithm, but advanced machine learning techniques appear to extract extra information useful for this task. In this article, we showed how labeled data can be used to further improve the unsupervised diffusion technique by learning various parameters of the similarity network. These results may have implications for other ranking problems in bioinformatics as well, as long as a suitable similarily network can be defined.

Methods

Data Preparation

We tested the quality of the protein rankings produced by RankProp, using as a gold standard the human-annotated SCOP database of protein 3D structural domains

The SCOP database is organized hierarchically into classes, folds, superfamilies and families. For the purposes of this experiment, two domains that come from the same superfamily are assumed to be homologous, and two domains from different folds are assumed to be unrelated. For pairs of proteins in the same fold but different superfamilies, their relationship is uncertain, and so these pairs are not used in evaluating the algorithm.

In all the experiments reported here, the SCOP database was concatenated with 101,602 proteins from Swiss-Prot version 40. Using this larger database benefits both PSI-BLAST and RankProp.

PSI-BLAST

PSI-BLAST (v 2.2.2) was used for comparison with RankProp. PSI-BLAST was run with default parameters, including the BLOSUM 62 matrix, but with an E-value threshold of 10,000 for reporting results. PSI-BLAST was allowed to run a maximum of six iterations, which previous work indicates is sufficient for good performance

RankProp

The protein similarity network for RankProp was built using the same version of PSI-BLAST as above. In the network _{ij }associated with a directed edge from protein _{j}(_{j}(

Given the similarity network, the RankProp algorithm can then be described as follows:

**1. Initialization**: _{1}(0) = 1; _{i}(0) = 0

2. **for ****do**

3. **for ****to ****do**

4. _{i}(_{1i }+

5. **end for**

6. **until convergence**

7. **Termination**: Let _{i}(^{th }point (largest ranked first).

Given a set of objects (in this case, proteins) _{1}, ..., _{m}}, let _{1 }be the _{2}, ..., _{m }be the database (_{ij }gives a similarity score between _{i }and _{j}, with _{1i }= _{i1 }for all _{i}, _{i}(

RankProp with probability of homology network

It is possible to define a similarity network for RankProp without resorting to the adjustment of free parameters. This is accomplished by making use of labeled data. To perform superfamily detection using RankProp (performing network propagation) the most natural weight to assign to an edge from protein _{i}, and a matrix _{ij }= _{j}(_{k}, and compute _{k}, the number of times _{ij }falls into the bin centered at _{k}, and _{k}, the number of times that the latter occurs when _{k}/_{k}, the empirical probability belonging to the superfamily of interest for the bin. The mapping ^{-20}, 10^{-15}, 10^{-10}, 10^{-9.5}, ..., 10^{-4.5}, 10^{-4}, 10^{-3.75}, ..., 10^{3}).

The resulting map is given in Figure _{j}(

RankProp with adaptive training

In adaptive RankProp, one chooses a different value of

The input to each regression problem is a 5-dimensional vector, where the features count the number of E-values returned by PSI-BLAST using the given query that are less than 1e-10, 1e-5, 0.1, 1, and 10, respectively. The regression output is the predicted ROC_{1 }score on a validation set when RankProp is trained with the given value of

We subtracted the mean from the outputs and normalized the inputs to have mean zero and standard deviation one, and used linear least squares to learn the regression.

Authors' contributions

JW developed and implemented the RankProp algorithms, and drafted the manuscript. RK helped design and implement some of the experiments, and revised part of the manuscript. WSN and CL coordinated the research, helped design the experiments and assisted with drafting the manuscript.

Acknowledgements

We would like to thank Andre Elisseeff and Denyong Zhou for their help with this work, and Jim Kent and Mark Diekhans for their help with implementing the RankProp search capability for the human genome browser. This work is supported by the NSF (awards EIA-0312706 and DBI-0243257) and the NIH (GM74257-01). WSN is an Alfred P. Sloan Research Fellow.