Department of Biological Sciences, Columbia University, 1212 Amsterdam Avenue, MC 2441, New York, NY 10027, USA

Center for Computational Biology and Bioinformatics, Columbia University, New York, NY, USA

Swammerdam Institute for Life Sciences, University of Amsterdam, BioCentrum Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands

Abstract

Background

The genomewide pattern of changes in mRNA expression measured using DNA microarrays is typically a complex superposition of the response of multiple regulatory pathways to changes in the environment of the cells. The use of prior information, either about the function of the protein encoded by each gene, or about the physical interactions between regulatory factors and the sequences controlling its expression, has emerged as a powerful approach for dissecting complex transcriptional responses.

Results

We review two different approaches for combining the noisy expression levels of multiple individual genes into robust pathway-level differential expression scores. The first is based on a comparison between the distribution of expression levels of genes within a predefined gene set and those of all other genes in the genome. The second starts from an estimate of the strength of genomewide regulatory network connectivities based on sequence information or direct measurements of protein-DNA interactions, and uses regression analysis to estimate the activity of gene regulatory pathways. The statistical methods used are explained in detail.

Conclusion

By avoiding the thresholding of individual genes, pathway-level analysis of differential expression based on prior information can be considerably more sensitive to subtle changes in gene expression than gene-level analysis. The methods are technically straightforward and yield results that are easily interpretable, both biologically and statistically.

Introduction

Many of the popular methods for analyzing DNA microarray expression data, from clustering

To understand this, assume that we are comparing two conditions and that the measurement error for the fold-change of individual genes is 20%. Now consider a specific pathway consisting of 100 genes that are all upregulated by 10%. This level of differential expression is well within the noise for individual genes, none of which will therefore be classified as significantly induced. However, the error in the

In recent years, two distinct classes of methods have been developed that use prior information about how genes can be viewed as belonging to different regulatory or functional pathways (Figure

Scoring pathway activity: gene sets versus regression

**Scoring pathway activity: gene sets versus regression**. Two types of prior information, categorical and quantitative, may be combined with non-thresholded genome-wide expression data to derive a statistical measure of pathway-level activity. In (A) a pre-defined gene set (gray), such as those annotated by the Gene Ontology project, is scored using a t-test for its expression response (red = positive, green = negative) compared to all other genes. In (B) estimated interaction strengths (shades of gray), such as those derived from regulatory sequence analysis or ChIP-chip experiments, are correlated with the expression response of all genes. In both instances the result is a t-value (yellow = positive, blue = negative) that measures the change in mRNA expression associated with a category (A) or interaction (B).

In this review, we describe how such pathway-level analyses can be implemented mathematically. It is helpful to understand that, in general, information about genes comes in two different types:

When to use which statistical test.

**First Feature**

**Second Feature**

**(Non)-Parametric?**

**APPROPRIATE TEST**

categorical

categorical

non-parametric

hypergeometric

quantitative

quantitative

parametric

Pearson

quantitative

quantitative

non-parametric

Spearman, Kendall

categorical

quantitative

parametric

two-sample t-test

categorical

quantitative

non-parametric

Wilcoxon-Mann-Whitney, Kolmogorov-Smirnov

When analyzing the statistical association between two features across the genome, the choice of statistical test depends on whether the features are categorical or quantitative, and whether or not a parametric method can be used. For each case the appropriate test is listed.

The traditional approach: scoring over-representation of predefined gene sets

Suppose that we want to know whether a specific set of genes of interest is statistically enriched for genes with a specific annotation in Gene Ontology. In this case, both features (namely, "does the gene belong to the set of genes of interest" and "is the gene associated with GO term X") are categorical, and the appropriate statistic is the

This makes sense: if a fraction

where

and

It is also possible to have significant under-representation (

This use of the cumulative hypergeometric distribution is also known as "Fisher's exact test." The test is by nature non-parametric because both input features are non-parametric. Under specific conditions the hypergeometric distribution may be approximated by the binomial or chi-square distribution. Several implementations of this approach are reviewed by Khatri and Draghici

An alternative: scoring the distribution of expression levels for predefined gene sets

An early example of the use of predefined gene sets to analyze differential expression at the pathway level can be found in Lascaris et al.

Here

Here _{S }and

with _{S }and _{S' }the standard deviation of the expression values of the genes within set

Figure

Scoring GO categories: Fisher's exact test versus two-sample t-test

**Scoring GO categories: Fisher's exact test versus two-sample t-test**. We analyzed gene expression data for the response to the ergosterol biosynthesis inhibitor Lovastatin as measured by Hughes et al. [27]. The two-sample t-test reveals that the mean expression level of genes in the GO category "ergosterol biosynthesis" is significantly higher than expected (dotted line; ^{-13}). Fisher's exact test can be used to score over-representation of the same GO category in the set of most induced genes. However, this requires one to first define a threshold for the expression fold-change of individual genes. The solid line shows how the P-value from Fisher's exact test depends on this threshold.

Other statistical tests have also been used to detect differential expression of gene sets based on the distribution of expression values. The original version of the "gene set enrichment analysis" (GSEA) method

Beyond gene sets: approaches based on regression analysis

The assignment of genes to gene sets is categorical: Either the gene belongs to the set, or it does not. However, gene sets are often a proxy for regulatory pathways. This is most obvious in the case of the gene sets based on ChIP-chip data

_{g }= _{g}

where _{g }is the mRNA expression log-ratio of gene _{g }represents the regulatory network connectivity between the TF and the promoter region of gene _{g }and _{g}, the deviance

is minimized. The solution is given by

and

where <_{g }_{g }denotes an average over all genes and _{g }≡ _{g }- <^{2}> equals the variance of

can be directly related to the slope

It can furthermore be shown that, in the univariate case, ^{2}, defined as the fraction of the variance in expression that can be explained by the linear model, is given by the square of Pearson correlation:

A transformation of

yields a statistic

There are many ways in which the regulatory network connectivities _{g }can be chosen. The first application of regression analysis to microarray data, by Bussemaker et al. ^{2 }obtained with such sequence-based predictors are typically in the range of 1–5%. Another possible choice for ^{2 }observed in this case are usually smaller (< 1%).

Conclusion

In this work, rather than providing a comprehensive review of all relevant literature, we have outlined two conceptually different approaches for scoring differential expression at the pathway level. These methods use prior information about how different genes relate to each other to reduce the dimensionality of the problem. This obviates the need to first obtain gene clusters or modules from expression data over multiple conditions, and thereby makes it possible to analyze each differential expression profile by itself in a condition-specific fashion.

Authors' contributions

HJB drafted the paper, which was edited and proofread by all authors. LDW and AB prepared Figure

Acknowledgements

We thank members of the Bussemaker Lab for valuable discussions, and Barrett Foat for a critical reading of the manuscript. This work was supported by grants HG003008, CA121852, and GM074105 from the National Institutes of Health and grant APB.5504 from the Netherlands Foundation for Technical Research (STW).

This article has been published as part of