Department of Electrical Engineering, Columbia University, New York, NY 10027, USA

IBM T J Watson Research Center, Department of Mathematical Sciences, Yorktown Heights, NY 10598, USA

Abstract

Background

Time-course gene expression analysis has become important in recent developments due to the increasingly available experimental data. The detection of genes that are periodically expressed is an important step which allows us to study the regulatory mechanisms associated with the cell cycle.

Results

In this work, we present the Laplace periodogram which employs the least absolute deviation criterion to provide a more robust detection of periodic gene expression in the presence of outliers. The Laplace periodogram is shown to perform comparably to existing methods for the

Conclusion

Time-course gene expression data are often noisy due to the limitations of current technology, and may include outliers. These artifacts corrupt the available data and make the detection of periodicity difficult in many cases. The Laplace periodogram is shown to perform well for both data with and without the presence of outliers, and also for data that are non-uniformly sampled.

Background

In the past decade, time-course gene expression datasets have become increasingly available, and have enabled the study of the dynamical behaviors of gene expression and the related regulatory mechanisms, as well as the analysis of the relationships between genes and cellular processes. Of particular interests are the genes that regulate and that are being regulated in relation to the cell-division cycles. A cell-division cycle is a series of sequential steps which are repeated throughout the lifetime of an eukaryotic cell, and it consists of four distinct phases: G_{1 }phase, S phase, G_{2 }phase, and M phase. The cell-division cycle is regulated by a complex interaction of a set of mechanisms which include genes such as cyclins and cyclin-dependent kinases (CDKs). These genes are known to be expressed periodically with respect to the cell-division cycle

In recent years, many periodic signal detection algorithms have been proposed to detect periodically expressed genes from their time-course gene expression data. It is well known that for uniformly-spaced samples, the classical periodogram can be used to estimate the angular frequency spectrum of the sampled signal. Given a time sequence _{1}, ..., _{N}, the classical periodogram is computed as

where _{i }= _{i }+ _{i }_{i }- _{i}) + _{i}, where

While the above listed algorithms have achieved varying degrees of success, they are often limited by factors such as being developed for a specific dataset, not being able to provide a ranking for genes, and the nature of the time-course gene expression. In particular, one problem that plagues the detection of periodic signals in time-course gene expression data is that the samples are typically non-uniformly spaced, which is caused by the cell arresting and measurement methods employed by the experiments. One approach to resolve non-uniform sampling is to extrapolate a continuous signal from the available samples, and obtain a set of uniformly-spaced samples of the data from the extrapolated signal. Various works have explored this option, such as linear interpolation

While using the least-squares fitting to sinusoidal functions allows the treatment of non-uniformly spaced samples, it is also well known that the least-squares method is non-robust in the presence of heavy-tailed noise and outliers due to its assumption that noise is independently and identically Gaussian distributed

In this paper, we propose the use of the Laplace periodogram

Results and discussion

Periodic gene detection without outliers

In this section, we compare the periodicity detection performance of the Fourier-score-based algorithm

We use the same three sets of benchmarks described in

For each experiment, we ranked the time-course expression of the genes at the normalized cell-division-cycle frequency using the p-values of the scores computed by each of the three algorithms. Since we have very small number of samples, we estimated the p-values of the scores at the normalized cell-division-cycle frequency by a bootstrap method similar to

Since the three benchmark sets discussed above include genes that are known to be or potentially periodically expressed, we will evaluate their performance by searching for the genes in these benchmark sets from amongst the highly ranked genes. We search within the top

Detection rate in the top scoring genes by the Fourier-score-based algorithm

**Detection rate in the top scoring genes by the Fourier-score-based algorithm ****, M-estimator ****, and the Laplace periodogram for the Alpha dataset with no random impulse added for (a)B1, (b)B2, and (c)B3 benchmark sets**.

Detection rate in the top scoring genes by the Fourier-score-based algorithm

**Detection rate in the top scoring genes by the Fourier-score-based algorithm ****, M-estimator ****, and the Laplace periodogram for the CDC15 dataset with no random impulse added for (a)B1, (b)B2, and (c)B3 benchmark sets**.

Detection rate in the top scoring genes by the Fourier-score-based algorithm

**Detection rate in the top scoring genes by the Fourier-score-based algorithm ****, M-estimator ****, and the Laplace periodogram for the CDC28 dataset with no random impulse added for (a)B1, (b)B2, and (c)B3 benchmark sets**.

The figures plot the ratio of periodic genes as indicated by the B1, B2, and B3 benchmark sets discovered in a subset of the top scoring genes scored by the three algorithms. As the subset of top scoring genes (number of genes in the subset) increases, the ratio of benchmark periodic genes contained in these subsets also increases.

From Figures

For this comparison, we use the experimental data provided by

Comparison similar to the

Detection rate in the top scoring genes by the Fourier-score-based algorithm

**Detection rate in the top scoring genes by the Fourier-score-based algorithm **** and the Laplace periodogram for the Arabdopsis dataset**.

Periodic gene detection in the presence of outliers

We now compare the detection performances of the Fourier score, Laplace periodogram, and M-estimator on the same

Detection rate in the top scoring genes by the Fourier-score-based algorithm

**Detection rate in the top scoring genes by the Fourier-score-based algorithm ****, M-estimator ****, and the Laplace periodogram for the Alpha dataset with random impulse added for (a)B1, (b)B2, and (c)B3 benchmark sets**.

Detection rate in the top scoring genes by the Fourier-score-based algorithm

**Detection rate in the top scoring genes by the Fourier-score-based algorithm ****, M-estimator ****, and the Laplace periodogram for the CDC15 dataset with random impulse added for (a)B1, (b)B2, and (c)B3 benchmark sets**.

Detection rate in the top scoring genes by the Fourier-score-based algorithm

**Detection rate in the top scoring genes by the Fourier-score-based algorithm ****, M-estimator ****, and the Laplace periodogram for the CDC28 dataset with random impulse added for (a)B1, (b)B2, and (c)B3 benchmark sets**.

From these figures we can see that with the addition of impulse noise, the Laplace periodogram on average gives better detection performance than Fourier score for most of the combinations, and only in the CDC28-B3 combination does the Laplace periodogram achieves worse detection accuracy than Fourier score. However, it should be noted that a lot of the genes in benchmark set B3 are not involved in the transcriptional regulation, thus only a very small amount of genes in B3 are expected to be periodic

where _{N }a scaling factor. The Tukey's biweight function is given as

where

Conclusion

Our simulation results have shown that the Laplace periodogram is a useful tool for detecting periodic time-course gene expression, particularly when the dataset contains outliers and when the sampling intervals are highly uneven. The Laplace periodogram achieves better performance for the

Methods

Laplace periodogram

For time series samples ** y **= [

for the frequency range

where

With ** y **to the regressor

To overcome the weakness of classical and Lomb-Scargle periodograms in dealing with outliers and heavy-tailed noise, it is proposed in _{2 }norm in (6) be replaced with the _{1 }norm, thus replacing the least squares with least absolute deviation (LAD). Thus, the Laplace periodogram can be computed for

where we replace the least squares coefficient

Note here that the magnitude at each angular frequency can be computed independent of the other frequencies, meaning that if we know exactly the periodicity that we are looking for, there is no need to compute the LAD coefficients for the entire frequency spectrum. In

An implementation of the proposed algorithm in MATLAB can be found at

Method for LAD approximation

To solve for the LAD coefficients, we can convert (8) into a set of equations and constraints to be solved using linear programming _{1}_{2}]^{T}, and _{t }= [_{t,1 }_{t,2}]^{T}, where _{t,1 }= cos(_{t,2 }= sin(

where _{t }and _{t }are non-negative variables. By setting _{j }= _{j }- _{j}, where _{j }and _{j }are non-negative variables, we can obtain the best _{1 }approximation by solving the following linear programming problem:

To solve the LAD approximation for non-uniformly spaced samples, we follow the same steps to solve for the LAD coefficient in the following,

where [_{1}, _{2}, ..., _{N }] are the

In this formulation, the LAD coefficients can be easily solved using standard algorithms for solving linear programming problems. For our implementation in MATLAB, we used the LINPROG function in the Optimization Toolbox. In terms of computational time required to process the data, for experiment Alpha which consists of 6075 genes and 18 samples each, the total time to compute 1000 permutations for the p-value analysis takes approximately 24 hours on a Pentium Core 2 CPU at 2.66 GHz, which is similar to the amount of time taken by the M-estimator-based method, also implemented in MATLAB using the ROBUSTFIT function in the Statistics Toolbox.

Authors' contributions

KL implemented the Laplace periodogram in MATLAB, performed the simulations and comparisions, and contributed in the writing of the draft. TL developed the Laplace periodogram in his earlier work. Both XW and TL conceived of the project and coordinated its implementation.

Acknowledgements

We would like to thank Harri Läahdesmäaki for generously providing us with the source to their code for the robust regression methods, and Cyclebase.org for the datasets used in the simulations in this work.