Department of Statistics, Duksung Women’s University, Seoul, South Korea

Department of Biostatistics, Columbia University, New York, USA

Biochemicals and Synthetic Biology Research Center, Korea Research Institute of Bioscience & Biotechnology, Daejeon, South Korea

Abstract

Background

Time course gene expression experiments are an increasingly popular method for exploring biological processes. Temporal gene expression profiles provide an important characterization of gene function, as biological systems are both developmental and dynamic. With such data it is possible to study gene expression changes over time and thereby to detect differential genes. Much of the early work on analyzing time series expression data relied on methods developed originally for static data and thus there is a need for improved methodology. Since time series expression is a temporal process, its unique features such as autocorrelation between successive points should be incorporated into the analysis.

Results

This work aims to identify genes that show different gene expression profiles across time. We propose a statistical procedure to discover gene groups with similar profiles using a nonparametric representation that accounts for the autocorrelation in the data. In particular, we first represent each profile in terms of a Fourier basis, and then we screen out genes that are not differentially expressed based on the Fourier coefficients. Finally, we cluster the remaining gene profiles using a model-based approach in the Fourier domain. We evaluate the screening results in terms of sensitivity, specificity, FDR and FNR, compare with the Gaussian process regression screening in a simulation study and illustrate the results by application to yeast cell-cycle microarray expression data with alpha-factor synchronization.

The key elements of the proposed methodology: (i) representation of gene profiles in the Fourier domain; (ii) automatic screening of genes based on the Fourier coefficients and taking into account autocorrelation in the data, while controlling the false discovery rate (FDR); (iii) model-based clustering of the remaining gene profiles.

Conclusions

Using this method, we identified a set of cell-cycle-regulated time-course yeast genes. The proposed method is general and can be potentially used to identify genes which have the same patterns or biological processes, and help facing the present and forthcoming challenges of data analysis in functional genomics.

Background

Time-course gene expression data are often measured to study dynamic biological systems and gene regulatory networks. Array technologies have made it straightforward to monitor the expression pattern of thousands of genes simultaneously. The challenge now is to interpret such massive data sets. The first step is to extract the fundamental patterns of gene expression inherent in the data. Gene-expression levels can be monitored with cDNA or oligonucleotide chips over a time-course for a temporal process. Following a microarray time series experiment, a key challenge is to extract the continuous representation of all genes throughout the time-course. Identifying significant or differentially expressed genes is challenging because different genes may have different profiles, and because of the noise present in time series expression data. A comprehensive review about time series expression data analysis and the related computational challenges may be found in

Microarrays have recently been used for the purpose of monitoring expression levels of thousands of genes simultaneously and for identifying genes that are differentially expressed. With the number of inferences made in the analysis of microarray data, it is natural to be concerned about multiple testing. This problem of multiplicity can be dealt with by controlling the false discovery rate (FDR)

In the past decade, many approaches to gene selection have been considered including a two sample t-test

There has been considerable research about discovering patterns using clustering and testing including clustering after transformation and smoothing as a technique for nonparametrically estimating and clustering a large number of curves

Smoothing away noise-induced wiggles of gene expression data with Fourier series for microarray data has been considered including an improved Fourier method with irregular or monotonic components of cell-cycle expression

Model-based hierarchical clustering was proposed in character recognition problems using a multivariate normal model

There has been much work done on clustering microarray data, mostly on grouping common expression patterns. However, less attention has been paid to time-course gene studies. Currently the analysis of GETS (gene expression time-series) is commonly performed using a GP (Gaussian process)

In this research, we propose a new method for gene screening using Fourier coefficients to cluster time-courses of genes that exhibit similar patterns.

This paper introduces a methodology for gene selection based on time-course data. The first step is screening, in which we seek to isolate the inactive genes from the active ones, while properly taking into account the serial dependence in the time course data and controlling the FDR, all in the Fourier domain. The second step involves a model-based clustering of the “active” genes, also in the Fourier domain. We evaluate the performance of the methodology using both simulated data and yeast cell cycle data.

Results and discussion

Simulated data

Since real expression data sets are generally noisy and their clusters may not be fully reflective of the class information, we first evaluate the performance of our method with simulated data, for which the “true” classes are known.

We simulate data according to the regression model

where _{
iu
}) = 0 and _{
iu
}’s from an autoregressive AR(1) process with a variety of values of the AR parameter. The regression functions for

Each simulated dataset consists of 800 curves originating from the 6 functions: 400 _{1}, s and 80 curves of each _{2}, ⋯, _{6}, to reflect typical gene expression data. Thus, there are 5 sets of differential genes and 1 set of non-differential genes. The standard deviation of the innovation process was set to

The cosine system

The number of clusters is determined according to the Bayesian Information Criterion (BIC).

Let

Regarding the estimation error, the clustering estimation error rate

where {_{1}, ⋯, _{
n
}} denote the true curves and

We compare our screening method with the recently proposed GPR (Gaussian process regression) screening

To summarize the performance across the 500 replications of 800 curves in each data set, we compute four performance measures to evaluate the procedures:

i. Sensitivity: proportion of differentially expressed genes that are declared significant

ii. Specificity: proportion of non-differentially expressed genes that are not declared significant

iii. False discovery rate (FDR): proportion of genes declared significant that are not differentially expressed

iv. False non-discovery rate (FNR): proportion of genes not declared significant that are differentially expressed

Tables

**Without screening**

**With screening**

* FC: proposed method with Fourier coefficients, **GPR: Gaussian process regression.

Comparison of estimation error rate (E), Silhouette width (S) and Adjusted Rand Index (ARI) values of model-based clustering without screening vs with screening with

AR(1) parameter

Method

Error

Sil

ARI

Error

Sil

ARI

Sensitivity

Specificity

FDR

FNR

FC*

2

.037

.509

.909

.015

.560

.918

.878

.723

.121

.276

3

.020

.471

.921

.016

.484

.932

.860

.783

.139

.216

4

.015

.430

.963

.017

.438

.937

.863

.842

.136

.157

5

.015

.388

.964

.014

.403

.944

.854

.798

.145

.201

8

.015

.305

.964

.017

.317

.940

.851

.836

.148

.163

GPR**

.855

.779

.220

.144

FC

2

.052

.471

.871

.021

.523

.875

.871

.722

.128

.277

3

.036

.423

.912

.026

.443

.888

.846

.783

.153

.217

4

.029

.386

.931

.028

.398

.895

.847

.839

.152

.160

5

.027

.348

.935

.022

.366

.906

.837

.798

.162

.205

8

.028

.274

.936

.029

.287

.895

.830

.836

.169

.163

GPR

.826

678

.321

.173

FC

2

.073

.430

.822

.030

.487

.815

.863

.723

.136

.276

3

.056

.380

.865

.042

.402

.814

.828

.783

.171

.217

4

.052

.339

.875

.045

.356

.825

.827

.834

.172

.165

5

.049

.306

.883

.036

.326

.845

.817

.790

.182

.209

8

.047

.244

.888

.049

.257

.823

.803

.832

.196

.167

GPR

.798

.571

.428

.201

FC

2

.159

.340

.610

.056

.414

.633

.835

.717

.165

.201

3

.139

.287

.663

.093

.329

.591

.775

.768

.224

.231

4

.124

.255

.702

.113

.280

.578

.766

.811

.233

.188

5

.132

.226

.682

.093

.259

.615

.762

.773

.237

.226

8

.143

.181

.649

.134

.205

.562

.730

.815

.269

.184

GPR

.756

.410

.589

.244

FC

2

.266

.287

.345

.088

.357

.351

.755

.704

.244

.295

3

.264

.224

.347

.153

.272

.314

.682

.738

.317

.261

4

.258

.190

.370

.186

.230

.303

.668

.771

.331

.228

5

.258

.171

.375

.161

.211

.317

.676

.745

.324

.255

8

.267

.137

.339

.220

.172

.287

.641

.769

.358

.230

GPR

.731

.335

.664

.268

**Without screening**

**With screening**

* FC: proposed method with Fourier coefficients, **GPR: Gaussian process regression

Comparison of estimation error rate (E), Silhouette width (S) and Adjusted Rand Index (ARI) values of model-based clustering without screening vs with screening with

AR(1) parameter

Method

Error

Sil

ARI

Error

Sil

ARI

Sensitivity

Specificity

FDR

FNR

FC*

2

.326

.259

.200

.235

.292

.149

.589

.708

.411

.291

3

.321

.192

.199

.298

.214

.151

.561

.716

.439

.283

4

.321

.151

.194

.320

.166

.156

.553

.722

.447

.278

5

.323

.125

.185

.269

.142

.156

.571

.714

.428

.285

8

.324

.084

.175

.324

.094

.148

.552

.722

.448

.277

GPR**

.483

.779

.221

.517

FC

2

.343

.253

.164

.234

.291

.117

.567

.697

.432

.302

3

.339

.185

.155

.287

.208

.117

.545

.703

.454

.296

4

.338

.149

.151

.306

.160

.120

.539

.708

.461

.291

5

.337

.125

.146

.261

.138

.120

.555

.702

.445

.297

8

.338

.086

.132

.307

.094

.113

.538

.706

.462

.293

GPR

.536

.677

.323

.463

FC

2

.359

.248

.128

.284

.290

.090

.546

.681

.453

.318

3

.351

.185

.119

.329

.208

.089

.531

.683

.468

.316

4

.350

.148

.115

.347

.159

.092

.526

.685

.473

.314

5

.350

.127

.108

.304

.137

.088

.537

.681

.462

.318

8

.357

.083

.089

.351

.091

.079

.526

.685

.474

.314

GPR

.584

.572

.427

.415

FC

2

.383

.246

.073

.330

.284

.053

.517

.632

.482

.367

3

.375

.183

.066

.356

.198

.051

.512

.634

.488

.365

4

.369

.151

.062

.365

.158

.052

.510

.633

.490

.366

5

.369

.126

.056

.338

.137

.046

.514

.633

.485

.367

8

.370

.086

.046

.370

.092

.042

.509

.634

.490

.365

GPR

.646

.409

.590

.353

FC

2

.395

.248

.035

.356

.275

.030

.505

.504

.495

.422

3

.384

.186

.034

.368

.193

.028

.503

.503

.496

.424

4

.383

.148

.031

.373

.155

.026

.502

.502

.497

.421

5

.381

.125

.028

.358

.134

.024

.504

.504

.496

.419

8

.377

.092

.027

.370

.097

.023

.503

.502

.497

.415

GPR

.679

.337

.662

.320

As shown in Table

Yeast cell cycle data analysis

We have used alpha synchronized yeast cell expression data

Following

**Number. of Fourier coeff**

**Number of clusters without-screening**

**Number of clusters with screening**

**No. of sig. genes**

**Without-screening**

**Screening**

**Med sil**

**Avg. sil**

**Med sil**

**Avg. sil**

Median and average silhouette values for model-based clustering with Fourier coefficients using Euclidean distance with screening vs. without screening.

J = 2

5

4

1715

.160

.112

.451

.388

J = 3

5

4

1735

.204

.146

.505

.426

J = 4

5

4

2227

.174

.119

.552

.485

J = 5

5

4

2792

.029

.015

.048

.028

J = 6

5

4

3071

.196

.119

.041

.043

J = 8

5

4

3050

.136

.055

.032

.024

J = 10

5

4

3142

.153

.092

.031

.010

Judging from the silhouette value, the model-based with 4 Fourier coefficients and 4 clusters was considered most appropriate. Therefore it should be noted that silhouette values of Euclidean distance between two clustering models may not be the only criterion for model comparison. Rather, as in the following gene ontology analysis, clustering should be evaluated based on biological interpretation of results. With

Means of J = 4 sample Fourier coefficients with yeast data

**Means of J = 4 sample Fourier coefficients with yeast data.** The mean profiles of Fourier coefficients in the four clusters and one cluster with genes screened out.

Average gene curves in four clusters and one screened-out cluster

**Average gene curves in four clusters and one screened-out cluster.** The mean curves of gene curves in 4 clusters and one mean curve with genes screened out.

We also applied GPR screening for yeast data with the critical value log (1.5) and identified 1,620 non-quiet differentially expressed genes. A Gaussian mixture model clustering algorithm was applied to these genes indicating 4 clusters. The clusters have 209, 477, 598, and 336 genes, respectively. For this clustering the median silhouette value is 0.508 and the average silhouette value is 0.370, each less than the corresponding silhouette summaries for the proposed procedure.

Owing to noise and the high dimensionality of data, careful consideration of statistical and biological validity is needed when analyzing microarray data. From our review we have found that without plausible interpretation and biological validation, the number of partitions produced by numerical analysis is highly unreliable, and sometimes even misleading.

In order to evaluate our clustering analysis, Gene Ontology (GO) is applied to the clustered genes

**Screened out**

**Cluster 1**

**Cluster 2**

**Cluster 3**

**Cluster 4**

The number of genes screened, screened out and significant genes with respect to GO.

Number of genes

2262

51

29

2077

70

Filtered genes

2041

46

28

1881

64

Significant GO terms

1

36

17

6

37

Even though the number of genes affects to the hypergeometric test, cluster 3 and the screened-out group have only one and six significantly overrepresented GO terms, respectively. This means that all the genes (or some of them) in those clusters rarely share their biological processes and these clusters can be considered as random collections of genes. So, we focused on the three clusters 1, 2, and 4 to find their biological meanings.

In order to identify the relationships among the selected GO terms, we construct a GO graph of each cluster. This graph is constructed by locating the selected terms as leaf nodes and linking all the nodes to their ancestors until their root term. Note that whatever GO terms are selected in a cluster, the cluster has only one GO relationship graph because every GO terms are eventually ended up at the root term if we follow their ancestors. Figure

GO graph of each cluster of yeast data

**GO graph of each cluster of yeast data.**

Among the selected GO terms, we looked into the terms which are connected sequentially from the top node to the bottom (root). This cascade form of GO terms provides more enhanced evidences that the corresponding genes are more closely related in their biological processes than that of a parent–child relationship. These sequentially connected terms are labeled by their subset number in the Figure

**C**

**S**

**GO ID**

**Terms**

**S**

**GO ID**

**Terms**

Cluster 1, 2 and 4 have biologically meaningful genes as shown in the table.

1

1

GO:0006334

Nucleosome assembly

2

GO:0000280

Nuclear division

GO:0031497

Chromatin assembly

GO:0000087

M phase of mitotic cell cycle

GO:0006323

DNA packaging

GO:0048285

Organelle fission

GO:0034728

Nucleosome organization

GO:0022402

Cell cycle process

GO:0006333

Chromatin assembly or disassembly

GO:0000278

Mitotic cell cycle

GO:0016043

Cellular component organization

GO:0007049

Cell cycle

GO:0051276

Chromosome organization

GO:0007067

Mitosis

GO:0006325

Chromatin organization

GO:0000279

M phase

GO:0006996

Crganelle organization

GO:0022403

Cell cycle phase

2

1

GO:0007109

Cytokinesis, completion of separation

2

GO:0071554

Cell wall organization or biogenesis

GO:0000920

Cell separation during cytokinesis

GO:0007047

Cellular cell wall organization

GO:0032506

Cytokinetic process

GO:0071555

Cell wall organization

GO:0000910

Cytokinesis

GO:0070882

Cellular cell wall organization or biogenesis

GO:0031505

Cungal-type cell wall organization

4

1

GO:0051716

Cellular response to stimulus

2

GO:0006302

Couble-strand break repair

GO:0050896

Response to stimulus

GO:0006281

DNA repair

GO:0033554

Cellular response to stress

GO:0000724

Couble-strand break repair via homologous recombination

GO:0006950

Response to stress

GO:0006974

Cesponse to DNA damage stimulus

GO:0034605

Cellular response to heat

3

GO:0006260

DNA replication

GO:0009408

Response to heat

GO:0006273

Cagging strand elongation

GO:0009628

Response to abiotic stimulus

GO:0006261

DNA-dependent DNA replication

GO:0009266

Response to temperature stimulus

GO:0006271

DNA strand elongation during DNA replication

GO:0022616

DNA strand elongation

GO:0006259

DNA metabolic process

Conclusions

A number of recent studies in this field have focused on the analysis of time series of gene expression data, collected by performing DNA microarray experiments at several or more points in time. We have proposed a significance method to identify differentially expressed genes in time course microarray gene expression data using time series screening based on Fourier coefficients controlling FDR and model based clustering with the sample genewise Fourier coefficients, and have compared our screening method with GP screening. Recently spectral mixture kernels

We demonstrated the effectiveness of our approach using model-based clustering of gene profiles. Although we assumed that the residuals follow an AR process, we found that it is not necessary to assume a specific correlation structure since the sample Fourier coefficient estimates themselves do not depend heavily on the underlying covariance structure. The most commonly used techniques are clustering (unsupervised) techniques, which are particularly well suited for an exploratory investigation of this kind of data. The main advantage of the model-based methods is their reliance on a highly structured theoretical basis. Model-based clustering methods are based on the assumption that the data were generated by some underlying model and attempt to infer these models from data. Data generated by the same model is then considered to be “similar” and clustered together. Also, the choice of the optimal number of clusters and the selection of the best model can be performed using sound statistical criteria.

The proposed method is able to identify a set of cell-cycle-regulated genes in yeast and time-course genes. The proposed method is general and can be potentially used to identify genes which have the same patterns or biological processes, and help facing the present and forthcoming challenges of data analysis in functional genomics.

Methods

The Fourier representation model

We observe data _{
iu
},

where _{
iu
}) = 0 and the _{
iu
} values arise from a covariance-stationary process with mean zero and covariance function _{
i
}, _{
i
}(_{
iu
}
_{
i,u + k
}) for all _{
iu
} is the log gene expression of gene

We assume further that the curve _{
i
} belongs to a class of smooth functions as defined below:

where {_{
j
}} is an orthonormal basis system and

We can estimate each _{
i
} using its empirical Fourier coefficients:

which is the projection onto the first

The empirical Fourier coefficients can be computed as

with _{
r
}

Screening out flat genes

Many microarray experiments are aimed at finding ‘active’ genes that vary significantly in expression. Differential expression indicates the changing of transcription levels across different time points, and it is thought that these transcription changes might be responsible for the change in phenotype. For example, the genes responsible for the presence of a certain disease will be transcribed at a different rate than when the disease is absent. Cluster analysis often fails to detect differentially expressed genes that belong to clusters for which most genes do not change because most of the other genes in their clusters do not change significantly.

The problem can be formulated as hypothesis-testing for individual genes as follows:

_{0} : _{
i
}(·) = _{1} : _{
i
}(·) ≠

_{
0
}: Gene

_{
1
}: Gene

In a microarray setting, it is typical to consider thousands of tests simultaneously. In this situation the familywise error rate (FWER) or FDR (false discovery rate), the average proportion of inactive genes among those that were declared active, should be controlled. The FDR procedure

Let

Then reject all _{(g)0}, for _{(g)0} is the associated null hypothesis and _{(g)} is the

where

and

where

The p-values of test statistics

where

Clustering differentially expressed genes

All genes for which the null hypothesis of no change has been rejected will undergo clustering analysis, and this will operate on the Fourier domain representation of each expression profile. The sample Fourier coefficient

By the Central Limit Theorem for dependent data

With the large number of genes monitored in these studies, clustering is a key task for microarray data analysis. It seeks to identify groups of genes with similar expression profiles across samples. Clustering can reduce the effort of studying individual genes and more importantly it can unmask the functional groups among genes.

Considering that the empirical Fourier coefficients of the gene profiles have an asymptotic multivariate normal distribution enables the use of an efficient algorithm to compute the posterior probability that a gene belongs to a certain cluster.

The geometric features (shape, volume, orientation) of each group _{
k
} of the Fourier coefficients. A general framework for exploiting the representation of the covariance matrix is done in terms of its eigenvalue composition

Performance metrics

For evaluating the performance of clustering algorithms, the adjusted Rand Index

Suppose

The silhouette width

where

Competing interests

The authors declare that there are no competing interests.

Authors’ contributions

JK initiated this research, outlined the general idea and did the numerical work. RTO improved the method and the simulation scheme. HK accomplished the gene ontology and provided the biological interpretation of the yeast cell cycle data. All authors read and approved the final manuscript.

Acknowledgements

JK’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF 2012–0001879). HK’s research was supported by the KRIBB Research Initiative Program (2012–0111).