Department of Electrical Engineering, Columbia University, New York, NY 10027, USA

The inference of gene regulatory network from expression data is an important area of research that provides insight to the inner workings of a biological system. The relevance-network-based approaches provide a simple and easily-scalable solution to the understanding of interaction between genes. Up until now, most works based on relevance network focus on the discovery of direct regulation using correlation coefficient or mutual information. However, some of the more complicated interactions such as interactive regulation and coregulation are not easily detected. In this work, we propose a relevance network model for gene regulatory network inference which employs both mutual information and conditional mutual information to determine the interactions between genes. For this purpose, we propose a conditional mutual information estimator based on adaptive partitioning which allows us to condition on both discrete and continuous random variables. We provide experimental results that demonstrate that the proposed regulatory network inference algorithm can provide better performance when the target network contains coregulated and interactively regulated genes.

1. Introduction

The prediction of the functions of genes and the elucidation of the gene regulatory mechanisms have been an important topic of genomic research. The advances in microarray technology over the past decade have provided a wealth of information by allowing us to observe the expression levels of thousands of genes at once. With the increasing availability of gene expression data, the development of tools that can more accurately predict gene-to-gene interactions and uncover more complex interactions between genes has become an intense area of research.

1.1. Background

Some of the first attempts at determining gene regulations are based on the gene expression clustering algorithms. These algorithms determine genes that are likely to be coregulated by grouping genes that exhibit similar gene expressions under the same conditions. Different clustering algorithms differ in the metric used to measure similarity between gene expressions, and how the metric is used to cluster into groups similarly expressed genes [1]. In [2], a hierarchical clustering algorithm using a correlation coefficient metric is proposed. The K-means algorithm has also been applied to partition genes into different clusters [3]. Other clustering algorithms such as self-organizing map (SOM) [4], mutual-information-based algorithms [56], and graph-theory-based algorithms [7] have also been proposed.

While gene clustering algorithms allow us to discover genes that are coregulated, they do not reveal much of the underlying biological mechanism such as the regulatory pathways. In recent years, many models have been proposed attempting to understand how individual genes interact with each other to govern the diverse biological processes in the cell. In [8–10], gene regulatory network inference based on graphical models is proposed. A graphical model depicts the relationships among nodes in a graph which are considered as random variables. Links between nodes represent dependence of the two variables. For network inference based on the graphical Gaussian model [1112], the nodes with corresponding random variables

Another method that is related to graphical model is called relevance network. Relevance networks are based on the idea of "covariance graph" where a link exists between genes

While relevance-network-based methods such as ARACNE perform well when the interactions in the gene regulatory network are between pairs of genes, they are unable to completely discover interactions that are results of the joint regulation of the target gene by two or more genes. The XOR interactive regulation is one such interaction that can be recognized only by exploiting the conditional dependence between variables of interest. Using conditional mutual information (CMI), it is possible to detect the XOR and other nonlinear interactive regulation by two genes.

Several recent works have attempted to incorporate information theoretic measures for more than two variables in regulatory network discovery. In [20], a CMI measure where the conditioning variable takes discrete values in two states (high and low) is proposed to discovery the transcriptional interactions in the human B lymphocytes. In [2122], methods based on both MI and CMI have also been proposed to decrease the false positive rate for the detection of the interactions. In [23], the conditional coexpression model is introduced, and the CMI is used as a measure of conditional coexpression. In [24], an extension of the context likelihood of relatedness (CLR) algorithm [25], called "synergy augmented CLR" is proposed. The technique uses the recently developed information theoretic concept of synergy [26] to define a numerical score for a transcriptional interaction by identifying the most synergistic partner gene for the interaction. In this work, we propose a relevance-network-based gene regulatory network inference algorithm similar to [24], using information theoretic measure to determine the relationship between triplets of genes.

1.2. Objective

Here, we make use of both mutual information and conditional mutual information as measures of dependence between gene expressions. The main focus of this work is to discover the potential interactions between genes by adapting the relevance network model, which is also used in [1718]. The inference of the connectivity, or the "wiring" of the network, is also an important aspect of biological network inference. The proposed network inference algorithm uses an adaptive partitioning scheme to estimate the mutual information between

The remainder of the paper is organized as follows. In Section 2, we present the system model for regulatory network inference. In Section 3, we present the adaptive partitioning algorithms for estimating mutual information and conditional mutual information as well as our proposed network inference algorithm based on MI-CMI. In Section 4, we present experimental results. Section 5 concludes the paper.

2. System Model

Suppose that the given set of genes

In a network inference problem, our primary goal is to correctly identify the links representing direct regulation and reduce the false negative and false positive links. A false negative can be due to the incorrect estimation of the metric that measures the interaction between the expressions of two genes. When interactive regulation is introduced into the network, false negatives may occur for certain interactive regulations due to that no significant interaction is detected between the regulated gene and any one of the regulating genes, but rather the regulation is only detectable when the regulated gene and all of the regulating genes are considered together. For example, in Figure

(a) XOR interactive regulation of

**(a) XOR interactive regulation of **. (b) Coregulation of

In the relevance network approach, two nodes are connected when they exhibit high degrees of interaction according to the chosen metric. Using metrics such as correlation coefficient and mutual information, high degrees of interaction between two genes typically indicate that one of the genes is directly or indirectly regulating the other gene, or the two genes are being coregulated by another gene. In relevance networks, indirect regulation and coregulated genes often are the cause of false positive links. ARACNE, as discussed in the previous section, removes indirect regulation by the application of DPI. However, ARACNE and other network inference algorithms based only on correlation coefficient or mutual information are unable to identify genes that are being coregulated, particularly if they are coregulated by the same mechanism. For example, in Figure

The insufficiencies of using only mutual information or correlation coefficient as discussed above naturally lead us to the use of conditional mutual information as the metric of choice in our proposed regulatory network inference algorithm. For Figure

From the above discussion, in the next section, we develop a relevance-network-based regulatory network inference algorithm that utilizes both mutual information and conditional mutual information to predict interactions between genes from the observed gene expression data. It is clear that we need efficient estimators that can accurately compute mutual information and conditional mutual information from data. Moreover, the conditional mutual information estimator should be able to support both discrete and continuous conditioning variables to allow for wider ranging uses.

3. MI-CMI Regulatory Network Inference Algorithm

There are several mutual information estimators such as the Gaussian kernel estimator and the equipartition estimator [29] but each has its weakness. The Gaussian kernel estimator requires a smoothing window that needs to be optimized for different underlying distributions, thus increasing the estimator complexity. While the equipartition estimator is simple in nature, the different grids in a partition often have variable efficiency in terms of contribution to the mutual information estimate due to the underlying sample distribution. In this section, we make use of an adaptive partitioning mutual information estimator proposed in [30] and extend it to estimating conditional mutual information. These estimators are then employed in building our MI-CMI-based relevance network for regulatory network inference.

3.1. Adaptive Partitioning Mutual Information Estimator

Let us consider a pair of random variables

For mutual information estimators that partition the samples according to equal length or equiprobable partition, many of the grids may turn out to be inefficient due to the distribution of the samples. For example, let

In the adaptive partitioning scheme, the sample space

Let us denote

and their densities

respectively, where

where

We define a sequence of the partitioning of the sample space

where

Given the pairs of samples

that is, the frequency of the samples falling into the grid

Furthermore, in [30] it is shown that the residual diversity approaches zero as

Thus, mutual information can be estimated by computing the relative sample frequency on appropriately placed rectangular grids.

We now give the adaptive partitioning algorithm that constructs an asymptotic sufficient sequence of partitions for mutual information estimation.

Algorithm 1 (Adaptive partitioning algorithm for mutual information estimation).

(i) Initialization: Partition

that is,

(ii) Partitioning

Denote

If the sample distribution of the quadrants passes the uniform test, that is, (11) holds,

(iii) Repeat step (ii) for all grids in

(iv) Repeat steps (ii) and (iii) until

(v) Using the partition

Here, we give an example of how to adaptively partition a given set of sampled data. In this example, we sampled 100 times

Quadrant sample counts in each grid after second-level partition, and result of chi-squared test.

**Quadrant 1**

**Quadrant 2**

**Quadrant 3**

**Quadrant 4**

**Pass?**

Grid 1

18

7

6

11

8.4762

no

Grid 2

0

0

7

1

17.0000

no

Grid 3

1

6

1

0

11.0000

no

Grid 4

12

6

18

6

9.4286

no

Example of adaptive partitioning steps for pairwise mutual information

**Example of adaptive partitioning steps for pairwise mutual information**. (a) 100 samples of

In Figure

Quadrant sample counts in each grid after third-level partition, and result of chi-squared test.

**Quadrant 1**

**Quadrant 2**

**Quadrant 3**

**Quadrant 4**

**Pass?**

Grid 1

7

3

5

3

2.4444

yes

Grid 2

2

1

1

3

1.5714

yes

Grid 3

2

3

0

1

3.3333

yes

Grid 4

3

3

4

1

1.7273

yes

Grid 5

2

0

2

3

2.7143

yes

Grid 6

0

0

1

0

3.0000

yes

Grid 7

0

1

0

0

3.0000

yes

Grid 8

2

2

1

1

0.6667

yes

Grid 9

4

2

5

1

3.3333

yes

Grid 10

2

0

1

3

3.3333

yes

Grid 11

1

0

0

0

3.0000

yes

Grid 12

3

2

1

0

3.3333

yes

Grid 13

2

5

5

6

2.0000

yes

3.2. Conditional Mutual Information Estimator

Works in various fields have utilized conditional mutual information to test for conditional independence. However, in most cases, they are often limited to conditioning on a discrete, often binary, random variable [3132]. When conditioning on a discrete random variable, the conditional mutual information can be computed as

This is done by simply dividing the samples into

Let us consider a triplet of random variables

Suppose that the space

and its density

Similar to (3), we can write

where **Q** denotes **R** denotes

We can rewrite

Notice that this is simply a weighted sum for the restricted diversity as computed in (3) for samples grouped according to the

and it can be estimated as

Following the proof in [3033],

We can see from (15) and (17) that

and the integral

in the definition of

We now present the algorithm for estimating the conditional mutual information with continuous conditioning variable.

Algorithm 2 (Adaptive partitioning algorithm for conditional mutual information estimation).

(i) Initialization: partition

that is,

(ii) Partitioning

Denote

If the sample distribution passes the uniform test, that is, if (24) holds, the cuboid

are added to

(iii) Repeat step (ii) for all cuboids in

(iv) Repeat steps (ii) and (iii) until

(v) Using the partition

Figures

Adaptive partition of

**Adaptive partition of **.

Adaptive partition of

**Adaptive partition of **.

Compared to the estimation of conditional mutual information for discrete conditioning variable, we can see that instead of grouping samples into subsets where samples belonging in the same subset have the same values for the discrete-valued conditioning variable, here we group samples based on the adaptively determined partitioning of

Note that the complexity of the Gaussian kernel estimator is known to be

3.3. Gene Regulatory Network Inference Algorithm

To infer a gene regulatory network that has various interactive regulations and coregulations, we propose a strategy of using both mutual information and conditional mutual information to reconstruct the regulatory network. In our proposed algorithm, we first use mutual information as metric to build regulatory network similarly to [17] to capture most of the direct regulations. To decrease the complexity of the algorithm by avoiding computing conditional mutual information for all triplets, while still allowing us to detect most of the causes for false positives and false negatives, we only compute the CMI for triplets of genes where either all three genes are connected, or all three genes are not connected. The decrease in complexity would depend on several factors. Once the pairwise MI threshold is chosen, the triplets that have one or two connections between the three genes indicate that the pairwise MI is sufficient for the determination of the interaction between the three genes, and the use of CMI is not necessary. Thus, instead of computing the CMI for all triplets of genes, CMI needs to be computed only for those triplets that are completely connected or completely unconnected. The amount of decrease in complexity would then depend on the ratio of triplets that have only one or two connections, which would depend on the actual connectivities between the genes, and the threshold selected for the pairwise mutual information phase of the algorithm.

The MI-CMI gene regulatory network inference algorithm is as follows.

(i) For a gene expression dataset containing

(ii) Initialize the graph

(iii) Detecting indirect regulation and coregulation: for any triplet of genes

(a) If

this means that

(b) If

and

(iv) Detecting interactive regulation: for any triplet of genes

(a) If one or two of the CMI estimates is greater than

(b) If all three of the CMI estimates are greater than

4. Experimental Results

In this section, we present simulation results to demonstrate the performance of the algorithms discussed in Section 3. We first illustrate the performance of Algorithm 2 for estimating the conditional mutual information of jointly Gaussian random variables. Next, we consider the performance of Algorithm 1 for estimating mutual information, by implementing the regulatory network inference algorithm in [18], but replacing the Gaussian kernel mutual information estimator employed there with Algorithm 1. Finally, we compare the network inference performance of Algorithm 3 with that of ARACNE [18] and BANJO [11] on synthetic networks.

4.1. Conditional Mutual Information of Jointly Gaussian Random Variables

To assess the accuracy of Algorithms 1 and 2 for the estimation of gene regulatory networks, we consider estimating the pairwise and conditional mutual information of multivariate Gaussian distributions. In our simulation, we compare the MI and CMI estimates of Algorithms 1 and 2 with those of the b-spline estimators. A b-spline MI estimator is proposed in [34] which divides the sample range into a number of bins. Contrary to the approach in the classical histogram estimators, where each sample contributes only to the bin it is in, for the b-spline estimator, the weight of a sample is spread to the bins. In the case of a third-order b-spline estimator, for a sample located in bin

For MI estimation, we generated bivariate Gaussian samples with correlation coefficients 0, 0.3, and 0.6. For each coefficient, we generated

where

For CMI estimation, we generated samples trivariate Gaussian distributions with the following covariance matrices:

For each Gaussian distribution, we generated

where

where

Comparison of the estimated MI of bivariate Gaussian distribution with different correlation coefficient using Algorithm 1 and b-spline algorithm.

Correlation coefficient

Algorithm

100

200

300

400

500

Analytical

0

Adaptive

0.0080

0.0036

0.0022

0.0022

0.0015

0

b-spline 10

0.0912

0.0443

0.0288

0.0210

0.0166

0.3

Adaptive

0.0280

0.0287

0.0305

0.0319

0.0330

0.0472

b-spline 10

0.1248

0.0789

0.0640

0.0562

0.0515

0.6

Adaptive

0.1371

0.1730

0.1916

0.1999

0.2052

0.2231

b-spline 10

0.2471

0.2029

0.1879

0.1781

0.1719

Comparison of the estimated CMI of trivariate Gaussian distribution with different covariance matrices using Algorithm 2 and the modified b-spline algorithm.

Cond. Corr.

Algorithm

100

200

300

400

500

Analytical

0

Adaptive

0.0263

0.0215

0.0171

0.0175

0.0113

0

b-spline 10

0.1899

0.1039

0.0711

0.0536

0.0429

b-spline 20

0.7888

0.5592

0.4330

0.3497

0.2943

0.1612

Adaptive

0.0310

0.0278

0.0253

0.0249

0.0187

0.0132

b-spline 10

0.1899

0.1065

0.0759

0.0603

0.0495

0.3035

Adaptive

0.0497

0.0510

0.0534

0.0565

0.0582

0.0483

b-spline 10

0.2251

0.1377

0.1032

0.0855

0.0761

0.7408

Adaptive

0.2294

0.3050

0.3234

0.3444

0.3784

0.3979

b-spline 10

0.2773

0.2390

0.2190

0.2092

0.2029

b-spline 20

0.6387

0.5323

0.4719

0.4378

0.4121

As a comparison, we performed CMI estimation of covariance matrices 1 and 4 using b-spline estimator with 20 bins. In [34], it is shown that the b-spline method has similar performance to that of the kernel density estimator (KDE), and the MI computed has the same level of significance. However, the KDE is shown to be

Looking more closely at CMI estimation, for small sample size and large CMI value, Algorithm 2 has a negative bias. As the sample size increases, the bias quickly reduces. On the other hand, when the true CMI value is small, Algorithm 2 tends to overestimate. It should be noted that estimating the CMI from a finite number of samples for a distribution with zero conditional correlation coefficient will typically result in a nonzero value. Nevertheless, the estimation results are still reasonably accurate, even for only 100 samples, so that conditional independence can be easily detected.

4.2. Regulatory Networks with Only Direct Regulation

Next, we implemented the algorithm described in [18] by replacing the Gaussian kernel MI estimator there with Algorithm 1. The modified algorithm is then compared with the original ARACNE algorithm in [18]. The purpose of this comparison is to show that the adaptive partitioning MI estimator is a valid alternative for the Gaussian kernel estimator. Specifically, we constructed 25 synthetic regulatory networks, each with 20 one-to-one gene regulations, using NetBuilder [35]. To compare the network inference performance, we adopt the same metrics as used in [18]—recall and precision. Recall, defined as

In Figure

Comparison of ARACNE and relevance-network-based algorithm with adaptive partitioning MI estimator and DPI

**Comparison of ARACNE and relevance-network-based algorithm with adaptive partitioning MI estimator and DPI**.

4.3. Regulatory Networks with Coregulation and Interactive Regulation

We now compare the performance of Algorithm 3, ARACNE, and BANJO for regulatory network inference in the presence of coregulated and interactively regulated genes. We again use the synthetic network modeling software NetBuilder to generate random networks. NetBuilder allows modeling of gene-to-gene interactions such as activation by transcription factor combination (AND and OR), repression (NOT), and other combinatory interactions. We generated 50 synthetic networks, each containing 15 to 25 nodes with 20 links. For each node, we generated 100 steady-state expression data samples. To compare the effects of interactive regulation and coregulation on the performance of the three algorithms, two sets of synthetic networks are constructed: one set contains 25 networks where 30% of the interactions involve interactive regulation and coregulation, the other set contains 25 networks where 60% of the interactions involve interactive regulation and coregulation. In Figures

Precision versus recall for datasets with 30% coregulated or interactively regulated links

**Precision versus recall for datasets with 30% coregulated or interactively regulated links**.

Precision versus recall for datasets with 60% coregulated or interactively regulated links

**Precision versus recall for datasets with 60% coregulated or interactively regulated links**.

In Figure

Figure

True underlying network configuration inferred in Figure 9

**True underlying network configuration inferred in Figure 9**.

(a) Synthetic network inferred by MI-CMI algorithm

**(a) Synthetic network inferred by MI-CMI algorithm**. (b) Synthetic network inferred by ARACNE. (c) Synthetic network inferred by BANJO.

5. Conclusions

We have proposed a new gene regulatory network inference algorithm that employs both mutual information and conditional information to discover possible direct and interactive regulations between genes, and to eliminate false links due to indirect regulations and coregulation. The mutual information and conditional mutual information are estimated from the expression data using an adaptive partitioning estimator. We have shown that the proposed network inference method outperforms BANJO and ARACNE when the underlying regulatory network contains coregulated or interactively regulated genes. In this work, we have focused on the discovery of the joint regulation of a gene by two other genes. It is possible to extend this work to joint regulation by multiple genes by modifying the proposed conditional mutual information estimator to a higher order. However, doing so would pose several computational problems. As the dimension of the CMI increases, increasing number of samples is needed to maintain the same level of accuracy. Also, as the dimension of the CMI increases, the number of sets of genes to be tested also increases, thus rendering this method impractical for brute force computation of all possible sets of genes. One possibility to reduce the amount of computations needed is to take into consideration the constraints placed on the possible connectivities from known biochemical reactions between the genes involved. This can be a future direction for research in this area.