Translational and Molecular Imaging Institute, Icahn School of Medicine at Mount Sinai, New York NY 10029, USA

Electrical Engineering Department, Columbia University, New York NY 10027, USA

Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10027, USA

Abstract

Background

DNA pooling constitutes a cost effective alternative in genome wide association studies. In DNA pooling, equimolar amounts of DNA from different individuals are mixed into one sample and the frequency of each allele in each position is observed in a single genotype experiment. The identification of haplotype frequencies from pooled data in addition to single locus analysis is of separate interest within these studies as haplotypes could increase statistical power and provide additional insight.

Results

We developed a method for maximum-parsimony haplotype frequency estimation from pooled DNA data based on the sparse representation of the DNA pools in a dictionary of haplotypes. Extensions to scenarios where data is noisy or even missing are also presented. The resulting method is first applied to simulated data based on the haplotypes and their associated frequencies of the AGT gene. We further evaluate our methodology on datasets consisting of SNPs from the first 7Mb of the HapMap CEU population. Noise and missing data were further introduced in the datasets in order to test the extensions of the proposed method. Both HIPPO and HAPLOPOOL were also applied to these datasets to compare performances.

Conclusions

We evaluate our methodology on scenarios where pooling is more efficient relative to individual genotyping; that is, in datasets that contain pools with a small number of individuals. We show that in such scenarios our methodology outperforms state-of-the-art methods such as HIPPO and HAPLOPOOL.

Background

In recent years large genome wide association studies have been considered a promising approach to identify disease genes. In these studies, which typically include thousands of individuals, a potential allele frequency difference for a specific marker between cases and controls could indicate an association between the marker and the disease.

Allele frequencies for cases and controls can be obtained either through individual genotyping or through DNA pooling. Although individual genotyping provides more accurate estimates of individual allele frequencies, as well as haplotypes which enable the study of genetic interactions, DNA pooling has been widely used as it can be more cost effective in genome wide association studies

As evident, one of the main concerns with the use of genotype pooling is genotype error. For a given pooled DNA sample, the standard deviation (SD) of the estimated allele frequency is between 1% and 4%

Even though the main purpose of pooling is to screen alleles for potential discrepancies between cases and controls, estimating haplotype frequencies across a number of markers is also of interest with the pooled data as it can improve the power of detecting associations with disease. To facilitate haplotype-based association analysis it is necessary to estimate haplotype frequencies from pooled DNA data.

It has been claimed in the literature that pooling DNA samples is efficient for estimating haplotype frequencies. However, the results presented within the context of haplotype frequency estimation algorithms are largely numerical and they do not address the statistical properties and efficiency of the estimates being computed. In a recent study, Kuk et al. ^{3} haplotypes with appreciable frequency). Factors such as Linkage Disequilibrium (LD) and rare alleles reduce the number of non-rare haplotypes appearing in the population and pooling could still remain more efficient either for a larger number of loci or when the pool size is kept considerably small, as suggested by Barratt et al.

A variety of haplotype estimation methods from pooled genomic data have been proposed in the literature that fall into two large categories. The first category consists of methods that focus on a small number of markers but allow for considerably larger pool sizes while the second category of methods allows for a larger number of markers but for a small number of individuals per pool.

As haplotype frequency estimation from pooled genomic data can be seen as a missing data problem, it comes to no surprise that the majority of methods focusing on small pool sizes mainly contains methods that use the expectation-maximization (EM) algorithm for maximizing the multinomial likelihood

Haplotype frequency estimation from large genotype pools was first addressed by Zhang et al.

There is also an equivalence between the haplotype frequencies estimation and the inference of relative abundances of species in mutagenomics studies. Kessner et al.

In this study we present an algorithm for haplotype frequency estimation based on the maximum parsimony principle. A mathematical framework is presented where this principle is translated in a **joint** sparsity requirement and the frequency inference is performed using the alternating direction method (ADM) of multipliers. Our method focuses on datasets that have a small number of individuals per pool and a considerably large number of markers. We compare our method with the best performing methods from the two pooling algorithm categories as presented above, namely HIPPO and HAPLOPOOL. We have performed comparisons on a variety of marker and dataset sizes. All our comparisons represent scenarios for which, based on Kuk et al.

Results and discussions

In this section, first we describe the datasets and figures of merit used to evaluate the method. Then we present the results from comparing our method ADM to HIPPO and HAPLOPOOL.

All our comparisons were performed in scenarios were the use of pooling is potentially beneficial relative to no pooling according to the guidelines of Kuk et al.

In real applications, it is very often the case that studies are performed in datasets for which partial knowledge of the existing haplotypes already exists (for example datasets from HAPMAP studied populations). This information could be used as a basis for an accurate definition of the haplotype dictionary matrix **H**, as will be defined in the Methods section, so that the number of possible haplotypes

The presented method is based on the augmented Lagrangian expansion of a constrained optimization problem which has an associated parameter _{2}-norm of the difference between the solution at step _{2}-norm of the solution at step ^{−20}. If the first term is smaller or

Datasets

To examine the performance of our methodology we have considered in our experiments real datasets for which estimates of the haplotype frequencies were already available and which cover a variety of dataset sizes.

We have first simulated data using the 10 loci haplotypes and their associated frequencies for the AGT gene considered in Yang et al.

**Haplotype**

**Frequency**

1 1 1 1 0 1 1 0 0 0

0.033

1 1 0 1 0 1 1 1 1 0

0.016

1 1 0 1 0 0 1 0 0 1

0.017

1 0 0 1 0 1 1 0 0 1

0.017

1 1 0 1 0 1 1 0 0 1

0.017

1 1 1 1 0 1 1 1 0 1

0.507

0 1 0 1 1 0 0 1 1 1

0.017

1 1 0 0 0 0 1 1 1 1

0.033

0 1 0 1 0 0 1 1 1 1

0.1

1 1 0 1 0 1 1 1 1 1

0.193

1 1 1 1 1 1 1 1 1 1

0.05

The second dataset consisted of SNPs from the first 7Mb (742 kb to 7124.8 kb) of the HapMap CEU population (HapMap 3 release 2- Phasing data (

Performance criteria

Assume first that **g**= [_{1}⋯_{
M
}]^{
T
} is the gold standard haplotype frequency vector in a given dataset observed in the population and **
f
**= [

^{2} distance: The ^{2} distance between the two distributions **g** and **
f
** is defined as

_{1} distance: The _{1} distance between the two distributions is defined as

Frequency estimation

We have examined the accuracy of our method and compared it against HIPPO and HAPLOPOOL on the AGT gene and HapMap datasets described in our previous subsection. The performance of the methods is shown in Figures ^{2} and _{1} distance from a 100 simulation experiments. We can see that ADM demonstrated superior performance for both figures of merit (Figure

Measures of performance of HAPLOPOOL, HIPPO and ADM applied to the AGT gene dataset

**Measures of performance of HAPLOPOOL, HIPPO and ADM applied to the AGT gene dataset.**

Measures of performance of HAPLOPOOL and ADM applied to the HapMap dataset

**Measures of performance of HAPLOPOOL and ADM applied to the HapMap dataset.**

For the HapMap dataset (Figure

From our experiments we can also see that the number of pools also affected accuracy. All algorithms demonstrated improved performance with increasing number of pools in the dataset.

Noise and missing data

We have further evaluated the performance of our method in the presence of measurement error. We have simulated genotyping error by adding a Gaussian error with SD _{
ij
}, then the perturbed allele frequency is given by ^{2}). To obtain the allele counts we discretize each allele frequency to the closest allowed frequency depending on the number of individuals per pool and obtain the allele counts accordingly.

We have selected the values for

Measures of performance of HAPLOPOOL and ADM applied to the HapMap dataset with noisy observations

**Measures of performance of HAPLOPOOL and ADM applied to the HapMap dataset with noisy observations.**

We can see that ADM demonstrates superior performance compared to competing methods and, as expected, its performance deteriorates with increasing noise levels. The results also demonstrate the fact that pooling errors affect more pools that contain a large number of individuals. The reason is, as has been noted before, that in smaller pools the gap between allowable frequency calls is much larger resulting in a smaller percentage of miscalled allele counts and thus in better frequency estimates.

We have further set a realistic percentage of SNPs to be missing (1% and 2% per dataset) and demonstrated the accuracy of our modified methodology. As shown in Figure

Measures of performance of ADM applied to the HapMap dataset with missing observations

**Measures of performance of ADM applied to the HapMap dataset with missing observations.**

Conclusions

In this study we have presented a method for estimating haplotype frequencies from pooled data based on the maximum parsimony principle. A novel mathematical framework is introduced where this principle is translated to finding a sparse representation of the observed DNA pools in a dictionary of haplotypes. This leads to an optimization problem that is solved with the alternating direction method of multipliers. The proposed method is also extended to scenarios where noisy and missing data is present in the considered DNA pools, and is able to process pools with a large number of SNPs.

Numerical experiments using synthetic and real data have shown improved performance with respect to the best of the haplotype frequency inference methods. In particular, the proposed ADM method is an efficient method that performs better than other methods such as HIPPO and HAPLOPOOL in the considered datasets consisting of pools with a small number of individuals and a large number of markers.

Methods

Overview

This section provides a description of the proposed method for haplotype frequencies inference based on the maximum-parsimony principle. The method seeks to discover the frequencies of the haplotypes present in a population given the observed relative frequencies of each allele in each DNA pool. In order to obtain a biological meaningful estimation, the proposed method makes use of the maximum-parsimony principle which attempts to minimize the total number of haplotypes observed in the sample

Each pool has an associated vector of observed relative frequencies that, with the proposed mathematical framework, can be expressed as the linear combination of haplotypes of a dictionary. This dictionary of haplotypes can be constructed using information from external databases **jointly**.

This framework for haplotype frequencies estimation leads to a joint constrained sparse optimization problem. This kind of optimization problem has been studied in the compressed sensing literature, where the alternating direction method (ADM) of multipliers has been proposed to find the corresponding solution. The proposed method makes use of the ADM adapted to the haplotype frequencies estimation.

Maximum-parsimony haplotype frequencies inference framework

The proposed method estimates the frequencies of haplotypes consisting of **H** as an **H**, we can use information from external databases _{
i
} individuals (_{
i
} haplotypes in each pool. Moreover, we define **p**
^{
i
} satisfy

where _{1}-norm of the vector **p**
^{
i
}. Since **p**
^{
i
}≽**
0
**, we have

Then the haplotype frequency estimation problem can be stated as follows: Given the observed relative frequencies of the alleles **a**
^{
i
}, **p**
^{
i
}, **a**
^{
i
} is **p**
^{
i
} is **a**
^{
i
} should be as small as possible. Therefore, the maximum parsimony haplotype inference problem is stated as follows. Given the set {**a**
_{
i
},_{
i
} subjects and for **p**
_{
i
},**H** as possible to explain all the observed frequency vectors **a**
^{
i
}.

Haplotype frequencies inference based on a joint constrained sparse representation of pooled DNA

We define

where **H**. This is equivalent to requiring the inferred solution **x**
^{
i
} and **x**
_{
j
} be the **X**, respectively and define a vector **e**(**X**) containing the energy of each row of matrix **X**, i.e., **x**
^{
i
})=∥**x**
^{
i
}∥_{2}, then row-sparsity implies finding the solution to the following optimization problem

where ∥**e**(**X**)∥_{0} is the _{0} norm of the vector **e**(**X**) and corresponds to the number of non-zero components of the vector. This means that the solution will have as many all-zero rows as possible.

However, minimizing an _{0} norm is computational intractable as it involves solving a combinatorial problem. One option well studied in the compressed sensing literature is to replace the _{0} norm with the _{1} norm, as it promotes sparsity and leads to more tractable solutions. Then, the inference, in our case, becomes the solution to

This matrix problem lies within the convex optimization framework. In the most general case where there is no prior information regarding the possible haplotypes to be considered, the size of the matrix **H** grows exponentially with the number of SNPs. In what follows we present an efficient method to find the solution to (4) by means of the alternating direction method (ADM) of multipliers. The ADM proceeds to solve local small problems in order to uncover the global solution to the problem with proven convergence; that is, the ADM is guaranteed to find the optimal solution to (4)

Alternating direction method of multipliers

Given two convex functions

with

For

Minimizing (6) with respect to **x** and **z** jointly is usually not tractable. Instead, the alternating direction method of multiplier proceeds to iterate minimizing (6) over **x** for a fixed **z**, followed by the minimization of (6) with respect to **z** for a fixed **x** and a dual variable update; that is, let

1

Set

2

Set

3

Initialize **
x
**

4

Repeat

5

6

7

8

9

until convergence

Joint constrained sparse haplotype frequency estimation algorithm

Introducing the **Z**, (4) can be restated in order to apply the ADM and obtain closed-form expressions for the local minimization steps as follows

This optimization problem can be restated in the framework of (5) by defining

where ⊗ is the Kronecker product, **
0
**

where _{
S
} is the indicator function of the set _{
S
}(**x**)=0 if **x**∈**z** that correspond to the **Z**.

With these definitions, the steps of the ADM in Table

1

Set

2

Set

3

Set **X**
^{0} = **
0
**,

4

**U**
_{4} = (**I**−**H**
^{
T
}(**I**+**H****H**
^{
T
})^{−1}**H**)

5

Repeat

6

7

8

9

10

11

12

until convergence

the max operation of step 9 is component-wise, and

Extensions

Noisy data

Measurement errors in determining allele frequencies are considerable in DNA pools, presenting a variance between 0.02 and 0.04 **H****X**=**A** is too restrictive and can be relaxed in order to take the measurement noise into account. In particular, we introduce a parameter

Introducing the **Z**
_{1} and the **Z**
_{2}, the ADM method can be used to solve (10), by solving the equivalent problem

where **Z**
_{2}. This simple transformation allows us to obtain closed-form expressions for the local minimization steps of the ADM.

The maximum parsimony solution to the haplotype frequency inference estimation with noisy observations can be found by following the steps illustrated in Table **X**
^{
k+1} and

1

Set

2

Set

3

Set **X**
^{0} = **
0
**,

4

**U**
_{4} = (**I**−**H**
^{
T
}(**I**+**H****H**
^{
T
})^{−1}**H**)

5

Repeat

6

7

8

9

10

11

12

13

until convergence

Missing data

Errors often occur during the genotyping process, and the data at some loci might not have been observed. We present modifications to the algorithms to perform haplotype inference in the presence of missing data. We assume that it is known a priori where the genotype information is missing for each genotype of each individual.

The presence of missing data in a genotype of a given pool imply a smaller number of constraints. Let **H**
_{
i
} the matrix with all the rows corresponding to those loci removed. Notice that different pools present missing information in different loci, making the matrix dependant on the considered individual.

The solution to the haplotype inference problem can be found by solving

The ADM is also used to find the solution to this optimization problem, and the resulting steps to find the haplotype frequency estimation are shown in Table

1

Set

2

Set

3

Set **X**
^{0} = **
0
**,

4:

**U**
_{4} = (**I**−**H**
^{
T
}(**I**+**H****H**
^{
T
})^{−1}**H**)

5

Repeat

6

7

For

8

9

10

end for;

11

12

13

For

14

15

end for

16

until convergence

Large number of SNPs

When the number of SNPs is large, the size of the matrix **H** increases dramatically. One approach for this case is to partition the data into blocks and process one block at a time. After all blocks are processed, a ligation process is performed to obtain the final result. We adopt the partition-ligation (PL) method

The PL method starts with the partition phase. The vectors of observed relative frequencies **a**
_{
i
},

Then, the PL proceeds to a ligation phase, where adjacent sets are merged to obtain a new partition of the data, with **H** to contain every possible concatenations of the haplotypes of the (2

In order to use the PL method, we need to determine an initial partition of the data. Therefore, we need to specify the number of partitions

Availability of supporting data

Our method is available for download at

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XW and DA conceived of the study. GHJ, AI, DA and XW participated in the design of the study. GHJ and AI performed the computer experiments and wrote the first draft of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the research grant DBI-0850030 from the National Science Foundation.