Center for Computational Biology and Bioinformatics and Department of Electrical Engineering, Columbia University, New York, NY, USA

Abstract

Background

Typically, the first phase of a genome wide association study (GWAS) includes genotyping across hundreds of individuals and validation of the most significant SNPs. Allelotyping of pooled genomic DNA is a common approach to reduce the overall cost of the study. Knowledge of haplotype structure can provide additional information to single locus analyses. Several methods have been proposed for estimating haplotype frequencies in a population from pooled DNA data.

Results

We introduce a technique for haplotype frequency estimation in a population from pooled DNA samples focusing on datasets containing a small number of individuals per pool (2 or 3 individuals) and a large number of markers. We compare our method with the publicly available state-of-the-art algorithms HIPPO and HAPLOPOOL on datasets of varying number of pools and marker sizes. We demonstrate that our algorithm provides improvements in terms of accuracy and computational time over competing methods for large number of markers while demonstrating comparable performance for smaller marker sizes. Our method is implemented in the "Tree-Based Deterministic Sampling Pool" (TDSPool) package which is available for download at

Conclusions

Using a tree-based determinstic sampling technique we present an algorithm for haplotype frequency estimation from pooled data. Our method demonstrates superior performance in datasets with large number of markers and could be the method of choice for haplotype frequency estimation in such datasets.

Background

In recent years large genetic association studies involving hundreds or thousands of individuals have become increasingly available, providing opportunities for biological and medical discoveries. In these studies, hundreds of thousands of SNPs are genotyped for the cases and the controls, and discrepancies between the haplotype distributions indicate an association between a genetic region and the disease. Typically, the first phase of a GWAS includes genotyping across hundreds of individuals and validation of the most significant SNPs. One possible approach to reducing the overall cost of GWAS is to replace individual genotyping in phase I with allelotyping of pooled genomic DNA

Rather than examining SNPs independent of each other, simultaneously considering the values of multiple SNPs within haplotypes (combinations of alleles at multiple loci in individual chromosomes) can improve the power of detecting associations with disease and is also of general interest with the pooled data. To facilitate haplotype-based association analysis it is necessary to estimate haplotype frequencies from pooled DNA data.

A variety of algorithms have been suggested to estimate haplotype frequencies from pooled data. Available methods fall into two large categories. The first category consists of methods that focus on accurate solutions for small pool sizes (2 or 3 individuals per pool) and considerably large genotype segments. Many well known approaches that focus on small pool sizes use an expectation-maximization (EM) algorithm for maximizing the multinomial likelihood

Naturally, pooling techniques are more prone to errors and offer less possibilities for assessing the quality of the data than individual genotyping. As argued and discussed by Kirkpatrick et al.

In a recent study Kuk et al. ^{3} haplotypes) and up to four individuals per pool, whereas accuracy decreases with increasing pool size and number of loci. Rare alleles or linkage disequilibrium (LD) (or both) decrease the number of haplotypes that appear with non-negligible frequencies and thus pooling could remain efficient for larger haplotype blocks. In general, pooling could still remain more efficient in the case where only a small number of haplotypes can occur with appreciable frequency, as also suggested in Barratt et al.

In this paper we propose a new tree-based deterministic sampling method (TDSPool) for haplotype frequency estimation from pooled DNA data. Our method specifically focuses on small pool sizes and can handle arbitrarily large block sizes. In our study, we examine real data focusing on dense SNP areas, in which only a small number of haplotypes appear with appreciable frequency, so that our scenarios are within the limits of Kuk et al.

Results

In order to compare the accuracy of frequency estimation between the different methods and under the different scenarios examined, we compared the predicted haplotype frequencies from a given method, ^{2} distance between the two distributions which is simply the result of the ^{2} statistic, where ^{2}(_{
i=1}
^{
d
}(_{
i
} − _{
i
})^{2}/_{
i
} and

Datasets

To examine the performance of our methodology we have considered in our experiments real datasets for which estimates of the haplotype frequencies were already available and which cover a variety of dataset sizes.

We have first simulated using the three loci haplotypes and their associated frequencies from the dataset of Jain et al.

**Haplotype**

**Frequency**

1 0 0

0.082

0 0 1

0.525

1 0 1

0.283

1 1 1

0.106

Next, we considered two more cases with larger number of loci. In the second case which has

**Haplotype**

**Frequency**

1 1 1 1 0 1 1 0 0 0

0.033

1 1 0 1 0 1 1 1 1 0

0.016

1 1 0 1 0 0 1 0 0 1

0.017

1 0 0 1 0 1 1 0 0 1

0.017

1 1 0 1 0 1 1 0 0 1

0.017

1 1 1 1 0 1 1 1 0 1

0.507

0 1 0 1 1 0 0 1 1 1

0.017

1 1 0 0 0 0 1 1 1 1

0.033

0 1 0 1 0 0 1 1 1 1

0.1

1 1 0 1 0 1 1 1 1 1

0.193

1 1 1 1 1 1 1 1 1 1

0.05

The third dataset consisted of SNPs from the first 7Mb (742 kb to 7124.8 kb) of the HapMap CEU population (HapMap 3 release 2- Phasing data). This chromosomal region was partitioned based on physical distance into disjoint blocks of 15 kb. The resulting blocks had a varying number of markers ranging from 2–28. For our purposes we have considered only the datasets that had more than 10 SNPs and less than 20 (which was the maximum number of loci so that HAPLOPOOL could produce estimates within a reasonable amount of time) which resulted in selecting a total of 80 blocks. On each block the parental haplotypes and their estimated frequencies were used as the true haplotype distribution. As in the previous cases, in each block two different pool sizes, 2 and 3 individuals per pool, were considered and four different number of pools per dataset.

Frequency estimation

We have examined the accuracy of our method and compared it against HIPPO and HAPLOPOOL on the three datasets described in our previous subsection. In all experiments considered in this subsection the DNA pools were simulated assuming no missing data or measurement error. The performance of the methods is shown in Figure

Accuracy of haplotype frequency estimates

**Accuracy of haplotype frequency estimates.** Estimating χ^{2} distance for 3 loci, 10 loci and HapMap dataset for 50,75, 100 and 150 pools with HAPLOPOOL, TDSPool and HIPPO.

For the 3 and 10 loci datasets the result presented is the average χ^{2} distance from a 100 simulation experiments, whereas in the HapMap dataset the result presented is the average χ^{2} distance on the 80 datasets considered. For the 3 loci dataset it can be seen that TDSPool and HAPLOPOOL produced similar accuracy. For the remaining two datasets with larger number of loci TDSPool demonstrated superior performance. For the HapMap dataset only TDSPool and HAPLOPOOL were evaluated since the maximum number of loci HIPPO can handle without prior knowledge of the major haplotypes in the population is 10. At the same time even though HAPLOPOOL can in principle handle larger datasets, due to excessive computational time for datasets with 24 and 28 loci we restricted our comparisons to datasets between 10 and 20 loci. We note here as well that since HIPPO is based on a central limit theorem it is likely to be a better approximation in large pools as opposed to small ones that we focus in our study.

From our experiments we can also see that the number of pools also affected accuracy. All algorithms demonstrated improved performance with increasing number of pools in the dataset.

Noise and missing data

In the previous subsection we have evaluated the performance of our method by simulating DNA pools without missing data and measurement errors. However, in allelotyping pooled DNA, allele frequencies may not be estimated properly in some practical situations and the data are consequently missing or have measurement errors.

In order to measure the effect of genotype error on the accuracy of the haplotype frequency estimation and evaluate the performance of our method under such scenarios, we have simulated genotyping error by adding a Gaussian error with SD σ to each called allele frequency. Suppose we denote the correct allele frequency at SNP _{ij}. The perturbed allele frequency is given by ^{2}). After simulating these perturbed haplotype frequencies, we discretize the resulting frequencies to produce perturbed allele counts that are consistent with the number of haplotypes in each pool. We have considered a variety of values for σ, ranging from 0 to 0.06 similar to Kirkpatrik et al.

Accuracy of haplotype frequency estimates with genotyping errors

**Accuracy of haplotype frequency estimates with genotyping errors.** Estimating χ^{2} distance for 3 loci, 10 loci and HapMap datasets when noise is added on the pooled allele frequencies.

For small number of loci, HAPLOPOOL achieves the best performance. However, for larger datasets TDSPool outperforms all competing methods.

Furthermore, we have evaluated the performance of our methodology using missing data. We have randomly masked 1 and 2% of the SNPs respectively on the 10 loci datasets and estimated the accuracy. As shown in Figure

Accuracy of haplotype frequency estimates with missing data

**Accuracy of haplotype frequency estimates with missing data.** Estimating χ^{2} distance for 10 loci dataset with 0,1 and 2% of missing SNPs.

Timing results

The computational times for all datasets are displayed in Table

**Number of pools**

**50**

**75**

**100**

**150**

For each dataset in each algorithm the first line corresponds to the case that each pool has 2 individuals whereas the second line to the case that each pool has three individuals. Time is given in seconds.

3-loci Dataset

TDSPool

0.4458

0.4331

0.4743

0.4861

0.4260

0.4772

0.5346

0.5350

HaploPool

0.0697

0.0642

0.0607

0.0674

0.0593

0.0681

0.0607

0.0691

HIPPO

2.3593

3.0793

3.8856

5.3911

2.4182

3.2047

4.1161

5.5873

10-loci Dataset

TDSPool

0.8094

0.7778

1.0367

1.1259

1.0269

1.0805

1.1804

1.3920

HaploPool

0.5136

0.7381

0.9554

1.4012

0.8531

1.2331

1.6247

2.4078

HIPPO

59.5605

62.7163

64.1563

71.0505

58.8816

64.6515

64.5386

73.9019

HapMap

Dataset

TDSPool

1.0189

1.1660

1.1765

1.5455

1.8760

2.0830

2.1848

3.2719

HaploPool

0.6737

0.9577

1.2679

1.8489

1.1636

1.6928

2.2006

3.2905

Discussion

We have introduced a new algorithm for estimating haplotype frequencies from datasets with pooled DNA samples and we have compared it with existing available packages. We have shown that for datasets with small number of loci our algorithm has comparable performance to state-of-the-art methods in terms of accuracy and computational time but it demonstrates superior performance for datasets with larger number of loci.

Our method specifically focuses on small pool sizes and we have demonstrated the performance on pools of two or three individuals. In our experiments we have partitioned pooled genotype vectors in blocks of 4 SNPs as described in the "Partition-Ligation" subsection. We have chosen to partition the pooled genotypes every 4 SNPs so that computations are performed fast and we avoid cases with huge number of solutions. Partitioning the dataset every 3 SNPs had negligible impact on the accuracy of our results (results not shown) whereas partitioning every 5 SNPs in general can produce block pool genotypes with thousands of solutions, especially when missing data occur.

In the framework developed by Pirinen

Conclusions

We have introduced a new algorithm for estimating haplotype frequencies from pooled DNA samples using a Tree-Based Deterministic sampling scheme. Algorithms for haplotype frequency estimation from pooled data fall into two categories. The first category consists of algorithms that focus on accurate solutions and allow for considerably large genotype segments and the second category of algorithms that focus on small segments but allow for a large number of individuals per pool. We have compared our methodology with state-of-the-art algorithms from each category, namely HAPLOPOOL and HIPPO. We have focused on scenarios and datasets in which the use of pooling data is suggested for haplotype frequency estimation according to the study of Kuk et al.

Methods

In the beginning of the section we introduce some notation. We then present the prior and posterior distribution given the data and derive the state update equations for the TDSPool estimator. We further present the modified partition-ligation procedure adjusted for the pooled data so that we are able to handle larger haplotype vectors and we finally give a summary of the proposed procedure.

Definitions and notation

Suppose we are given a set of pooled DNA measurements on

Suppose that we have _{
j
}
_{
t
} = {_{
t
}
^{1}, …_{
t
}
^{
L
}} to be the pool genotype of the _{
j
}
^{
i
} ∈ {0, …, 2_{
t
}}. Suppose also that _{
t
} = {_{1}, …, _{
t
}} is a set of pool genotypes of pools up to and including pool _{
t
} = {_{
t,1}, …, _{
t,2Nt
}} where _{
t,i
} ∈ {0, 1}^{
L
} is a binary string of length _{
t,i,j
} = 0. We further define _{
t
} = {_{1}, …, _{
t
}}, similarly to _{
t
} as the set of haplotypes for each genotype pool up to and including pool

Schematic representation of the notation used in our methodology

**Schematic representation of the notation used in our methodology.** For each pool genotype (_{t}) and at each locus, the value of the pool genotype at that locus _{t}^{j} is the sum of the values on that loci across all haplotypes in that pool i.e.

Let us also define _{1}, …_{
M
}} , where _{
m
} ∈ {0, 1}^{
L
} is a binary string of length _{
i
} the subset _{
i
} = {_{
i
}
^{1}, …, _{
i
}
^{
Y
}} _{
i
}. The set _{
i=1}
^{
T
}
_{
i
} . A set of population haplotype frequencies _{1}, …, _{
M
}} is also associated with the set _{
m
} is the probability with which the haplotype _{
m
} occurs in the total population.

Probabilistic model

Assuming random mating in the population it is clear that the number of each unique haplotype in _{1}, …, _{
M
})

With mean

Before we calculate the posterior distribution for

and similarly

Calculating the posterior distribution for

where we denote _{
m
}(_{
m
} − _{
t,i
}) with _{
t
} is the indicator function which equals 1 when _{
m
} − _{
t,i
} is a vector of zeros, and 0 otherwise.

We have shown that the posterior distribution for _{
t
} = {_{
m
}(_{
t−1}, _{
t
}, _{
t
} as given by (1) i.e. _{
t
} = _{
t
}(_{
t−1}, _{
t
}, _{
t
}).

Inference problem

Following the notation we used in our previous subsections we can summarize the frequency estimation problem as follows: Given _{1}, …, _{
T
}} the set of observed pool genotype vectors and _{1}, …, _{
M
}} the set of haplotypes compatible to the pool genotypes in A we wish to infer _{1}, …, _{
T
}} the unknown haplotypes in each pool and _{1}, …, _{
M
}} the haplotype frequencies of all the haplotypes occurring in the population.

Computational algorithm (TDSPool)

Similar to traditional Sequential Monte Carlo (SMC) methods, we assume that by the time we have processed pool genotype _{
t-1
} we have K sets of solution streams (i.e. sets of candidate haplotypes for pools 1,…, t-1) and their associated weights _{
t−1}|_{
t−1}).

Given the set of solution streams and the associated weights we approximate the distribution _{
t−1}|_{
t−1}) as follows:

where

When we process the pool genotype _{
t
} based on the pool genotypes _{
t
}. Let us further assume that there are ^{
ext
} possible haplotype solutions compatible with the genotype of the _{
t
}
^{
i
}, ^{
ext
} .

Before we move to the derivation of the state update equation we note here that in the following we will use the fact that for the unknown parameters θ, as we have shown in "Probabilistic Model" subsection, under certain assumptions the prior and posterior distribution are Dirichlet and depend only on a set of sufficient statistics _{
t
} = _{
t
}(_{
t−1}, _{
t
}, _{
t
})

Therefore, from Bayes’ theorem we have:

where

Assuming that we have approximated _{
t−1}|_{
t−1}) as in (2), we can approximate _{
t
}|_{
t
}) using (3) as_{.}

The weight update formula is given by

Partition-Ligation

In the partition phase the dataset is divided into small segments of consecutive loci. Once the blocks are phased, they are ligated together using a modified extension of the Partition-Ligation (PL) method

In our current implementation to be able to derive all possible solution combinations for each pool genotype efficiently we have decided to keep the maximum block length to 4 SNPs. Clearly the more SNPs are included in a block the more information about the LD patterns we can capture but at the same time the number of possible combinations increases and becomes prohibitive for more than 5 SNPs. For our experiments in a dataset with L loci we have considered

The result of phasing for each block is a set of haplotype solutions for each pool genotype. Two neighbouring blocks are ligated by creating merged solutions for each pool genotype from all combinations of the block solutions, one from each block. When creating a merged solution for a pool genotype from the two separate solutions (one from each block), since we do not know which haplotypes belong to the same chromosome, all different possible assignments are examined. The TDSPool algorithm is then repeated in the same manner as it was for the individual blocks.

Furthermore, the order in which the individual blocks are ligated is not predetermined. We first ligate the blocks that would produce in each step the minimum entropy ligation. This procedure allows us to ligate first the most homogeneous blocks so that we have more certainty in the solutions that we produce while moving in the ligation procedure.

Summary of the proposed algorithm

Routine 1

● Set the current number of streams m = 1. Define K as the maximum number of streams allowed. Define _{0}
^{1} ={}.

● For

○ Find the ^{
ext
} possible haplotype configurations compatible with the pool genotype of the

○ For ^{
ext
}

▪ Enumerate all possible particle extensions

▪_{
t
}
^{(k,j)} according to (4)

○ Select and preserve ^{
ext
}
_{
t
}
^{(k)}, _{
t
}
^{(k)}, _{
t
}
^{(k,j)}, _{
t
}
^{(k,j)}, ^{
ext
} }

○ Update the number of counts of each encountered haplotype in each stream

○ Set

TDSPool ALGORITHM

● Partition the genotype dataset

● For

● Until all blocks are ligated, repeat the following

○ Find the blocks that if ligated would produce the minimum entropy

○ Ligate the blocks, following the procedure described in the Partition-Ligation section

Authors’ contributions

All authors contributed equally to this work. All authors read and approved the final manuscript.