Department of Electrical Engineering, Columbia University, 500 W 120th St, New York, 10027 NY, USA

Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, 10029 NY, USA

Abstract

Background

Xor-genotype is a cost-effective alternative to the genotype sequence of an individual. Recent methods developed for haplotype inference have aimed at finding the solution based on xor-genotype data. Given the xor-genotypes of a group of unrelated individuals, it is possible to infer the haplotype pairs for each individual with the aid of a small number of regular genotypes.

Results

We propose a framework of maximum parsimony inference of haplotypes based on the search of a sparse dictionary, and we present a greedy method that can effectively infer the haplotype pairs given a set of xor-genotypes augmented by a small number of regular genotypes. We test the performance of the proposed approach on synthetic data sets with different number of individuals and SNPs, and compare the performances with the state-of-the-art xor-haplotyping methods PPXH and XOR-HAPLOGEN.

Conclusions

Experimental results show good inference qualities for the proposed method under all circumstances, especially on large data sets. Results on a real database, CFTR, also demonstrate significantly better performance. The proposed algorithm is also capable of finding accurate solutions with missing data and/or typing errors.

Background

A human genome is a sequence of nucleotides that can differ from one individual to another (approximately 0.1% difference between any two individual) due to various reasons, such as insertions/deletions of fractions of the sequence on the genome or mostly the substitution/mutation of single nucleotides on commonly observed sites called

The entire human genome consists of 23 distinct chromosomes each appearing in two copies (autosomes) except for the chromosome-23 (allosome) which consists of two copies of chromosome-X in females or one chromosome-X and one chromosome-Y in males. Each chromosome is a pair of two distinct sequences -haplotypes- inherited from the parents, i.e., one is from the maternal genome and the other is from the paternal genome. The genotype is sequenced by identifying the types of alleles -nucleotide variants- across the SNP locations (locus) in chromosomes. In a particular locus of a chromosome if both haplotypes have the same allele we call this site in the genotype

Methods for solving the haplotype inference problem given the regular genotypes can be summarized in two categories: combinatorial methods that usually state an explicit objective function and propose methods for optimizing it, and statistical methods that relies on the statistical modeling of the problem. Various methods have been published for the haplotype inference problem

On the other hand, it is known that in a population of individuals certain haplotypes are frequently found in certain genomic regions

In

For the xor-haplotyping problem, there is an increased ambiguity due to the XOR operation between haplotypes, i.e., the process of xor-genotyping that determines whether the type of alleles in both haplotypes differ in a particular site (heterozygous) or they are the same (homozygous). However this ambiguity can be resolved with the assistance of regular genotypes. Regular genotypes can either be used as post-processing inputs for eliminating set-equivalent solutions of a particular inference, or they can be used to refine inference while constructing the solution.

Tractability of the maximum parsimony haplotyping problem in the xor-genotype case is still open

The remainder of the paper is organized as follows. In

Preliminaries

In an SNP locus only 2 nucleotides are observed, and a single bit is sufficient for the representation of nucleotide variants such that 0 encodes the major allele and 1 encodes the minor allele. The haplotype of an individual can thereby be represented with a binary vector that shows the SNP variants across the individual’s chromosome. The genotype can then be thought of as a ternary vector where a 0 (2) indicates that the site is homozygous and both haplotypes have major 0/0 (minor 1/1) alleles, and 1 indicates that the site is heterozygous and the haplotypes have different alleles 0/1 or 1/0. Notice that when encoding homozygous and heterozygous sites we used a different notation from the literature in order to express a genotype vector as the sum of two haplotypes: a minor-homozygous SNP is encoded with 2 and a heterozygous SNP is encoded with 1, so that a 2 in the genotype is given by (the sum of) two minor alleles, and a 1 in the genotype is given by (the sum of) one major and one minor allele.

In general, given a length-^{
k
} binary sequences, and the other haplotype will be the complement (inverted values) of that sequence. Therefore, for solving a genotype with ^{
k
} distinct binary vectors of length-

On the other hand, in xor-haplotyping problem the conflated data — ^{
L
} distinct binary vectors of length-

Besides the xor-haplotyping problem is NP-hard, there is also no unique solution to this problem. The nature of the XOR operation results in a phenomenon called **
H
** consisting of length-

Problem definition

For each SNP

where _{
i
}(

where **
x
**

In regular haplotyping, a putative haplotype **
z
** ∈ {0, 1}

In xor-haplotyping, on the other hand, it is trivial to see that any haplotype **
z
** ∈ {0, 1}

Because of this compatibility between the xor-genotypes and candidate haplotypes an SNP site can always be explained by either of the two alleles, and thus unambiguous SNPs do not exist anymore. Notice that, in particular, an xor-genotype with all-homozygous SNPs is still ambiguous and requires to be solved up to bit flipping. However, we know that such an xor-genotype is always explained by a pair of identical haplotypes which correspond to the same column of **
Z
**. On the other hand, if there is at least one heterozygous SNP in the xor-genotype then its phasing haplotypes are not identical and correspond to the different columns in

The xor-genotype of

where (.)_{2} represents the component-wise **
v
**

Given **
Z
**, finding the indicator vector

Methods

Xor Haplotyping by Sparse Dictionary Selection (XHSD)

If an (all-homozygous) xor-genotype is explained by only one haplotype, i.e., **
x
**

where **
v
**

For each observed xor-genotype **
x
**

To solve the xor-haplotyping problem, we choose the sparse dictionary

For an observed xor-genotype the reconstruction accuracy can be interpreted as the Euclidean distance between the observation and its closest approximation, i.e.,

where **
Z
** used to approximate

The individual cost function in (5) is then translated into a fitness function associated with a given dictionary

Finally, the fitness value of

For a given cardinality (sparsity) of

and the sparsest dictionary that is sufficient to reconstruct all observed xor-genotypes is determined by

Notice that determining both

For xor-haplotype inference, on the other hand, the problem is fundamentally different. That is, the submodularity property may not hold for the cost function in (5) due to the XOR operation, and thereby the theoretical guarantee does not hold either for the greedy method. Nonetheless, we still use the similar greedy heuristic as SHSD in

In our algorithm Xor Haplotyping by Sparse Dictionary Selection (XHSD), we start with an empty dictionary set

To compute (10) requires solving (5) and (6) for each **
x
**

Notice that in XHSD algorithm the number of compatible haplotypes |

• Initialization.

–

–

–

• Iterate until all xor-genotypes are explained, i.e.,

– Perform the greedy search.

∗ For

∗ Let

Set

∗ Check if any xor-genotype is explained by the addition of the new element

–

Given the xor-genotypes of a set of individuals, this algorithm finds the haplotypes of each individual based on the maximum parsimony principle.

As an example, consider the following demonstration. Let **
x
**

The set of compatible haplotypes for these individuals will consist of all length-3 binary vectors, i.e.,

After initializing **
Z
**, and starting with empty dictionary

This simple example demonstrates how the proposed greedy approach can efficiently construct sparse solutions, where three xor-genotypes are explained by only three haplotypes within three iterations. Nonetheless, the solution set has the ambiguity of being one of the equivalent sets of the true solution due to the bit flip degree of freedom which should be resolved.

Resolving bit flip degree of freedom

In **
X
** ∈ {0, 1}

By bit flipping on a given solution **
H
**, one attempts at choosing among the set-equivalent solutions

Ambiguity resolution for PPXH method

**Ambiguity resolution for PPXH method.** Informative regular genotypes ** H**.

However, this post-processing method have certain limitations. Notice that, for large **
H
**, e.g., for a given set of xor-genotypes it is very likely that any two different inferences

Furthermore, notice that flipping the bits on some loci across all the haplotypes in **
H
** does not affect the parsimony of the solution. The final solution

Therefore, instead of using regular genotypes to post-process a solution, a more intuitive way could be to aim at resolving the bit-flip degree of freedom while constructing the solution. In particular, regular genotypes can be used as constraints when solving the homozygous sites of an xor-genotype. In this sense, given a set of individuals’ xor-genotypes we determine the individuals that have the most informative regular genotypes and pre-process the data set by replacing with the regular genotypes for those individuals. The MTI algorithm

Ambiguity resolution in XHSD or in XOR-HAPLOGEN (XHAP)

**Ambiguity resolution in XHSD or in XOR-HAPLOGEN (XHAP).** Informative regular genotypes

In most cases the xor-genotypes in **
X
** has empty intersection and for each run MTI outputs 2 or 3 individuals, i.e.,

Next we explain the necessary modifications to the XHSD algorithm for utilizing the regular genotypes.

XHSD with regular genotypes

The information provided by regular genotypes is used to reveal the type of allele in homozygous sites of an individual so that we can improve the reconstruction accuracy in (5), and build the dictionary **
g
**

where **
Z
** is the set of haplotypes that are compatible with the

To exploit this fact, we can introduce a weight _{
i
} in the cost function

The weight parameter _{
i
} could be set as proportional to the average rate of homozygous SNPs per genotype, assuming that the more homozygous sites the regular genotype contains the more informative it will be. We experimentally set _{
i
}=4 as it yielded good performance with both synthetic and real databases.

Extensions

Long xor-genotypes

Note that the size of **
Z
** grows exponentially with the length-

The haplotype diversity of a given block is measured by its Shannon entropy. The block partitioning by minimizing the total Shannon entropy proceeds as follows. Let

The entropy of the haplotype block

and the total entropy of ^{
q
} − ^{
q
} + 1 ≤

To determine the initial and ending loci of each block

Missing data

Genotyping errors often occur when the observed genotype of an individual differs from the original sequence for various reasons

Let **
g
**

where **
Z
** with the rows corresponding to the missing loci of the

where

Different weight functions could be employed to exploit the distribution of missing sites. Since, in our experiments, the missing sites are uniformly distributed across the SNPs and individuals the function in (14) gave a good performance.

The proposed method does not account for the direct inference of the missing sites, i.e., imputing missing genotypes

**Matlab implementation.** This file includes the Matlab code of the proposed algorithm, and an implementation with the example database, CFTR.

Click here for file

Results and discussion

We tested the performance of several xor-haplotyping methods with a number of metrics. First we measured the _{
e
}), i.e., the percentage of individuals whose inferred pair of haplotypes are different from the original pair. This measure is sensible for assessing the inference quality in regular haplotyping problem since the alleles corresponding to homozygous loci are known and only the heterozygous loci are ambiguous thereby performance depends on the inference accuracy on heterozygous loci. Nonetheless, in xor-haplotyping there are a large number of equivalent solutions to original one up to bit flipping and thereby it is very likely that a solution set differs from the original phasing on at least one SNP. In particular, for a given xor-genotype even if there is a single SNP difference (namely bit flip) between the pair of inferred haplotypes and the pair of haplotypes that originally gave rise to that xor-genotype, it is counted as mis-inference. A more sensible metric, therefore, would take into account the percentage of such SNPs where the inference differs from the true phasing. In that sense, the _{
i
} / 2), i.e.,

Moreover, to assess the accuracy on homozygous sites, we employ _{
p
})

We performed xor-haplotyping on various data sets, with and without missing information on loci: synthetic data sets with different recombination rates simulated by a coalescence based program of

Synthetic data

Based on different recombination rates three different scenarios are considered in synthetic data sets: no recombination (

In Figure

Potential inference quality on short (**<****1****4**) synthetic data

**Potential inference quality on short (**
**
L
**

To evaluate the inference quality when regular genotype data are available, we first determined only a limited number of regular genotypes by the MTI method, i.e., the smallest set of regular genotypes that have empty intersection on the heterozygous SNPs, then resolved the ambiguity by bit flipping on the initial inference according to these regular genotypes (Figure _{
e
} rates despite the small augmentation of data by only 2 regular genotypes, compared to using them in the post-processing, i.e.,

Performance on long (**5****4****6**) synthetic data by bit flipping via 2 regular genotypes

**Performance on long (**
**5 **
**
≤
**

It is worthy of noting that the algorithms based on segmentation may deteriorate when processing long xor-genotype sequences, especially with increasing recombination rates where the detection of haplotype blocks is complicated

For more practical results we added regular genotypes in each method with different percentages of the population and allowed the methods to remove ambiguity by their own, except for PPXH. Since PPXH cannot make use of regular genotypes directly, we applied bit flipping using the MTI solver to remove ambiguity for this method. To regularly genotype a given percentage of the population, the regular genotypes are determined by running the MTI method several times until the number of distinct regular genotypes obtained achieves the given percentage of the total number of individuals.

Figure _{
p
}) and heterozygous sites (

Performance on long (**5****4****6**) synthetic data from 50 individuals by employing different numbers of regular genotypes

**Performance on long (**
**5 **
**
≤
**

Missing data

We investigated capability for dealing with missing data under different circumstances by various methods. Since the methods performed similarly under zero recombination rate we used the same data sets with no recombination to generate the database with missing entries. An SNP site of an individual is defined as “missing” with a probability of _{
miss
} and the data sets for different percentages of missing SNPs are generated accordingly. PPXH method is excluded since it cannot handle missing data. In XHSD the block partitioning is applied as before with a maximum block size of

Figures

Performance under low rates of missing data, long (**5****4****6**) synthetic data

**Performance under low rates of missing data, long (**
**5 **
**
≤
**

Performance under high rates of missing data, long (**5****4****6**) synthetic data

**Performance under high rates of missing data, long (**
**5 **
**
≤
**

We examined the dependency of methods on percentage of the missing data rate for a population with large number of individuals. That is, we used the xor-genotypes from 50 individuals and replaced 30% and 50% of the population with regular genotypes, and performed xor-haplotype inference under different missing data rates ranging from 0.5_{
e
} level with XHSD, e.g., regular genotyping by 30% in XHSD is comparable to that of 50% in XOR-HAPLOGEN.

Performance under different percentages of missing data

**Performance under different percentages of missing data.**

CFTR gene database

Cystic fibrosis (CF) is an autosomal recessive disorder caused by mutations in the gene that encodes the

We tested the performance of each method on this database with/without missing sites {0,5

Performance on CFTR gene database with different population sizes with/without missing data

**Performance on CFTR gene database with different population sizes with/without missing data.**

Figure

Running times on CFTR gene database with different population sizes with/without missing data

**Running times on CFTR gene database with different population sizes with/without missing data.**

Typing errors

Combinatorial optimization techniques are known with their sensitivity to genotyping errors _{
err
}, and typed the site as either homozygous or heterozygous with equal probabilities. We then run the algorithms without providing the knowledge of erroneous site positions. We excluded PPXH method due to its low performance on the CFTR database. Figure _{
err
} = 2

Performance on CFTR gene database for different population sizes, with _{err}**=****2****%**, with/without missing data

**Performance on CFTR gene database for different population sizes, with **
**
P
**

ANRIL database

The performance of haplotyping methods can deteriorate on databases with decreasing linkage disequilibrium (LD) rates. A SNP database with low pairwise-LD scores are investigated in an association study given in

Performance on ANRIL gene database with different population sizes with/without missing data

**Performance on ANRIL gene database with different population sizes with/without missing data.**

Notice that the algorithms cannot mitigate the error rates with increasing number of individuals. This can be explained by the occurrence of very high haplotype diversity in corresponding low-LD SNP regions. The number of distinct haplotypes explaining the given number of individuals presumably remains at high diversity as the number of individuals grows, whereas the methods based on maximum parsimony principle fail to incorporate this fact. They are tend to find parsimonious (low-diversity) solutions in all population sizes, with a decreasing ratio (_{
miss
} = 0), we observed that such ratio decreases as **
ρ
** = [1.3, 0.95, 0.83, 0.72, 0.66] in respect to the populations with 10,20,30,40,50 individuals; whereas the same ratio for the true phasing (ground truth data) is in fact much higher, i.e.,

Conclusions

In this paper, we have presented a new xor-haplotyping method XHSD based on the maximum parsimony principle that infers the haplotype pairs for each member of a group of unrelated individuals by observing their xor-genotypes. A dictionary selection method is utilized to find the smallest set of haplotypes selected from a candidate set that can explain the given set of xor-genotypes. The proposed approach requires regular genotypes from only a small percentage of individuals for the removal of ambiguity across all SNPs of the inferred haplotypes. The smallest subgroup of individuals having the most informative regular genotypes are efficiently determined by the minimum tree intersection algorithm. Although the inference accuracy was proportional to the percentage of the individuals given by regular genotypes, XHSD shows less dependency on regular genotypes compared to other methods. Experimental results have demonstrated that XHSD is a reliable method for xor-haplotyping under all circumstances including missing data and typing error cases. Low rates of missing values (≤ 10

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XW and GJ conceived of the project. AE, GJ and XW participated in the design of the method. AE performed the computer experiments and contributed in the writing of the draft. All authors read and approved the final manuscript.