Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520 USA

Department of Biostatistics, Department of Preventive Medicine, University of Medicine and Dentisry of New Jersey, Newark, NJ 07101, USA

Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032 USA

Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA

Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520 USA

Department of Genetics, Yale University, New Haven, CT 06520 USA

Abstract

Common human disorders, such as alcoholism, may be the result of interactions of many genes as well as environmental risk factors. Therefore, it is important to incorporate gene × gene and gene × environment interactions in complex disease gene mapping. In this study, we applied a robust Bayesian genome screening method that can incorporate interaction effects to map genes underlying alcoholism through its application to the data of the Collaborative Studies on Genetics of Alcoholism provided by Genetic Analysis Workshop 14. Our Bayesian genome screening method uses the regression-based stochastic variable selection, coupled with the new Haseman-Elston method to identify markers linked to phenotypes of interest. Compared to traditional linkage methods based on single-gene disease models, our method allows for multilocus disease models for simultaneous screening including both main and interaction (epistatic) effects. It is conceptually simple and computationally efficient through the use of Gibbs sampler. We conducted genome-wide analysis and comparison between scans based on microsatellites and single-nucleotide polymorphisms. A total of 328 microsatellites and 11,560 single-nucleotide polymorphisms (by Affymetrix) on 22 autosomal chromosomes and sex chromosome were used.

Background

Alcohol dependence is a complex disorder that is influenced by many genetic and environmental factors. Identifying genes associated with alcohol dependence is critical to understand its etiology and to develop efficient methods for prevention and treatment. However, this effort has been hampered by the complexity underlying alcohol dependence: rather than there being one or a few major genes affecting alcohol dependence, it is likely that multiple genes interact with each other, together with environmental factors, to affect susceptibility to alcohol dependence. In this paper, we describe analyses of the Collaborative Study on the Genetics of Alcoholism (COGA) data (Problem 1), using self-reported "maximum number of drinks consumed in a 24-hour period" (denoted by M) as a quantitative trait, to map genes underlying alcohol dependence. The measure M is closely related to alcoholism diagnosis and provides a quantitative measure for alcohol dependence. For genome screens for this trait, we use the modified Haseman-Elston regression method

The Haseman-Elston method and its derivatives allow one to apply linear regression methods in linkage analysis. For each sibling pair, these methods use the number of alleles identical by descent (IBD) at each marker as the explanatory variable and a statistic measuring similarity of the quantitative traits in the sibling pair, squared difference, or cross-product, as the response variable.

In practice, the number of markers and their possible epistatic effects are often larger than the number of observations (patients or sib-pairs), where the design model is referred as being "supersaturated". As we often have hundreds of markers to consider, we must deal with the problem of multiple testing in this context. Besides, if one would like to take epistasis into account, the number of tests can easily exceed tens of thousands. Performing hypothesis tests for linkage for all of these possibilities without appropriate adjustment of multiple comparisons can lead to the identification of spurious genetic effects or the masking of real effects. The supersaturated nature of the design model also makes the conventional best subset model selection methods

In this study, we apply the method developed by Oh

Methods

Haseman-Elston method

The original Haseman-Elston method ^{2 }= (_{1 }- _{2})^{2}) in pairs of siblings on the number of alleles shared IBD between each sib pair at a given marker. Although the original Haseman-Elston method is simple, robust, and computational inexpensive, it may ignore information contained in the observed bivariate data. In fact, the squared mean corrected trait sum of sib pairs (^{2 }= (_{1 }+ _{2 }- 2^{2}) may provide additional information on the genetic effect ^{2 }and ^{2 }in linkage analysis to improve statistical power. One of the simplest methods was proposed by Elston et al. ^{2 }- ^{2})/4 = (_{1 }- _{2 }- ^{2 }and ^{2}) with equal weights. An additional advantage of using CP as the response variable is that it may be more normally distributed than ^{2 }and ^{2}.

Bayesian genome screening

Assume that we observe _{j }is the candidate explanatory variable. Then the observed

where ^{2}) are assumed to be independent. A subset model is represented by a binary vector

The choice of the values, (_{00},_{01},_{10},_{11}) represents the belief of the relationship between factors _{00},_{01},_{10},_{11}) = (0,0,0,_{00},_{01},_{10},_{11}) = (0,_{1},_{2},_{3}) to relax these conditions.

With an appropriate prior distribution on σ^{2}, one can obtain the posterior distribution of

In microsatellites and SNPs, there are 328 and 7,826 markers considered, respectively. Therefore, to consider epistatic effects, we needed to include 53,957 factors for microsatellites and about 30 millions factors for SNPs in the models. We set the prior as _{00},_{01},_{10},_{11}) = (0,

Results

In the full sample of cases, the quantitative trait M ranges from 0 to 160. Five individuals are in the highest threshold class (M > 128 drinks). The highest reported M is 160. The use of the log transformation minimizes their impact on the regression analysis, which can be inflated by self-report

In each analysis, the Markov chain Monte Carlo (MCMC) sampler was run for 100,000 cycles after discarding the first 2,000 cycles for the burn-in period. Because MCMC samplers arise from recursive draws, they produce correlated samplers from the posteriors. Therefore, the chains are thinned (one iteration in every 10 cycles is saved) to reduce serial correlation in the stored samples. The total number of samples kept in the post-Bayesian analysis is 10,000. It takes ~2 hours for microsatellites and ~6 hours for SNPs to generate each sample with JAVA programs on a Linux cluster using 2.4-GHz Intel processors. Table

Comparisons between microsatellites and SNPs on chromosome 4

**Comparisons between microsatellites and SNPs on chromosome 4**. Both results from microsatellites and SNPs show similar patterns for markers having the evidence being linked to the disease genes.

Comparisons of SNPs and microsatellites for main effect and two-way interaction effect screening. Both microsatellite and SNP analyses show a strong and frequent main effect in chromosome 4, whereas epistatic effects are located differently.

Ranking

Chromosome

Marginal posterior probabilities

Microsatellites

1

Chr 4

0.21133

2

Chr 6, Chr 13, Chr 16

0.15433

3

Chr 4 (2 markers^{a}), Chr 10

0.13922

4

Chr 23, Chr 17, Chr 7

0.09066

5

Chr 23, Chr 2

0.08533

6

Chr 1

0.08466

7

Chr 13

0.07577

8

Chr 16

0.07422

9

Chr 14 (2 markers)

0.07266

10

Chr 7

0.06944

11

Chr 17

0.06922

12

Chr 3

0.06622

13

Chr 20

0.05766

14

Chr 8 × Chr 15^{b}

0.05644

15

Chr 10 × Chr 17^{b}

0.03244

SNPs

1

Chr 4

0.2068

2

Chr 4 (2 markers^{a})

0.1793

3

Chr 23

0.1786

4

Chr 4

0.1725

5

Gender

0.1703

6

Chr 3, Chr 13

0.1563

7

Chr 23

0.1516

8

Chr 23 × sex^{b}

0.1461

9

Chr 23

0.1295

10

Chr 4, Chr 6

0.128

11

Chr 6, Chr 23

0.1256

12

Chr 3

0.1247

13

Chr 16

0.1237

14

Chr 7, Chr 4

0.1208

15

Chr 14

0.0209

^{a}Two markers are ranked.

^{b }Epistatic effect between the two chromosomes

Discussion

In this study, we have compared the genome-wide linkage analyses based on microsatellites and SNPs. Our methods located the main effects of markers both from microsatellites and SNPs and produced similar patterns between them. However, the results for epistatic effect screening are less consistent and revealing. This might be purely because these epistatic effects are weak in nature and further research in this area is warranted.

Conclusion

Bayesian genome screening methods provide a powerful and efficient tool in identifying potential markers and their epistatic effects. They are very effective because they are able to conduct searches over the entire model space; while the frequentist's best subset model selection procedure is constrained by computing power required to examine all candidate models. In addition, Bayesian genome screening methods can work on problems with many more candidate variables, which is essential to consider when epistatic effects are studied. When one tries to locate the epistatic effects, the number of covariates (factors) easily far outnumbers the sample size. Most traditional linkage methods do not work under this condition because they often assume a single-gene model and test effects one at a time. By using the prior structures that reflect the relationship among the candidate variables, our general approach can accommodate a large number of candidate markers as well as their epistatic effects by evaluating all factors simultaneously. We were able to locate markers on chromosome 4 that show the strong evidence of linkage with alcoholism related to quantitative phenotype, "maximum number of drinks consumed in a 24-hour period", both from microsatellite and SNP scans and weak evidence for epistatic effects.

Abbreviations

COGA: Collaborative Study on the Genetics of Alcoholism

GAW14: Genetic Analysis Workshop 14

IBD: Identical-by-descent

MCMC: Markov chain Monte Carlo

SNP: Single-nucleotide polymorphism

SSVS: Stochastic search variable selection

Authors' contributions

CO participated in the design of the study, performed the analysis, and drafted the manuscript. SW helped to obtain IBD values for linkage analysis. SW, NL, LC, and HZ participated in the design and the discussion of the study, and the preparation of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

Supported in part by NIH grant R01 GM59507 and NSF grant DMS 0241160.