Department of Biostatistics, School of Public Health, Columbia University, 722 West 168th Street, New York, NY 10032, USA

Department of Mathematics and Statistics, Georgia State University, 750 COE, 7th Floor, 30 Pryor Street, Atlanta, GA 30303, USA

Abstract

Single-locus analysis is often used to analyze genome-wide association (GWA) data, but such analysis is subject to severe multiple comparisons adjustment. Multivariate logistic regression is proposed to fit a multi-locus model for case-control data. However, when the sample size is much smaller than the number of single-nucleotide polymorphisms (SNPs) or when correlation among SNPs is high, traditional multivariate logistic regression breaks down. To accommodate the scale of data from a GWA while controlling for collinearity and overfitting in a high dimensional predictor space, we propose a variable selection procedure using Bayesian logistic regression. We explored a connection between Bayesian regression with certain priors and _{1 }and _{2 }penalized logistic regression. After analyzing large number of SNPs simultaneously in a Bayesian regression, we selected important SNPs for further consideration. With much fewer SNPs of interest, problems of multiple comparisons and collinearity are less severe. We conducted simulation studies to examine probability of correctly selecting disease contributing SNPs and applied developed methods to analyze Genetic Analysis Workshop 16 North American Rheumatoid Arthritis Consortium data.

Background

Single-locus analysis is a widely used approach to analyze genome-wide association (GWA) data, but it may not be adequate to capture complex pattern of disease etiology

To accommodate large number of SNPs from a GWA while controlling for collinearity and overfitting in a high dimensional predictor space, we propose a variable selection procedure using Bayesian logistic regression. We explored a connection between certain priors and penalized logistic regression. After analysing large number of SNPs simultaneously in a Bayesian logistic regression, we selected important SNPs for further consideration. With much fewer selected SNPs of interest, problems of multiple comparisons and collinearity are less severe. We conducted simulation studies to examine the probability of correctly selecting disease contributing SNPs. Finally, we applied the methods to analyze Genetic Analysis Workshop (GAW) 16 Problem 1 chromosome 9 data.

Methods

Logistic regression is commonly used to fit dichotomous dependent variables. The general form of logistic regression is:

Maximum likelihood is used to estimate parameters in the model. When the number of predictors exceeds the sample size, traditional logistic regression breaks down. In addition, when the predictors are high correlated, the maximum likehood estimate from Eq. (1) is of poor quality.

Gaussian prior and _{2 }penalty

In a Bayesian logistic regression, the coefficients _{j }in Eq. (1) follows some prior distribution. There is a connection between the Gaussian prior _{2 }penalized logistic regression. To be specific, if we assume _{j }is independent and follows a Gaussian distribution with mean 0 and variances _{j}^{2}, then finding the posterior mode of _{2 }penalty _{j}^{2 }represents the prior belief of whether _{j }will be near zero. A small value of _{j}^{2 }indicates that _{j }is close to zero, and a large value indicates a less informative prior belief. Here we assume all _{j}^{2 }have a common value ^{2}. _{2 }penalized logistic regression is proposed to deal with the problem of overfitting and collinearity for large number of predictors _{2}-norm of the coefficients, that is, to minimize

where ^{2 }is equivalent to choosing smoothing parameter

Laplace prior and _{1 }penalty

If we assume that _{j }is independent and follows a Laplace prior (_{j}|_{j}) = _{j}/2exp(-_{j}|_{j}|)) in a Bayesian logistic regression, then finding the posterior mode of _{1 }penalty, which is

While _{2 }penalized regression shrinks coefficients towards zero, it does not favor them to be exactly zero. In contrast, _{1 }penalized regression provides sparse solutions when a large number of coefficients will be zero. Here we assume the prior parameter _{j }to take the common value

Selecting prior parameters

Choosing prior variance of the parameters in a Bayesian regression, or equivalently, the regularization parameter in a penalized regression, is important for variable selection. A small prior variance provides more shrinkage towards zero or favors more coefficients to be zero. A large prior variance reflects more uncertainty of the prior information. The prior variance was chosen by 10-fold cross validation. The sample was split randomly into 10 parts. The model was fit on 9 out of the 10 parts and the log likelihood function was computed using the remaining one part of the data. This procedure was done for each of the 10 parts and the average log likelihood was calculated. The prior variance was chosen as the one that maximizes the "cross-validated" average log likelihood.

Simulations

We performed simulation studies to examine the effectiveness of Bayesian logistic regression as a variable selection procedure. We simulated 100 dichotomous predictors from a Bernoulli distribution. The probability of the predictor being one is generated from a uniform distribution,

We fit Bayesian logistic regression with Gaussian and Laplace priors using software BBRBMR

NARAC data analysis

All analyses were performed on the GAW16 Problem 1 North American Rheumatoid Arthritis Consortium (NARAC) data. We analyzed 2705 SNPs on chromosome 9, ranging from 91,730,970 kb to 138,303,776 kb with minor allele frequency greater than 0.01 and no missing genotypes. This area covers the location where the most significant SNP (rs3761847) was reported by Plenge et al.

We divided the sample into a discovering sample (

Results

Simulations

For the Gaussian prior with sample size 250 and high odds ratio (odds ratio ranging from 2 to 2.5), the average number of correctly identified SNPs in the top 20 SNPs selected by the magnitude of the regression coefficients is 8.3 (out of the 10 disease-associated SNPs). For the same prior and the sample size but with moderate odds ratio (odds ratio ranging from 1.5 to 2), the average number of correctly identified SNPs is 6.7. When decreasing the sample size to 150, in the high and moderate odds ratio model, the average number of correctly identified SNPs is 7.4 and 6.4, respectively. The consistencies (the average percent times of each disease-contributing variable being selected across simulation data sets) in the above four settings ranges from 0.73 to 0.97, 0.53 to 0.77, 0.6 to 0.87, and 0.57 to 0.73. For the Laplace prior, the average numbers of SNPs correctly identified in each of the four settings were: 6.7, 4.5, 4.2, and 4.0, respectively. The consistencies were lower than the Gaussian prior.

NARAC data analysis

For the Bayesian logistic regression with 2705 SNPs, the number of iteration in the Markov-Chain Monte Carlo calculation was 250 for the additive model and 187 for the dominance model. Convergence was reached with threshold 0.005. For the dominant model, the highest

** p-Values for 2705 SNPs in the Bayesian logistic regression (Gaussian prior): additive model (left panel) and dominant model (right panel)**.

Bayesian logistic regression of 2705 SNPs on chromosome 9

**Additive model**

**Dominant model**

**Rank**

**SNP**

**Position**

**abs ( z-score)**

**SNP**

**Position**

**abs ( z-score)**

1

rs1407869

101353456

8.02

rs7864653

100860678

7.16

2

rs4437724

113188649

7.76

rs10989329

100794635

7.16

3

rs10120479

111426956

6.97

rs4237190

97922972

6.58

4

rs9697192

116879138

6.97

rs6478644

123942505

6.42

5

rs3824535

122763410

6.90

rs1407869

101353456

6.03

6

rs10491578

116463442

6.39

rs2229594

101204219

5.97

7

rs10121681

111718477

6.37

rs10820559

103716588

5.87

8

rs694428

117692812

6.13

rs1536705

126851425

5.86

9

rs2900180

120785936

5.96

rs2564362

123365200

5.74

10

rs11243755

132287257

5.96

rs10978456

106155366

5.73

We selected the top 300 SNPs and performed single-SNP analysis using the independent testing set of 1031 subjects. LD plot revealed that the selected SNPs had lower intermarker LD than the total marker map (not shown). Table

** p-Values of the top 300 SNPs selected by β or z scores (single-SNP analysis): additive model (left panel) and dominant model (right panel)**.

Single-SNP analysis of the top 300 selected SNPs

**Additive model**

**Dominant model**

**Rank**

**SNP**

**Position**

**SNP**

**Position**

1

rs2900180

120785936

6.24 × 10^{-9}

rs2900180

120785936

6.24 × 10^{-9}

2

rs1953126

120720054

2.76 × 10^{-8}

rs11787779

114820894

6.89 × 10^{-5}

3

rs942152

121031239

3.94 × 10^{-6}

rs17148869

132180015

1.00 × 10^{-4}

4

rs7858974

91959665

1.26 × 10^{-5}

rs7862566

117133575

2.00 × 10^{-4}

5

rs11787779

114820894

6.89 × 10^{-5}

rs4978629

107708375

3.00 × 10^{-4}

6

rs6478300

117115323

7.12 × 10^{-5}

rs4978890

110046695

3.00 × 10^{-4}

7

rs989980

106309592

1.00 × 10^{-4}

rs1333914

119662788

4.00 × 10^{-4}

8

rs17148869

132180015

1.00 × 10^{-4}

rs1332408

122271713

4.00 × 10^{-4}

9

rs7862566

117133575

2.00 × 10^{-4}

rs2095069

94782055

0.001

10

rs945246

119953710

2.00 × 10^{-4}

rs4743420

100567644

0.0011

Discussion

We propose a Bayesian logistic regression procedure to select important SNPs based on the

Among the top 300 SNPs selected by the ^{-6}). The other five SNPs are not in the candidate region and are not in LD with SNPs in the region. The significance of other SNPs deserves further investigation in an independent sample.

An alternative one-step approach would be reporting permutation

We only analyzed SNPs with no missing data due to the incapability of handling missing covariates data of the BBRBMR software. One solution is to first impute the missing genotypes and then run the Bayesian regression on the imputed data. An alternative is to handle missing data directly in a Bayesian analysis by data augmentation.

Here the priors are assumed to be independent and their variances are assumed to be the same. We choose prior variance by cross-validation. An alternative strategy would be specifying a hyper-prior distribution (such as non-informative prior). To incorporate prior knowledge such as physical distance between the SNPs, one can specify prior distribution to have distance-based correlation. How to specify such a correlation for a large scale regression is worth further attention.

Conclusion

Large scale Bayesian logistic regression is useful to analyze genome wide case-control data with large number of SNPs. Coefficient estimates or

List of abbreviations used

GAW: Genetic Analysis Workshop; GWA: Genome-wide association; HWE: Hardy-Weinberg equilibrium; LD: Linkage disequilibrium; NARAC: North American Rheumatoid Arthritis Consortium; SNP: Single-nucleotide polymorphism

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YW designed the study, performed the statistical analysis, and drafted the manuscript. NS performed the statistical analysis. YF participated in the study design and helped to draft the mansucript. All authors read and approved the final manuscript.

Acknowledgements

The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. YW (PI) and YF (subcontract PI) are supported by NIH grant AG031113-01A2.

This article has been published as part of