Genomic surveys by methylation-sensitive SNP analysis identify sequence-dependent allele-specific DNA methylation

Allele-specific DNA methylation (ASM) is a hallmark of imprinted genes, but ASM in the larger nonimprinted fraction of the genome is less well characterized. Using methylation-sensitive SNP analysis (MSNP), we surveyed the human genome at 50K and 250K resolution, identifying ASM as recurrent genotype call conversions from heterozygosity to homozygosity when genomic DNAs were predigested with the methylation-sensitive restriction enzyme HpaII. Using independent assays, we confirmed ASM at 16 SNP-tagged loci distributed across various chromosomes. At 12 of these loci (75%), the ASM tracked strongly with the sequence of adjacent SNPs. Further analysis showed allele-specific mRNA expression at two loci from this methylation-based screen—the vanin and CYP2A6-CYP2A7 gene clusters—both implicated in traits of medical importance. This recurrent phenomenon of sequence-dependent ASM has practical implications for mapping and interpreting associations of noncoding SNPs and haplotypes with human phenotypes.

Allele-specific CpG methylation (ASM) is a hallmark of genomic imprinting, but the characteristics of ASM at nonimprinted loci are less well understood. DNA methylation tracking as a genetic trait was found in early studies of human VNTR loci 1,2 , and a recent study using genome fractionation uncovered a CpG island with parent of origin-independent ASM on chromosome 21 (ref. 3). We have described a microarray-based approach, called MSNP, for genetic and epigenetic profiling 4 . Genomic DNA is digested with a nonmethylation-sensitive restriction enzyme with or without a second, methylation-sensitive, restriction enzyme, such as HpaII. Probes are synthesized by linker ligation and PCR and hybridized to Affymetrix SNP arrays. Fragments with internal unmethylated HpaII sites drop out from the plus-HpaII representations. Thus, by comparing the results in the minus-and plus-HpaII representations, we can measure net methylation and also detect ASM, which appears as a conversion from an AB genotype call to BB or AA.
Here, we report genome-wide surveys for ASM using MSNP at 50K and 250K resolution. The 50K dataset consisted of single XbaI and XbaI + HpaII representations from 12 normal tissue samples: peripheral blood leukocyte (PBL) DNA from four adults, kidney DNA from one adult, brain and lung DNA from one adult, DNA from three placentas (sampled at the fetal surface), and buccal cell DNA from two adults. On this 50K array, there are 39,100 class 1 SNPs (noninformative for DNA methylation, because of absence of HpaII sites within the XbaI fragment), 15,057 class 2 SNPs (informative, because of the presence of nonpolymorphic HpaII sites) and 4,328 class 3 SNPs (not reliably informative, because of polymorphic HpaII sites). In assessing the SNP-tagged loci for ASM, we first eliminated those with weak signals-below the 10th percentile in the XbaI representations-in Z5 samples. We next asked for allele call conversions in Z3 samples in the plus-HpaII representations, producing a list of 74 SNP-tagged loci. Lastly, we examined the hybridization quality and the selective reduction in intensity of one of the two alleles for each of these loci using the perfect match/mismatch (PM/MM) view in dChip 5 .
We brought forward two of the most convincing candidate loci for independent validations. SNP rs1042073, in exon 13 of the lactotransferrin (LTF) gene, showed AB-AA conversions with HpaII predigestion in five tissues (PBL, brain, kidney, lung and placenta), each obtained from a different individual, whereas no AB-BB conversions were seen (Fig. 1a). These results predicted that the C allele would be consistently more methylated than the T allele at one or both of the two HpaII sites in this XbaI fragment. As shown for several tissues from different individuals (Fig. 1), PCR/RFLP and sequencing consistently showed a reduced representation of the T allele when the DNA had been predigested with HpaII, and the same genotype dependence was seen in an expanded series ( Table 1 and Supplementary Table 1 online). To validate this correlation between ASM and SNP genotype using a non-PCR-based method, we did DNA blotting of genomic DNAs from 21 individuals with the 3 possible rs1042073 genotypes. ASM was readily scored as a doublet band pattern, biallelic methylation as a single upper band, and biallelic lack of methylation as a single lower band. As shown in Figure 1d, there was an invariant relationship between the rs1042073 genotype and the pattern of methylation, with C-allele homozygotes showing biallelic methylation, CT heterozygotes showing ASM and TT homozygotes showing biallelic lack of methylation at this site. Given this notably constant relationship, we sought artifactual explanations, such as the presence of a polymorphic HpaII site created by or in linkage disequilibrium with the index SNP in the same restriction fragment. However, sequencing excluded this possibility. As ASM is often found at imprinted loci, we also sought to exclude imprinting as an explanation. We addressed this question by using trios of DNA from maternal and paternal PBL and placenta (fetal surface) from the offspring. Among the informative trios, the methylated allele in the LTF exon 13 region was paternal in origin in four but maternal in origin in one, thus excluding parental imprinting as an explanation for the ASM (Supplementary Fig. 1 online).
We obtained similar results for a second locus from the 50K survey, tagged by SNP rs9258170 in the upstream region of the HLA-F gene. There were AB-AA call conversions with HpaII predigestion in each of five samples (one kidney, one lung, one brain, one placenta, and two buccal epithelial cell preparations), and no AB-BB conversions. As shown by pre-digestion followed by PCR, ASM in the HLA-F upstream region correlated strongly, although not invariably, with the local DNA sequence, as scored by SNPs and indels ( Table 1 and Supplementary Fig. 2 online). To ask whether the ASM at the HpaII sites in this region might reflect more widespread ASM at other CpG sites, we carried out bisulfite sequencing of a region near the index SNP. This procedure showed a patch of differential methylation affecting at least 17 CpG dinucleotides ( Supplementary Fig. 2).
We next carried out MSNP using 250K StyI SNP arrays, containing 164,561 class 1 (69%), 57,583 class 2 (24%), and 15,906 class 3 SNPs (7%). Our 250K survey included six PBL samples, two normal bone marrow samples, one sample of CD34 + Linhematopoietic cells, and three placental samples. We analyzed duplicate technical replicates for StyI and StyI + HpaII representations, and single determinations for StyI + MspI (60 microarrays). We asked for call conversions from AB in StyI to AA or BB in StyI + HpaII in both technical replicates in at least 2 of the 12 biological samples. To exclude loci with biallelic hypomethylation, we additionally required that the average StyI + HpaII intensity be reduced not more than 70% from the average StyIonly intensity, and that the intensity in the StyI + MspI representation be at least 40% less than the mean signal with StyI + HpaII. This procedure produced a list of 58 loci as candidates for ASM. As expected, this list was depleted in class 1 SNPs (0, 0%) and enriched in class 2 (31, 53%) and class 3 (27, 47%) SNPs. The opposite analysis, asking for all class 2 SNPs with persistent AB calls and preserved intensity of both alleles in the plus-HpaII representations (reduced not more than 70% from the minus-HpaII intensity) in all heterozygous samples (requiring at least two of the samples to be heterozygous), with a signal intensity in the StyI + MspI representation at least 40% less than the mean signal with StyI + HpaII, yielded 15,454 class 2 SNPs. So within the sensitivity of this method and given the validations and determination of the true-positive rate below, we estimate that at least 0.16% of the informative SNP-tagged loci queried by the array have recurrent ASM in the tissues examined.
As internal controls in this 250K dataset, we found consistent call conversions at SNPs in three imprinted domains: in a PBL sample at SNP rs11161318 located 0.5 kb downstream of the 3¢ end of the MAGEL2 gene, in a bone marrow sample at SNP rs231907 in the KCNQ1 gene, and in the CD34 + stem cells at SNP rs2107425 upstream of the H19 gene. Monoallelic expression of MAGEL2 mRNA as a result of parental imprinting has been shown previously 6,7 , and DNA methylation occurs preferentially on the maternal allele at multiple sites over this imprinted domain on chromosome 15 (ref. 8), but ASM has not been previously examined in the MAGEL2 downstream region. We therefore validated the MSNP findings for this SNP-tagged StyI fragment using pre-digestion followed by PCR and RFLP. As expected for an imprinted locus, in which parent-of-origin, not DNA sequence, dictates which allele  Figure 1 Genotype-dependent ASM in LTF intron 13. (a) Hybridization data from 50K XbaI MSNP. Perfect match (pm) and mismatch (mm) intensities are shown, and the genotype calls from XbaI and XbaI + HpaII genomic representations are indicated. With addition of HpaII, there is a call conversion from AB to AA at SNP rs1042073. (b) Predigestion followed by PCR and RFLP assay validating ASM in the XbaI fragment containing SNP rs1042073. Genomic DNA was digested with XbaI or with XbaI + HpaII, followed by PCR with the indicated primers (arrows). PCR products were digested with BsrDI to score the RFLP corresponding to SNP rs1042073 (asterisk). In each of these samples and in additional individuals ( Table 1), the T allele drops out with HpaII predigestion, indicating relative hypomethylation. The methylation status of the two HpaII sites between the primers, based on this assay and genomic DNA (Southern) blotting, is indicated by the circles (methylated, black; ASM, half-black). The image of the ethidium-stained gel is inverted to show the bands as dark on a light background. (c) Confirmation of ASM by predigestion/PCR/sequencing. The double peak indicating heterozygosity at SNP rs1042073 in the XbaI-digested samples is nearly a single peak in PCR products from the XbaI + HpaII pre-digested samples. (d) Genotype-dependent ASM downstream of LTF exon 13 indicated by DNA blotting. In a series of 21 individuals, 10 with rs1042073 genotype CC, 8 with genotype CT and 3 with genotype TT, the ASM at HpaII site 2, seen as the doublet in the RsaI + HpaII lane, was only observed in tissues from individuals who were heterozygous: individuals homozygous for the C allele at this SNP showed full methylation of this HpaII site, whereas those homozygous for the T allele lacked methylation at this site. Overall P ¼ 1.72 Â 10 À8 , Fisher's exact test with null hypothesis of random allelic methylation. B, BsrDI; H, HpaII; M, MspI; R, RsaI; X, XbaI.
becomes methylated, ASM in the MAGEL2 region did not correlate with the SNP genotype ( Supplementary Fig. 3 online).
We next selected a group of loci reported as having ASM in the 250K data, outside of known imprinted regions, for independent validations in larger series of individuals. These assays validated ASM for 14/18 candidate loci examined (true-positive rate 78%). For the minority of loci that could not be validated, it was evident from PCR that biallelic hypomethylation, with relatively greater sensitivity of the SNP array probe sets for one allele, accounted for the false-positive call conversions. The MSNP data and validations by pre-digestion followed by PCR and bisulfite sequencing are shown for the CYP2A7 and VNN1 genes in Figures 2 and 3, and examples of validations of ASM at additional SNP-tagged loci are shown in Supplementary  Figures 3-5 online. In the combined 50K and 250K data, the ASM correlated strongly with the allele in cis at the index SNP or closely adjacent informative SNPs for 12/16 (75%) of the validated loci ( Table 1 and Supplementary Table 1).
The known relationship between DNA methylation and gene expression raised the question of whether this sequence-dependent ASM might be associated with allele-specific mRNA expression (ASE).
Among four loci that we tested by comparing allelic representation in cDNA versus genomic DNA, only rare individuals showed ASE of the LTF and HLA-F genes, whereas many showed strong ASE of the CYP2A7 and VNN1 genes. Both of these positive examples are in gene clusters, on chromosome 19 and 6, respectively, which lack classical upstream CpG islands and show tissue specific expression. Although our MSNP screen was done using hematopoietic tissues, we directly assayed ASM and ASE for CYP2A7 in the main expressing organ, the liver. Comparison of allelic representation in genomic and cDNA PCR products from the CYP2A7 gene showed strong ASE in 10 of 17 heterozygous human liver samples examined, with relative overexpression of the same allele in each of these 10 individuals, and sequencing across multiple intragenic SNPs revealed that the presence or absence of ASE was strongly dependent on genotype (Fig. 2b and Supplementary Table 2 online). For VNN1, we confirmed ASM by predigestion followed by RFLP analysis near the index SNP and by bisulfite sequencing across a CG-rich region, not dense enough to qualify as a CpG island, between the index SNP and the first exon (Fig. 3a,c,d). We found a consistent difference in methylation patterns between the two alleles, already detectable early in the  Fig. 3d). By comparing cDNA and genomic PCR products in PBL, we documented a domain-wide effect, with ASE strongest for VNN1 but also evident for the other two genes (VNN3 and VNN2) in this cluster (Fig. 3b).
Both of these examples are medically relevant: prior genotypephenotype correlations for CYP2A6 (located immediately downstream of CYP2A7 and 93% identical in amino acid sequence) have shown that genes in this cytochrome P450 cluster control blood nicotine levels and may thereby influence smoking behavior 9 , and a recent report described total VNN1 mRNA expression as a quantitative trait linked to haplotypes, and in turn to HDL cholesterol levels 10 . Future methylation-based screens using the methods developed here will likely uncover additional examples of sequence-dependent ASM linked to important human traits. This dependence of ASM on the local DNA sequence in cis is distinct from other epigenetic phenomena such as imprinting, allelic exclusion, random monoallelic methylation and expression [11][12][13] , and epimutation (which may be haplotype-dependent in some cases [14][15][16][17]. Mechanisms leading to sequence-dependent ASM will likely differ among loci: allele-specific affinity for DNA-binding proteins with downstream effects on DNA methylation is one obvious possibility and our findings, and in conjunction with prior data 10,[18][19][20] , suggests that this model may apply to the CYP2 and vanin gene clusters. Direct effects of the DNA sequence on propensity for methylation are another   Table 1 for sequence dependence of the ASM at this location in the CYP2 gene cluster, queried by SNPs rs3844442 and rs34724660. (b) ASE of CYP2A7 mRNA in human livers. Liver DNAs were screened for heterozygosity at SNP rs3815710, and 17 heterozygous samples from different individuals were analyzed for ASE: 10 showed strongly preferential expression of the C allele (Liv2, Liv3 and Liv4, for example) and 7 showed equal biallelic expression or a slight preferential expression of the A allele (Liv1). As shown in Supplementary Table 2, the presence or absence of ASE was significantly associated with the genotype at SNP rs3815706 in CYP2A7 intron 3. Sequencing of cloned partial cDNAs showed that the two common haplotypes in this highly polymorphic gene differ by Z6 amino acids in the region spanning exon 2a to exon 6, suggesting that the allelic expression bias may have functional consequences (Supplementary Table 2 and data not shown). (c) Confirmation of ASM in the StyI fragment containing the index SNP in human livers by predigestion/PCR/sequencing. The double peak indicating heterozygosity at SNP rs34724660 becomes nearly a single peak in PCR products from the HpaII predigested samples. (d) ASM shown by bisulfite sequencing at two positions (gray rectangles) in the CYP2 gene cluster. The bisulfite clones were assigned to each allele in the index region using SNP rs34724660; the clones from the intragenic CpG island overlapping exon 2b of the CYP2A7 gene were assigned using SNP rs3815710. Dashes in the bisulfite sequences indicate polymorphic CpG sites present on one allele and absent on the other. Of 47 PBL samples screened for the rs2294757 SNP in VNN1 exon 1, 24 were heterozygous. Of these, 17 were analyzed for ASE: 16 showed strongly preferential expression of the A allele (PBL4 and PBL5, for example; sequence chromatograms are in the reverse direction), whereas only one sample showed equal biallelic expression. In a smaller series of heterozygotes, ASE was also observed for VNN3 and VNN2: of seven heterozygous samples examined for ASE of VNN2, four showed preferential expression of the A allele (SNP rs1883613), whereas one showed preferential expression of the G allele and two showed equal biallelic expression; of six heterozygous samples examined for ASE of VNN3, all showed preferential expression of the A allele (SNP rs2294759; sequence chromatograms are in the reverse direction). (c) Predigestion followed by PCR and RFLP and sequencing validating ASM at a HpaII site in the StyI fragment containing SNP rs12205095. Genomic DNA was mock-digested or digested with HpaII, followed by PCR with the indicated primers (arrows). PCR products were digested with TaqI to score the RFLP corresponding to SNP rs6569826 (asterisk). In PBL3 and in additional individuals ( possibility 21 . In practical terms, the phenomenon of sequence-dependent ASM extends the concept of haplotypes to include epigenetic marks ('epihaplotypes') and, as an indicator for the presence of nearby regulatory polymorphisms that confer the allelic asymmetry, it will likely be useful for fine mapping and interpreting associations of noncoding SNPs with complex genetic diseases.

METHODS
MSNP procedure. Genomic DNA from normal human tissues, obtained with institutional review board approval and anonymous to individual identifiers, was purified using the DNeasy Blood and Tissue Kit (Qiagen). For both 50K XbaI and 250K StyI MSNP, we followed the recommended protocol of the SNP array manufacturer (Affymetrix), but with the addition of an initial HpaII or MspI digestion for the plus-HpaII and plus-MspI probes. Genomic DNA (250 ng per representation) was digested with HpaII or MspI in a total volume of 12 ml (Buffer 1; New England Biolabs) for 3 h, followed by adjustment of the buffer by addition of 15.8 ml of water and 3.2 ml of 10Â Buffer 3 (New England Biolabs), and a further digestion with StyI for 3 h. StyI linkers were ligated to the restriction fragments, and PCR was carried out for 30 cycles with linker primers. We carried out PCR purification using the DNA Amplification Clean-Up Kit (Clontech), and probe fragmentation (brief exposure to DNase), biotinlabeling and hybridization as described in the Affymetrix protocol. We determined intensity values by normalization and model-based expression (PM-only method) using dChip 5 . The main determinant of the data quality is the starting material: the genomic DNA must be high molecular weight. Call rates were in the range of 89-95% with StyI and StyI + HpaII, with lower call rates of 76-88%, as predicted, in the StyI + MspI representations, in which many of the class 2 SNPs dropped out as a result of complete digestion of internal CCGG sites.
Independent assays for ASM. For the HpaII predigestion followed by PCR, RFLP and sequencing assays, genomic DNA, 1 mg in 80 ml, was digested with 5 U of HpaII overnight, followed by an additional 5 U of HpaII for 1 h or incubated in parallel with buffer alone. Of this digested DNA, 25 ng was used for each PCR (primers in Supplementary Table 3 online). In this predigestion/ PCR assay, we did a control with MspI predigestion, and verified that this procedure markedly reduce the amount of PCR product obtained under the same cycling conditions. For bisulfite conversion followed by PCR, cloning and sequencing, we treated genomic DNA with sodium bisulfite and purified the converted DNA using the CpGenome DNA modification kit (Chemicon) according to the protocol of the manufacturer. The bisulfite/PCR primers and SNP ID numbers for distinguishing the alleles in the HLA-F upstream region, two positions in the CYP2 gene cluster, and in the promoter region of the VNN1 gene are listed in Supplementary Table 3. Bisulfite PCR products were cloned in the pCR 2.1-TOPO vector (Invitrogen), and multiple clones were sequenced. For methylation-sensitive DNA (Southern) blotting of the LTF exon 13 region, we digested 3 mg genomic DNA with the indicated restriction enzymes overnight, followed by electrophoresis on 1% agarose gels, transfer to Nytran membranes (Schleicher and Schull) and hybridization of the blots with a 32 P-labeled LTF probe. The probe DNA was synthesized by genomic PCR; the primer sequences are in Supplementary Table 3.
Assays for allele-specific mRNA expression. Reverse transcription PCR was carried out with cDNAs prepared from total PBL RNA (Protoscript First Strand cDNA Synthesis Kit, New England Biolabs), using primers specific for VNN1, VNN2, VNN3 and CYP2A7, designed to span at least one intron and include at least one exonic SNP with a high minor allele frequency. We avoided SNPs within the primer sequences and sequences that were identical among family members, conditions which put strong constraints on primer placement in these highly polymorphic gene clusters. The controls for these analyses were genomic PCR products spanning these same SNPs. The primer sequences are listed in Supplementary Table 3.
Accession codes. NCBI GEO: data have been deposited with accession code GSE11409.
Note: Supplementary information is available on the Nature Genetics website.