Genome-Wide Association Studies – Karen Mohlke (2012)


Dr. Andy Baxevanis:
Okay, good morning everyone, and thank you for joining us on this absolutely beautiful
day here on the Bethesda Campus. Our lecture today is devoted to genome-wide association
studies. And as you all know, these kinds of studies really help us separate genetic
variations that are biologically insignificant from those that do produce some sort of change
that might ultimately be detrimental or advantageous to a particular individual. And study of these
variations are also critical to identifying what genes are responsible for particular
genetic or genomic disorder as you heard about during last week’s lecture by Lynn Jorde. There’s also a much more practical reason
to study these genetic variations, particularly the single nuclear type polymorphisms or SNPs
that give rise to all of those subtle differences between each and every one of us in this hall
since a very thorough understanding of these variations might provide a way for us to know
in advance how well someone will respond to a particular drug or to a particular treatment
regimen. And we’ll hear much more about the pharmacogenomic implications of having this
kind of knowledge in next week’s lecture in this hall by Howard McLeod. This week I am very pleased to introduce to
you Dr. Karen Mohlke, who will be presenting today’s lecture on genome-wide association
studies. Dr. Mohlke is an NHGRI alumna, having done her post-doctoral work in Francis Collins’
lab where she used genome-wide approaches to localize diabetes susceptibility genes.
She is currently an associate professor in the Department of Genetics at the University
of North Carolina, a member of the Carolina Center for Genome Sciences, and a member of
the Lineberger Comprehensive Cancer Center at UNC. Her lab studies complex traits with
complex inheritance patterns using many of the approaches that she will be describing
to you today to study conditions such as type 2 diabetes and obesity. As always, it’s a
pleasure to have you here with us today, Karen. And so please join me in welcoming Dr. Karen
Mohlke back to the NIH campus. Okay. [applause] Dr. Karen Mohlke:
All right. All right, thank you very much. It’s always a pleasure to be here, and no,
no difference. It’s always a pleasure to be here. So, as Andy said, I’m going to be talking
today about genome-wide association studies, and these are especially relevant for complex
traits. Oh. And I have no relevant financial relationships to describe. So, complex traits are traits that have both
genetic and environmental contributions to them. There may be many genetic factors, many
environmental factors, and these factors may interact. That is, that there’s not necessarily
a single gene responsible for these traits. And some of the genetic factors have rather
subtle effects. As we investigate genome-wide association
studies, these are especially good at identifying common genetic factors that may be responsible
for common variation in complex traits. And by common factors I mean that when looking
at a stretch of DNA sequence, and looking at several copies of this stretch of DNA sequence,
of course many of the alleles are, many of the nucleotides are identical between those
sequences, but sometimes there are differences. For example, here is a “T” but in some copies
of this sequence there is an “A”. That is a relatively common variant. Three out of
10 times in that representation it’s an “A” allele so an allele frequency of 30 percent.
There are also DNA variants that are less common, or rare. So, for example, over, later
in the sequence there is only one copy of a “G” allele where there may be a hundred,
or a thousand other copies of that “C” allele. When we think about the genetic architecture
of genes influencing common complex traits, we can consider the different power of various
approaches to identify the underlying genetic variation. We consider the frequency of the
variants up here being more common variants, and common is often defined as the frequency
of an allele greater than about 5 percent. Moving on down to the very rare alleles, those
that might be present in only one person or one family. And considering the effect of
the allele, how strongly that variant acts to cause disease, or to increase risk to disease.
And so a very strong effect allele shown high in the Y axis compared to ones that have a
relatively modest effect, low on this axis. So, genome-wide association studies are especially
well suited to identifying common variants implicated in common diseases in contrast
to, say, that rare alleles causing Mendelian disease that may, were more easily identified
using linkage approaches or candidate gene approaches. There have been relatively few
examples of high-effect common variants that influence common diseases. And as genomic
technologies advance, we’re moving more from the common variants into the lower frequency
variants. And so lower frequencies may be from 5 percent down to about a half a percent,
and as sequencing technologies develop, and more individuals are sequenced, then we’re
moving into the, identifying more of the rare variants that will be identified to play a
role in both common and Mendelian type disorders. So today, as we talk about genome-wide associations,
I’m going to talk first about what the goal of these studies are; how these studies are
performed; what can be learned from the associated regions that are identified by the studies;
and then what the findings tell us about disease. So genome-wide association studies, the first
ones were done seven years ago now perhaps. There became many more done sort of in the
five, three years ago and continuing on today. The benefits of doing a genome-wide association
study compared to classical approaches such as linkage analysis or candidate gene, genetic
association studies were that, genome-wide association studies are more powerful than
linkage to identify common and penetrant variants, and provide a better resolution than linkage.
So that the variants identified are closer to the underlying causal genes and/or variants
than linkage analysis approaches. And they can be performed in an unbiased approach.
There is no need to select candidate genes and know the underlying biology ahead of time.
These can be used to discover completely novel pathways involved in a disease or trait that
were not previously known. Now why were they only started several years
ago? There was requirements to perform a genome-wide association study. We need to know the catalog
of human genetic variants so the genome could be sequenced and genetic variants across the
genome identified. There is a need for low-cost, accurate methods of genotyping, and technology
advances have enabled this to be, this to be possible so now hundreds of thousands or
millions of variants can be identified in a single reaction. Need large studies of people,
large numbers of informative samples, and along the way, efficient statistical design
and analysis methods to handle the large number of variants being analyzed. So the goals of
a genome-wide association study are to test a large proportion of the common single nucleotide
genetic variants for association with a disease or for a variation in a quantitative trait,
and doing all this without having to have any prior hypothesis of how the genes may
act, or what their functions might be. I’ll talk through many of the steps in a genome-wide
association study, so starting with ascertainment and collection of the individuals, the samples,
the methods for performing genotyping, steps of quality control using that genotyping data,
some of the methods of statistical analysis using this data, and the importance of replication.
So as we start thinking about the phenotype that is being studied, this can either be
a disease or a quantitative trait. So a disease such as type 2 diabetes or prostate cancer,
or it could be a quantitative trait: height, cholesterol levels, something that is not
discreet but has a continuous distribution of phenotype across the individuals. Disease could be rare or could be common,
although the common disorders are perhaps more appropriate for a genome-wide association
study. Quantitative traits have the advantage of being easy to measure, things like weight
and height. Some of them require careful approaches to measurement, and getting an accurate measurement.
Genome-wide association studies can also be formed with, performed using traits such as
gene expression level of all of the genes across the genome. The accuracy with which
a phenotype is assigned is a important step in analysis. The more well defined the phenotype
is the more likely one will be able to identify the genetic variants responsible for it. The
more heterogeneous the phenotype, if it’s really a mixture of many different causes
that create that disease, then those will be sort of mixed together, and harder to identify
the underlying causes. When selecting the individuals to perform
analysis, the strategy, so, one strategy is to perform a case control analysis, meaning
ascertaining cases affected with disease, and then also ascertaining controls who do
not have disease. Another approach would be to do a population survey: collect many, many
individuals across the population and then determine which ones of those are affected
with disease. Using the population survey of a smaller proportion of individuals affected
with the disease, but they may be more representative of that disease in the population than if
you perform a — ascertain the cases that are severely affected with disease. That might
be less representative, although they might lead to greater possibility of identifying
the genetic variants responsible for them. So in a case control analysis, the methods
to, or the approaches used to define the case are relevant and important to consider when
interpreting the results of a case control association study. So were cases defined with
extreme phenotype, were they — how were they collected? Is there some special subset of
phenotype that may be especially enriched in that particular set of cases? Similarly
with controls: Are the controls selected to be random members of the population that are
not yet affected with disease? But then some of them, if it’s an adult onset disorder,
perhaps will become affected with a disease next month or next year. Perhaps less good
controls when seeking to have a greater difference, although — so consideration of these approaches
is important for how those results are interpreted. So potential criteria that one could use when
selecting cases would be to choose individuals that are more severely affected with the disease.
These might be individuals that have a greater genetic load, then, and so provide a greater
opportunity to identify the underlying genetic factors. One could require other family members
to have the disease. This is more evidence of a genetic factor responsible as opposed
to more of an environmental contribution. Choosing — for an adult onset disorder — choosing
individuals with a younger age of disease onset also could enrich for genetic factors. When considering criteria for selecting controls,
could enrich the genetic effect by choosing individuals with a lower risk of disease rather
than population-based samples. It’s important to keep the ancestry of the controls and the
cases matched as well as possible, and to try to match the controls to cases based on
age, sex, and other demographic factors that may influence disease. To show a bit of an example about a matched
ancestry: If the cases are collected from the population, but have different underlying
ancestry represented here by the different shadings of the different symbols here. So
maybe solid filled symbols, and these two different categories. If that different ancestry
is differently represented, if the proportions of those are differently represented between
the cases and the controls and there are genetic factors or genetic variants that are more
common within some of those subsets than others, then those genetic variants may appear to
be associated with disease when truly they are associated with being part of that subpopulation. When performing an association study in a
set of samples that have not previously been analyzed genetically, you may have inadequate
ancestry information prior to performing the genotyping. Ascertaining individuals from
a particular area may assume that the ancestry is similar between individuals. After performing
genotyping with hundreds of thousands of markers across the genome, one can look at the frequency
of different alleles, and identify perhaps subsets of individuals that are, create subpopulations
within the sets of cases and controls. So this subpopulations is, that I’ve been
talking about, another word for this is population stratification. So the issue being that population
stratification can produce false positive association results in case control studies.
In addition, individuals that are cryptically related, that you don’t know are related but
have —
that are, say, cousins or something not collected, not known in the collection of individuals,
can enrich for particular alleles within samples, and that can also create a false positive
association. Ways to account for or avoid stratification
and relatedness: one is to perform genomic control. So this is a correction that is an
average — evaluates that sort of the average excess association identified and adjusts
the results of the association study by this average measure to sort of alter the threshold
that you use to define what a significant result is. Another approach is to use the
allele frequencies of variants across the genome to identify principle components of,
say, subpopulations or of substructure within the samples, and then include those principle
components of substructure as covariates in the analysis to counter-adjust for them. Another
approach to avoid population stratification would be to be perform a family-based study
design, where instead of selecting cases and controls the association analysis is performed
within families, and considering the relationships between the individuals. On a — it — with
— given a set genotyping budget, however, there is reduced power for identifying variants
when individuals are related and part of those families. So the genotyping process, now genotyping
panels are available with as few as ten thousand SNPs, single nucleotide polymorphisms, as
many as five million SNPs now. Two main companies provide a number of fixed content panels available,
meaning that the genotyping arrays or chips are available with set SNPs that are being
evaluated on them. The approaches used to select the SNPs for these panels, some of
them are random SNPs. Some of them are selected to be haplotype tag SNPs and Lynn Jorde talked
about this, and I’ll show a slide about this as well. Some of the variants, or some of
the nucleotides chosen to be on these panels are not nucleotides that vary, but have different
alleles in the population, but for which the intensity of the signal differs because of
a copy number variation. And some of the arrays now are — that are now available have fixed
content but the user is allowed to add on an additional 10,000, 50,000 single nucleotide
variants. So if you were to perform a genome-wide association study today you may choose a panel
and then say, “Oh, but these particular variants are missing from that panel.” Perhaps if you
know of some less common or rare variants that are not on the panel, or some particular
functional variants, or variants that you think really play a role, those could be added
onto the panel. Higher density genotyping, higher density SNPs in special regions of
interest could be added onto those arrays. So I talk about selecting haplotype tag SNPs,
example shown here. So in this example now there are four copies of a particular chromosome.
Again most of the nucleotides are the same, this is representing three single nucleotide
variants in this region. When combined together with variants that are both upstream and downstream
of this, the variants can be shown, represented as haplotypes. And the, given the history
of human populations, and the non-random recombination events that have occurred during human demographic
history, there are clusters or sets of SNPs that are being inherited together in most
members of the population. And so selecting SNPs that are representative of variation
of other SNPs allows a more efficient, fewer SNPs to be genotyped to represent a larger
proportion of the variation. So for example, these haplotypes of 20 variants
can be represented by just choosing three SNPs within this set, and there are other
variants that could be chosen as well. This is sort of an example: “T-C-T-C” variant here
could also easily be represented by this variant here, “C-T-C-T,” but the set of three variants
represents the variation present. So this also means that when interpreting the results
of an association study, although a single variant might be described or reported in
a paper, say this variant is described as showing strong evidence of association, it’s
important to remember that there are other variants located nearby that are in linkage
to equilibrium with that variant. They’re inherited together in the same pattern as
that variant that may — that would also show similar or identical evidence of association
with that trait. So I’ll talk through a few of the methods
of allelic discrimination that are used in these genome-wide genotyping panels. One of
them is this Illumina Infinium assay. And the Illumina assay DNA is amplified to generate
larger amounts of DNA, and then the DNA is captured on oligonucleotides that are bound
to beta rays. An allele-specific extension or a mini-sequencing assay is then performed.
So here is the genomic DNA target. It’s being, hybridizing to sequence that is on an oligo
that is bound to a bead, and a sequencing reaction happens. So that if the allele provided
is a perfect match then the polymerase can continue on with that sequencing reaction.
If there is a mismatch of the end nucleotide then no continuing sequencing reaction can
occur. There are a few different forms of this assay
that Illumina provides. The Infinium 1 assay and the Infinium 2 assay, in this case there
are two different bead types used to represent that single SNP, and one color of detector
to, that is, detectable label that is used. In this form, a single base extension reaction
happens, so a single bead type is used, and two different colors of detector are involved.
So when Illumina describes the number of SNPs that are available on a panel and the number
of, say, custom-designed SNPs that could be added to a panel, they talk about bead types,
because some SNPs are assayed well with the single bead type, and some SNPs are assayed
better with two bead types. Okay, Affymetrix has a genotyping platform
called their GeneChip Array. In this strategy, the genomic DNA is sort of reduced genomic
complexity by performing restriction enzyme digestion and size selection of the fragments.
Adaptors added, amplification steps, fragmentation end labeling, and the allelic discrimination
happens based on hybridization of one allele, two sets of oligos on the array. So in their
GeneChip Probe Array there are millions of copies of a specific oligo probe bound, so
in a given, in a given region here are DNA probes in sort of one part of the array, and
there are multiple copies of this same sequence with the same variant allele present. A given SNP can be represented by many different
probes. Say the SNP allele, the variant allele may be in the center of a oligonucleotide,
and there could be as many as the four different sequences represented on the probe representing
all four possible alleles that could be bound there. And then the, the variant could be
offset by a nucleotide or two not precisely in the middle, but moved over or the probe
could be a little bit longer, a little bit shorter. With time, the choice of which, which
probes are the most efficient at discriminating between the two alleles improves, and that’s
what allows Affymetrix to add on additional variants to be able to fit more variants onto
an array, and allows the discrimination to be optimized for given variants. Affymetrix also has a newer platform, their
Axiom Array. In this case, the DNA is amplified and fragmented into, say, 25- to 125-base
pair fragments enzymatically, and then the fragmented amplicons loaded onto the array
to hybridized oligos. And after selection, some — a solution of random 9-mer oligos
that are labeled are hybridized to the array, and they’re hybridized such that if the alleles
match then a ligation reaction can be performed. And so the discrimination, the allele discrimination
is based on ligation which requires the alleles of the adjacent nucleotides to be, to be matched
and to hybridize well. And that provides greater allelic discrimination a little bit better
than, say, hybridization would, would provide. And then the labels that are present are stained
and imaged. So here is a representation of what some of
the sort of coverage of common variants is for a set of arrays that are available. These
are a little bit some of the older arrays. And so coverage is calculated by looking at
the set, some defined set of common variants that are present, and when you interpret what,
what the coverage is of a particular array, you want to consider what that set of variants
is. Often HapMap variants will be defined, or 1000 Genomes variants. The more sequencing
that happens, the more variants that are identified, so knowing what that reference set is, is
valuable, and then looking at the linkage disequilibrium between a given variant and
the other variants that are present in that set is used to estimate what that coverage
is for the given chips. And the coverage is going to differ based on the population of
the individuals being assayed, because allele frequencies differ, and linkage disequilibrium
relationships differ between populations. So some of the newer arrays that have more
variants present in them do a better job, have higher coverage of common variants than,
say, some of the older arrays. Now, the most recent generation of SNP arrays
that are available are improving coverage of the lower frequency variants. So the initial
arrays were covering the variants, say five, frequencies of 5 percent and greater. Now
the frequencies covered are moving down into the, say, less common ranges. So here is a
slide from Illumina. One of the newer arrays that they have available is specifically chosen
for the Chinese population, so this particular chip was designed to select variants based
on individuals from Chinese ancestry. And so they show that the coverage on the Y axis
here of variants with an allele frequency greater than 5 percent is sort of shown here
on this particular array compared to one of their other genome-wide association arrays.
So here is a more general array, and this is the one that is chosen to be specific for
the Chinese population. And you can see that they’re also improving the coverage of their,
of the less frequent variants, those with a minor allele frequency greater than 2.5
percent increases with this specific chip. To be fair, here is also a slide showing one
example of an array from Affymetrix, and they too, in their latest arrays that are available,
show that they are, well they have good coverage of the common variants. They’re also trying
to have improved coverage of the less common variants in this little bit lower frequency,
sort of that 2 to 5 percent allele frequency range. Okay, so genotyping of samples, cases and
controls, members of a population is performed, genotyping data comes back. There are a number
of quality control steps that are important to do in a genome-wide association, prior
to performing the association analysis. One is to look for and detect poor quality samples.
The samples that had a success rate less than some level, maybe at the 95 percent of the
SNPs are successful. The more SNPs that fail, the more that the SNPs that succeed are called
into questions as to perhaps be generating inaccurate genotypes. So if, if most of the
samples are working very, very well and some of them are not as well then it could be that
heterozygotes are being miscalled as homozygotes for particular alleles. And so identifying
and excluding poor quality samples is valuable. An excess of heterozygous genotypes might
suggest that those — a DNA sample is really a mixture of two DNA samples. One can use the genotype data to evaluate
whether any sample switches have happened in that process from when the DNA sample was
collected from the individuals and then that, say that the tube of blood was collected.
It was processed into DNA. It probably changed hands many times. It was moved from a tube
onto a plate, and a plate that was then genotyped, and that whole process. Sample switches can
happen, and one way to identified whether that has happened is to look at the sex of
the individual based on markers on the X and Y chromosomes, and evaluate whether it matches
the sex expected in that individual. If DNA samples are around a lab for a while, then
particular alleles that are, are particular genotypes known from one set of genotyping
reactions can be compared to those done on another, you know, with another assay to see
whether at another time point, to see whether any sample switches have happened in the intervening
time. One can use the genetic data to look for unexpected
related individuals. So again, when analyzing a cohort or a population for the sample for
the first time, one can use pair-wise comparisons of genotype similarity and look for, say,
unexpected duplicates might turn out to be monozygotic twins, or people who participated
in the sample collection more than once with different identifiers. And you can also use
the allele frequencies of variants across the genome to look for individuals who have
ancestry that may be a little bit different from the rest of the sample and then consider
that, and either exclude them or account for that, those differences when performing the
later analysis. In addition to looking for poor quality samples,
one can look for poor quality SNPs. So shown here are a few examples of raw data of genotyping
of sets of, set of individuals. So shown over on the left here now, it’s signal intensity
of one marker, said the X marker. We’ll call it the A allele signal intensity of another
marker. It’s labeled the Y marker; let’s call it the C allele. So the — this is a lovely
looking marker where the allele intensity is very high on the A axis for this, the samples
relatively low on C axis set these would be the AA homozygotes. These similarly are very
high on the C allele axis. These would be the CC gene type, and these would be the heterozygotes.
It’s an ideal genotyping plot. When doing hundreds of thousands and millions
of markers, software is used to assign the genotypes to various clusters. It can occasionally,
the software might not detect that these two clusters are distinct. It might call them
together as heterozygotes, so erroneously assigning heterozygous genotypes to these
individuals. Sort of trying, look for cases when that happens and fix them, or exclude
those markers. Some assays for given SNPs don’t work all that well and there is not
much discrimination, or the discrimination is not clean between the clusters. And so
the individuals that are especially close between these two clusters may be more likely
to be miscalled with an incorrect genotype. And those genotypes can either be excluded
or, or it’s at least helpful to recognize the marker, and perhaps exclude the entire
marker to avoid having errors in the data that might lead to false positive or a false
negative associations. Other ways to, so that often happens at the
genotyping level, the individuals performing the genotyping analysis are those who are
looking at that raw data, evaluating some of those characteristics. One can also detect
SNPs that are of poor quality by looking for a genotyping success rate less than 95 percent.
So now this is a SNP that worked in less than 95 percent of the samples. It’s sort of an
arbitrary threshold but a commonly-used one. Might suggest that there is some problem in
that assay that the, perhaps it’s not discriminating well between the clusters. Perhaps the genotypes
that continue to exist are inaccurate and therefore excluding the marker would be more
prudent. Often these analyses are done using a small percentage of samples are duplicated,
present twice within the set of samples being genotyped. So then the genotypes from those
duplicate samples can be compared, and finding mismatches or discrepancies between those
identical samples is a bad characteristic for a SNP. I’d want to exclude those particular
markers. Can also do a test for Hardy-Weinburg equilibrium,
looking for the expected proportions of genotype, or genotype frequencies are not consistent
with the observed allele frequencies. This also suggests that the marker perhaps has
a problem, that the — perhaps heterozygotes are more often being called homozygotes incorrectly,
and so statistical tests can be used to identify that kind of an error. If there are related
individuals within samples such as a mom, dad, and a child, trios, then one can look
for Mendelian inheritance of alleles from the parents to the child. Some groups will
add additional quality control samples to their genotyping, to their sets of samples
to allow this kind of SNP error to be detected. And then it’s also important that if, say,
a set of cases are going to be compared to a set of controls, that the genotyping be
done as similarly as possible between those two groups. If the cases are genotyped entirely
separately from the controls, then it’s possible that there is different allele missingness
or that there’s different accuracy in the cause between the cases and the controls.
And this can lead to false positive associations, so it’s important to try to intermingle the
cases and controls as much as possible to account for any differences in plates or arrays
or any of the technical steps in doing the genotyping to detect any sort of potential
errors. Okay, so once the genotype data is cleaned,
meaning that the, you know, poor quality samples, poor quality SNPs have been removed, then
one can go test for association. So, in a case control study, now looking for differences
between the cases and controls in terms of their allele frequency, genotype frequency,
lots of things. So, for example, one could perform a test for trend looking at the frequencies
between those different sets. So look at the counts of individuals in these, with different
genotypes within the cases and controls. It’s valuable if there are covariates that are
also associated with disease, so if the disease prevalence increases with age or if it’s more
common in males than females then covariates representing all these factors should be included
in the analysis to account for them to improve the opportunity for the genetic variants’
contribution to disease risk, or the quantitative trait to be identified. Often tests are done looking for an additive
effect of the alleles on the trait, meaning that having one allele has an effect and having
two alleles has more of an affect. Other tests can be done looking for evidence of dominant
or recessive models or are — however, the additional number of tests performed in doing
an analysis like this would need to be considered when deciding what the threshold of significance
of the overall results of the end are. So, for example, in a case control study when
looking, when looking for the effect of an allele on risk of developing disease one could
calculate an odds ratio. So if these are counts of individuals, cases and controls that have
counts of the alleles A and C represented in those individuals, then one can calculate
an odds ratio as the odds of having a C allele given case status over the odds of having
a C allele given the control status, and this would form an odds ratio. And so a value that
is greater than one shows increased risk of disease for that particular allele. And an
odds ratios that is significantly less than one is evidence of decreased risk of disease. When performing association analysis on a
genome-wide scale, many, many tests are done. So if 300,000 to five million SNPs are being
analyzed, then one would want to correct for that number of multiple tests when defining
what a significant result is, and what a sort of spurious chance result could be. One approach
for doing this is to take a commonly used threshold of significance, say 5 percent.
So one in 20 times you might see a result, a difference between cases and controls that
is at this level of significance, and divide that by the number of statistical tests being
performed. So, a commonly-used threshold assumes that the number of common variants being tested
across the population, this was designed based on a Caucasian population, was approximately
a million tests. And so taking a P value threshold of .05 dividing it by a million creates a
new threshold of 5 times 10 to the minus 8. So this is a commonly-used threshold for declaring
that a particular result is significant and not likely to have occurred by chance. Achieving
a threshold like this requires either a large effect of that particular variant or a large
sample size to detect a more modest effect. Question? Male Speaker:
Just a quick question. So, so is there any preference to which multiple testing procedures
used on GWAS studies whether it’s [inaudible] or Benjamin Hartford [spelled phonetically]
or? Dr. Karen Mohlke:
So different approaches are used to define — the question is, are there different strategies
one could use a false discovery as opposed to this Bonferroni correction for multiple
tests. Different approaches are used. I would say that declaring a threshold of 5 times
10 to the minus 8 is very commonly used within the literature. Although people will argue
whether that is an appropriate threshold to be used, and often there are signals that
do not reach that threshold that it’s due to limited power and when sample size increases
in the next round of study then those variants become significant and so, it is a valuable
thing to consider. So I show here an example of what results
would look right from an association test. This is from an early test for type 2 diabetes
association between comparing not quite 1,200 type 2 diabetes cases to not quite 1,200 normal
glucose tolerant controls. This is work of the fusion study. And the results shown here
are for the genome with the chromosomes lined up end to end. So chromosome 1 on the left,
all the way down to chromosome 22 and then the X chromosome. With each dot representing
a single nucleotide variant that was tested for association, and this analysis was done
using logistic regression with an additive model, and adjusting for age, sex, and birth
province even within Finland to account for a potential stratification. And then on the
Y axis is this minus log 10 of the P value, so a P value threshold of .05 would be about
there. So you can see when doing this many tests that is not an appropriate threshold
for defining what is significant. There are many, many variants have a P value smaller
than that threshold. The threshold for accounting for the number of tests done here would be
in the sort of 10 to the minus 7 or that 10 to the minus 8 range. You’ll notice that the
maximum scale here is six, so none of the results from this initial study reached that
threshold of genome-wide significance. As we, that makes it difficult to figure out
what variants might represent true positives. At the time that this study was done, sort
of before genome-wide association studies were available, there were three variants,
or three loci that had a well-established roll in genetic contribution to type 2 diabetes.
And so it we looked for the location of those variants within this data, so one of them
was at the TCF7L2 locus. So it was gratifying to see that within the top 10 SNPs of this
association analysis that that was, variants were present. So that suggested that the,
that it would be possible to be identifying genetic factors. Another of the variants was
that the PPAR gamma locus this is maybe now the top 300 variants, and another of the variants
was within established role was around 3,000th on the list of 300,000 variants analyzed. One way that the, to evaluate whether there
is an excess of significant results at a given threshold is to plot the P values that result
from the test of association against the P values from a uniform distribution. So shown
here on the X axis is minus log 10 of a uniform distribution, and the Y axis minus 10 of the
P value from the test of association. So there is a black line showing sort of the expected
right along the edge here, and the blue dots that are, represent the data that I just showed
you. So you can see that there is sort of a slight movement off of this line, but very
much falls along the line. So this is good from the perspective of there is no excess
of associations that might represent population stratification or some sort of excess relatedness
within the individuals. But it’s bad from the perspective of there are no variants that
showed strongly significant excess evidence of association in the true analysis compared
to the uniform distribution. If one was doing an association analysis in
a population that had evidence of substructure or stratification, then a plot, similar plot
might show that the variants in these dark blue dots show an excess significance sort
of all the way through the, through the scale. If the, this population stratification is
adjusted for then the P values that result from the association test are more in line
with that expected distribution. And so correcting for population stratification can reduce the
excess result, excess associations that are false positives that are not due to true genetic
signals. So, performing an association analysis and
doing all that work and not identifying significant results, a frequent next step is to try to
gain statistical power by increasing sample size. Larger sample sizes will have a greater
possibility of identifying genetic factors that have a more modest effect. So the frequent,
the common way that this is performed is that each group does their own genome-wide association
analysis, and then the date from several studies is combined together by performing a meta-analysis
of the results for each genetic variant. Now, potential issues for performing a meta-analysis
across studies: one is that different genotyping platforms may be used, and different analysis
strategies might have been used in the beginning; and also that the definition of cases and
controls may differ. So there is some heterogeneity that is introduced by the fact that different
studies are performed in different ways. Generally the strategy that has been applied is that
larger sample size is more valuable and more powerful in the face of these, say, differences
in sample collection, and so results need to be taken with, considered with some caution.
That, about what heterogeneity might underlie them, but the generally larger sample size
is identifying additional, more variants. To address the different genotyping platforms
that may be used by different groups, the several strategies for imputing, or predicting
the missing genetic variants between platforms have been developed. So, in imputation one
might have in your study sample genetic variants typed at, say a position here, a position
here, a position here, but that the other genetic variants in the intervening regions
were not typed. They were not selected for that genotyping platform. The study samples
can be compared to some sort of a dense genotyping platform, or dense set of genotypes. So HapMap is a commonly used set of variants.
So this is on, sort of samples that were chosen to try to be representative of some particular
populations that were analyzed at a much denser set of genetic variants. Now more recently
the 1000 Genomes project has generated data, an even denser set of variants and so one
could take the genotyping data from a particular study, and impute the variants from the 100
Genomes project, and fill in many more of the genetic variants. So instead of analyzing,
say 500,000 variants that were genotyped on the array, one could analyze two, 2.5 million
variants present that are on some of these reference panels. So the strategy for doing imputation is that
a probabilistic search for mosaics of chromosomes that match each individual is performed. So,
for example, the top chromosome from this individual is represented by this haplotype
within the reference panel. The lower chromosome of this study individual is best represented
by a mosaic of, say, one portion of a chromosome and another portion someplace else, suggesting,
right, that this individual has a, that the portions of these two different haplotypes,
a recombination event has occurred sometime in the past. So then the genotypes can be
sort of filled in from those phased chromosomes. There are several different approaches to
performing imputation, and often they, the analysis provides some evidence of the likelihood
filling in that genotype was correct. And so thresholds for quality can be used, and
if a variant is, you know, part of a chromosome that has been seen many, many times in exactly
that same set of variants, and has been seen in many copies of that haplotype. Might have
a lot of confidence filling in the intervening genotypes, whereas if it’s a region of lots
of recombination, and it’s unclear exactly which haplotypes match best then the filled
in genotypes may have less accuracy, less likely to be correct. And so analysis can
be performed and sort of choose a threshold and not include genotypes that are imputed
with a low likelihood of accuracy. The advantage of doing imputation is that
it allows the many different genotyping platforms, studies done on these different genotyping
platforms to be combined together. So here is an example of one of the arrays that, say,
perhaps genotyped these particular markers, whereas a different array genotyped these
particular markers, and when both, both sets of data were used to impute markers from the
hatmap project, the markers shown in blue were able to be analyzed in both studies.
So while the overlap between the sets of data available from one platform or the other was,
you know, the directly-genotyped markers that were shared was relatively small, the total
number of markers that were able to be analyzed was a much larger, is a much larger set. Imputation doesn’t require that the variants
be perfectly in linkage disequilibrium with the variants that are tested. It’s a haplotype-based
approach, and so it’s possible to identify variants that have a different frequency than
the variants that were typed. So there are examples at least in the early stages where
variants were identified to show association only when imputation was done. That none of
the markers on the genotyping panel themselves showed association. So in this particular
plot, this is a zoomed in region of a portion of chromosome 9, with some genes show below,
and the minus log 10 P value for LDL cholesterol levels shown on the Y axis. And the dots that
are shown in red are the markers that were directly genotyped on the particular genotyping
array. And the dots shown in blue were the ones that were imputed based on using the
genotypes from the Affy array, and imputing the variants present in the HapMap sample. And so you can see that none of the red dots
showed strong evidence of association in this region, however, at least one of the blue
dots gets up into a more significant P value showing evidence of association. This is the
low density life of protein receptor locus used for, associated with LDL cholesterol.
A result that was known prior to this kind of analysis, but goes to show that the imputing
can identify variants that were not present on the genotyping panel. So here shown is an example of the structure
of a meta-analysis, where seven different groups got together. Each one performed their
own genome-wide association analysis using a shared analysis plan for what method to
use, and what model to use, and what covariates to use. And then a meta-analysis of those
seven studies was performed, and the top SNPs, the most strongly associated SNPs from that
study or representative ones of the, of those results were selected to follow up in additional
samples. So some studies, some cohorts have genome-wide genotypes available. Some do not
and are, but are able to genotype, say, 50 SNPs to go follow up in results. And so in this particular example, around
40 to 60 SNPs were selected and different groups in these replication cohorts genotyped
those variants separately using a different genotyping platform. And then the data from
those replication cohorts was analyzed to determine which of the initial variants showed
significant evidence of association. So in this particular example the genome-wide association
analysis was done around 20,000 individuals, and then some of the top variants were followed
up in around 20,000 individuals. The results of that particular analysis are
shown here. Now there are three genome-wide association plots, because there were three
phenotypes analyzed with that set of data: LDL cholesterol, HDL cholesterol, and triglyceride
levels. Phenotypes measured in the same people once the genotype data is available, then
looking at the range of all phenotypes present is relatively quick. So show here are three
genome-wide association plots and three — these quantile-quantile plots. So let me zoom in and show a portion of one
of these. So here is a portion the genome-wide association plot. These are often called “Manhattan
plots” because the tall buildings show up out of the background of shorter buildings
there. In this analysis, this was sort of the — not the first round of genome-wide
association studies for these traits, but a later round. So they show the results on
this q-q plot here. The grey line represents the expectation if none of the variants show
significant association, and this is shown now with a 95 percent confidence interval
on that line. So black represents the set of all variants identified in this particular
trait, LDL. When removing the variants that were known previously, then the blue symbols
are representative of the data being reported in this particular study. So they still showed
an excess of significant results. There are still novel signals; evidence of association
being identified. If we remove the effects of those variants you can see that there’re
still a little bit of excess association present, but none of the variants in particular reached
the genome-wide significant level. So, meta-analysis is useful and follow-up
in replication of initial association results, especially ones that don’t reach genome-wide
significance levels yet, can allow for increased power and increased opportunity to identify
novel signals associated with a disease or a trait. When performing meta-analysis, however,
one has to be concerned about heterogeneity between the studies. So one example to demonstrate this: when The
Wellcome Trust Case-Control Consortium performed a genome-wide association of type 2 diabetes,
they showed strong evidence of association of variants at the FTO locus with type 2 diabetes.
However, a couple of other studies that we’re doing association analysis of type 2 diabetes
at the same time didn’t really see evidence of association with FTO at all. It turns out
that the Wellcome Trust cases were more obese than the controls in that study, whereas,
the other diabetes studies, their case control selection had been more balanced with respect
to body mass index — body size. So the identification of this source of heterogeneity between the
studies led to identification of FTO as a gene that plays a strong roll in obesity. Some of that data is shown here. This is a
plot showing odds rations and a 95 percent confidence interval of the odds ratio. So
the X axis is odds ratio, 1.0 would mean that there’s no increased risk or decreased risk
of a given variant. Here the FTO at the A allele of this marker representing the FTO
locus. The initial set of Wellcome Trust cases of type 2 diabetes showed a strong odds for
obesity. Here are the controls that were used in that analysis. So when you see the controls
used in the type 2 diabetes analysis. So you can see the effect on obesity is larger in
these type 2 diabetes cases than in those type 2 diabetes controls. That’s why it looked
like evidence of association with type 2 diabetes at first. When they go and collected — when
they went in and collected other sets of cases, other sets of controls, and then, valuably,
samples that were from population-based collections, so not disease status ascertainments, and
evaluated the effect of this particular allele, you can see that it consistently shows an
increased risk with obesity. So this odds ratio is 1.3 and the confidence interval around
it is quite narrow because it’s a very large sample size and show that this was sort of
the definitive evidence showing that these variants are associated with obesity. Okay, so genome-wide association studies have
been performed now for at least 237 traits. This is a results cataloged by the NHGRI in
a catalog of genome-wide association studies. The slide shows the various chromosomes and
with some colored dots representing positions of some of these loci and most recently there
is, last summary here, there is about 1,449 published genome-wide association signals
with P values less than 5 times 10 to the minus 8 representing 237 traits. So many genome-wide
association studies have been performed and many, many loci have been identified where
genetic factors are associated with the trait or disease. As would be expected, more loci are found
with larger sample sizes. So, in this recent review, the — a number of different results
are summarized with the number of cases shown here on the X axis, 1,000, 10,000, and 100,000.
And the number of genome-wide association hits or signals represented on the Y axis
at 1, 10, and 100. And the different symbols representing different studies that were performed
with different sample sizes. And here is a subset of case control studies that were done
for Crohn’s disease, different various studies. And you can see that generally the larger
the sample size and the larger the number of cases, then the larger more genome-wide
association hits are identified, showing that many signals exist and that the effects for
many of them are relatively modest and that large sample sizes are needed to identify
them. So let’s look at some of the examples of types
of results that are identified in genome-wide association studies. I’m going to look at
a few plots of particular loci, so zooming in on the genome on the two particular regions.
So here is a portion of chromosome 19 and about 400 kilobases are shown on the X axis.
Each of these representing genes in this gene-dense region and P value test of association over
here has a strongest signal here with a P value better than one times 10 to the 25.
This is replicating a known association, one that’s been know for a very long time, of
a variant of the APOE locus associated with LDL cholesterol levels. Now this is not the
variant itself that has been sort of more strongly to play a functional role at the
locus, but it’s inherited at a similar pattern. This example also lets me highlight that when
results — so in this particular case, this variant is close enough to a known gene that
this gene might be the one highlighted in a report of a genome-wide association study.
However, if this was a novel signal, then the evidence, the decision about what gene
label to use in a report might be a little bit arbitrary, might be a little bit driven
by what the biology of those underlying genes might be. But it’s important to know that,
when reading a paper of a genome-wide association study, that the gene label assigned is often
just the nearest gene to that SNP that happens to be the top signal, and might not be the
— a gene that is contributing to variation at that locus. Also, even though a single
gene might be provided in that label, there could be genetic variants that are affecting
more than one gene at a given locus that, you know, that there’s true causal underlying
variants; that there’s multiple of them and that they could be affecting different genes
at that locus. So, interpret with caution. Okay, so then some signals that are identified
can be novel signals. In this particular case, the strongest evidence of association was
found within an intron of a gene, meaning that, shown down here, these little tiny boxes
representing exons here and all of the variants that show the strongest evidence of association
here are localized within an intron. So perhaps underlying causal variants are not shown on
the plot but are in linkage to equilibrium with variants in the plot and could be playing
a role in the protein sequence or perhaps underlying variants are influencing gene expression
of this gene or of some other gene nearby. Some novel signals are found at a distance
from known protein-coding genes. So these are identifying possible novel biology or
possible novel mechanisms. So variants that are found at a distance to protein coding
genes perhaps are affecting other sequencing in the genome, RNA sequences, non-protein
coding genes that may be present; not all of these are annotated in the genome yet.
Or there could be regulatory effects, you know, having regulatory influence. Say it’s
as enhancers or repressors of transcription of genes that are hundreds of kilobases away. More and more, multiple signals of association
are identified in a given region. This makes sense with what’s known about genetic variation
and allelic heterogeneity for Mendelian disorders. There’s more than one way to influence a gene;
there’s more than one way to alter a gene. So there’s often more than one common variant
or signal that can play a role in association at a given locus. So, shown here are two separate
— really it is the same data shown twice, but it is colored based on the relationship
of the variants to one another. So there are really two signals here: one that’s localized
quite close to this particular — the promoter of this particular gene, and another signal
that is independent — independently inherited from the signal that is located tens of kilobases
upstream of this particular locus. One way to look for independent signals is
to include a give single nucleotide polymorphism variant in a regression analysis to adjust
away the effect of one variant and then see what the results of the other variant are
in the region. So in this particular case, if — each dot here is representing the evidence
of association with the trait. If one were to perform this test and include one of these
variants in that test of association, at this locus the signals are independent. And so,
by including this evidence of association, the test of any of these other variants would
essentially go away and show no evidence of association. However, these variants — the
association of these variants remains unaffected by that other signal. So this is really strong evidence of independent
signals influencing association. Now there may be more variants that are not necessarily
independent of each other. There could be two causal functional variants that share
some haplotypes but not all haplotypes with each other. And so, when going into the functional
biology, trying to figure out what the mechanisms are, what the underlying variants are, it’s
not just independent signals, but the multiple signals that might be present that might help
indicate how these DNA variants are leading to changes in gene expression or function
leading to disease. Here is evidence of association that shows
that you can obtain different results in different populations and that populations that are
older and that have more evidence of recombination events that have narrower regions of linkage
to equilibrium can provide greater resolution to the signal that can show a narrower region
of association than in other populations. So shown here are some evidence of association
with height for a set of variants across a region. And then shown below are the linkage
disequilibrium. Pair-wise linkage disequilibrium plots for sets of variants in this region
from the CEU HapMap population HapMap sample and the YRI HapMap sample. And you can see
that this evidence of association which is from a populations of a European ancestry
samples shows evidence of associations across this region, and that there’s a relatively
wide linkage disequilibrium block in this region, whereas in the YRI samples, there
is more narrow sets of these variants are more inherited together; these are more inherited
together but they’re not; these and these show less association with each other. The signal from Caucasian sample was quite
broad; the signal in African-American individuals was strong in this region, but was not strong
in this region, suggesting that the more likely location of a potentially functional underlying
variant was restricted to this region and not in this one. In this particular case,
the variant that was showed the stronger association in the African-Americans, was also one that
had been shown previously to have an effect on gene expression of one of the nearby genes
perhaps providing some support for it having a functional role. The more genome-wide association studies that
are done with a range of traits, the more that the same variants and the same genes
are being identified as associated with two or more traits. Sometimes these signals are
being identified that are associated with traits that one can recognize what the underlying
mechanism might be. Sometimes the relationship and the different diseases that are — or
traits that are — show evidence of association helps provide some biological clues as to
what those pathways might be that are responsible for a particular trait. So, there are variants that are being identified,
for example, for both diabetes and cancer. And in at least one case, the same DNA variant
was associated with increased risk of prostate cancer and decreased risk of type 2 diabetes.
Examples like this are suggesting perhaps the role of cell cycle genes and that variants
can end up having different sorts of effects. Looking at the collections of traits and associations
might help us understand what the driving biology is underlying a signal and which association
is coming, you know, sort of as a result of that initial trait. So in this analysis of genome-wide association
signals, the authors took the set of SNPs that had shown evidence of association with
the trait or disease, and then looked at annotation classes of where those variants were found
in the genome, and looked at annotation classes such as non-synonymous sites, regions around
promoters, regions in introns, regions that are intergenic, and compared a randomly selected
sets of variants on genome-wide association panels to those that showed evidence of association,
and looked to see whether there’s an excess of variants in particular classes that had
been found to be associated with disease. So, in this particular analysis, here’s the
odds ratio of one, so anything crossing an odds ratio of one is not significant at the
5 percent level. But these classes here of non-synonymous variants, and promoter regions
at sort of 1KB and 5KB definitions all showed that the trait associated SNPs were over-represented
in these classes compared to just random variants on the genome-wide arrays. And even though
there are more variants present in the introns and, you know, many variants identified in
introns and intergenic regions that show evidence of association, there are also very more variants
on the arrays that are — have these characteristics. So, taken together, the genome-wide associated
variants are being identified that explain some of the population variation for the various
traits. Shown here is a subset of traits, a partial table from a recent review. And
it shows a set of traits and the heritability from pedigree studies expected for these particular
traits. So, some traits are more highly heritable than others, and they show it in comparison
the genome-wide association signal hits the ones that are sort of defined at genome-wide
significance and what proportion of the variation they are explaining of the — of this heritability.
And so, we’re approximately in many cases looking at, say, about 10 percent of the heritability
as explained by the genome-wide association hits. Now analyses are being done to evaluate what
the effect of all common SNPs might be, not just the ones that have reached that threshold
to define significance, but the ones that maybe have not reached it yet, that with greater
sample size and more power might reach it in the future, to estimate what the heritability
might be of all SNPs that are being analyzed. And you can see, for example, that the heritability
that may be attributed to such common SNPs could increase a fair bit, still not likely
to be representing all of the variation that may be present. Where only genome-wide association
studies are largely restricted to some of the common variants, and so this suggests
that there are other genetic factors that are playing a role in heritability. The use of this information to prevent disease
is really dependent on the disease and heritability, and I should also say that in this particular
case with type 1 diabetes, there were variants known prior to the — they included some variants
known prior to the GWAS era that had a very strong effect when looking at that heritability
number. One way that people are characterizing individuals is based on the number of risk
alleles that they have. You could see some evidence of differences in groups of individuals.
So while the variants might not be well predicted for a given person, one can count up. So in this particular case, there were more
than eight SNPs available that had shown evidence of association. So for each individual that
had counted up how many height-increasing alleles did that person have, and then grouped
them. So here’s a block of individuals that had fewer than eight, or equal to eight height-increasing
alleles and plotted their average height, and compared it to, in these other regions,
the individuals that had at least 16 height-increasing alleles, and plotted their average height.
And so between the individuals that had the lowest and the highest number of height-increasing
alleles, there is a few centimeter difference in how tall they are. However, these are — most
individuals fall in the middle of this plot, these are common SNPs and the individual predictability
of the variants is relatively low. The value in clinical translation, then, of
these genome-wide association studies largely is starting with the novel biological insights.
These hundreds, more than a thousand signals identified in the past few years provides
hundreds and thousands of novel biology to biological signals to go investigate and evaluate,
determine what the role of those variants and those genes plays in disease, which would
then in time lead to clinical advances, particular drugs, or biomarkers that represent the disease
better potentially leading towards prevention. There may be some improved measures of individual
genetic approaches, and I think you’ll learn more about those especially with respect to
drug development and drug response next week. So, in summary, when performing genome-wide
association studies, it’s important, or interpreting them, it’s important to pay attention to design
and quality control, large sample sizes are needed to identify signals with modest effects.
There are more than 1,400 signals and counting across the genome-wide association studies
done to date. And that finding any signal doesn’t immediately provide information on
the underlying biology or clinical utility, but sets off lots of follow up analysis that
can lead to these discoveries, and the time to changes in medical care are based on some
of these results, it might be years, but the biology is really advancing quickly. As we progress with genome-wide association
studies, more and more loci are being identified- larger meta-analyses are being done- groups
are gathering together more and more sets of samples- there is deeper follow-up of genome-wide
association signals, so groups are creating custom arrays of not just 50 variants to follow
up, but thousands of variants to follow up to identify additional signals; population-specific
panels are being developed to increase the range of genetic variants that can be analyzed
in a given study; more diverse populations are being used to identify variants, other
types of sequence variants, not just single nucleotide variants are being incorporated;
analyses are being done with multiple traits, and looking and the relationships between
those traits; and these are beginning to allow gene-gene and gene-environment analyses and
interactions to be evaluated; and finally the data are generating sort of evidence and
spawning much future analysis to figure out the molecular and biological mechanisms underlying
the signals. So, thank you very much for your attention. [applause]

Tags: , , , , , , , ,

11 thoughts on “Genome-Wide Association Studies – Karen Mohlke (2012)”

  1. Duy Ngoc says:

    thank, but it's too long

  2. Knockout Investing says:

    I believe @16:30 the Professor is talking about the Bonferroni correction.

  3. Knockout Investing says:


  4. Zibraz313 says:

    She is worrying and so nervous while doing this.

  5. Thomas Xu says:

    She talks so much, makes everything difficult to understand.

  6. Muhammad Sulaman Nawaz says:

    Excellent presentation. Except little focus on statistical methods for GWAS. She did cover everything so nicely!
    I would say you do need prior knowledge steps for GWAS before this presentation!

  7. Arya A says:

    very interesting

  8. Alexander Gorelick says:

    Fantastic talk! Very comprehensive and well-structured overview. I found it mostly approachable with a basic knowledge of genetics/gene expression and a decent stats background (she only really mentions ChiSq, pvals, quantiles and logistic regression). Occasionally I googled some topics and definitions I did not know, but that was enough to follow the presentation. I really learned a lot from this. Thanks!

  9. Tamara al Janabi says:

    Really enjoyed this and found it v useful – thank you for putting this online as it's really useful to all kinds of researchers

  10. kiri says:

    hi biol2036

  11. 강윤경 says:

    Thanks for your wonderful lecture.

Leave a Reply

Your email address will not be published. Required fields are marked *