Comprehensive lecture notes for the Human Genome Association studies module covered in MCB3026F. These notes cover all content taught in lectures as well as additional materials (powerpoints, textbooks) required to succeed. These notes were created by a student who achieved a distinction in this co...
Lecture 1A: Introduction
How to find a disease-causing gene (in pre-genomic era)
- found a family where the disease was prevalent in every
generation in multiple numbers
- track the disease in the family, take a pedigree and if it was
peas or Arabidopsis, cross them and do LOD sore
centimorgans linkage
- may can very close to mapping where the gene is that is
associated with that disease
- sequence using chromatograms and turn the DNA
sequence into a protein
- through mutation detection, find the gene and where it is
located on the chromosome
- only worked for large families with a disease that was
dominant
How to find a disease-causing gene (post-genomic era)
- complete human genome sequence signaled the start of the genetics gold rush for the search of disease-related genes using
Genome Wide Association Studies (GWAS)
- was sequenced by taking DNA fragments, clone it, and assemble it vs now don’t have to map and clone and do pedigrees
- tried to identify the gene in adult-onset diabetes, no gene for diabetes, it is a complex disorder
Genetic Association Studies (GWAS) identify genes associated with disease
- with linkage study, go through a pedigree where everyone's’ relatedness is
known and it’s a small number of individuals whereas gene association studies,
do not know who has the disease, do not know family relationships and have a
group with the disease and a group without the disease (called a case-control
situation where case if the phenotype of interest and the control is the people
who apparently do not have the disease or phenotype)
- in GWA studies, trying to find the genetic association of genes that contribute
to the risk of disease (not finding disease-causing genes but rather the risk
alleles contributing to the disease)
- for example: diabetes is caused by environmental influences and unknown genes
- each gene contributes to diabetes to a different degree
- aim is to find these small genetic influences, thousands of patients (case) and health (controls) people are screened with
thousands of gene variants (SNPs)
GWAS try to identify genes associated with disease
- Q: how are gene association studies different from linkage studies that find disease-associated genes?
1) Gene association studies are not pedigree based
2) Instead, very large numbers of unrelated people a e examined (case vs. control groups)
3) the diseases examined are complex diseases where many interacting genes are responsible for the phenotype
Genome-wide association studies (GWAS)
- it is a genome-wide association study
- it does not depend on traditional pedigree analysis
- basically, one is looking for an association between and genotype and phenotype (usually disease)
- but thousands of individuals make up a study group and 100 000’s to 1 000 000’s DNA variants
are used
- a gene is a known piece of sequence that has a start code and its turned into a protein but ais also a
locus in the genome and GWAS looks at loci associated with the disease (can be coding or non-
coding and GWAS compares millions of differences (e.g. SNP’s) across 100 to 1000 individuals’
DNA to test if variants are associated with a known trait or disease
- computing which variants occur with disease symptoms allow statistical estimates regarding the
level of increased risk associated with the variant (look at these variants across a chip array)
,Lecture 2B: SNPs
Rapid increase in GWAS publications
- there has been a rapid increase in GWAS publications because of the availability of the sequenced human genome
- one of the first genomes that was sequences was an anonymous person, a reference genome
Insights from the human genome (the numbers)
- do not know the repeats, people have very different numbers
of copy number variance so often cannot align properly and
up to 1% to 2% is an error on the non-coding stuff between
any two individuals
- there are only about 3 million base pairs yet there is 18 million SNPs meaning that there is more variation and the more
people you add, the more the variation
- SNPs are single-nucleotide polymorphism which is a germline substitution of a single nucleotide at a specific position in
the genome and can act as biological markers
The human reference genome
1) DNA sequence representative a species’ set of genes
2) Assembled from DNA from a number of people
3) Does not accurately represent the set of genes of any single person
4) Provides a haploid mosaic of different DNA sequences from each donor
5) CRCh37 (genome reference consortium human genome, build37) is derived from thirteen anonymous volunteers from
New York, USA
Insights from human genome
- contains about 25,000 genes in distinct domains of gene organization
- combinatorial strategies lead to gene amplification and diversity
- genome sequence studies affirm evolution from a common ancestor (gene order is highly conserved)
Extensive allelic variation among individuals of same species
- first big insight is that there is a lot of inter-individual variation
- Individual variation known in protein sequences since 1950s
- DNA sequencing of individual genomes revealed that humans gave huge amount of sequence variants
- shows SNP variation (a single-nucleotide polymorphism that is a base
substitution at any site between two individuals)
- they all differ by almost a million base pairs (SNPs) where any two
individuals differ by a base pair
- it could be in the coding for a locus, the SNP locus, a coding gene, an
intron or within a gene, could be a synonymous change where is does
not change the amino acid or could change the amino acid
- the SNPs variation can be changed into amino acid variation which
will be less as not all SNPs change amino acids as might be
synonymous or in non-coding regions
Example of the SNPs in a 400 kb region of human chromosome 7 containing the CFTR gene
- shows SNPs in two individual genomes
- can see that Venter differs from the reference at
different places that Watson differs but are also
some places where they are similar
Categories of genetic variation
- many ways that genomes differ from each other
- obvious is just have a DNA base substitution and a simple DNA base
substitution is called a SNP
- the mutation rate of how often you make a new SNP in your DNA is a
function of DNA errors and repair enzymes and it measured by looking at the
endpoint of two people and how much difference they have and how related
they are and can measure the mutation rate
, - know the mutation rates for SNPs are on average and certain areas of DNA mutate faster than other areas so when looking
for SNPs, find more in non-coding regions and introns and in exons, they often hit the 3rd base so do not change the function
(non-synonymous SNPs)
- other mutations are important, insertion deletions happens and small and large ones happen and missing out some DNA
- copy number variation polymorphism which is where you have more than one copy of a variation and typically are larger
fragments and tells you how often across the genome landscape you find these variations and how many there are
- can see the most numerous kind is SNPs
Single nucleotide polymorphisms (SNPs)
- SNP (pronounced snip) is a single nucleotide polymorphism
- locations within the genome where nucleotide substitution
- most common type of genetic variation
- at least 1% of a population must have the alternative nucleotide variant to be considered a SNP (has to have a measurable
frequency whereas just a mutation is where it is very, very rare)
- account for majority of variation among human genomes (> 18 million human SNPs have been identified)
- there is an element of compounding complexity depending on what population is being looked at as populations are
geographically isolated and only exchange genes with people in the area
- if an African population with lots of SNPs migrated, would be a bottleneck
- there is lots more genetic variation in African populations and larger populations and older populations (because mutations
happen at a normal rate, older populations have more time to accumulate more mutations)
- SNPs are useful to understand migration patterns and how much and what kinds of SNPs present give an idea of
population structure
- SNPs can be homozygous or heterozygous in an
individual
- comparison of human and chimpanzee genomes
revealed the SNPs that occurred since divergence of
these two species
- The G in the human and A in the chimp is not a SNP in the human
population but is different from another primate
- The T and G in the human is a SNP
Lecture 2A: Arrays and CNV
Example of the SNPs in a 400 kb region of human chromosome 7 containing the CFTR gene
- NCBI sequence viewer and UCSC genome
browser:
- 3 tracks of SNPs compared against RefSeq genome,
SNPs in two individual genomes (Watson and
center) and SNPs in sequences from all European
subjects (CEU)
- block patterns of SNP similarity and dissimilarity
- this provide the basis for GWAS
- GWAS is they know that there is an A or a T so to analyze that in a population, either go resequencing the population
(even just a small PCR primer for that region, there are only 2 SNPs) or look at all the SNPs
- resequencing: exome sequencing experiments where they only sequence the exomes because it is less to sequence and get
SNPs that effect a protein coding region, costs are much cheaper
- GWAS experiments do not sequence it, they know the two versions of the allele and put it on an array
Genomics is failing in representing diversity
- databases generally mostly genomes from European subjects
- many populations are left behind on the road to precision medicine
- especially African populations how have the greatest diversity
- an analysis indicated that many populations are still being left behind
- societal problem as well as governments pay for the sequences and many countries could not afford this
SNPs can be genotyped using different molecular methods
1) RFLP where a SNP changes restriction enzyme site
2) DNA microarrays where detect SNPs alleles at > 1 million loci
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller ggauntlett. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $2.86. You're not tied to anything after your purchase.