Part 1. Automation in R : loops, lists and functions
Introduction
The folder contains the 4 datasets with results of a study into the genetic causes of age-related
hearing impairment (ARHI) : connexins.txt, kcnq4.txt, oxidStress.txt and monogenic.txt.
ARHI is the gradual decline of hearing with ageing. In some persons this decline is worse
than in others, and there are reasons to believe that part of this variation between individuals
is genetic. Unlike monogenic disorders, where one single gene causes a disease phenotype,
ARHI is complex – due to multiple genes with a small effect, plus environmental risk factors.
In the ARHI study, we searched for genetic variants (single nucleotide polymorphisms, SNPs)
that were associated with the hearing phenotype. If a gene contains several SNPs showing a
strong association with the hearing phenotype, this can indicate a role of this gene in hearing
impairment.
The statistical analysis involves an association test between the phenotype and the genotype.
The phenotype is described by the Z-score. This is an age- and gender corrected score that
describes how well a person can hear, given age and gender. Good-hearing persons have a
low (negative) Z-score, whereas bad hearing persons have a high (positive) Z-score. The Z-
score is a numeric variable with an approximately normal distribution.
Each SNP has two alleles and, hence, 3 genotypes: aa, ab and bb. Association tests are
commonly performed in two ways:
1) using an ANOVA : treating the 3 genotypes as 3 completely distinct categories
(with no ordering)
2) using regression, coding the genotypes like this : aa=0 , ab=1 and bb=2. This
analysis treats the heterozygous as the intermediate between the two homozygous
genotypes. This is called testing under an additive model. :
Automation 1: Looping
Read in the dataset connexins.txt. This dataset contains genotyping results from SNPs within
several genes of the connexin family (gap junction proteins expressed in the inner ear.).
First: setwd(“/”)
inputData <- read.table("connexins.txt", header=TRUE, sep="\t",
na.strings=c("0",""),stringsAsFactors=TRUE)
The latter argument specifies that missing values are indicated by either an empty field ("") or
a zero. (Yes, R allows more than one missing value indicator).
How does the dataset look like?
str(inputData)
1
,The first column is the subject-identifier (ID). The Zscore is the numeric phenotype. The
remainder of the columns represent the genotypes of the SNPs of interest. They have been
read in as factors (=categorical variables), with 3 levels.
When we test the first SNP (Cx26_SNP1) for association with the phenotype (Zscore), we can
use the ANOVA or the regression approach. Remember that both are carried out by the lm
function in R, and the type of X-variable determines whether an ANOVA or regression is
performed:
For ANOVA, with genotype categorical :
modelANOVA<-lm(Zscore~Cx26_SNP1,data=inputData)
summary(modelANOVA)
p.anova <- anova(modelANOVA)[1,5]
2
, The last command retrieves the overall p-value. If we want the regression approach, we need
to convert the genotype to a number (1,2,3). In R, we can automatically change the categorical
genotypes aa, ab, bb to numbers using the co-ercion formula as.numeric(). The lm function
will then carry out regression:
modelRegr<-lm(Zscore~as.numeric(Cx26_SNP1),data=inputData)
summary(modelRegr)
p.regres <- anova(modelRegr)[1,5]
This gives us the association test for the first SNP. However, the study involved genotyping
several SNPs – nowadays such studies involve up to a million SNPs. We therefore need to
automate the analysis.
If we want to run the above code for all SNPs, the only argument that changes is the
independent variable (after the tilde). The Z-score and the rest of the code will always be the
same. First we create a dataframe with only the SNPs. That is, the original with the first two
columns removed.
allSNPs<-inputData[,-c(1,2)]
In the allSNPs dataframe, the ith column contains the genotype from the ith SNP. How many
SNPs are these…? The dim() function gives us the dimensions of the dataframe : the number
of rows and columns. The SNP names can be retrieved using the names() function.
n.columns<-dim(inputData)[2]
n.snps<-n.columns-2
SNPnames<-names(allSNPs)
3