GWAS on tetraploid genotypes using GAPIT

Packages needed to run GAPIT

Install the following packages if you don’t have them already.

install.packages(c("gplots", "LDheatmap", "genetics", "ape", "EMMREML",
                   "scatterplot3d", "BiocManager", "ggplot2", "viridis"))
BiocManager::install("multtest")

Now we will load all of the packages as described in the GAPIT manual.

library(multtest)
library(gplots)
library(LDheatmap)
library(genetics)
library(ape)
library(EMMREML)
library(compiler) #this library is already installed in R
library(scatterplot3d)

And more for our own visualizations.

library(ggplot2)
library(viridis)

And finally load the source code with the GAPIT and EMMA functions.

source("http://zzlab.net/GAPIT/gapit_functions.txt")
source("http://zzlab.net/GAPIT/emma.txt")

Data for this tutorial

Here we load the GAPIT files that were exported from polyRAD, containing posterior mean genotypes. See the previous tutorial to understand how these were generated.

myGD <- read.csv("Msa_tetraploids_Chr05_GD.csv", stringsAsFactors = FALSE)
myGM <- read.csv("Msa_tetraploids_Chr05_GM.csv", stringsAsFactors = FALSE)

myGD[1:10,1:4]

##              taxa S05_51928_TTA S05_51928_TCC S05_51928_ATC
## 1         PMS-458  2.250853e-03  4.624893e-04    0.51356418
## 2  KMS-widespread  5.117938e-01  3.804286e-07    0.54778187
## 3      UI10-00117  4.479319e-02  1.716671e-02    0.05933096
## 4      UI11-00005  5.012321e-02  5.222333e-05    1.01539608
## 5      UI11-00027  1.163206e-04  6.286932e-02    0.13186265
## 6    IGR-2011-005  1.818289e-05  1.818315e-05    0.02728319
## 7         PMS-457  8.085649e-04  1.339165e-04    0.50462159
## 8           JY186  1.410903e-01  9.146026e-04    0.40938347
## 9           JY192  6.475757e-02  5.004677e-04    0.18882859
## 10          JY182  1.554136e-01  2.000000e-04    0.41823570

head(myGM)

##            Name Chromosome Position
## 1 S05_51928_TTA          5    51928
## 2 S05_51928_TCC          5    51928
## 3 S05_51928_ATC          5    51928
## 4  S05_81981_GG          5    81981
## 5  S05_81981_AC          5    81981
## 6 S05_132813_GC          5   132813

Next we will import some phenotypic data. The taxa are listed in the same order that they are in for the genotype file.

myY <- read.csv("Msa_tetraploids_pheno.csv", stringsAsFactors = FALSE)

The phenotypes file contains least-squared means, Box-Cox transformed, for stem diameter.

hist(myY$Stem_diameter)

sum(is.na(myY$Stem_diameter))

## [1] 38

Impact of population structure on the phenotype

GAPIT will perform a PCA, but we can also do the PCA ourselves to evaluate to what degree the phenotype is associated with population structure.

mypca <- prcomp(as.matrix(myGD[,-1]), scale. = TRUE)

We can plot the marker variance explained by each axis to get a sense of how many axes are biologically meaningful.

plot(mypca$sdev[1:50] ^ 2)

It looks like the first nine should suffice. Now we can see whether their association with the phenotype is significant.

pca_model <- lm(myY$Stem_diameter ~ mypca$x[,1:9])
summary(pca_model)

##
## Call:
## lm(formula = myY$Stem_diameter ~ mypca$x[, 1:9])
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.01787 -0.13323 -0.00425  0.14116  1.07171
##
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.0462611  0.0173360 118.036  < 2e-16 ***
## mypca$x[, 1:9]PC1 -0.0025183  0.0006115  -4.118 5.41e-05 ***
## mypca$x[, 1:9]PC2 -0.0111238  0.0007219 -15.410  < 2e-16 ***
## mypca$x[, 1:9]PC3 -0.0052087  0.0007632  -6.825 8.43e-11 ***
## mypca$x[, 1:9]PC4  0.0023454  0.0009276   2.529   0.0122 *  
## mypca$x[, 1:9]PC5  0.0009586  0.0009692   0.989   0.3237    
## mypca$x[, 1:9]PC6  0.0001994  0.0011145   0.179   0.8582    
## mypca$x[, 1:9]PC7  0.0010986  0.0011355   0.967   0.3344    
## mypca$x[, 1:9]PC8  0.0003912  0.0017202   0.227   0.8203    
## mypca$x[, 1:9]PC9 -0.0011942  0.0012338  -0.968   0.3341    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2604 on 220 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.581,  Adjusted R-squared:  0.5639
## F-statistic:  33.9 on 9 and 220 DF,  p-value: < 2.2e-16

Based on this, we want to use the first four PCs as covariates in GWAS. We can put them into a data frame.

myCV <- data.frame(Taxa = myY$Taxa,
                   PC1 = mypca$x[,1],
                   PC2 = mypca$x[,2],
                   PC3 = mypca$x[,3],
                   PC4 = mypca$x[,4])

We can also visualize how the axes are associated with the phenotype.

gp <- ggplot(myCV, aes(col = myY$Stem_diameter)) +
  scale_color_viridis()
gp + geom_point(aes(x = PC1, y = PC2))

gp + geom_point(aes(x = PC3, y = PC4))

Running GAPIT

GAPIT generates a lot of output files, so we will want to put them in a separate folder. Make a folder called GAPIT_output and then change your working directory to that one. Then we can run GAPIT.

setwd("GAPIT_output")
my_gapit <- GAPIT(Y = myY, GD = myGD, GM = myGM, CV = myCV)
setwd("..")

Evaluating the results

Many files were output from GAPIT. If you open one of the manhattan plots, as well as the QQ plot, you should see that there was one highly significant marker. We can import the results table to find out which it was.

gwas_results <- read.csv("GAPIT_output/GAPIT.MLM.DBI.year3.BC.GWAS.Results.csv",
                         stringsAsFactors = FALSE)
head(gwas_results)

##             SNP Chromosome Position    P.value        maf nobs
## 1 S05_51928_ATC          1    51928 0.41298877 0.12328428  230
## 2 S05_51928_TCC          1    51928 0.59161215 0.03967660  230
## 3 S05_51928_TTA          1    51928 0.70552296 0.04533799  230
## 4  S05_81981_AC          1    81981 0.32211425 0.09000310  230
## 5  S05_81981_GG          1    81981 0.53008783 0.40364946  230
## 6 S05_132813_GC          1   132813 0.03273092 0.29255000  230
##   Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP
## 1                    0.5274219                 0.5288100
## 2                    0.5274219                 0.5280176
## 3                    0.5274219                 0.5277173
## 4                    0.5274219                 0.5294538
## 5                    0.5274219                 0.5282379
## 6                    0.5274219                 0.5369475
##   FDR_Adjusted_P.values      effect
## 1             0.9747857  0.03585107
## 2             0.9921247 -0.04872719
## 3             0.9948205 -0.04753831
## 4             0.9680404 -0.03408207
## 5             0.9832450 -0.08948270
## 6             0.9153636 -0.11704214

gwas_results[gwas_results$FDR_Adjusted_P.values < 0.05, ]

##                   SNP Chromosome Position      P.value        maf nobs
## 2532 S05_13267268_AAA          1 13267268 5.125402e-08 0.04786535  230
##      Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP
## 2532                    0.5274219                 0.5929617
##      FDR_Adjusted_P.values    effect
## 2532          0.0006749129 0.0418769

What was the relationship between this allele and the phenotype? We will multiply the genotype by two so that it scales from zero to four rather than zero to two, which will make it more intuitive in terms of allele copy number.

gp2 <- ggplot(mapping = aes(x = myGD$S05_13267268_AAA * 2,
                            y = myY$Stem_diameter)) +
  scale_color_viridis() +
  xlab("Allele copy number of AAA at S05_13267268") +
  ylab("Stem diameter")
gp2 + geom_point(aes(col = myCV$PC1)) + labs(col = "PC1")

We can see the three populations more or less by three color groups. Within each population, there seems to be a negative relationship between number of copies of this allele and stem diameter.

This visualization is helpful for understanding posterior mean genotypes and how they can be useful. The individuals clustered around zero on the x-axis probably have zero copies of the allele. At 0.5 on the x-axis, individuals have about a 50% chance of having one copy of the allele and a 50% chance of having zero copies. Since some of those individuals will have the allele and some won’t, for the sake of linear regression it makes sense to have them positioned intermediate between zero and one.

Also notice that the three populations differ in terms of their median genotype value, even though the median is always less than 0.5 (i.e probably having zero copies). The allele is most common in the yellow/yellow-green population. Therefore, at low read depth, even if there were no reads for the allele, there is some probability that the individual has the allele, and that probability is higher than it is for the other populations. Likewise, the allele is rarest in the purple population, hence why the median genotype value for that population is much closer to zero.

If you have time, you can go back to the RADdata object for this marker and see how allelic read depth and genotype prior probabilities led to these posterior mean genotypes.