Packages for this tutorial
We will use the VariantAnnotation
package from Bioconductor to read and filter VCFs. We will also use ape
to make a phylogenetic tree. If you don’t have these packages already, install them like so:
install.packages("BiocManager", "ape")
BiocManager::install("VariantAnnotation")
Then load them.
library(VariantAnnotation)
library(ape)
Data for this tutorial
We will use the VCF that was generated during the GBS-SNP-CROP tutorial, which we named Msa_GSCdemo.vcf
. It contains data for about 38K markers on eight tetraploid individuals of Miscanthus sacchariflorus.
It is generally easier to work with VCFs in VariantAnnotation
if they have been zipped and indexed, so we’ll do that first.
mybg <- bgzip("Msa_GSCdemo.vcf")
indexTabix(mybg, format = "vcf")
We can take a look at some information from the file header.
hdr <- scanVcfHeader(mybg)
hdr
## class: VCFHeader
## samples(8): JM-2014-H-28 JM-2014-K-8 ... KMS257 RU2012-055
## meta(4): fileDate fileformat phasing source
## fixed(0):
## info(5): AC AF DP AV NS
## geno(2): GT AD
samples(hdr)
## [1] "JM-2014-H-28" "JM-2014-K-8" "JM-2014-M-5" "JM-2014-S-5"
## [5] "JY134" "JY185" "KMS257" "RU2012-055"
meta(hdr)$source
## DataFrame with 1 row and 1 column
## Value
## <character>
## source GBS-SNP-CROP
geno(hdr)
## DataFrame with 2 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## GT 1 String Genotype
## AD R Integer Allele Depth
And confirm this matches the text in the file.
cat(readLines(mybg, 20), sep = "\n")
## ##fileformat=VCFv4.2
## ##fileDate=Thu Apr 25 10:55:57 2019
## ##source=GBS-SNP-CROP
## ##phasing=partial
## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
## ##INFO=<ID=AF,Number=A,Type=Integer,Description="Allele Frequency">
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
## ##INFO=<ID=AV,Number=1,Type=Integer,Description="Average Depth">
## ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allele Depth">
## #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT JM-2014-H-28 JM-2014-K-8 JM-2014-M-5 JM-2014-S-5 JY134 JY185 KMS257 RU2012-055
## MockRefGenome 1731 . A G . PASS AC=8;AF=0.500;DP=7403;AV=925.38;NS=8 GT:AD 0/1:189,76 0/1:511,419 0/1:238,191 0/1:845,644 0/1:607,600 0/1:233,204 0/1:726,480 0/1:779,661
## MockRefGenome 1851 . T A . PASS AC=7;AF=0.438;DP=6180;AV=871.14;NS=7 GT:AD 0/1:646,167 0/1:720,102 0/1:240,48 0/1:520,103 0/1:898,188 ./.:.,. 0/1:679,170 0/1:1342,275
## MockRefGenome 3311 . A G . PASS AC=3;AF=0.188;DP=4447;AV=555.88;NS=8 GT:AD 0/0:23,0 0/0:349,0 0/0:464,0 0/1:521,165 0/1:620,121 0/1:311,104 0/0:898,0 0/0:871,0
## MockRefGenome 3833 . G T . PASS AC=8;AF=0.500;DP=4091;AV=511.38;NS=8 GT:AD 0/1:460,130 0/1:359,230 0/1:283,104 0/1:335,174 0/1:607,178 0/1:160,61 0/1:329,86 0/1:496,99
## MockRefGenome 5102 . T C . PASS AC=8;AF=0.500;DP=4108;AV=513.50;NS=8 GT:AD 0/1:203,154 0/1:299,118 0/1:181,52 0/1:616,209 0/1:524,191 0/1:253,58 0/1:445,210 0/1:440,155
## MockRefGenome 5104 . C G . PASS AC=8;AF=0.500;DP=4107;AV=513.38;NS=8 GT:AD 0/1:203,155 0/1:299,118 0/1:181,50 0/1:596,230 0/1:514,200 0/1:248,63 0/1:430,226 0/1:439,155
## MockRefGenome 5127 . G A . PASS AC=8;AF=0.500;DP=4107;AV=513.38;NS=8 GT:AD 0/1:203,155 0/1:298,118 0/1:181,52 0/1:595,230 0/1:514,200 0/1:245,63 0/1:431,227 0/1:440,155
## MockRefGenome 5327 . A G . PASS AC=8;AF=0.500;DP=3290;AV=411.25;NS=8 GT:AD 0/1:155,104 0/1:384,91 0/1:125,68 0/1:251,41 0/1:291,89 0/1:161,35 0/1:478,98 0/1:734,185
Making a prefilter
I know that Miscanthus has a whole genome duplication (even before the autotetraploidy in these M. sacchariflorus accessions) and so even with the paralog filtering that was performed by GBS-SNP-CROP, I’m suspicious that some paralogs made it through. Of those first few markers that I printed out, I can see that most of them are heterozygous for all individuals. Those markers are probably merged paralogs, and definitely not informative, so I would like to get rid of them.
One option from VariantAnnotation
is to make a prefilter. This is a function that treats each line of the VCF genotype table as a text string, and returns TRUE
or FALSE
incidating whether or not that line should pass the filter. One very handy function for building prefilter functions is grepl
, since it searches for a particular string or pattern, and returns TRUE
if it finds it and FALSE
otherwise. Another handy function is gregexpr
if you want to know how many times you find a particular pattern in each line.
If all eight samples are either heterozygous or missing, I want to discard the marker. In the sense of the file format, this means that the GT
field for every sample is either 0/1
or ./.
. Another way of looking at it is that there would be no homozygotes, i.e. no 0/0
and no 1/1
. Here’s a function to perform that check.
NotAllHet <- function(lines){
found00 <- grepl("0/0:", lines)
found11 <- grepl("1/1:", lines)
return(found00 | found11)
}
We can test it with the first few lines of the file, ignoring the header lines.
mylines <- readLines(mybg, 30)
# skip header lines starting with "#"
mylines <- mylines[!grepl("^#", mylines)]
keepem <- NotAllHet(mylines)
cat(mylines[keepem], sep = "\n")
## MockRefGenome 3311 . A G . PASS AC=3;AF=0.188;DP=4447;AV=555.88;NS=8 GT:AD 0/0:23,0 0/0:349,0 0/0:464,0 0/1:521,165 0/1:620,121 0/1:311,104 0/0:898,0 0/0:871,0
## MockRefGenome 6284 . G A . PASS AC=5;AF=0.312;DP=3185;AV=398.00;NS=8 GT:AD 0/0:292,1 0/0:713,0 0/1:235,60 0/1:474,172 0/1:234,219 0/1:124,128 0/1:153,44 0/0:336,0
## MockRefGenome 6333 . A T . PASS AC=5;AF=0.312;DP=3169;AV=396.12;NS=8 GT:AD 0/0:291,0 0/0:711,0 0/1:223,69 0/1:437,203 0/1:234,219 0/1:125,126 0/1:118,78 0/0:335,0
## MockRefGenome 7285 . C T . PASS AC=3;AF=0.188;DP=3123;AV=423.43;NS=7 GT:AD 0/0:388,0 0/0:552,0 ./.:.,. 0/1:372,98 0/1:489,334 0/1:138,123 0/0:235,1 0/0:235,0
cat(mylines[!keepem], sep = "\n")
## MockRefGenome 1731 . A G . PASS AC=8;AF=0.500;DP=7403;AV=925.38;NS=8 GT:AD 0/1:189,76 0/1:511,419 0/1:238,191 0/1:845,644 0/1:607,600 0/1:233,204 0/1:726,480 0/1:779,661
## MockRefGenome 1851 . T A . PASS AC=7;AF=0.438;DP=6180;AV=871.14;NS=7 GT:AD 0/1:646,167 0/1:720,102 0/1:240,48 0/1:520,103 0/1:898,188 ./.:.,. 0/1:679,170 0/1:1342,275
## MockRefGenome 3833 . G T . PASS AC=8;AF=0.500;DP=4091;AV=511.38;NS=8 GT:AD 0/1:460,130 0/1:359,230 0/1:283,104 0/1:335,174 0/1:607,178 0/1:160,61 0/1:329,86 0/1:496,99
## MockRefGenome 5102 . T C . PASS AC=8;AF=0.500;DP=4108;AV=513.50;NS=8 GT:AD 0/1:203,154 0/1:299,118 0/1:181,52 0/1:616,209 0/1:524,191 0/1:253,58 0/1:445,210 0/1:440,155
## MockRefGenome 5104 . C G . PASS AC=8;AF=0.500;DP=4107;AV=513.38;NS=8 GT:AD 0/1:203,155 0/1:299,118 0/1:181,50 0/1:596,230 0/1:514,200 0/1:248,63 0/1:430,226 0/1:439,155
## MockRefGenome 5127 . G A . PASS AC=8;AF=0.500;DP=4107;AV=513.38;NS=8 GT:AD 0/1:203,155 0/1:298,118 0/1:181,52 0/1:595,230 0/1:514,200 0/1:245,63 0/1:431,227 0/1:440,155
## MockRefGenome 5327 . A G . PASS AC=8;AF=0.500;DP=3290;AV=411.25;NS=8 GT:AD 0/1:155,104 0/1:384,91 0/1:125,68 0/1:251,41 0/1:291,89 0/1:161,35 0/1:478,98 0/1:734,185
## MockRefGenome 5498 . A C . PASS AC=7;AF=0.438;DP=3112;AV=400.14;NS=7 GT:AD ./.:.,. 0/1:344,71 0/1:96,53 0/1:338,95 0/1:418,91 0/1:247,81 0/1:279,80 0/1:481,127
## MockRefGenome 6514 . G C . PASS AC=7;AF=0.438;DP=3970;AV=566.86;NS=7 GT:AD 0/1:360,189 0/1:323,121 0/1:122,111 0/1:227,89 0/1:478,177 ./.:.,. 0/1:337,246 0/1:738,450
## MockRefGenome 6832 . A T . PASS AC=7;AF=0.438;DP=10029;AV=1204.14;NS=7 GT:AD 0/1:1479,226 0/1:1216,210 0/1:641,105 ./.:.,. 0/1:651,87 0/1:325,55 0/1:1333,238 0/1:1580,283
## MockRefGenome 6835 . G T . PASS AC=7;AF=0.438;DP=9750;AV=1339.14;NS=7 GT:AD 0/1:1447,240 0/1:1127,269 0/1:602,112 0/1:1290,274 0/1:531,154 ./.:.,. 0/1:1289,222 0/1:1546,271
## MockRefGenome 7510 . C A . PASS AC=8;AF=0.500;DP=2853;AV=356.62;NS=8 GT:AD 0/1:203,67 0/1:262,77 0/1:131,48 0/1:222,102 0/1:515,111 0/1:225,57 0/1:407,71 0/1:253,102
## MockRefGenome 8478 . C G . PASS AC=8;AF=0.500;DP=4276;AV=534.50;NS=8 GT:AD 0/1:363,115 0/1:493,131 0/1:196,84 0/1:317,104 0/1:562,122 0/1:253,44 0/1:439,128 0/1:756,169
## MockRefGenome 8490 . G C . PASS AC=8;AF=0.500;DP=4272;AV=534.00;NS=8 GT:AD 0/1:251,227 0/1:368,254 0/1:154,126 0/1:256,165 0/1:398,286 0/1:206,91 0/1:346,220 0/1:614,310
Here is how I would apply that filter to my VCF file, exporting a filtered VCF.
filter1_file <- "Msa_GSCdemo_filter1.vcf"
filterVcf(mybg, genome = "MsaMockReference",
destination = filter1_file,
prefilters = FilterRules(list(NotAllHet)))
And I can confirm that it looks filtered.
cat(readLines(filter1_file, 20), sep = "\n")
## ##fileformat=VCFv4.2
## ##fileDate=Thu Apr 25 10:55:57 2019
## ##source=GBS-SNP-CROP
## ##phasing=partial
## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
## ##INFO=<ID=AF,Number=A,Type=Integer,Description="Allele Frequency">
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
## ##INFO=<ID=AV,Number=1,Type=Integer,Description="Average Depth">
## ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allele Depth">
## #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT JM-2014-H-28 JM-2014-K-8 JM-2014-M-5 JM-2014-S-5 JY134 JY185 KMS257 RU2012-055
## MockRefGenome 3311 . A G . PASS AC=3;AF=0.188;DP=4447;AV=555.88;NS=8 GT:AD 0/0:23,0 0/0:349,0 0/0:464,0 0/1:521,165 0/1:620,121 0/1:311,104 0/0:898,0 0/0:871,0
## MockRefGenome 6284 . G A . PASS AC=5;AF=0.312;DP=3185;AV=398.00;NS=8 GT:AD 0/0:292,1 0/0:713,0 0/1:235,60 0/1:474,172 0/1:234,219 0/1:124,128 0/1:153,44 0/0:336,0
## MockRefGenome 6333 . A T . PASS AC=5;AF=0.312;DP=3169;AV=396.12;NS=8 GT:AD 0/0:291,0 0/0:711,0 0/1:223,69 0/1:437,203 0/1:234,219 0/1:125,126 0/1:118,78 0/0:335,0
## MockRefGenome 7285 . C T . PASS AC=3;AF=0.188;DP=3123;AV=423.43;NS=7 GT:AD 0/0:388,0 0/0:552,0 ./.:.,. 0/1:372,98 0/1:489,334 0/1:138,123 0/0:235,1 0/0:235,0
## MockRefGenome 8822 . T A . PASS AC=4;AF=0.250;DP=5573;AV=718.86;NS=7 GT:AD 0/0:264,0 ./.:.,. 0/1:300,63 0/1:670,87 0/0:963,0 0/1:199,29 0/0:872,0 0/1:1308,277
## MockRefGenome 11589 . C A . PASS AC=3;AF=0.188;DP=3951;AV=493.88;NS=8 GT:AD 0/1:281,103 0/1:429,334 0/1:202,109 0/0:541,0 0/0:583,0 0/0:225,0 0/0:576,0 0/0:568,0
## MockRefGenome 12355 . C T . PASS AC=2;AF=0.125;DP=1681;AV=222.00;NS=7 GT:AD 0/0:287,1 0/1:215,55 0/0:57,0 0/1:176,32 0/0:169,0 ./.:.,. 0/0:263,0 0/0:300,0
## MockRefGenome 12663 . T C . PASS AC=2;AF=0.125;DP=1807;AV=226.29;NS=7 GT:AD 0/0:118,0 ./.:.,. 0/0:76,0 0/1:327,192 0/0:270,0 0/1:50,9 0/0:223,0 0/0:319,0
Then zip and index it.
filter1_bg <- bgzip(filter1_file)
indexTabix(filter1_bg, format = "vcf")
Importing data from a VCF
This is a pretty small VCF, so let’s first explore what happens when we import the whole thing.
all_my_data <- readVcf(filter1_bg)
all_my_data
## class: CollapsedVCF
## dim: 29294 8
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 5 columns: AC, AF, DP, AV, NS
## info(header(vcf)):
## Number Type Description
## AC A Integer Allele Count
## AF A Integer Allele Frequency
## DP 1 Integer Total Depth
## AV 1 Integer Average Depth
## NS 1 Integer Number of Samples With Data
## geno(vcf):
## SimpleList of length 2: GT, AD
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## AD R Integer Allele Depth
We kept 29K markers (pretty good given how heavily filtered the first few lines were). We also see a bunch of slots in the VCF object that we can explore.
We can get to the header, with the same data we found using scanVcfHeader
on the file.
header(all_my_data)
## class: VCFHeader
## samples(8): JM-2014-H-28 JM-2014-K-8 ... KMS257 RU2012-055
## meta(4): fileDate fileformat phasing source
## fixed(0):
## info(5): AC AF DP AV NS
## geno(2): GT AD
Here’s the data from the leftmost columns of the genotype table. This is a GRanges
object, which is a special kind of table used in Bioconductor for representing genomic coordinates.
rowRanges(all_my_data)
## GRanges object with 29294 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## MockRefGenome:3311_A/G MockRefGenome 3311 * | <NA>
## MockRefGenome:6284_G/A MockRefGenome 6284 * | <NA>
## MockRefGenome:6333_A/T MockRefGenome 6333 * | <NA>
## MockRefGenome:7285_C/T MockRefGenome 7285 * | <NA>
## MockRefGenome:8822_T/A MockRefGenome 8822 * | <NA>
## ... ... ... ... . ...
## MockRefGenome:39977669_A/G MockRefGenome 39977669 * | <NA>
## MockRefGenome:40101335_G/T MockRefGenome 40101335 * | <NA>
## MockRefGenome:40259299_T/C MockRefGenome 40259299 * | <NA>
## MockRefGenome:40949663_A/T MockRefGenome 40949663 * | <NA>
## MockRefGenome:40949695_A/G MockRefGenome 40949695 * | <NA>
## REF ALT QUAL
## <DNAStringSet> <DNAStringSetList> <numeric>
## MockRefGenome:3311_A/G A G <NA>
## MockRefGenome:6284_G/A G A <NA>
## MockRefGenome:6333_A/T A T <NA>
## MockRefGenome:7285_C/T C T <NA>
## MockRefGenome:8822_T/A T A <NA>
## ... ... ... ...
## MockRefGenome:39977669_A/G A G <NA>
## MockRefGenome:40101335_G/T G T <NA>
## MockRefGenome:40259299_T/C T C <NA>
## MockRefGenome:40949663_A/T A T <NA>
## MockRefGenome:40949695_A/G A G <NA>
## FILTER
## <character>
## MockRefGenome:3311_A/G PASS
## MockRefGenome:6284_G/A PASS
## MockRefGenome:6333_A/T PASS
## MockRefGenome:7285_C/T PASS
## MockRefGenome:8822_T/A PASS
## ... ...
## MockRefGenome:39977669_A/G PASS
## MockRefGenome:40101335_G/T PASS
## MockRefGenome:40259299_T/C PASS
## MockRefGenome:40949663_A/T PASS
## MockRefGenome:40949695_A/G PASS
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The INFO column from the VCF also got unpacked into its own table.
info(all_my_data)
## DataFrame with 29294 rows and 5 columns
## AC AF DP AV
## <IntegerList> <IntegerList> <integer> <integer>
## MockRefGenome:3311_A/G 3 0 4447 555
## MockRefGenome:6284_G/A 5 0 3185 398
## MockRefGenome:6333_A/T 5 0 3169 396
## MockRefGenome:7285_C/T 3 0 3123 423
## MockRefGenome:8822_T/A 4 0 5573 718
## ... ... ... ... ...
## MockRefGenome:39977669_A/G 2 0 235 29
## MockRefGenome:40101335_G/T 2 0 198 27
## MockRefGenome:40259299_T/C 6 0 303 41
## MockRefGenome:40949663_A/T 10 0 353 44
## MockRefGenome:40949695_A/G 8 0 251 31
## NS
## <integer>
## MockRefGenome:3311_A/G 8
## MockRefGenome:6284_G/A 8
## MockRefGenome:6333_A/T 8
## MockRefGenome:7285_C/T 7
## MockRefGenome:8822_T/A 7
## ... ...
## MockRefGenome:39977669_A/G 8
## MockRefGenome:40101335_G/T 7
## MockRefGenome:40259299_T/C 7
## MockRefGenome:40949663_A/T 8
## MockRefGenome:40949695_A/G 8
We can get the definitions of those columns from the header.
info(header(all_my_data))
## DataFrame with 5 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## AC A Integer Allele Count
## AF A Integer Allele Frequency
## DP 1 Integer Total Depth
## AV 1 Integer Average Depth
## NS 1 Integer Number of Samples With Data
Last but certainly not least, there are the data from the genotypes table. This includes strings representing the genotypes (GT
) and allelic read depths (AD
).
geno(all_my_data)$GT[1:10,]
## JM-2014-H-28 JM-2014-K-8 JM-2014-M-5 JM-2014-S-5
## MockRefGenome:3311_A/G "0/0" "0/0" "0/0" "0/1"
## MockRefGenome:6284_G/A "0/0" "0/0" "0/1" "0/1"
## MockRefGenome:6333_A/T "0/0" "0/0" "0/1" "0/1"
## MockRefGenome:7285_C/T "0/0" "0/0" "./." "0/1"
## MockRefGenome:8822_T/A "0/0" "./." "0/1" "0/1"
## MockRefGenome:11589_C/A "0/1" "0/1" "0/1" "0/0"
## MockRefGenome:12355_C/T "0/0" "0/1" "0/0" "0/1"
## MockRefGenome:12663_T/C "0/0" "./." "0/0" "0/1"
## MockRefGenome:13346_C/A "0/0" "0/1" "0/0" "0/1"
## MockRefGenome:13563_C/T "0/0" "0/0" "0/0" "0/0"
## JY134 JY185 KMS257 RU2012-055
## MockRefGenome:3311_A/G "0/1" "0/1" "0/0" "0/0"
## MockRefGenome:6284_G/A "0/1" "0/1" "0/1" "0/0"
## MockRefGenome:6333_A/T "0/1" "0/1" "0/1" "0/0"
## MockRefGenome:7285_C/T "0/1" "0/1" "0/0" "0/0"
## MockRefGenome:8822_T/A "0/0" "0/1" "0/0" "0/1"
## MockRefGenome:11589_C/A "0/0" "0/0" "0/0" "0/0"
## MockRefGenome:12355_C/T "0/0" "./." "0/0" "0/0"
## MockRefGenome:12663_T/C "0/0" "0/1" "0/0" "0/0"
## MockRefGenome:13346_C/A "./." "0/0" "0/1" "0/0"
## MockRefGenome:13563_C/T "./." "0/0" "0/1" "0/1"
geno(all_my_data)$AD[1:10,]
## JM-2014-H-28 JM-2014-K-8 JM-2014-M-5 JM-2014-S-5
## MockRefGenome:3311_A/G Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:6284_G/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:6333_A/T Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:7285_C/T Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:8822_T/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:11589_C/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:12355_C/T Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:12663_T/C Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:13346_C/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:13563_C/T Integer,2 Integer,2 Integer,2 Integer,2
## JY134 JY185 KMS257 RU2012-055
## MockRefGenome:3311_A/G Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:6284_G/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:6333_A/T Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:7285_C/T Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:8822_T/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:11589_C/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:12355_C/T Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:12663_T/C Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:13346_C/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:13563_C/T Integer,2 Integer,2 Integer,2 Integer,2
geno(all_my_data)$AD[[1,1]]
## [1] 23 0
This is about the stage where I’d like to do a sanity check to confirm that my samples show the pattern of relatedness that I’m expecting. I will convert the diploidized genotypes to numbers and make a neighbor-joining tree.
numgen <- matrix(NA_integer_, nrow = dim(all_my_data)[1],
ncol = dim(all_my_data)[2],
dimnames = dimnames(geno(all_my_data)$GT))
numgen[geno(all_my_data)$GT == "0/0"] <- 0L
numgen[geno(all_my_data)$GT == "0/1"] <- 1L
numgen[geno(all_my_data)$GT == "1/1"] <- 2L
mydist <- dist(t(numgen))
mynj <- nj(mydist)
plot(mynj, type = "unrooted")
We can see a clustering of Japan vs. the mainland, and northern vs. southern Japan, as expected.
Making a filter
Before, we did prefiltering based on text processing alone. However, we can also make functions to filter the data after it has been imported to R.
We had some rough cutoffs for depth using GBS-SNP-CROP, but we can narrow those further. What is the distribution of total depth among markers in the dataset?
hist(info(all_my_data)$DP)
hist(log(info(all_my_data)$DP))
How about we say we want markers with a total depth below 1000. We can make a filter function for that.
DPbelow1000 <- function(vcf){
return(info(vcf)$DP < 1000)
}
mean(DPbelow1000(all_my_data))
## [1] 0.9479416
94.8% of markers would pass that threshold.
We also allowed 25% missing data when we ran GBS-SNP-CROP. We can see that when we look at NS
, the number of samples with data.
table(info(all_my_data)$NS)
##
## 7 8
## 16473 12821
Or perhaps it was <25% missing data. If we wanted no missing data at all, we could make a filter function.
NoMissing <- function(vcf){
return(info(vcf)$NS == 8)
}
mean(NoMissing(all_my_data))
## [1] 0.4376664
Then we could use both of those functions as filters. We could even combine them with the prefilter we made before.
filter2_file <- "Msa_GSCdemo_filter2.vcf"
filterVcf(mybg, genome = "MsaMockReference",
destination = filter2_file,
prefilters = FilterRules(list(NotAllHet)),
filters = FilterRules(list(DPbelow1000, NoMissing)))
Working with big VCFs
That’s all well and good for this little 5 Mb VCF, where we could import the whole thing and explore it before deciding how we wanted to filter it. What about a 50 Gb VCF?
In VariantAnnotation
, we can create a TabixFile
object, which contains a pointer to the VCF file that we’re working with. When we do that, we can set yieldSize
to indicate how many SNPs to import at once. This TabixFile
object is a type of file connection.
mycon <- TabixFile(mybg, yieldSize = 1000)
We open the connection to read from the beginning of the file.
open(mycon)
Then when we run readVcf
, we will only get the first 1000 SNPs.
first1000 <- readVcf(mycon)
first1000
## class: CollapsedVCF
## dim: 1000 8
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 5 columns: AC, AF, DP, AV, NS
## info(header(vcf)):
## Number Type Description
## AC A Integer Allele Count
## AF A Integer Allele Frequency
## DP 1 Integer Total Depth
## AV 1 Integer Average Depth
## NS 1 Integer Number of Samples With Data
## geno(vcf):
## SimpleList of length 2: GT, AD
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## AD R Integer Allele Depth
If we read again, we would get the next 1000. Here I am assigning it to another object, but depending on what you are doing, this is a great opportunity to use a loop.
next1000 <- readVcf(mycon)
next1000
## class: CollapsedVCF
## dim: 1000 8
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 5 columns: AC, AF, DP, AV, NS
## info(header(vcf)):
## Number Type Description
## AC A Integer Allele Count
## AF A Integer Allele Frequency
## DP 1 Integer Total Depth
## AV 1 Integer Average Depth
## NS 1 Integer Number of Samples With Data
## geno(vcf):
## SimpleList of length 2: GT, AD
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## AD R Integer Allele Depth
Compare to see that they are really different.
rowRanges(first1000)
## GRanges object with 1000 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## MockRefGenome:1731_A/G MockRefGenome 1731 * | <NA>
## MockRefGenome:1851_T/A MockRefGenome 1851 * | <NA>
## MockRefGenome:3311_A/G MockRefGenome 3311 * | <NA>
## MockRefGenome:3833_G/T MockRefGenome 3833 * | <NA>
## MockRefGenome:5102_T/C MockRefGenome 5102 * | <NA>
## ... ... ... ... . ...
## MockRefGenome:189686_C/A MockRefGenome 189686 * | <NA>
## MockRefGenome:190184_C/T MockRefGenome 190184 * | <NA>
## MockRefGenome:190190_T/C MockRefGenome 190190 * | <NA>
## MockRefGenome:190326_G/A MockRefGenome 190326 * | <NA>
## MockRefGenome:190895_C/T MockRefGenome 190895 * | <NA>
## REF ALT QUAL
## <DNAStringSet> <DNAStringSetList> <numeric>
## MockRefGenome:1731_A/G A G <NA>
## MockRefGenome:1851_T/A T A <NA>
## MockRefGenome:3311_A/G A G <NA>
## MockRefGenome:3833_G/T G T <NA>
## MockRefGenome:5102_T/C T C <NA>
## ... ... ... ...
## MockRefGenome:189686_C/A C A <NA>
## MockRefGenome:190184_C/T C T <NA>
## MockRefGenome:190190_T/C T C <NA>
## MockRefGenome:190326_G/A G A <NA>
## MockRefGenome:190895_C/T C T <NA>
## FILTER
## <character>
## MockRefGenome:1731_A/G PASS
## MockRefGenome:1851_T/A PASS
## MockRefGenome:3311_A/G PASS
## MockRefGenome:3833_G/T PASS
## MockRefGenome:5102_T/C PASS
## ... ...
## MockRefGenome:189686_C/A PASS
## MockRefGenome:190184_C/T PASS
## MockRefGenome:190190_T/C PASS
## MockRefGenome:190326_G/A PASS
## MockRefGenome:190895_C/T PASS
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
rowRanges(next1000)
## GRanges object with 1000 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## MockRefGenome:190913_A/G MockRefGenome 190913 * | <NA>
## MockRefGenome:191217_G/A MockRefGenome 191217 * | <NA>
## MockRefGenome:191479_G/A MockRefGenome 191479 * | <NA>
## MockRefGenome:191569_C/T MockRefGenome 191569 * | <NA>
## MockRefGenome:191604_G/T MockRefGenome 191604 * | <NA>
## ... ... ... ... . ...
## MockRefGenome:324994_G/T MockRefGenome 324994 * | <NA>
## MockRefGenome:325077_T/A MockRefGenome 325077 * | <NA>
## MockRefGenome:325099_C/T MockRefGenome 325099 * | <NA>
## MockRefGenome:325368_A/G MockRefGenome 325368 * | <NA>
## MockRefGenome:325552_C/A MockRefGenome 325552 * | <NA>
## REF ALT QUAL
## <DNAStringSet> <DNAStringSetList> <numeric>
## MockRefGenome:190913_A/G A G <NA>
## MockRefGenome:191217_G/A G A <NA>
## MockRefGenome:191479_G/A G A <NA>
## MockRefGenome:191569_C/T C T <NA>
## MockRefGenome:191604_G/T G T <NA>
## ... ... ... ...
## MockRefGenome:324994_G/T G T <NA>
## MockRefGenome:325077_T/A T A <NA>
## MockRefGenome:325099_C/T C T <NA>
## MockRefGenome:325368_A/G A G <NA>
## MockRefGenome:325552_C/A C A <NA>
## FILTER
## <character>
## MockRefGenome:190913_A/G PASS
## MockRefGenome:191217_G/A PASS
## MockRefGenome:191479_G/A PASS
## MockRefGenome:191569_C/T PASS
## MockRefGenome:191604_G/T PASS
## ... ...
## MockRefGenome:324994_G/T PASS
## MockRefGenome:325077_T/A PASS
## MockRefGenome:325099_C/T PASS
## MockRefGenome:325368_A/G PASS
## MockRefGenome:325552_C/A PASS
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
When you are all done reading, close the connection.
close(mycon)
Keeping only specific fields or genomic regions
So far we have read all of the information from the VCF, even though there are some fields that we haven’t looked at at all. The ScanVcfParam
function is really useful if we want to save memory by only importing fields that we want.
For fixed fields, let’s ignore QUAL
and FILTER
since GBS-SNP-CROP didn’t put useful information there. We plan to re-call the genotypes as tetraploid, so we can also drop the diploidized genotypes in the GT
field. And in the INFO
column, maybe we only care about DP
and NS
.
myparam <- ScanVcfParam(fixed = "ALT", info = c("DP", "NS"), geno = "AD")
open(mycon)
first1000slim <- readVcf(mycon, param = myparam)
close(mycon)
first1000slim
## class: CollapsedVCF
## dim: 1000 8
## rowRanges(vcf):
## GRanges with 3 metadata columns: paramRangeID, REF, ALT
## info(vcf):
## DataFrame with 2 columns: DP, NS
## info(header(vcf)):
## Number Type Description
## DP 1 Integer Total Depth
## NS 1 Integer Number of Samples With Data
## geno(vcf):
## SimpleList of length 1: AD
## geno(header(vcf)):
## Number Type Description
## AD R Integer Allele Depth
rowRanges(first1000slim)
## GRanges object with 1000 ranges and 3 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## MockRefGenome:1731_A/G MockRefGenome 1731 * | <NA>
## MockRefGenome:1851_T/A MockRefGenome 1851 * | <NA>
## MockRefGenome:3311_A/G MockRefGenome 3311 * | <NA>
## MockRefGenome:3833_G/T MockRefGenome 3833 * | <NA>
## MockRefGenome:5102_T/C MockRefGenome 5102 * | <NA>
## ... ... ... ... . ...
## MockRefGenome:189686_C/A MockRefGenome 189686 * | <NA>
## MockRefGenome:190184_C/T MockRefGenome 190184 * | <NA>
## MockRefGenome:190190_T/C MockRefGenome 190190 * | <NA>
## MockRefGenome:190326_G/A MockRefGenome 190326 * | <NA>
## MockRefGenome:190895_C/T MockRefGenome 190895 * | <NA>
## REF ALT
## <DNAStringSet> <DNAStringSetList>
## MockRefGenome:1731_A/G A G
## MockRefGenome:1851_T/A T A
## MockRefGenome:3311_A/G A G
## MockRefGenome:3833_G/T G T
## MockRefGenome:5102_T/C T C
## ... ... ...
## MockRefGenome:189686_C/A C A
## MockRefGenome:190184_C/T C T
## MockRefGenome:190190_T/C T C
## MockRefGenome:190326_G/A G A
## MockRefGenome:190895_C/T C T
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
ScanVcfParam
can also be used to subset samples.
myparam2 <- ScanVcfParam(fixed = "ALT", info = c("DP", "NS"), geno = "AD",
samples = c("JM-2014-H-28", "JM-2014-K-8", "JM-2014-M-5", "JM-2014-S-5"))
open(mycon)
first1000slim2 <- readVcf(mycon, param = myparam2)
close(mycon)
first1000slim2
## class: CollapsedVCF
## dim: 1000 4
## rowRanges(vcf):
## GRanges with 3 metadata columns: paramRangeID, REF, ALT
## info(vcf):
## DataFrame with 2 columns: DP, NS
## info(header(vcf)):
## Number Type Description
## DP 1 Integer Total Depth
## NS 1 Integer Number of Samples With Data
## geno(vcf):
## SimpleList of length 1: AD
## geno(header(vcf)):
## Number Type Description
## AD R Integer Allele Depth
geno(first1000slim2)$AD[1:10,]
## JM-2014-H-28 JM-2014-K-8 JM-2014-M-5 JM-2014-S-5
## MockRefGenome:1731_A/G Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:1851_T/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:3311_A/G Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:3833_G/T Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:5102_T/C Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:5104_C/G Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:5127_G/A Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:5327_A/G Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:5498_A/C Integer,2 Integer,2 Integer,2 Integer,2
## MockRefGenome:6284_G/A Integer,2 Integer,2 Integer,2 Integer,2
A third handy use for ScanVcfParam
is specifying a genomic region with the which
parameter. The one catch is that you can’t use it with TabixFile
s that have a yieldSize
other than NA
. Here is how I would read from 10000 to 11000 bp on MockRefGenome, assuming it were a real chromosome.
myparam3 <- ScanVcfParam(which = GRanges("MockRefGenome", IRanges(10000, 11000)))
specific_range <- readVcf(mybg, genome = "MockRefGenome", param = myparam3)
rowRanges(specific_range)
## GRanges object with 2 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## MockRefGenome:10163_C/G MockRefGenome 10163 * | <NA>
## MockRefGenome:10168_C/A MockRefGenome 10168 * | <NA>
## REF ALT QUAL
## <DNAStringSet> <DNAStringSetList> <numeric>
## MockRefGenome:10163_C/G C G <NA>
## MockRefGenome:10168_C/A C A <NA>
## FILTER
## <character>
## MockRefGenome:10163_C/G PASS
## MockRefGenome:10168_C/A PASS
## -------
## seqinfo: 1 sequence from MockRefGenome genome; no seqlengths
ScanVcfParam
is not only useful for importing data, but also filtering.
filter3_file <- "Msa_GSCdemo_filter3.vcf"
filterVcf(mybg, genome = "MsaMockReference",
destination = filter3_file,
prefilters = FilterRules(list(NotAllHet)),
filters = FilterRules(list(DPbelow1000, NoMissing)),
param = myparam2)
cat(readLines(filter3_file, 20), sep = "\n")
## ##fileformat=VCFv4.2
## ##fileDate=20190425
## ##phasing=partial
## ##source=GBS-SNP-CROP
## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
## ##INFO=<ID=AF,Number=A,Type=Integer,Description="Allele Frequency">
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
## ##INFO=<ID=AV,Number=1,Type=Integer,Description="Average Depth">
## ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allele Depth">
## ##contig=<ID=MockRefGenome,assembly="MsaMockReference">
## #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT JM-2014-H-28 JM-2014-K-8 JM-2014-M-5 JM-2014-S-5
## MockRefGenome 92975 MockRefGenome:92975_C/A C A . . DP=982;NS=8 AD 92,0 143,0 40,19 134,33
## MockRefGenome 93207 MockRefGenome:93207_C/G C G . . DP=965;NS=8 AD 114,0 144,40 60,0 123,42
## MockRefGenome 95964 MockRefGenome:95964_C/T C T . . DP=955;NS=8 AD 99,0 115,17 92,0 154,0
## MockRefGenome 98167 MockRefGenome:98167_C/A C A . . DP=889;NS=8 AD 0,32 69,31 52,0 42,0
## MockRefGenome 101036 MockRefGenome:101036_G/T G T . . DP=966;NS=8 AD 137,0 140,0 71,0 121,44
## MockRefGenome 101076 MockRefGenome:101076_G/C G C . . DP=967;NS=8 AD 137,0 140,0 72,0 121,44
## MockRefGenome 103076 MockRefGenome:103076_C/A C A . . DP=961;NS=8 AD 83,0 217,0 76,0 146,0