Variant Calling without a Reference

In this tutorial, I use GBS-SNP-CROP to identify variants in a population of tetraploid individuals without a reference genome, starting with the FASTQ files. I then export the allelic read depth to VCF for genotype calling downstream.

Preparing the sequencing data

This section shows how I prepared the data for this tutorial. I did it a little differently from how you might, partly because my lab uses custom adapters and partly because I only wanted to use a few samples in order to create a small dataset that would process quickly. You might instead follow the GBS-SNP-CROP tutorial directly for your own data.

My RAD-seq library was sequenced using single-end reads on a HiSeq 4000, with inline barcodes (i.e. the barcode is at the beginning of each read, followed by the remainder of the restriction cut site). There are many tools available for demultiplexing such data, including Fastx Toolkit, Stacks, and GBS-SNP-CROP. I used my own software TagDigger because it has the custom adapters that my lab uses built into it, and will trim them out while performing the demultiplexing. The CSV passed to the -b argument lists input file, barcode, and output file for every sample. The argument -a indicates what adapter set was used. TagDigger and all of the other tools mentioned generate one FASTQ file for each barcode in the original file, with the barcode sequence removed.

I ran the code below using Windows Command Prompt, but it would also work in bash.

python barcode_splitter_script.py -b 190419remainingMsa_barcodekey_Hiseq4000.csv -a PstI-MspI-Clark

Out of the files that were generated, we will use the following ones for further analysis. These are all tetraploid, non-hybrid Miscanthus sacchariflorus accessions. If you wish to download the FASTQ files in order to reproduce this entire tutorial, the run accession numbers on NCBI-SRA are provided.

File Sample Provenance SRA
DOEMsa56a_CCTCCAGA.fq JM-2014-H-28 Japan SRR8997766
DOEMsa56a_CCATAAG.fq JM-2014-K-8 Japan SRR8997767
DOEMsa56a_GGTGGCCA.fq JM-2014-M-5 Japan SRR8997511
DOEMsa56a_GCGGTCCA.fq JM-2014-S-5 Japan SRR8997617
DOEMsa56a_TTGTTGTCTA.fq JY134 China SRR8997716
DOEMsa56a_CCACCATCAG.fq JY185 China SRR8997496
DOEMsa56a_GGTGCCA.fq KMS257 South Korea SRR8997613
DOEMsa56a_TTCTGACA.fq RU2012-055 Russia SRR8991820

The sequences then need to all be trimmed to the same length, in particular because the barcodes were variable length and we don’t want to call SNPs at the end of the read where there will be a lot of missing data. We also want to get rid of any low-quality base calls to reduce the number of sequencing errors called as SNPs. Trimmomatic can accomplish this trimming, and we’ll use it here also because it is what is used internally by GBS-SNP-CROP. We will trim the sequences to a maximum of 80 nucleotides, then clip of low-quality sequence at the beginning or end, then discard reads with fewer than 36 nucleotides.

Below is a bash script designed to run on a Linux cluster using Lmod to manage software. On your own Linux server, you might have to run java -jar /path/to/trimmomatic-0.38.jar rather than just trimmomatic.

#!/bin/bash

module load Trimmomatic

infiles=("DOEMsa56a_CCTCCAGA.fq" "DOEMsa56a_CCATAAG.fq"
         "DOEMsa56a_GGTGGCCA.fq" "DOEMsa56a_GCGGTCCA.fq"
         "DOEMsa56a_TTGTTGTCTA.fq" "DOEMsa56a_CCACCATCAG.fq"
         "DOEMsa56a_GGTGCCA.fq" "DOEMsa56a_TTCTGACA.fq")
outfiles=("JM-2014-H-28.R1.fq" "JM-2014-K-8.R1.fq"
          "JM-2014-M-5.R1.fq" "JM-2014-S-5.R1.fq"
          "JY134.R1.fq" "JY185.R1.fq" "KMS257.R1.fq" "RU2012-055.R1.fq")

for ((i=0;i<${#infiles[@]};++i)); do
    trimmomatic SE -threads 1 -phred64 ${infiles[i]} ${outfiles[i]} CROP:80 LEADING:3 TRAILING:3 MINLEN:36
done

gzip demultiplexed/*.R1.fq

The gzip line is necessary because GBS-SNP-CROP expects gzipped files.

Assembling the mock reference

We’ll make a tab-delimited text file, that we’ll call Msa_barcodes.txt, which will indicate the samples in this experiment. The last column with YES and NO indicates which samples to use for building the mock reference. Two of the files are smaller than the rest, which is why they are excluded.

CCTCCAGA	JM-2014-H-28	YES
CCATAAG	JM-2014-K-8	YES
GGTGGCCA	JM-2014-M-5	NO
GCGGTCCA	JM-2014-S-5	YES
TTGTTGTCTA	JY134	YES
CCACCATCAG	JY185	NO
GGTGCCA	KMS257	YES
TTCTGACA	RU2012-055	YES

Now we will use GBS-SNP-CROP to build the mock reference.

#!/bin/bash

module load PEAR
module load VSEARCH
module load Perl

perl GBS-SNP-CROP-4.pl -pr pear -vs vsearch -d SE -b Msa_barcodes.txt -t 4 -rl 80 -MR Msa.MR

Again, we are loading pear and vsearch using Lmod, so we can use the program names directly in place of specifying a path to them. The argument -d SE incidates that we have single end reads. We will specify four processing cores using -t 4, and that our reads are 80 nt long using -rl 80. Finally, we specify the file prefix for our mock reference genome using -MR Msa.MR. (“Msa” being an abbreviation that the Sacks lab commonly uses for Miscanthus sacchariflorus.)

Once this is done running, there are some new files in your main directory, including Msa.MR.Genome.fa, Msa.MR.Clusters.fa, and PosToMask.txt. We can use less to inspect them.

less Msa.MR.Genome.fa

Beginning of Msa.MR.Genome.fa

Msa.MR.Genome.fa is what will be used as the reference genome when we align to Msa. It contains all of the tag clusters separated using strings of A for spacers.

less Msa.MR.Clusters.fa

Beginning of Msa.MR.Clusters.fa

Msa.MR.Clusters.fa isn’t strictly needed for GBS-SNP-CROP, but is there in case downstream software requires a reference genome.

less PosToMask.txt

Beginning of PosToMask.txt

PosToMask.txt contains positions of all spacer nucleotides in Msa.MR.Genome.fa.

Aligning reads to the mock reference and identifying SNPs

The next script in GBS-SNP-CROP uses the alignment software BWA to align every sequence read to the mock reference that we just created. It then uses SAMtools to sort and index the alignments, then identify potential SNPs. The -q, -Q, -f, -F, and -Opt parameters get passed to SAMtools.

#!/bin/bash

module load BWA
module load SAMtools
module load Perl

perl GBS-SNP-CROP-5.pl -bw bwa -st samtools -d SE -b Msa_barcodes.txt -ref Msa.MR.Genome.fa -Q 30 -q 0 -f 0 -F 2308 -t 4 -Opt 0

The most important output from this script is the set of files ending in .mpileup, which contain information about how reads from each sample correspond to alleles and loci.

Mpileup output for JM-2014-H-28

At position 1 in the mock reference, where the reference base is a T, there were 8000 reads that aligned for JM-2014-H-28. The rest of the information is not particularly human-readable, however.

Another script then converts the information from the mpileup files into a more readable format for further processing. For the sake of having a smaller dataset to look at in this tutorial, we will only look at SNPs and not indels (-p snp).

#!/bin/bash

module load Perl

perl GBS-SNP-CROP-6.pl -b Msa_barcodes.txt -out Msa.MasterMatrix.txt -p snp -t 4

Count output for JM-2014-H-28

Here we see how many reads corresponded to A, C, G, and T at each position. Scrolling through the file we can see how depth varied from tag to tag.

Count output for JM-2014-H-28

Additionally, one file was created combining this information from all samples.

Master matrix of counts

Filtering markers

Now we will filter the markers and output them to a genotype matrix. Although GBS-SNP-CROP only makes diploid gentoype calls, which we will not be using, we still want to tweak the parameters so that it can make these calls, because it will throw out SNPs below a given call rate. Here are all the parameters and why I set them as I did:

  • -mnHoDepth0 20: The default value of 5 assumes diploidy, where a heterozygote will have alleles in a 1:1 ratio, so the probability of a genotype with five reads having only reads of the first allele is 3%. In tetraploids we will frequently see a 3:1 ratio, so we would need twelve reads to have that same 3% probability. Since we seem to have pretty high depth in this dataset, I increased the number from 12 to 20.
  • -mnHoDepth1 50: When there is just one read of an allele, that read might be due to contamination, or it might indicate a true heterozygote. Again, because we expect heterozygotes that are 3:1 rather than 1:1, we’ll set a higher threshold than the default for determining that a read represents contamination.
  • -mnHetDepth 3: I didn’t see a particular reason to change this from the default. If we see at least three reads, we can start to believe that the allele is really there.
  • -altStrength 0.8: Another one that I kept at the default. If there are multiple alternative alleles, we want most of them to belong to one allele so that GBS-SNP-CROP can convert it to a biallelic marker.
  • -mnAlleleRatio 0.125: This is used for catching paralogous loci that were collapsed into single loci. In a diploid system, we expect a depth ratio of 0.5, but because it can vary due to random chance, the default is to allow it to be as low as 0.25. Since we have a tetraploid and are expecting 0.25, I’ll lower it to 0.125 to allow sampling error. If we see allele depth ratios much below that, it could indicate that a locus isn’t really just one locus.
  • -mnCall 0.75: The default is fine; wanting 75% of samples to have calls for a given SNP is typical.
  • -mnAvgDepth 20 : Since we have very high depth, we will set this value higher than the default of 3.
  • -mxAvgDepth 3000: Since we have very high depth, we will set this value higher than the default of 200.
#!/bin/bash

module load Perl

perl GBS-SNP-CROP-7.pl -in Msa.MasterMatrix.txt -out Msa.GenoMatrix.txt -p snp -mnHoDepth0 20 -mnHoDepth1 50 -mnHetDepth 3 -altStrength 0.8 -mnAlleleRatio 0.125 -mnCall 0.75 -mnAvgDepth 20 -mxAvgDepth 3000

We can see a log of what happened, showing that we got almost 38K SNPs.

Log from step 7

This is what the output genotype matrix looks like. It contains genotype calls and allelic read depth.

Genotype matrix

Exporting to VCF

Finally, we’ll convert the genotype matrix to a VCF.

#!/bin/bash

module load Perl

perl GBS-SNP-CROP-8.pl -in variants/Msa.GenoMatrix.txt -out Msa_GSCdemo -b Msa_barcodes.txt -formats vcf

Genotypes as VCF

This is the file that we will keep and use for further analysis with other software.