In this tutorial, we will learn how to identify short variants (e.g. SNPs and indels) with GATK version 4, following the New York University pipeline. We will use DNA-Seq data that sequenced from five strains of Arabidopsis thaliana in Cao et al study, and detect the short variants among these strains.
In order to prevent the comtamination of system environemnt, we build a virtual analysis environment with Anaconda. We use conda
command to build an environment named atgwasenv, and then install some bioinformatics tools.
x
# build an analysis environment
conda create --name atgwasenv
# activate the analysis environemnt
conda activate atgwasenv
# install bioinformatics tools
conda install -c bioconda fastqc trimmomatic bwa samtools vcftools
conda install -c bioconda perl-vcftools-vcf gatk4 tabix snpeff
Then, we create some directories to save DNA-Seq data, gneome data, intermediate results, and the final results. In addititon, we set an environment variable PROJECTT_PATHT
to remember the root path of this project, just for convenience. Note that you should change lecturer01
to your user name.
xxxxxxxxxx
# project path
PROJECT_PATH=/home/lecturer01/projects/atgwas
mkdir -p ${PROJECT_PATH}
cd ${PROJECT_PATH}
pwd
## /home/lecturer01/projects/atgwas
mkdir genome
mkdir fastq
mkdir cleaned_fastq
mkdir bam
mkdir bqsr
mkdir vcf
The resequencing data by Cao et al. can be downloaded from EMBL-EBI with accession SRA029270. There are more than 50 strains used in this study, while we only use the 5 strains in this workshop. Here, we use wget
command to download FASTQ datasets from EMBL-EBI directly.
xxxxxxxxxx
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR094/SRR094979/SRR094979_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR094/SRR094979/SRR094979_2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095680/SRR095680_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095680/SRR095680_2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095786/SRR095786_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095786/SRR095786_2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095851/SRR095851_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095851/SRR095851_2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095698/SRR095698_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR095/SRR095698/SRR095698_2.fastq.gz
Note that if you use AI super computer, you can copy the FASTQ files from /data/workshop1/hts/fastq/atgwas
with the following commands.
x
cd ${PROJECT_PATH}/fastq
#cp /data/workshop1/hts/fastq/atgwas/*.fastq.gz .
cp /data/workshop1/hts/fastq/atgwasmini/*.fastq.gz .
We will use TAIR10 genome of A. Thaliana as a reference for variant calling. The TAIR10 genome sequences and annotations can be downloaded from Ensembl website. Here, we use wget
command to download both from Ensembl directly.
x
cd ${PROJECT_PATH}/genome
# reference sequence
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
# gene annotation
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-40/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.40.gff3.gz
# decompress
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.40.gff3.gz
If you use AI super computer, you can copy the FASTQ files from /data/workshop1/hts/genome/TAIR10
with the following commands.
xxxxxxxxxx
cd ${PROJECT_PATH}/genome
cp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa .
cp /data/workshop1/hts/genome/TAIR10/Arabidopsis_thaliana.TAIR10.40.gff3 .
The resequencing data in FASTQ files may contain low quality parts, such as artificial contaminations, poly-A tails, and so on. To determine the detailed these erros, we use FastQC (fastqc
) to check all FASTQ files.
xxxxxxxxxx
cd ${PROJECT_PATH}/fastq
fastqc *.gz
FastQC generates a quality report for each library. Considering the reports, we then determine the thresholds to perform quality control (QC) against all FASTQ files by using Trimmomatic (trimmomatic
). Since we will perfrom the same QC processes agains all FASTQ files, we will use for
syntax to processs all FASTQ files one by one. Note that, DO NOT ADD SPACES NEAR BY =
, and DO NOT ADD ANY CHARACTERS AFTER \
including a space.
x
cd ${PROJECT_PATH}/fastq
for fpath in `ls *_1.fastq.gz`
do
fname=${fpath%_1.fastq.gz}
trimmomatic PE -threads 4 \
${fname}_1.fastq.gz ${fname}_2.fastq.gz \
${fname}_cleaned_1.fastq.gz ${fname}_unpaired_1.fastq.gz \
${fname}_cleaned_2.fastq.gz ${fname}_unpaired_2.fastq.gz \
LEADING:20 TRAILING:20 MINLEN:20
done
mv *_cleaned_*.gz ${PROJECT_PATH}/cleaned_fastq/
To make sure the QC was succesfully performed, we check qualities of the all cleaned FASTQ files with FastQC.
xxxxxxxxxx
cd ${PROJECT_PATH}/cleaned_fastq
fastqc *.gz
For variant calling, STAR is recommended for mapping of RNA-Seq reads and BWA is recommended for DNA-Seq reads. In this tutorial, we use BWA to map DNA-Seq reads to TAIR10 genome sequences. To generate the index for mapping, we use bwa
command with index
option. We name the index as TAIR10_bwa here.
xxxxxxxxxx
cd ${PROJECT_PATH}/genome
bwa index -p TAIR10_bwa Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
Then, we use bwa
command to map all sequencing data to TAIR10_bwa.
x
cd ${PROJECT_PATH}/cleaned_fastq
for fpath in `ls *_cleaned_1.fastq.gz`
do
fname=${fpath%_cleaned_1.fastq.gz}
bamRG="@RG\tID:"${fname}"\tPL:ILLUMINA\tSM:"${fname}
bwa mem -t 2 -R ${bamRG} ${PROJECT_PATH}/genome/TAIR10_bwa \
${fname}_cleaned_1.fastq.gz ${fname}_cleaned_2.fastq.gz > ${fname}.sam
done
mv *.sam ${PROJECT_PATH}/bam/
Since the reads are randomly stored in SAM file, we use samtools
command with sort
opition to convert SAM to BAM, and then sort reads according to chromosome positions.
x
cd ${PROJECT_PATH}/bam
for fpath in `ls *.sam`
do
samtools sort -@ 4 ${fpath} > ${fpath%.sam}.bam
done
GATK analyzes alignments in BAM files to call variants. Like to other tools, STAR and BWA, GATK also requires index file for data analysis. We use CreateSequenceDictionary
function to build a index. The index name should be same as FASTA file, but the extension should be .dict. In addition, we will use the reference and index many times in the downstream analysis, we here create an environment variable REF
to remeber the path to the reference file.
xxxxxxxxxx
cd ${PROJECT_PATH}/genome
REF=${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
samtools faidx ${REF}
gatk CreateSequenceDictionary -R ${REF} -O ${REF%.fa}.dict
Note that, if GATK command dose not work, add --java-options
to upgrade the memory for Java.
xxxxxxxxxx
gatk --java-options "-Xmx4G" CreateSequenceDictionary -R ${REF} -O ${REF%.fa}.dict
Resequencing reads may contains PCR duplication. The duplicated reads affect the accuracy of variant calling. We use GATK to mark up the duplicated reads in BAM file. The new BAM file generated in this step (*_markdup.bam) still contains the duplicated reads, but the duplicated reads are marked up which enable GATK to ignore these reads.
x
cd ${PROJECT_PATH}/bam
for fpath in `ls *.bam`
do
gatk MarkDuplicates -I ${fpath} \
-O ${fpath%.bam}_markdup.bam \
-M ${fpath%.bam}_markdup_metrics.txt
samtools index ${fpath%.bam}_markdup.bam
done
Then, we use GATK to check the quality of mapping results, such as alignment scores and the insertion sizes of the paired-end reads.
xxxxxxxxxx
cd ${PROJECT_PATH}/bam
for fpath in `ls *.bam | grep -v _markdup`
do
fname=${fpath%.bam}
gatk CollectAlignmentSummaryMetrics -I ${fname}_markdup.bam \
-R ${REF} \
-O ${fname}_alignment_metrics.txt
gatk CollectInsertSizeMetrics -I ${fname}_markdup.bam \
-O ${fname}_insert_metrics.txt \
-H ${fname}_insert_size_histogram.pdf
samtools depth -a ${fname}_markdup.bam > ${fname}_depth.txt
done
The original read quality may not reflect the errors of base calling in HTS platform and PCR errors. These errors may affect the accuracy during variant calling. To reduce the errors, we recalculate the base scores and update the quality scores in BAM files. This process is called as Base Quality Score Recalibration (BQSR) in GATK. To perform BQSR, we need tell GATK the prior-information of variants. If we can obtain these information from public databases or previous studies, we can use that. If there is no prior-information, we can perform variant calling and use that as prior-information.
If we have prior-information of variants, we can use the information for BQSR. We perform BaseRecalibrator
against input.bam
and get the updated BAM input_sqrt.bam
.
x
gatk BaseRecalibrator -R ${REF} -I input.bam \
--known-sites dbsnp_146.b38.vcf.gz \
--known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites resources_broad_hg38_v0_1000G.phase3.integrated.sites_only.hg38.vcf \
-O ${fname}_recal_data.table
gatk ApplyBQSR -R ${REF} -I input.bam -bqsr ${fname}_recal_data.table -O input_bqsr.bam
If we do not have prior-information of variants, we can perform variant calling to generate the temporal variant information. We will run GATK with HaplotypeCaller
option to call short variants. This step will generate the intermediate results. To distinguish between the intermediate and final results, we use .g.gvf
as an suffix for these intermedaite results, and usually we call these files as gVCF files. The process takes more than 1 hour to process 1 GB BAM file.
x
cd ${PROJECT_PATH}/bam
for fpath in `ls *_markdup.bam`
do
fname=${fpath%_markdup.bam}
gatk HaplotypeCaller -R ${REF} --emit-ref-confidence GVCF \
-I ${fname}_markdup.bam -O ${fname}.g.vcf
done
mv *.g.vcf ../bqsr/
mv *.g.vcf.idx ../bqsr/
After we run HaplotypeCaller
, i.e., perform haplotype calling and generate the itermediate result for each sample, we will merge these intermediate results into one file (merged.g.vcf
). Then, we perform genotyping with the merged file.
x
cd ${PROJECT_PATH}/bqsr
# list up all gVCF files
gvcf_files=""
for gvcf_file in `ls *.g.vcf`
do
gvcf_files=${gvcf_files}"-V ${gvcf_file} "
done
gatk CombineGVCFs -R ${REF} ${gvcf_files} -O merged.g.vcf
gatk GenotypeGVCFs -R ${REF} -V merged.g.vcf -O merged.vcf
Next, we extract SNPs from the merged VCF file and perform filteration to tag high confidence SNPs. Finally, we will retrive SNPs in *_snps_filtered.vcf
file. This file contain all SNPs, SNPs passed the filteration are tagged with PASS
, otherwise tagged with filteration name which causes failure of filteration.
x
cd ${PROJECT_PATH}/bqsr
gatk SelectVariants -R ${REF} -V merged.vcf --select-type-to-include SNP -O merged_snps.vcf
gatk VariantFiltration -R ${REF} -V merged_snps.vcf -O merged_snps_filtered.vcf \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 4.0" --filter-name "SOR4" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8"
Same to SNPs, we additionaly perform the same processses to indels to get the final result.
x
cd ${PROJECT_PATH}/bqsr
gatk SelectVariants -R ${REF} -V merged.vcf --select-type-to-include INDEL -O merged_indels.vcf
gatk VariantFiltration -R ${REF} -V merged_indels.vcf -O merged_indels_filtered.vcf \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "SOR > 10.0" -filter-name "SOR10" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20"
So far, we get merged_snps_filtered.vcf
and merged_indels_filtered.vcf
files that contain SNPs and indels information, respectively. We use these information as prior-information for BQSR.
x
cd ${PROJECT_PATH}/bam
for fpath in `ls *_markdup.bam`
do
fname=${fpath%_markdup.bam}
gatk BaseRecalibrator -R ${REF} -I ${fname}_markdup.bam \
--known-sites ${PROJECT_PATH}/bqsr/merged_snps_filtered.vcf \
--known-sites ${PROJECT_PATH}/bqsr/merged_indels_filtered.vcf \
-O ${fname}_recal_data.table
gatk ApplyBQSR -R ${REF} -I ${fname}_markdup.bam -bqsr ${fname}_recal_data.table -O ${fname}_bqsr.bam
done
To compare the effections between before and after BQSR, we need run BaseRecalibrator
again with new BAM file and then use AnalyzeCovariantes
to comparison. This step is not required for variant calling, just for checking the effections of BQSR.
x
cd ${PROJECT_PATH}/bam
for fpath in `ls *_bqsr.bam`
do
fname=${fpath%_bqsr.bam}
gatk BaseRecalibrator -R ${REF} -I ${fname}_bqsr.bam \
--known-sites ${PROJECT_PATH}/bqsr/merged_snps_filtered.vcf \
--known-sites ${PROJECT_PATH}/bqsr/merged_indels_filtered.vcf \
-O ${fname}_recal_data.table.2
gatk AnalyzeCovariates -before ${fname}_recal_data.table -after ${fname}_recal_data.table.2 \
-plots ${fname}_recalibration_plots.pdf
done
After BQSR, We will call variant again with GATK HaplotypeCaller
against each updated BAM file.
x
cd ${PROJECT_PATH}/bam
for fpath in `ls *_bqsr.bam`
do
fname=${fpath%_bqsr.bam}
gatk HaplotypeCaller -R ${REF} --emit-ref-confidence GVCF \
-I ${fname}_bqsr.bam -O ${fname}.g.vcf
done
mv *.g.vcf ../vcf/
mv *.g.vcf.idx ../vcf/
Then, we merge all gVCF files into one gVCF file with CombineGVCFs
function and convert gVCF to VCF file.
x
cd ${PROJECT_PATH}/vcf
# list up all gVCF files
gvcf_files=""
for gvcf_file in `ls *.g.vcf`
do
gvcf_files=${gvcf_files}"-V ${gvcf_file} "
done
gatk CombineGVCFs -R ${REF} ${gvcf_files} -O merged.g.vcf
gatk GenotypeGVCFs -R ${REF} -V merged.g.vcf -O merged.vcf
Next, we extract SNPs from the merged VCF file and perform filteration to tag high confidence SNPs. Finally, we will retrive SNPs in *_snps_filtered.vcf
file. This file contain all SNPs, SNPs passed the filteration are tagged with PASS
, otherwise tagged with filteration name which causes failure of filteration.
x
cd ${PROJECT_PATH}/vcf
gatk SelectVariants -R ${REF} -V merged.vcf --select-type-to-include SNP -O merged_snps.vcf
gatk VariantFiltration -R ${REF} -V merged_snps.vcf -O merged_snps_filtered.vcf \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 4.0" --filter-name "SOR4" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8"
Same to SNPs, we additionaly perform the same processses to indels to get the final result.
x
cd ${PROJECT_PATH}/vcf
gatk SelectVariants -R ${REF} -V merged.vcf --select-type-to-include INDEL -O merged_indels.vcf
gatk VariantFiltration -R ${REF} -V merged_indels.vcf -O merged_indels_filtered.vcf \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "SOR > 10.0" -filter-name "SOR10" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20"
Variants in VCF files are recorded by chromosome name and positions. We use gene annotation to anontate variants to clear that which variant are belong to what genes. We will use snpEff here. To use snpEff, we need to generate index file. Here we use the downloaded annotation file to generate index.
xxxxxxxxxx
# change path to your environmen
which snpEff
snpeff=/home/lecturer01/anaconda3/envs/atgwasenv/share/snpeff-4.5covid19-1
# go to snpEff dicretory
cd ${snpeff}
# make tair10 directory and save the sequences and annotations
mkdir -p data/tair10
cp ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa data/tair10/sequences.fa
cp ${PROJECT_PATH}/genome/Arabidopsis_thaliana.TAIR10.40.gff3 data/tair10/genes.gff
# change the configure file and build a database
echo "tair10.genome : tair10" >> snpEff.config
snpEff build -gff3 -v tair10
Then, we use the index to annotate the VCF files of the high confidence SNPs and indels.
xxxxxxxxxx
cd ${PROJECT_PATH}/vcf
snpEff tair10 merged_snps_filtered.vcf > merged_snps_filtered_ann.vcf
mv snpEff_genes.txt merged_snps_snpEff_genes.txt
mv snpEff_summary.html merged_snps_snpEff_summary.txt
bgzip merged_snps_filtered_ann.vcf
tabix merged_snps_filtered_ann.vcf.gz
snpEff tair10 merged_indels_filtered.vcf > merged_indels_filtered_ann.vcf
mv snpEff_genes.txt merged_indels_snpEff_genes.txt
mv snpEff_summary.html merged_indels_snpEff_summary.txt
bgzip merged_indels_filtered_ann.vcf
tabix merged_indels_filtered_ann.vcf.gz