1.2 - Tutorial material and case study
In this tutorial, we will use data from a case study of variant analysis published by (Trilla-Fuertes, et al. 2020). based on whole-exome NGS
data sequenced from samples of human anal cancer. The tutorial material consists of five formalin-fixed, paraffin-embedded (FFPE) samples from patients diagnosed with localized anal squamous cell carcinoma (ASCC). These samples were analyzed by whole-exome
sequencing (NextSeq500) via Illumina pair end. The sample names are provided in Table 1.
Table 1: Exome samples
SRA accession | Library Names |
---|---|
SRR10164002 | CAN2 |
SRR10163991 | CAN3 |
SRR10163980 | CAN4 |
SRR10163969 | CAN5 |
SRR10163960 | CAN12 |
The 5 fastq files can be downloaded from the NCBI at the following URL https://www.ncbi.nlm.nih.gov/bioproject/PRJNA573670. For questions regarding how to download this material, contact us for support in our forum at https://forum.biotechvana.com.
RefSeq material
In this tutorial, we used the Resource Bundle of GATK that is based on the Hg19 release of the human genome as a source of RefSeq. For training material, we used an interval file based on the seqCap VCRome V2 for human exome. The interval file can be downloaded by clicking seqCap_VCRome_V2_intervals_list.intervals.
For the cancer variant analysis, you need a panel of normal (PON) to filter all possible germline variants. This can be downloaded by clicking HPON.vcf and HPON.vcf.idx.
This PON was created using 11 human Iberian exome samples sequenced via Illumina technology (Illumina HiSeq 20) and Spanish populations HapMap provided by the 1000 genomes project (1000 Genomes whole exome sequencing of IBS population). The 11 samples can be downloaded from the SRA archive (http://www.ncbi.nlm.nih.gov/sra/) of NCBI with the following accessions SRR768531, SRR768530, SRR768529, SRR766062, SRR766027, SRR766011, SRR766005, SRR765982, SRR765992, SRR764760, SRR764761.
- Reference genome
For this experiment, you need the following training material from the hg19 release:
- ucsc.hg19.dict.gz
- ucsc.hg19.fasta.fai.gz
- ucsc.hg19.fasta.gz
- Training sets and known site files
The fastq libraries must be mapped on a reference genome (ucsc.hg19.fasta) as a RefSeq sequence . The additional files .dict and .fai are the dictionary and index files, respectively, that are associated with that sequence.
- dbsnp_138.hg19.vcf.gz
- dbsnp_138.hg19.vcf.idx.gz
- hapmap_3.3.hg19.sites.vcf.gz
- hapmap_3.3.hg19.sites.vcf.idx.gz
- 1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
- 1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz
- Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
- Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz
- af-only-gnomad.raw.sites.hg19.vcf
- af-only-gnomad.raw.sites.hg19.vcf.idx
The material (reference genome and training sets) can be downloaded from the Broad Institute FTP site at https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle. In particular you will need the following from the hg19 release folder.
The training sets and known site resources are files including lists of variants that created with machine-learning algorithms to model the properties of true variation vs. artifacts. They are required in several steps of the SPMI protocol to help the caller distinguish true variants from false positives. For more details see the this URL at the GATK forum https://gatk.broadinstitute.org/hc/en-us/articles/360035890831-Known-variants-Training-resources-Truth-sets