Course: VariantSeq: tutorial for usage with case study., Topic: 1.2

1.2 - Tutorial material and case study

In this tutorial, we will use data from a case study of variant analysis published by (Trilla-Fuertes, et al. 2020). based on whole-exome NGS data sequenced from samples of human anal cancer. The tutorial material consists of five formalin-fixed, paraffin-embedded (FFPE) samples from patients diagnosed with localized anal squamous cell carcinoma (ASCC). These samples were analyzed by whole-exome sequencing (NextSeq500) via Illumina pair end. The sample names are provided in Table 1.

NGS Data: Five fastq files with the following SRA Accessions (Table 1):

Table 1: Exome samples

SRA accession	Library Names
SRR10164002	CAN2
SRR10163991	CAN3
SRR10163980	CAN4
SRR10163969	CAN5
SRR10163960	CAN12

The 5 fastq files can be downloaded from the NCBI at the following URL https://www.ncbi.nlm.nih.gov/bioproject/PRJNA573670. For questions regarding how to download this material, contact us for support in our forum at https://forum.biotechvana.com.

RefSeq material

In this tutorial, we used the Resource Bundle of GATK that is based on the Hg19 release of the human genome as a source of RefSeq. For training material, we used an interval file based on the seqCap VCRome V2 for human exome. The interval file can be downloaded by clicking seqCap_VCRome_V2_intervals_list.intervals.

For the cancer variant analysis, you need a panel of normal (PON) to filter all possible germline variants. This can be downloaded by clicking HPON.vcf and HPON.vcf.idx.

This PON was created using 11 human Iberian exome samples sequenced via Illumina technology (Illumina HiSeq 20) and Spanish populations HapMap provided by the 1000 genomes project (1000 Genomes whole exome sequencing of IBS population). The 11 samples can be downloaded from the SRA archive (http://www.ncbi.nlm.nih.gov/sra/) of NCBI with the following accessions SRR768531, SRR768530, SRR768529, SRR766062, SRR766027, SRR766011, SRR766005, SRR765982, SRR765992, SRR764760, SRR764761.

- Reference genome

For this experiment, you need the following training material from the hg19 release:

ucsc.hg19.dict.gz
ucsc.hg19.fasta.fai.gz
ucsc.hg19.fasta.gz

- Training sets and known site files

The fastq libraries must be mapped on a reference genome (ucsc.hg19.fasta) as a RefSeq sequence . The additional files .dict and .fai are the dictionary and index files, respectively, that are associated with that sequence.

dbsnp_138.hg19.vcf.gz
dbsnp_138.hg19.vcf.idx.gz
hapmap_3.3.hg19.sites.vcf.gz
hapmap_3.3.hg19.sites.vcf.idx.gz
1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz
af-only-gnomad.raw.sites.hg19.vcf
af-only-gnomad.raw.sites.hg19.vcf.idx

The material (reference genome and training sets) can be downloaded from the Broad Institute FTP site at https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle. In particular you will need the following from the hg19 release folder.

The training sets and known site resources are files including lists of variants that created with machine-learning algorithms to model the properties of true variation vs. artifacts. They are required in several steps of the SPMI protocol to help the caller distinguish true variants from false positives. For more details see the this URL at the GATK forum https://gatk.broadinstitute.org/hc/en-us/articles/360035890831-Known-variants-Training-resources-Truth-sets

VariantSeq: tutorial for usage with case study.

1.2 - Tutorial material and case study

Biotechvana

Esta plataforma forma parte de: IVACE PROJECT IMDIGB/2020/56