Output files
PCGR generates multiple output files with annotations of molecular aberrations, including an interactive report, an Excel workbook, and pure text-based annotation files (TSV).
HTML report - quarto-based
An interactive and structured HTML report that shows the most relevant findings in the query cancer genome has the following naming convention:
-
<sample_id>.pcgr.<genome_assembly>.html
- The sample_id is provided as input by the user, and reflects a unique identifier of the tumor-normal sample pair to be analyzed.
The report is structured in various sections, pending upon the input provided by the user. The following sections may be included in the report:
-
Settings
- Lists key configurations for the analysis, including the genome assembly, type of sequencing assay (WES/WGS/TARGETED), the cancer type (as provided by the user), and the tumor purity and ploidy.
-
Somatic SNVs/InDels
- Provides an overview of the somatic SNVs and InDels detected in the tumor sample
- Includes a global distribution of allelic support, statistics with respect to variant types and consequences
- Variants are classified with respect to oncogenicity
(ClinGen/CGC/VICC standard operating procedures)
- permits also exploration of somatic mutations through interactive filtering according to several dimensions (variant sequencing depth/support, variant consequence etc.)
- Variants are classified with respect to actionability
(AMP/ASCO/CAP guidelines)
- individual evidence items linked to actionable variants can be explored, indicating strength of evidence, tumor type and therapeutic context, and clinical significance
-
Somatic CNAs
- Aberrations are classified with respect to actionability (AMP/ASCO/CAP guidelines)
- individual evidence items linked to actionable variants can be explored, indicating strength of evidence, tumor type and therapeutic context, and clinical significance
- Other potentially oncogenic aberrations are listed, pProto-oncogenes subject to copy number amplifications, and tumor suppressor genes subject to homozygous deletions
- MSI status
- Indicates predicted microsatellite stability from the somatic mutation profile and supporting evidence (details of the underlying MSI statistical classifier can be found here)
- The MSI classifier was trained on TCGA exome samples.
-
Tumor mutational burden (TMB)
- given a coding target region size specified by the user (ideally the callable target size), an estimate of the mutational burden is provided
- The estimated TMB is shown in the context of TMB distributions from different primary sites in TCGA
- Mutational signatures
- Estimation of relative contribution of known mutational signatures in tumor sample (using MutationalPatterns as the underlying framework)
- Datatable with signatures found and proposed underlying etiologies
- RNA expression analysis
- Datatable with expression outliers - as compared to distribution in reference cohorts
- Datatable with correlation between gene expression in query sample and other reference cohorts (TCGA, TreeHouse, DepMap)
- Immune contexture profiling
- Documentation
- Annotation resources - databases with version and licensing
information
- Report contents - brief description of the main sections in the report
- References - supporting scientific literature (key report elements)
SNVs/InDels
1. Variant call format - VCF
A VCF file containing annotated, somatic calls (single nucleotide variants and insertion/deletions) is generated with the following naming convention:
-
<sample_id>.pcgr.<genome_assembly>.vcf.gz
- The sample_id is provided as input by the user, and reflects a unique identifier of the tumor-normal sample pair to be analyzed. Following common standards, the annotated VCF file is compressed with bgzip and indexed with tabix. Below follows a description of all annotations/tags present in the VCF INFO column after processing with the PCGR annotation pipeline:
VEP consequence annotations
Tag | Description |
---|---|
CSQ |
Complete consequence annotations from VEP. Format (separated by a
| ): Allele , Consequence ,
IMPACT , SYMBOL , Gene ,
Feature_type , Feature , BIOTYPE ,
EXON , INTRON , HGVSc ,
HGVSp , cDNA_position ,
CDS_position , Protein_position ,
Amino_acids , Codons ,
Existing_variation , ALLELE_NUM ,
DISTANCE , STRAND , FLAGS ,
PICK , VARIANT_CLASS ,
SYMBOL_SOURCE , HGNC_ID ,
CANONICAL , MANE_SELECT ,
MANE_PLUS_CLINICAL , TSL , APPRIS ,
CCDS , ENSP , SWISSPROT ,
TREMBL , UNIPARC , UNIPROT_ISOFORM ,
RefSeq , DOMAINS , HGVS_OFFSET ,
AF , AFR_AF , AMR_AF ,
EAS_AF , EUR_AF , SAS_AF ,
gnomAD_AF , gnomAD_AFR_AF ,
gnomAD_AMR_AF , gnomAD_ASJ_AF ,
gnomAD_EAS_AF , gnomAD_FIN_AF ,
gnomAD_NFE_AF , gnomAD_OTH_AF ,
gnomAD_SAS_AF , CLIN_SIG , SOMATIC ,
PHENO , CHECK_REF , MOTIF_NAME ,
MOTIF_POS , HIGH_INF_POS ,
MOTIF_SCORE_CHANGE , TRANSCRIPTION_FACTORS ,
NearestExonJB
|
Consequence |
Impact modifier for the consequence type (picked by VEP’s
--flag_pick_allele option) |
Gene |
Ensembl stable ID of affected gene (picked by VEP’s
--flag_pick_allele option) |
Feature_type |
Type of feature. Currently one of Transcript, RegulatoryFeature,
MotifFeature (picked by VEP’s --flag_pick_allele
option) |
Feature |
Ensembl stable ID of feature (picked by VEP’s
--flag_pick_allele option) |
cDNA_position |
Relative position of base pair in cDNA sequence (picked by VEP’s
--flag_pick_allele option) |
CDS_position |
Relative position of base pair in coding sequence (picked by VEP’s
--flag_pick_allele option) |
CDS_RELATIVE_POSITION |
Ratio of variant coding position to length of coding sequence |
CDS_CHANGE |
Coding, transcript-specific sequence annotation (picked by VEP’s
--flag_pick_allele option) |
ALTERATION |
HGVSp/HGVSc identifier |
AMINO_ACID_START |
Protein position indicating absolute start of amino acid altered
(fetched from Protein_position ) |
AMINO_ACID_END |
Protein position indicating absolute end of amino acid altered
(fetched from Protein_position ) |
Protein_position |
Relative position of amino acid in protein (picked by VEP’s
--flag_pick_allele option) |
Amino_acids |
Only given if the variant affects the protein-coding sequence
(picked by VEP’s --flag_pick_allele option) |
Codons |
The alternative codons with the variant base in upper case (picked
by VEP’s --flag_pick_allele option) |
IMPACT |
Impact modifier for the consequence type (picked by VEP’s
--flag_pick_allele option) |
VARIANT_CLASS |
Sequence Ontology variant class (picked by VEP’s
--flag_pick_allele option) |
SYMBOL |
Gene symbol (picked by VEP’s --flag_pick_allele
option) |
SYMBOL_SOURCE |
The source of the gene symbol (picked by VEP’s
--flag_pick_allele option) |
STRAND |
The DNA strand (1 or -1) on which the transcript/feature lies
(picked by VEP’s --flag_pick_allele option) |
ENSP |
The Ensembl protein identifier of the affected transcript (picked by
VEP’s --flag_pick_allele option) |
FLAGS |
Transcript quality flags: cds_start_NF : CDS 5’,
incomplete cds_end_NF : CDS 3’ incomplete (picked by VEP’s
--flag_pick_allele option) |
SWISSPROT |
Best match UniProtKB/Swiss-Prot accession of protein product (picked
by VEP’s --flag_pick_allele option) |
TREMBL |
Best match UniProtKB/TrEMBL accession of protein product (picked by
VEP’s --flag_pick_allele option) |
UNIPARC |
Best match UniParc accession of protein product (picked by VEP’s
--flag_pick_allele option) |
HGVSc |
The HGVS coding sequence name (picked by VEP’s
--flag_pick_allele option) |
HGVSc_RefSeq |
The HGVSc coding sequence name using RefSeq transcript identifiers
(MANE select) - picked by VEP’s --flag_pick_allele
option) |
HGVSp |
The HGVS protein sequence name (picked by VEP’s
--flag_pick_allele option) |
HGVSp_short |
The HGVS protein sequence name, short version (picked by VEP’s
--flag_pick_allele option) |
HGVS_OFFSET |
Indicates by how many bases the HGVS notations for this variant have
been shifted (picked by VEP’s --flag_pick_allele
option) |
NearestExonJB |
VEP plugin that finds nearest exon junction for a coding sequence variant. Format: Ensembl exon identifier+distanceto exon boundary+boundary type(start/end)+exon length |
MOTIF_NAME |
The source and identifier of a transcription factor binding profile
aligned at this position (picked by VEP’s
--flag_pick_allele option) |
MOTIF_POS |
The relative position of the variation in the aligned TFBP (picked
by VEP’s --flag_pick_allele option) |
HIGH_INF_POS |
A flag indicating if the variant falls in a high information
position of a transcription factor binding profile (TFBP) (picked by
VEP’s --flag_pick_allele option) |
MOTIF_SCORE_CHANGE |
The difference in motif score of the reference and variant sequences
for the TFBP (picked by VEP’s --flag_pick_allele
option) |
CELL_TYPE |
List of cell types and classifications for regulatory feature
(picked by VEP’s --flag_pick_allele option) |
CANONICAL |
A flag indicating if the transcript is denoted as the canonical
transcript for this gene (picked by VEP’s
--flag_pick_allele option) |
CCDS |
The CCDS identifier for this transcript, where applicable (picked by
VEP’s --flag_pick_allele option) |
INTRON |
The intron number (out of total number) (picked by VEP’s
--flag_pick_allele option) |
EXON |
The exon number (out of total number) (picked by VEP’s
--flag_pick_allele option) |
EXON_AFFECTED |
The exon affected by the variant (picked by VEP’s
--flag_pick_allele option) |
LAST_EXON |
Logical indicator for last exon of transcript (picked by VEP’s
--flag_pick_allele option) |
LAST_INTRON |
Logical indicator for last intron of transcript (picked by VEP’s
--flag_pick_allele option) |
INTRON_POSITION |
Relative position of intron variant to nearest exon/intron junction (NearestExonJB VEP plugin) |
EXON_POSITION |
Relative position of exon variant to nearest intron/exon junction (NearestExonJB VEP plugin) |
DISTANCE |
Shortest distance from variant to transcript (picked by VEP’s
--flag_pick_allele option) |
BIOTYPE |
Biotype of transcript or regulatory feature (picked by VEP’s
--flag_pick_allele option) |
TSL |
Transcript support level (picked by VEP’s
--flag_pick_allele option)> |
PUBMED |
PubMed ID(s) of publications that cite existing variant - VEP |
PHENO |
Indicates if existing variant is associated with a phenotype, disease or trait - VEP |
GENE_PHENO |
Indicates if overlapped gene is associated with a phenotype, disease or trait - VEP |
ALLELE_NUM |
Allele number from input; 0 is reference, 1 is first alternate etc - VEP |
REFSEQ_MATCH |
The RefSeq transcript match status; contains a number of flags
indicating whether this RefSeq transcript matches the underlying
reference sequence and/or an Ensembl transcript (picked by VEP’s
--flag_pick_allele option) |
PICK |
Indicates if this block of consequence data was picked by VEP’s
--flag_pick_allele option |
VEP_ALL_CSQ |
All transcript consequences
(Consequence:SYMBOL:Feature_type:Feature:BIOTYPE ) -
VEP |
EXONIC_STATUS |
Indicates if variant consequence type is ‘exonic’ or ‘nonexonic’. We
here define ‘exonic’ as any variant with either of the following
consequences: stop_gained / stop_lost ,
start_lost , frameshift_variant ,
missense_variant , splice_donor_variant ,
splice_acceptor_variant ,
inframe_insertion / inframe_deletion ,
synonymous_variant , start_retained ,
stop_retained , protein_altering
|
CODING_STATUS |
Indicates if primary variant consequence type is ‘coding’ or ‘noncoding’ (wrt. protein-alteration). ‘coding’ variants are here defined as those with an ‘exonic’ status, with the exception of synonymous variants |
NULL_VARIANT |
Primary variant consequence type is frameshift or
stop_gained /stop_lost
|
LOSS_OF_FUNCTION |
Loss-of-function variant |
LOF_FILTER |
Loss-of-function filter |
SPLICE_DONOR_RELEVANT |
Logical indicating if variant is located at a particular location
near the splice donor site (+3A/G , +4A or
+5G ) |
REGULATORY_ANNOTATION |
Comma-separated list of all variant annotations of
Feature_type , RegulatoryFeature , and
MotifFeature . Format (separated by a | ):
<Consequence> , <Feature_type> ,
<Feature> , <BIOTYPE> ,
<MOTIF_NAME> , <MOTIF_POS> ,
<HIGH_INF_POS> ,
<MOTIF_SCORE_CHANGE> ,
<TRANSCRIPTION_FACTORS>
|
Gene information
Tag | Description |
---|---|
ENTREZGENE |
Entrez gene identifier |
APPRIS |
Principal isoform flags according to the APPRIS principal isoform database |
MANE_SELECT |
Indicating if the transcript is the MANE Select for the gene (picked
by VEP’s --flag_pick_allele_gene option) |
MANE_PLUS_CLINICAL |
Indicating if the transcript is MANE Plus Clinical, as required for
clinical variant reporting (picked by VEP’s
--flag_pick_allele_gene option) |
UNIPROT_ID |
UniProt identifier |
UNIPROT_ACC |
UniProt accession(s) |
ENSEMBL_GENE_ID |
Ensembl gene identifier for VEP’s picked transcript (ENSGXXXXXXX) |
ENSEMBL_TRANSCRIPT_ID |
Ensembl transcript identifier for VEP’s picked transcript (ENSTXXXXXX) |
ENSEMBL_PROTEIN_ID |
Ensembl corresponding protein identifier for VEP’s picked transcript (ENSPXXXXXX) |
REFSEQ_TRANSCRIPT_ID |
Corresponding RefSeq transcript(s) identifier for VEP’s picked transcript (NM_XXXXX) |
MANE_SELECT2 |
MANE select transcript identifer: one high-quality representative transcript per protein-coding gene that is well-supported by experimental data and represents the biology of the gene - provided through BioMart |
MANE_PLUS_CLINICAL2 |
transcripts chosen to supplement MANE Select when needed for clinical variant reporting - provided through BioMart |
GENCODE_TAG |
tag for gencode transcript (basic etc) |
GENCODE_TRANSCRIPT_TYPE |
type of transcript (protein-coding etc.) |
TSG |
Flag indicating whether gene is predicted as a tumor suppressor gene, from Cancer Gene Census, Network of Cancer Genes (NCG) & the CancerMine text-mining resource |
TSG_SUPPORT |
Underlying evidence for gene being a tumor suppressor. Format:
CGC_TIER<1/2>&NCG&CancerMine:num_citations
|
ONCOGENE |
Flag indicating whether gene is predicted as an oncogene, from Cancer Gene Census, Network of Cancer Genes (NCG) & the CancerMine text-mining resource. |
ONCOGENE_SUPPORT |
Underlying evidence for gene being an oncogene. Format:
CGC_TIER<1/2>&NCG&CancerMine:num_citations
|
INTOGEN_DRIVER |
Gene is predicted as a cancer driver in the IntoGen Cancer Drivers Database |
TCGA_DRIVER |
Gene is predicted as a cancer driver in the TCGA pan-cancer analysis of cancer driver genes and mutations |
PROB_EXAC_LOF_INTOLERANT |
dbNSFP_gene : the probability of being loss-of-function
intolerant (intolerant of both heterozygous and homozygous lof variants)
based on ExAC r0.3 data |
PROB_EXAC_LOF_INTOLERANT_HOM |
dbNSFP_gene : the probability of being intolerant of
homozygous, but not heterozygous lof variants based on ExAC r0.3
data |
PROB_EXAC_LOF_TOLERANT_NULL |
dbNSFP_gene : the probability of being tolerant of both
heterozygous and homozygous lof variants based on ExAC r0.3 data |
PROB_EXAC_NONTCGA_LOF_INTOLERANT |
dbNSFP_gene : the probability of being loss-of-function
intolerant (intolerant of both heterozygous and homozygous lof variants)
based on ExAC r0.3 nonTCGA subset |
PROB_EXAC_NONTCGA_LOF_INTOLERANT_HOM |
dbNSFP_gene : the probability of being intolerant of
homozygous, but not heterozygous lof variants based on ExAC r0.3 nonTCGA
subset |
PROB_EXAC_NONTCGA_LOF_TOLERANT_NULL |
dbNSFP_gene : the probability of being tolerant of both
heterozygous and homozygous lof variants based on ExAC r0.3 nonTCGA
subset |
PROB_GNOMAD_LOF_INTOLERANT |
dbNSFP_gene : the probability of being loss-of-function
intolerant (intolerant of both heterozygous and homozygous lof variants
based on gnomAD 2.1 data |
PROB_GNOMAD_LOF_INTOLERANT_HOM |
dbNSFP_gene : the probability of being intolerant of
homozygous, but not heterozygous lof variants based on gnomAD 2.1
data |
PROB_GNOMAD_LOF_TOLERANT_NULL |
dbNSFP_gene : the probability of being tolerant of both
heterozygous and homozygous lof variants based on gnomAD 2.1 data |
PROB_HAPLOINSUFFICIENCY |
dbNSFP_gene : Estimated probability of
haploinsufficiency of the gene (from http://dx.doi.org/10.1371/journal.pgen.1001154) |
ESSENTIAL_GENE_CRISPR |
dbNSFP_gene : Essential (E) or Non-essential
phenotype-changing (N) based on large scale CRISPR experiments. from http://dx.doi.org/10.1126/science.aac7041
|
ESSENTIAL_GENE_CRISPR2 |
dbNSFP_gene : Essential (E), context-Specific essential
(S), or Non-essential phenotype-changing (N) based on large scale CRISPR
experiments. from http://dx.doi.org/10.1016/j.cell.2015.11.015
|
Variant effect and protein-coding information
Tag | Description |
---|---|
MUTATION_HOTSPOT |
mutation hotspot codon in cancerhotspots.org. Format:
gene_symbol | codon | q-value
|
MUTATION_HOTSPOT_TRANSCRIPT |
hotspot-associated transcripts (Ensembl transcript ID) |
MUTATION_HOTSPOT_CANCERTYPE |
hotspot-associated cancer types (from cancerhotspots.org) |
PFAM_DOMAIN |
Pfam domain identifier (from VEP) |
INTOGEN_DRIVER_MUT |
Indicates if existing variant is predicted as driver mutation from IntoGen Catalog of Driver Mutations |
EFFECT_PREDICTIONS |
All predictions of effect of variant on protein function and
pre-mRNA splicing from database of
non-synonymous functional predictions - dbNSFP v4.8/dbscSNV.
Predicted effects are provided by different sources/algorithms
(separated by & ), T = Tolerated,
N = Neutral, D = Damaging: 1.SIFT, 2.MutationTaster (data release
Nov 2015), 3.MutationAssessor
(release 3), 4.FATHMM
(v2.3), 5.PROVEAN (v1.1
Jan 2015), 6.FATHMM_MKL,
7.PRIMATEAI,
8.DEOGEN2,
9.DBNSFP_CONSENSUS_RNN
(Ensembl/consensus prediction, based on deep learning), 10.SPLICE_SITE_EFFECT_ADA
(Ensembl/consensus prediction of splice-altering SNVs, based on adaptive
boosting), 11.SPLICE_SITE_EFFECT_RF
(Ensembl/consensus prediction of splice-altering SNVs, based on random
forest), 12.M-CAP, 13.MutPred, 14.GERP,
15.BayesDel, 16.LIST-S2, 17.ALoFT, |
18.AlphaMissense,
19.ESM1b,
20.PHACTboost,
21.MutFormer| |
DBNSFP_BAYESDEL_ADDAF
| predicted effect from BayesDel
(dbNSFP) | | DBNSFP_LIST_S2
| predicted effect from LIST-S2
(dbNSFP) | | DBNSFP_SIFT
| predicted effect from SIFT
(dbNSFP) | | DBNSFP_PROVEAN
| predicted effect from PROVEAN
(dbNSFP) | | DBNSFP_MUTATIONTASTER
| predicted effect from
MutationTaster (dbNSFP) | | DBNSFP_MUTATIONASSESSOR
|
predicted effect from MutationAssessor (dbNSFP) | |
DBNSFP_M_CAP
| predicted effect from M-CAP (dbNSFP) | |
DBNSFP_ALOFTPRED
| predicted effect from ALoFT (dbNSFP) | |
DBNSFP_MUTPRED
| score from MutPred (dbNSFP) | |
DBNSFP_FATHMM
| predicted effect from FATHMM (dbNSFP) | |
DBNSFP_PRIMATEAI
| predicted effect from PrimateAI (dbNSFP)
| | DBNSFP_DEOGEN2
| predicted effect from deogen2 (dbNSFP)
| | DBNSFP_PHACTBOOST
| predicted effect from PHACTboost
(dbNSFP) | | DBNSFP_ALPHA_MISSENSE
| predicted effect from
AlphaMissense (dbNSFP) | | DBNSFP_MUTFORMER
| predicted
effect from MutFormer (dbNSFP) | | DBNSFP_ESM1B
| predicted
effect from ESM1b (dbNSFP) | | DBNSFP_GERP
| evolutionary
constraint measure from GERP (dbNSFP) | | DBNSFP_FATHMM_MKL
| predicted effect from FATHMM-mkl (dbNSFP) | |
DBNSFP_META_RNN
| predicted effect from ensemble prediction
(deep learning - dbNSFP) | | DBNSFP_SPLICE_SITE_RF
|
predicted effect of splice site disruption, using random forest
(dbscSNV) | | DBNSFP_SPLICE_SITE_ADA
| predicted effect of
splice site disruption, using boosting (dbscSNV) |
Variant frequencies/annotations in germline/somatic databases
Tag | Description |
---|---|
gnomADe_AF |
Adjusted global germline allele frequency (gnomAD release 2) |
gnomADe_AFR_AF |
African/American germline allele frequency (gnomAD release 2) |
gnomADe_AMR_AF |
American germline allele frequency (gnomAD release 2) |
gnomADe_SAS_AF |
South Asian germline allele frequency (gnomAD release 2) |
gnomADe_EAS_AF |
East Asian germline allele frequency (gnomAD release 2) |
gnomADe_FIN_AF |
Finnish germline allele frequency (gnomAD release 2) |
gnomADe_NFE_AF |
Non-Finnish European germline allele frequency (gnomAD release 2) |
gnomADe_OTH_AF |
Other germline allele frequency (gnomAD release 2) |
gnomADe_ASJ_AF |
Ashkenazi Jewish allele frequency (gnomAD release 2) |
DBSNP_RSID |
dbSNP reference ID, as provided by VEP |
COSMIC_MUTATION_ID |
Mutation identifier in Catalog of somatic mutations in cancer database, as provided by VEP |
TCGA_PANCANCER_COUNT |
Raw variant count across all TCGA tumor types |
TCGA_FREQUENCY |
Frequency of variant across TCGA tumor types. Format:
tumortype| percent affected|affected cases|total cases
|
Clinical associations
Tag | Description |
---|---|
CLINVAR_MSID |
ClinVar Measure Set/Variant ID |
CLINVAR_ALLELE_ID |
ClinVar allele ID |
CLINVAR_PMID |
Associated Pubmed IDs for variant in ClinVar - germline state-of-origin |
CLINVAR_HGVSP |
Protein variant expression using HGVS nomenclature |
CLINVAR_PMID_SOMATIC |
Associated Pubmed IDs for variant in ClinVar - somatic state-of-origin |
CLINVAR_CLNSIG |
Clinical significance for variant in ClinVar - germline state-of-origin |
CLINVAR_CLNSIG_SOMATIC |
Clinical significance for variant in ClinVar - somatic state-of-origin |
CLINVAR_MEDGEN_CUI |
Associated MedGen concept identifiers (CUIs) - germline state-of-origin |
CLINVAR_MEDGEN_CUI_SOMATIC |
Associated MedGen concept identifiers (CUIs) - somatic state-of-origin |
CLINVAR_VARIANT_ORIGIN |
Origin of variant (somatic, germline, de novo etc.) for variant in ClinVar |
CLINVAR_REVIEW_STATUS_STARS |
Rating of the ClinVar variant (0-4 stars) with respect to level of review |
CLINVAR_KNOWN_ONCOGENIC |
Variant matches with known oncogenic variants in ClinVar, through
ClinGen/CGC/VICC SOP. Format: |
Other
Tag | Description |
---|---|
BIOMARKER_MATCH |
Variant matches with biomarker evidence in CIViC/CGI. Format:
|
ONCOGENICITY |
Oncogenicity annotation - ClinGen/CGC/VICC SOP implementation |
ONCOGENICITY_CODE |
Oncogenicity code - ClinGen/CGC/VICC SOP implementation |
ONCOGENICITY_SCORE |
Oncogenicity score - ClinGen/CGC/VICC SOP implementation |
2. Tab-separated values (TSV)
We provide a tab-separated values file with most important annotations for SNVs/InDels. The file has the following naming convention:
<sample_id>.pcgr.<genome_assembly>.snv_indel_ann.tsv.gz
The following variables are included in the TSV file (VCF tags issued
by the user (--retained_info_tags
) will be appended at the
end):
Variable | Description |
---|---|
1. SAMPLE_ID
|
Sample identifier |
2. GENOMIC_CHANGE
|
Identifier for variant at the genome (VCF) level,
e.g. 1:g.152382569A>G . Format:
<chrom>:g.<position><ref_allele><alt_allele>
|
3. GENOME_VERSION
|
Assembly version, e.g. GRCh37 |
4. VARIANT_CLASS
|
Variant type, e.g. SNV/insertion/deletion/indel |
5. SYMBOL
|
Gene symbol |
6. ENTREZGENE
|
Entrez gene identifier |
7. ENSEMBL_GENE_ID
|
Ensembl gene identifier |
8. GENENAME
|
Gene name |
9. ALTERATION
|
Combined HGVSp/HGVSc annotation |
10. PROTEIN_CHANGE
|
Protein change |
11. CONSEQUENCE
|
Variant consequence - from VEP |
12. PFAM_DOMAIN_NAME
|
Pfam domain name |
13. LOSS_OF_FUNCTION
|
Loss of function flag |
14. LOF_FILTER
|
Loss of function filter |
15. CDS_CHANGE
|
Coding sequence change |
16. CODING_STATUS
|
Coding status - flag indicating if consequence is protein-altering/affecting splice sites |
17. EXONIC_STATUS
|
Exonic status - flag indicating if consequence is silent/protein-altering/affecting splice sites |
18. DP_TUMOR
|
Depth of coverage at variant position in tumor sample |
19. VAF_TUMOR
|
Variant allele fraction at variant position in tumor sample |
20. DP_CONTROL
|
Depth of coverage at variant position in control sample |
21. VAF_CONTROL
|
Variant allele fraction at variant position in control sample |
22. MUTATION_HOTSPOT
|
Mutation hotspot annotation |
23. MUTATION_HOTSPOT_CANCERTYPE
|
Mutation hotspot-associated cancer types (from cancerhotspots.org) |
24. ACTIONABILITY_TIER
|
Actionability tier - AMP/ASCO/CAP implementation |
25. ACTIONABILITY
|
Actionability annotation - AMP/ASCO/CAP implementation |
26. ACTIONABILITY_FRAMEWORK
|
Actionability framework - AMP/ASCO/CAP implementation |
27. ONCOGENICITY
|
Oncogenicity annotation - ClinGen/CGC/VICC/CGC SOP implementation |
28. ONCOGENICITY_CODE
|
Oncogenicity code - ClinGen/CGC/VICC/CGC SOP implementation |
29. ONCOGENICITY_SCORE
|
Oncogenicity score - ClinGen/CGC/VICC/CGC SOP implementation |
30. HGVSc
|
HGVS coding sequence name |
31. HGVSc_RefSeq
|
HGVS coding sequence name (RefSeq) |
32. HGVSp
|
HGVS protein sequence name |
33. CANONICAL
|
Flag indicating if transcript is canonical |
34. CCDS
|
CCDS identifier |
35. UNIPROT_ACC
|
UniProt accession |
36. ENSEMBL_TRANSCRIPT_ID
|
Ensembl transcript identifier |
37. ENSEMBL_PROTEIN_ID
|
Ensembl protein identifier |
38. REFSEQ_TRANSCRIPT_ID
|
RefSeq transcript identifier |
39. REFSEQ_PROTEIN_ID
|
RefSeq protein identifier |
40. MANE_SELECT
|
MANE transcript select |
41. MANE_PLUS_CLINICAL
|
MANE transcript plus clinical |
42. CGC_TIER
|
Cancer Gene Census tier |
43. CGC_GERMLINE
|
Cancer Gene Census germline annotation |
44. CGC_SOMATIC
|
Cancer Gene Census somatic annotation |
45. ONCOGENE
|
Flag indicating if gene is oncogene (CGC/CancerMine/NCG) |
46. ONCOGENE_SUPPORT
|
Oncogene annotation support (CGC/CancerMine/NCG) |
47. TUMOR_SUPPRESSOR
|
Flag indicating if gene is tumor suppressor (CGC/CancerMine/NCG) |
48. TUMOR_SUPPRESSOR_SUPPORT
|
Tumor suppressor annotation support (CGC/CancerMine/NCG) |
49. TARGETED_INHIBITORS2
|
Targeted inhibitors |
50. EFFECT_PREDICTIONS
|
Variant effect predictions - from dbNSFP |
51. REGULATORY_ANNOTATION
|
Regulatory annotation |
52. VEP_ALL_CSQ
|
VEP consequence - all transcripts |
53. gnomADe_AF
|
gnomAD exomes allele frequency - globally |
54. DBSNP_RSID
|
dbSNP identifier |
55. COSMIC_ID
|
COSMIC identifier |
56. TCGA_FREQUENCY
|
Frequency of variant across TCGA tumor types. Format:
tumortype | percent affected | affected cases | total cases
|
57. TCGA_PANCANCER_COUNT
|
Raw variant count across all TCGA tumor types |
58. CLINVAR_MSID
|
ClinVar MedGen identifier |
59. CLINVAR_CLASSIFICATION
|
ClinVar variant classification |
60. CLINVAR_VARIANT_ORIGIN
|
ClinVar variant origin |
61. CLINVAR_NUM_SUBMITTERS
|
ClinVar number of submitters |
62. CLINVAR_REVIEW_STATUS_STARS
|
ClinVar number of review status stars |
63. CLINVAR_CONFLICTED
|
ClinVar variant classification is conflicted |
64. BIOMARKER_MATCH
|
Biomarker match |
65. CALL_CONFIDENCE
|
Call confidence |
For tumor-only runs, we provide a similarly formatted tab-separated values file that include both filtered (i.e. likely germline events) and unfiltered (deemed somatic) variants. The file has the following naming convention:
<sample_id>.pcgr.<genome_assembly>.snv_indel_filtered.ann.tsv.gz
In this TSV file, an additional column
SOMATIC_CLASSIFICATION
indicates for each variant if it is
classified as somatic or germline.
Tumor mutational burden (TSV)
We provide a tab-separated values (TSV) file with information about mutational burden detected in the tumor sample. The file has the following naming convention:
<sample_id>.pcgr.<genome_assembly>.tmb.tsv
The format of the TSV file is the following:
Variable | Description |
---|---|
1. sample_id
|
sample identifier |
2. n_somatic_variants
|
number of somatic variants in total for sample |
3. tmb_measure
|
TMB measure - type of variants included |
4. tmb_csq_regex
|
VEP consequence regex for variants included in TMB calculation |
5. tmb_target_size_mb
|
target size in megabases |
6. tmb_dp_min
|
minimum depth of coverage for variant to be included in TMB calculation |
7. tmb_af_min
|
minimum allele frequency for variant to be included in TMB calculation |
8. tmb_n_variants
|
number of variants included in TMB calculation |
9. tmb_estimate
|
TMB estimate |
10. tmb_unit
|
TMB unit (i.e. mutations/Mb) |
Mutational signature contributions (TSV)
We provide a tab-separated values (TSV) file with information about mutational signatures detected in the tumor sample. The file has the following naming convention:
<sample_id>.pcgr.<genome_assembly>.msigs.tsv.gz
The format of the TSV file is the following:
Variable | Description |
---|---|
1. sample_id
|
sample identifier |
2. signature_id
|
identifier for signature |
3. n_bs_iterations
|
number of bootstrap iterations |
4. prop_signature
|
relative contribution of mutational signature |
5. prop_signature_ci_lower
|
lower bound of confidence interval for relative contribution of mutational signature |
6. prop_signature_ci_upper
|
upper bound of confidence interval for relative contribution of mutational signature |
7. aetiology
|
underlying atiology of mutational signature |
8. comments
|
additional comments regarding aetiology |
9. group
|
keyword for signature aetiology |
10. all_reference_signatures
|
logical indicating if all reference signatures were used for reconstruction/inference |
11. tumor_type
|
tumor type (used for retrieval of reference signatures) |
12. reference_collection
|
collection used for reference signatures |
13. reference_signatures
|
signatures present in reference collection |
14. fitting_accuracy
|
accuracy of mutational signature fitting |
Copy number aberrations
1. Tab-separated values (TSV)
Copy number segments are intersected with the genomic coordinates of all transcripts from GENCODE’s basic gene annotation. In addition, PCGR attaches cancer-relevant annotations for the affected transcripts. The naming convention of the compressed TSV files are as follows:
-
<sample_id>.pcgr.<genome_assembly>.cna_segment.tsv.gz
- segment level information only
-
<sample_id>.pcgr.<genome_assembly>.cna_gene_ann.tsv.gz
- This file is organized according to the affected transcripts (i.e. one line/record per affected transcript, segments not overlapping with any transcripts will thus not be included in this files).
The format of the compressed cna_gene_ann.tsv.gz
is the
following:
Variable | Description |
---|---|
1. SAMPLE_ID
|
Sample identifier |
2. VAR_ID
|
Variant identifier. Format:
<chromosome>:<segment_start>-<segment_end>:<major_cn>:<minor_cn>
|
3. CN_MAJOR
|
Major copy number |
4. CN_MINOR
|
Minor copy number |
5. SEGMENT_LENGTH_MB
|
Length of segment in Mb |
6. CYTOBAND
|
Associated cytoband |
7. EVENT_TYPE
|
Focal or broad (covering more than 25% of chromosome arm) |
8. VARIANT_CLASS
|
gain: total copy number >= user-defined threshold; loss - total copy number equal to zero; undefined other copy number states |
9. SYMBOL
|
Gene symbol |
10. ENTREZGENE
|
Entrez gene identifier |
11. GENENAME
|
Gene name |
12. ENSEMBL_GENE_ID
|
Ensembl gene identifier |
13. TUMOR_SUPPRESSOR
|
Flag indicating if gene is tumor suppressor (CGC/CancerMine/NCG) |
14. TUMOR_SUPPRESSOR_SUPPORT
|
Tumor suppressor annotation support (CGC/CancerMine/NCG) |
15. ONCOGENE
|
Flag indicating if gene is oncogene (CGC/CancerMine/NCG) |
16. ONCOGENE_SUPPORT
|
Oncogene annotation support (CGC/CancerMine/NCG) |
17. TRANSCRIPT_OVERLAP
|
Comma-separated list of associated transcripts, including percentage of transcript overlap |
18. ACTIONABILITY_TIER
|
Actionability tier - AMP/ASCO/CAP |
19. ACTIONABILITY
|
Actionability - AMP/ASCO/CAP |
20. ACTIONABILITY_FRAMEWORK
|
Actionability framework - AMP/ASCO/CAP |
21. BIOMARKER_MATCH
|
Biomarker match |
22. TARGETED_INHIBITORS_ALL2
|
Molecularly targeted inhibitors - indicated for any tumor type |
Gene expression data
If users provide bulk RNA-seq expression data as input, PCGR will attach basic gene annotations for the affected transcripts, and perform similarity analysis and outlier detection if configured by the user. The naming convention of the compressed TSV files are as follows:
-
<sample_id>.pcgr.<genome_assembly>.expression.tsv.gz
- NOTE: This file is organized according to the affected transcripts (i.e. one line/record per affected transcript). Contains basic annotations of the affected transcripts.
-
<sample_id>.pcgr.<genome_assembly>.expression_similarity.tsv.gz
- NOTE: This file is organized according to the samples of other gene expression cohorts (i.e. similarity level, one line/record per sample).
-
<sample_id>.pcgr.<genome_assembly>.expression_outliers.tsv.gz
- NOTE: This file is organized according to how the expression levels of genes/transcripts compare to the distribution of expression levels found in reference cohorts. This files contain various statistics in this respect (e.g. z-scores, IQR, Q1, Q2, Q3, percentile etc), enabling the detection of expression outliers.
Excel workbook (XLSX)
The Excel workbook contains multiple sheets with data tables, mostly
self-explainable, with annotated datasets pending on the analysis
performed (assay/sample data, SNVs/InDels, CNAs, biomarker evidence,
TMB, MSI, mutational signatures, immune contexture profiling etc). The
naming convention of the Excel workbook is as follows:
<sample_id>.pcgr.<genome_assembly>.xlsx
.
Note: To reduce the size of the SNVs/InDel sheets in the Excel
workbook, we only include the clinically actionable variants as well as
other exonic variants (including splice site variants, silent variants,
and protein-altering variants).