Skip to contents

Basic variant consequence annotation

  • VEP - Variant Effect Predictor release 105 (GENCODE v39 as gene reference database (v19 for grch37))

Insilico predictions of effect of coding variants

  • dBNSFP - database of non-synonymous functional predictions (v4.2, March 2021)

Variant frequency databases

  • gnomAD - germline variant frequencies exome-wide (r2.1, October 2018)
  • dbSNP - database of short genetic variants (build 154)
  • Cancer Hotspots - a resource for statistically significant mutations in cancer (v2, 2017)
  • TCGA - somatic mutations discovered across 33 tumor type cohorts (release 31.0, October 2021)
  • ICGC-PCAWG - ICGC Pancancer Analysis of Whole Genomes - (release 28, March 17th, 2019)

Variant databases of clinical utility

  • ClinVar - database of clinically related variants (February 2022)
  • DoCM - database of curated mutations (v3.2, April 2016)
  • CIViC - clinical interpretations of variants in cancer (February 1st 2022)
  • CGI - Cancer Genome Interpreter Cancer Biomarkers Database (CGI) (January 17th 2018)
  • ChEMBL - database of drugs, drug-like small molecules and their targets (ChEMBL_28, February 2021)

Protein domains/functional features

  • UniProt/SwissProt KnowledgeBase - resource on protein sequence and functional information (2021_04, November 2021)
  • Pfam - database of protein families and domains (v35.0, November 2021)

Knowledge resources on gene and protein targets

  • CancerMine - Literature-mined database of tumor suppressor genes/proto-oncogenes (v42, December 2021)
  • Open Targets Platform - Database on disease-target associations, molecularly targeted drugs and tractability aggregated from multiple sources (literature, pathways, mutations) (2021.11)
  • TCGA driver genes - predicted cancer driver genes based on application of multiple driver gene prediction tools on TCGA pan-cancer cohort

Pathway databases

Notes on variant annotation datasets

Genome mapping

A requirement for PCGR variant annotation datasets is that variants have been mapped unambiguously to the reference human genome. For most datasets this requirement is not an issue (i.e. dbSNP, ClinVar etc.). A fraction of variants in the annotation datasets related to clinical interpretation, CIViC and CBMDB, has however not been mapped to the genome. Whenever possible, we have utilized TransVar to identify the actual genomic variants (e.g. g.chr7:140453136A>T) that correspond to variants reported at the amino acid level or with other HGVS nomenclature (e.g. p.V600E).

For variants that have been mapped to a specific build (GRCh37/GRCh38), we have utilized the crossmap package to lift the datasets to the other build.

Data quality

Clinical biomarkers

Clinical biomarkers included in PCGR are limited to the following:

  • Evidence items for specific markers in CIViC must be accepted (submitted evidence items are not considered)
  • Markers reported at the variant level (e.g. BRAF p.V600E)
  • Markers reported at the codon level (e.g. KRAS p.G12)
  • Markers reported at the exon level (e.g. KIT exon 11 mutation)
  • Within the Cancer bioMarkers database (CGI), only markers collected from FDA/NCCN guidelines, scientific literature, and clinical trials are included (markers collected from conference abstracts etc. are not included)
  • Copy number gains/losses

See also comment on a closed GitHib issue

IMPORTANT NOTE: The variant consequence reported by CIViC may deviate from what is reported by PCGR. PCGR picks the variant consequence according to VEP’s pick option (depending on a ranked list of criteria that can be configured by the user), and this particular transcript consequence may differ from what has been reported in the literature.

Molecularly targeted drugs

  • For targeted drugs extracted from Open Targets Platform, we distinguish between drugs in late clinical development (phase 3-4), versus those in early clinical development (phase 1-2).

Gene-disease associations

  • Cancer phenotype associations retrieved from the Open Targets Platform are largely based on the association score developed by the Open Targets Platform, with a couple of extra post-processing steps:
    • Phenotype associations in Open Targets Platform are assembled from a variety of different data sources. Target-disease associations included in PCGR must be supported by at least two distinct sources
    • The weakest associations, here defined as those with an association score < 0.4 (scale from 0 to 1), are ommitted
    • As is done within the Open Targets Platform, association scores (for genes) are represented with varying shades of blue: the darker the blue, the stronger the association. Variant hits in tier 3/4 and the noncoding section are arranged according to this association score. If several disease subtypes are associated with a gene, the maximum association score is chosen.

Tumor suppressor genes/proto-oncogenes

  • Status as oncogenes and/or tumor suppressors genes are done according to the following scheme in PCGR:
    • Five or more publications in the biomedical literature that suggests an oncogenic/tumor suppressor role for a given gene (as collected from the CancerMine text-mining resource), OR
    • At least two publications from CancerMine that suggests an oncogenic/tumor suppressor role for a given gene AND an existing record for the same gene as a tumor suppressor/oncogene in the Network of Cancer Genes (NCG)
    • Status as oncogene is ignored if a given gene has three times as much (literature evidence) support for a role as a tumor suppressor gene (and vice versa)
    • Oncogenes/tumor suppressor candidates from CancerMine/NCG that are found in the curated list of false positive cancer drivers compiled by Bailey et al. (Cell, 2018) have been excluded

TCGA somatic calls

  • TCGA employs four different variant callers for detection of somatic variants (SNVs/InDels): mutect2, varscan2, somaticsniper and muse. In the TCGA dataset bundled with PCGR, somatic SNVs are restricted to those that are detected by at least two independent callers (i.e. calls found by a single algorithm are considered low-confident and disregarded)