Resources - overview
Data harvested and integrated from the following resources form the backbone of oncoEnrichR:
- Open Targets Platform - human drug-target associations and comprehensive disease-target associations
- CancerMine - literature-mined database of drivers, proto-oncogenes and tumor suppressors in cancer
- Network of Cancer Genes - Network of cancer genes - a web resource to analyze duplicability, orthology and network properties of cancer genes
- IntOGen - Compendium of mutational driver genes in cancer
- UniProt - Comprehensive and freely accessible resource of protein sequence and functional information
- Cancer Gene Census
- DoRothEA - gene set resource containing signed transcription factor (TF) - target (regulatory) interactions
- CellChatDB - ligand-receptor interaction resource
- The Cancer Genome Atlas (TCGA) - gene aberration frequencies, recurrent point mutations/indels, and co-expression patterns in ~9,500 primary tumor samples
- The Human Protein Atlas - expression data for healthy human tissues (GTex)/cell types, and prognostic gene expression associations in cancer
- Molecular Signatures Database (MSigDB) - collection of annotated (e.g. towards pathways) genesets for enrichment/over-representation analysis. This includes genesets from Gene Ontology, Reactome, KEGG, WikiPathways, BIOCARTA, as well as curated immunologic and cancer-specific signatures.
- NetPath - signaling transduction pathways
- Pfam - protein domain database
- STRING - protein-protein interaction database
- BioGRID - biomedical interaction repository with data compiled through comprehensive curation efforts
- CORUM - protein complex database
- Compleat - protein complex resource
- ComplexPortal - manually curated, encyclopaedic resource of macro-molecular complexes
- hu.MAP2 - human protein complex map
- COMPARTMENTS - subcellular compartment database
- DepMap/Project Score - Databases on the effects on cancer cell line viability elicited by CRISPR-Cas9 mediated gene activation
- Genetic determinants of survival in cancer - Resource on the relative prognostic impact of gene mutation, expression, methylation, and CNA in human cancers
- Predicted synthetic lethality interactions - Resource on predicted synthetic lethality interactions between human gene paralogs (using human cancer cell lines)
Resources - annotation types
This section describes the underlying annotation types of oncoEnrichR in detail, and to what extent they are subject to quality control/filtering for use within the package.
Target-disease associations
- Associations between genes and diseases/tumor types are harvested from the Open Targets Platform. To increase the confidence of target-disease associations shown, we include only associations with support from at least two data types. Very weak associations, i.e. those with an overall score less than 0.05, have been excluded (complete range 0-1).
Target-drug associations
- Target-drug associations are primarily harvested from the Open Targets Platform. Through the use of our own phenOncoX R package, a dedicated resource on phenotype ontology mapping in cancer, we ensure that target-drug associations included in oncoEnrichR are exclusively for cancer conditions.
Tumor suppressor/proto-oncogene/cancer driver annotation
-
We assign three confidence levels with respect to genes having oncogenic and/or tumor suppressive roles
-
Very strong
- Strong_support (> 20 citations) in the biomedical literature (CancerMine), no curated entry in Network of Cancer Genes (NCG) or Cancer Gene Census (CGC), OR
- Manually curated entry in NCG or CGC, and with significant support (> 10 citations) in the biomedical literature (CancerMine)
-
Strong
- Curated entry in NCG or CGC, with moderate support (<= 10 citations, > 5 citations) in the biomedical literature (CancerMine), OR
- Significant support (> 10 citations, <= 20 citations) in the biomedical literature (CancerMine), no entry in CGC/NCG
-
Moderate
- Curated entry in NCG/CGC, with no or weak support (<= 5 citations) in the biomedical literature (CancerMine), OR
- Moderate support (<= 10 citations, > 5 citations) in the biomedical literature (CancerMine), no entry in CGC/NCG
- Literature support for a tumor suppressive role is ignored if the gene has 2.5 times as much literature support for an oncogenic role (and vice versa)
-
Very strong
-
Classification of genes as potential cancer drivers require support from at least two of the following sources:
- Network of Cancer Genes - canonical cancer drivers
- IntOGen - predicted cancer driver genes
- Cancer Gene Census - TIER1/2 - curated cancer genes
- TCGA’s predicted cancer driver genes
- CancerMine - literature mining resource, considering only entries with support from > 5 citations)
Gene cancer hallmark associations
- The hallmarks of cancer comprise six biological capabilities acquired during the multistep development of human tumors. Genes attributed to the hallmarks of cancer have been retrieved from the Open Targets Platform/COSMIC
Gene copy number aberrations in tumors (somatic amplifications/homozygous deletions)
- Somatic copy number amplifications and homozygous deletions have been retrieved from The Cancer Genome Atlas (TCGA), focusing on gene-level calls processed with their copy number variation analysis pipeline. Copy number events included in oncoEnrichR are limited to high-level amplifications (GISTIC2 CNA value: 2) and homozygous deletions (GISTIC2 CNA value: -2)
Recurrent gene mutations in tumors (somatic SNVs/InDels)
- Recurrent somatic SNVs/InDels (occurring in more than one tumor sample) have been retrieved from The Cancer Genome Atlas (TCGA). SNVs/InDels have been annotated further with information on mutation hotspots in cancer (cancerhotspots.org), protein domains from PFAM, and their predicted status as loss-of-function variants (LOFTEE). Only protein-altering and canonical splice site mutations are included.
Gene co-expression in tumors (RNA-seq)
- Co-expression correlation coefficients between genes in tumors have been pre-calculated, using RNAseq data available in The Cancer Genome Atlas (TCGA). We calculate separate correlation coefficients for each tumor type, and we further more limit pairs of co-expressed genes to those that include either a tumor suppressor, an oncogene, or a predicted cancer driver (using the classification outlined above). Only strong correlation coefficients (Spearman’s rank >= 0.7 or <= -0.7) are included
Molecular gene signatures / pathways
- A comprehensive set of pathway annotations and molecular gene signatures form the basis for the functional enrichment module offered by oncoEnrichR. We include a diverse set of molecular signatures from MSigDB, annotations with respect to molecular function, biological processes and cellular components from Gene Ontology (GO), and pathway annotations from KEGG, WikiPathways, NetPath, and Reactome
Gene fitness effects
- Gene fitness effects are retrieved from an integration of data in DepMap & Project Score. We include data on genes with a statistically significant effect on cell fitness in cancer cell lines (fitness score is here considered a quantitative measure of the reduction of cell viability elicited by a gene inactivation, via CRISPR/Cas9 targeting).
Subcellular compartment annotation
- Subcellular localization/compartment annotations have been retrieved
from COMPARTMENTS. When
using oncoEnrichR, users have the possibility to configure the
stringency with respect to the target compartment annotations retrieved:
- Confidence score for each target compartment annotation (range from 3 to 5)
- The type of channels that support annotations ( Knowledge, Text mining or Experimental), as well as the required minimum number of channels that should support a given target compartment annotation
Protein complexes and protein domains
- Protein complex annotations are retrieved from multiple resources, including CORUM, Compleat, ComplexPortal, and hu.MAP2. Protein domains are retrieved from Pfam
Protein-protein interactions
-
Protein-protein interactions for members of the target set is explored using data from two different resources:
-
STRING - a database of known
and predicted protein-protein interactions, including direct (physical)
and indirect (functional) associations. In oncoEnrichR, the publicly
available STRING API is utilized
for network retrieval, and the users have the possibility to configure
the following:
- A required minimum interaction score, limiting the interactions retrieved to those with an interaction score greater or equal than this threshold. Interaction scores are indicators of confidence, i.e. how likely STRING judges an interaction to be true.
- The type of network to retrieve, either physical or functional
- The addition of proteins (maximum 50) that are not part of the query set, but that interact most confidently with proteins in the query set
-
BioGRID - a biomedical
interaction repository with data compiled through comprehensive curation
efforts
- A required minimum number of evidence items that support each interaction (i.e. from multiple types of low or high-throughput experiments, e.g. affinity capture-MS, or affinity capture-Western). Default: 3, maximum: 10
- The addition of proteins (maximum 50) that are not part of the query set, but that interact most confidently with proteins in the query set
-
STRING - a database of known
and predicted protein-protein interactions, including direct (physical)
and indirect (functional) associations. In oncoEnrichR, the publicly
available STRING API is utilized
for network retrieval, and the users have the possibility to configure
the following:
Gene regulatory interactions
- Regulatory interactions (TF-target) have been retrieved from the DoRothEA resource. Two collections of interactions are subject to overlap against the query set in oncoEnrichR: a global set of regulatory interactions, many of which are inferred from gene expression patterns in normal tissues (GTEx), and a cancer-focused set, the latter inferred from gene expression patterns in tumors (TCGA).
Prognostic gene associations
- Prognostic associations (gene expression versus survival) are collected from The Human Pathology Atlas, which have undertaken correlation analyses of mRNA expression levels of human genes in tumor tissue and the clinical outcome (survival) for ~8,000 cancer patients (underlying data from TCGA). Correlation analyses resulted in more than 10,000 prognostic genes. We show prognostic genes with regard to cancer type and whether they are favorable or unfavorable in terms of clinical outcome. Furthermore, we show prognostic associations established by tcga-survival.com, which relates various types of gene aberrations (expression or methylation levels, CNA/mutation status) in tumor samples to patient survival.
Ligand-receptor interactions
- Ligand-receptor interactions have been collected from the CellChatDB resource. In the oncoEnrichR module for these interactions, only interactions where both ligand and receptor are found in the query set are shown.
Synthetic lethality interactions
- Predicted synthetic lethality interactions between human gene paralogs have been collected from De Kegel et al., Cell Systems, 2021. Interactions are shown for which both members are part of the query set, and interactions where only a single member of the query set is present.