The PCGR software comes with a range of options to configure the analyses performed and to customize the output report.
Key settings
Below, we will outline key settings that are important with respect to the sequencing assay that has been conducted.
Whole-genome, whole-exome or targeted sequencing assay
By default, PCGR expects that the input VCF comes from whole-exome
sequencing. This can be adjusted with the assay
option,
which takes three alternative values:
--assay <WGS|WES|TARGETED>
If the input VCF comes from a targeted sequencing assay/panel
(i.e. --assay TARGETED
), the target size of the assay/panel
should be adjusted accordingly, which is critical for tumor mutational
burden estimation. Assay target size, reflecting the protein-coding
portion of the assayed genomic region in megabases, can be
configured through the following option:
--effective_target_size_mb <value>
Ideally, this should reflect the callable target size of the assay, i.e. the size of the target regions for which coding variants is sufficiently covered (i.e. with a certain sequencing depth) by the sequencing assay. This can be estimated by the user based on the design of the sequencing assay, or by using tools such as mosdepth.
Tumor-control vs. tumor-only
By default, PCGR expects that the input VCF contains somatic variants identified from a tumor-control sequencing setup. This implies that the VCF contains information with respect to variant allelic depth/support both for the tumor sample and the corresponding control sample.
If the input VCF comes from a tumor-only assay, turn
on the --tumor_only
option. In this mode, PCGR conducts a
set of successive filtering steps on the raw input set of variants,
aiming to exclude the majority of germline variants from the tumor-only
input set. In addition to default filtering applied against variants
found in the gnomAD database (population-specific minor allele frequency
thresholds can be configured, see below), additional filtering
procedures can be explicitly set, i.e.:
--exclude_dbsnp_nonsomatic
--exclude_likely_het_germline
--exclude_likely_hom_germline
--exclude_nonexonic
Users should note that these filters may occasionally remove true somatic variants, and should thus be used with caution.
If the user has provided a panel-of-normals VCF file as
input(--pon_vcf
), one may also opt to remove variants found
in that variant collection using a dedicated option:
--exclude_pon
IMPORTANT NOTE 1: A number of the analyses available in PCGR are not particularly well-suited for tumor-only input, e.g. mutational signature analysis and MSI status prediction. IMPORTANT NOTE 2: If you run PCGR on tumor-only WGS assays, we strongly recommend that you pre-filter your VCF against the exome, since we have frequently encountered that the large unfiltered variant set from such assays will cause a crash in PCGR. See also this issue.
Sample properties
A few selected properties of the tumor sample that should be determined prior to PCGR analysis can be provided, currently only used for display/reporting:
-
--tumor_ploidy <value>
- ploidy estimate -
--tumor_purity <value>
- purity estimate
Tumor site
The primary tissue/site of the tumor sample that is analyzed can and should be be provided, through the following option:
--tumor_site <value>
This option takes a value between 0 and 30, reflecting the main primary sites/tissues for human cancers. This information is used e.g. to discriminate between on and off-label targeted therapies from the actionable biomarkers detected in the sample. Note that the default value is 0 (implies Any tissue/site), which essentially prohibits the presence of tier 1 variants occurring in the report.
The following lists the possible encodings, and the corresponding primary tissues/sites:
0 = Any
1 = Adrenal Gland
2 = Ampulla of Vater
3 = Biliary Tract
4 = Bladder/Urinary Tract
5 = Bone
6 = Breast
7 = Cervix
8 = CNS/Brain
9 = Colon/Rectum
10 = Esophagus/Stomach
11 = Eye
12 = Head and Neck
13 = Kidney
14 = Liver
15 = Lung
16 = Lymphoid
17 = Myeloid
18 = Ovary/Fallopian Tube
19 = Pancreas
20 = Peripheral Nervous System
21 = Peritoneum
22 = Pleura
23 = Prostate
24 = Skin
25 = Soft Tissue
26 = Testis
27 = Thymus
28 = Thyroid
29 = Uterus
30 = Vulva/Vagina
(default: 0 - any tumor type)
If you want to check the nature of cancer subtypes attributed to each of the primary tissue/sites, take a look at the table with phenotype terms and associated primary sites.
Allelic support
Proper designation of INFO tags in the input VCF that denote variant depth/allelic fraction is critical to make the report as comprehensive as possible. See the input section on how to accomplish this.
Several options are used to inform PCGR about tags in the VCF INFO field that encodes this information:
--tumor_dp_tag <value>
--tumor_af_tag <value>
--control_dp_tag <value>
--control_af_tag <value>
If these tags are set correctly, one may set thresholds (sequencing depth and/or allelic fraction) on the variants to be included in the report, i.e. through:
--tumor_dp_min <value>
--tumor_af_min <value>
--control_dp_min <value>
--control_af_max <value>
Tumor mutational burden (TMB)
If tags for allelic support is provided in the VCF and configured by the user, users can configure the TMB calculation by setting minimum requirements for sequencing coverage and allelic fraction, i.e.:
--tmb_dp_min <value>
--tmb_af_min <value>
Large input sets (VCF)
For input VCF files with > 500,000 variants, note that these will be subject to filtering prior to the final step in PCGR (step 4: reporting). For such scenarios, raw input sets will be filtered according to consequence type, in the sense that intergenic/intronic/upstream/downstream gene variants will be excluded prior to reporting. This is a necessary step to avoid memory issues etc. during processing with the PCGR R package.
If you have a large input VCF, and have sufficient memory capacity on
your compute platform, we also recommend to increase the VEP
buffer size (option --vep_buffer_size
), as this will
speed up the VEP processing significantly.
Configuration of output files
If you only want the Excel/TSV output (with all variant
classifications and auxiliary analyses (e.g. MSI classifications, TMB
estimates, mutational signatures predictions) of PCGR, you may turn off
the HTML output by using the --no_html
option. This will
speed up the analysis, and could be a useful option if you foresee that
the sample input datasets are too large for the HTML report generation
to work properly.
If you only want the annotated VCF (also converted to TSV), use the
option --no_reporting
. This will skip the final steps of
the PCGR workflow, and only generate the annotated VCF and TSV
files.
All options
A tumor sample report is generated by running the pcgr command, which takes the following arguments and options:
usage:
pcgr -h [options]
--input_vcf <INPUT_VCF>
--vep_dir <VEP_DIR>
--refdata_dir <REFDATA_DIR>
--output_dir <OUTPUT_DIR>
--genome_assembly <GENOME_ASSEMBLY>
--sample_id <SAMPLE_ID>
Personal Cancer Genome Reporter (PCGR) workflow for clinical translation of tumor omics data (SNVs/InDels, CNA, RNA expression)
Required arguments:
--input_vcf INPUT_VCF
VCF input file with somatic variants in tumor sample, SNVs/InDels
--vep_dir VEP_DIR Directory of VEP cache, e.g. $HOME/.vep
--refdata_dir REFDATA_DIR
Directory where PCGR reference data bundle was downloaded and unpacked
--output_dir OUTPUT_DIR
Output directory
--genome_assembly {grch37,grch38}
Human genome assembly build: grch37 or grch38
--sample_id SAMPLE_ID
Tumor sample/cancer genome identifier - prefix for output files
Sequencing assay options:
--assay {WGS,WES,TARGETED}
Type of DNA sequencing assay performed for input data (VCF), default: WES
--effective_target_size_mb EFFECTIVE_TARGET_SIZE_MB
Effective target size in Mb (potentially limited by read depth) of sequencing assay (for TMB analysis) (default: 34 (WES/WGS))
--tumor_only Input VCF comes from tumor-only sequencing, calls will be filtered for variants of germline origin, (default: False)
Tumor sample options:
--tumor_site TSITE Optional integer code to specify primary tumor type/site of query sample,
choose any of the following identifiers:
0 = Any
1 = Adrenal Gland
2 = Ampulla of Vater
3 = Biliary Tract
4 = Bladder/Urinary Tract
5 = Bone
6 = Breast
7 = Cervix
8 = CNS/Brain
9 = Colon/Rectum
10 = Esophagus/Stomach
11 = Eye
12 = Head and Neck
13 = Kidney
14 = Liver
15 = Lung
16 = Lymphoid
17 = Myeloid
18 = Ovary/Fallopian Tube
19 = Pancreas
20 = Peripheral Nervous System
21 = Peritoneum
22 = Pleura
23 = Prostate
24 = Skin
25 = Soft Tissue
26 = Testis
27 = Thymus
28 = Thyroid
29 = Uterus
30 = Vulva/Vagina
(default: 0 - any tumor type)
--tumor_purity TUMOR_PURITY
Estimated tumor purity (between 0 and 1) (default: None)
--tumor_ploidy TUMOR_PLOIDY
Estimated tumor ploidy (default: None)
Allelic support options:
--tumor_dp_tag TUMOR_DP_TAG
Specify VCF INFO tag for sequencing depth (tumor, must be Type=Integer, default: _NA_
--tumor_af_tag TUMOR_AF_TAG
Specify VCF INFO tag for variant allelic fraction (tumor, must be Type=Float, default: _NA_
--control_dp_tag CONTROL_DP_TAG
Specify VCF INFO tag for sequencing depth (control, must be Type=Integer, default: _NA_
--control_af_tag CONTROL_AF_TAG
Specify VCF INFO tag for variant allelic fraction (control, must be Type=Float, default: _NA_
--call_conf_tag CALL_CONF_TAG
Specify VCF INFO tag for somatic variant call confidence (must be categorical, e.g. Type=String, default: _NA_
--tumor_dp_min TUMOR_DP_MIN
If VCF INFO tag for sequencing depth (tumor) is specified and found, set minimum required depth for inclusion in report (default: 0)
--tumor_af_min TUMOR_AF_MIN
If VCF INFO tag for variant allelic fraction (tumor) is specified and found, set minimum required AF for inclusion in report (default: 0)
--control_dp_min CONTROL_DP_MIN
If VCF INFO tag for sequencing depth (control) is specified and found, set minimum required depth for inclusion in report (default: 0)
--control_af_max CONTROL_AF_MAX
If VCF INFO tag for variant allelic fraction (control) is specified and found, set maximum tolerated AF for inclusion in report (default: 1)
Tumor-only filtering options:
--pon_vcf PON_VCF VCF file with germline calls from Panel of Normals (PON) - i.e. blacklisted variants, (default: None)
--maf_gnomad_nfe MAF_GNOMAD_NFE
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - European (non-Finnish), default: 0.002)
--maf_gnomad_asj MAF_GNOMAD_ASJ
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - Ashkenazi Jewish, default: 0.002)
--maf_gnomad_fin MAF_GNOMAD_FIN
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - European (Finnish), default: 0.002)
--maf_gnomad_oth MAF_GNOMAD_OTH
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - Other, default: 0.002)
--maf_gnomad_amr MAF_GNOMAD_AMR
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - Latino/Admixed American, default: 0.002)
--maf_gnomad_afr MAF_GNOMAD_AFR
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - African/African-American, default: 0.002)
--maf_gnomad_eas MAF_GNOMAD_EAS
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - East Asian, default: 0.002)
--maf_gnomad_sas MAF_GNOMAD_SAS
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - South Asian, default: 0.002)
--maf_gnomad_global MAF_GNOMAD_GLOBAL
Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF > pct, (gnomAD - global population, default: 0.002)
--exclude_pon Exclude variants occurring in PoN (Panel of Normals, if provided as VCF (--pon_vcf), default: False)
--exclude_likely_hom_germline
Exclude likely homozygous germline variants (allelic fraction of 1.0 for alternate allele in tumor - very unlikely somatic event), default: False)
--exclude_likely_het_germline
Exclude likely heterozygous germline variants (0.4-0.6 allelic fraction, AND presence in dbSNP + gnomAD, AND not existing as somatic record in COSMIC OR TCGA, default: False)
--exclude_clinvar_germline
Exclude variants found in ClinVar (germline variant origin), defult: False)
--exclude_dbsnp_nonsomatic
Exclude variants found in dbSNP (except for those present in ClinVar (somatic origin) OR TCGA OR COSMIC), defult: False)
--exclude_nonexonic Exclude non-exonic variants, default: False)
VEP options:
--vep_n_forks VEP_N_FORKS
Number of forks (VEP option '--fork'), default: 4
--vep_buffer_size VEP_BUFFER_SIZE
Variant buffer size (variants read into memory simultaneously, VEP option '--buffer_size')
- set lower to reduce memory usage, default: 500
--vep_pick_order VEP_PICK_ORDER
Comma-separated string of ordered transcript/variant properties for selection of primary variant consequence
(option '--pick_order' in VEP), default:
mane_select,mane_plus_clinical,canonical,biotype,ccds,rank,tsl,appris,length
--vep_no_intergenic Skip intergenic variants during variant annotation (VEP option '--no_intergenic' in VEP), default: False
--vep_regulatory Add VEP regulatory annotations (VEP option '--regulatory') or non-coding interpretation, default: False
--vep_gencode_basic Consider basic GENCODE transcript set only with Variant Effect Predictor (VEP) (VEP option '--gencode_basic').
Tumor mutational burden (TMB) and MSI options:
--estimate_tmb Estimate tumor mutational burden from the total number of somatic mutations and target region size, default: False
--tmb_display {coding_and_silent,coding_non_silent,missense_only}
Type of TMB measure to show in report, default: coding_and_silent
--tmb_dp_min TMB_DP_MIN
If VCF INFO tag for sequencing depth (tumor) is specified and found, set minimum required sequencing depth for TMB calculation: default: 0
--tmb_af_min TMB_AF_MIN
If VCF INFO tag for allelic fraction (tumor) is specified and found, set minimum required allelic fraction for TMB calculation: default: 0
--estimate_msi Predict microsatellite instability status from patterns of somatic mutations/indels, default: False
Mutational signature options:
--estimate_signatures
Estimate relative contributions of reference mutational signatures in query sample (re-fitting), default: False
--min_mutations_signatures MIN_MUTATIONS_SIGNATURES
Minimum number of SNVs required for re-fitting of mutational signatures (SBS) (default: 200, minimum n = 100)
--all_reference_signatures
Use _all_ reference mutational signatures (SBS) during signature re-fitting rather than only those already attributed to the tumor type (default: False)
--include_artefact_signatures
Include sequencing artefacts in the collection of reference signatures (default: False
--prevalence_reference_signatures PREVALENCE_REFERENCE_SIGNATURES
Minimum tumor-type prevalence (in percent) of reference signatures to be included in refitting procedure (default: 0.1)
Somatic copy number alteration (CNA) data options:
--input_cna INPUT_CNA
Somatic copy number alteration segments (tab-separated values)
--n_copy_gain N_COPY_GAIN
Minimum number of total copy number for segments considered as gains/amplifications (default: 6)
--cna_overlap_pct CNA_OVERLAP_PCT
Mean percent overlap between copy number segment and gene transcripts for reporting of gains/losses in tumor suppressor genes/oncogenes, (default: 50)
Bulk RNA-seq and RNA fusion data options:
--input_rna_expression INPUT_RNA_EXP
File with bulk RNA expression counts (TPM) of transcripts in tumor (tab-separated values)
--expression_sim Compare expression profile of tumor sample to known expression profiles (default: False)
--expression_sim_db EXPRESSION_SIM_DB
Comma-separated string of databases for used in RNA expression similarity analysis, default: tcga,depmap,treehouse
Germline variant options:
--input_cpsr INPUT_CPSR
CPSR-classified germline calls (file '<cpsr_sample_id>.cpsr.<genome_assembly>.classification.tsv.gz')
--input_cpsr_yaml INPUT_CPSR_YAML
CPSR YAML configuration file (file '<cpsr_sample_id>.cpsr.<genome_assembly>.conf.yaml')
--cpsr_ignore_vus Do not show variants of uncertain significance (VUS) in the germline section of the HTML report (default: False)
Other options:
--vcf2maf Generate a MAF file for input VCF using https://github.com/mskcc/vcf2maf (default: False)
--vcfanno_n_proc VCFANNO_N_PROC
Number of vcfanno processes (option '-p' in vcfanno), default: 4
--ignore_noncoding Ignore non-coding (i.e. non protein-altering) variants in report, default: False
--retained_info_tags RETAINED_INFO_TAGS
Comma-separated string of VCF INFO tags from query VCF that should be kept in PCGR output TSV file
--force_overwrite By default, the script will fail with an error if any output file already exists. You can force the overwrite of existing result files by using this flag, default: False
--version show program's version number and exit
--no_reporting Run functional variant annotation on VCF through VEP/vcfanno, omit other analyses (i.e. Tier assignment/MSI/TMB/Signatures etc. and report generation (STEP 4), default: False
--no_html Do not generate HTML report (default: False)
--debug Print full commands to log
--pcgrr_conda PCGRR_CONDA
pcgrr conda env name (default: pcgrr)
Example run
The examples folder contains input VCF files from two tumor samples sequenced within TCGA. A molecular interpretation report for a colorectal tumor sample can be generated by running the following command in your terminal window (this assumes you have installed PCGR through Conda, and that you have downloaded the necessary reference data files, VEP cache, see Installation):
$ (base) conda activate pcgr
$ (pcgr)
pcgr \
--refdata_dir /Users/you/dir2/data \
--vep_dir /Users/you/dir3/vep/.vep \
--output_dir /Users/you/dir4/pcgr_outputs \
--sample_id T001-COAD \
--genome_assembly grch37 \
--input_vcf /Users/you/pcgr/examples/T001-COAD.grch37.vcf.gz \
--tumor_dp_tag TDP \
--tumor_af_tag TVAF \
--tumor_site 9 \
--input_cna /Users/you/pcgr/examples/T001-COAD.cna.tsv \
--tumor_purity 0.9 \
--tumor_ploidy 2.0 \
--assay WES \
--vcf2maf \
--estimate_signatures \
--estimate_msi \
--estimate_tmb \
--force_overwrite
This command will run the Conda-based PCGR workflow and produce the following files in the /Users/you/dir4/pcgr_outputs folder:
N | File | Description |
---|---|---|
1 | <sample_id>.pcgr.grch37.html | An interactive HTML report for clinical interpretation (quarto-based) |
2 | <sample_id>.pcgr.grch37.xlsx | An excel workbook with multiple sheets of annotations (Assay & sample info/SNVs & InDels/CNAs/biomarkers/TMB/MSI), suitable for aggregation analysis across multiple samples |
3 | <sample_id>.pcgr.grch37.vcf.gz (.tbi) | Bgzipped VCF file with rich set of variant annotations to support interpretation |
4 | <sample_id>.pcgr.grch37.pass.vcf.gz (.tbi) | Bgzipped VCF file with rich set of variant annotations to support interpretation (PASS variants only) |
5 | <sample_id>.pcgr.grch37.pass.tsv.gz | Compressed vcf2tsv-converted file with rich set of variant annotations to support interpretation |
6 | <sample_id>.pcgr.grch37.conf.yaml | PCGR configuration data file (YAML), as generated by pre-reporting annotation workflow |
7 | <sample_id>.pcgr.grch37.cna_segments.tsv.gz | Tab-separated values file with raw copy number segments, per affected transcript |
8 | <sample_id>.pcgr.grch37.cna_gene_ann.tsv.gz | Tab-separated values file with genes subject to copy number alteration, including actionability assessment |
9 | <sample_id>.pcgr.grch37.tmb.tsv | Tab-separated values file with information on tumor mutational burden (TMB) estimates |
10 | <sample_id>.pcgr.grch37.maf | Annotated SNVs/InDels, converted to the Mutation Annotation Format (MAF) |
11 | <sample_id>.pcgr.grch37.msigs.tsv.gz | Tab-separated values file with information on the contribution of mutational signatures in the tumor sample |
12 | <sample_id>.pcgr.grch37.snv_indel_ann.tsv.gz | Tab-separated values file with key SNV/InDel variant annotations, including oncogenicity and clinical actionability assessment |