Cancer genomes contain large numbers of somatic alterations but few genes drive tumor development. Identifying cancer driver genes is critical for precision oncology. Most of current approaches either identify driver genes based on mutational recurrence or using estimated scores predicting the functional consequences of mutations.
driveR
is a tool for personalized or batch analysis of
genomic data for driver gene prioritization by combining genomic
information and prior biological knowledge. As features, driveR uses
coding impact metaprediction scores, non-coding impact scores, somatic
copy number alteration scores, hotspot gene/double-hit gene condition,
‘phenolyzer’ gene scores and memberships to cancer-related KEGG
pathways. It uses these features to estimate cancer-type-specific
probabilities for each gene of being a cancer driver using the related
task of a multi-task learning classification model.
The package supports GRCh37 (default) and GRCh38 genomic positions, which can be specified via the
build
argument.
The method is described in detail in Ülgen E, Sezerman OU. driveR: a novel method for prioritizing cancer driver genes using somatic genomics data. BMC Bioinformatics. 2021 May 24;22(1):263.https://doi.org/10.1186/s12859-021-04203-7
Below are some example usage cases for driveR
:
For scoring the impact of coding variants, we devised a metapredictor model that utilizes impact prediction scores from 12 different tools: SIFT, PolyPhen-2 (HumDiv scores), LRT, MutationTaster, Mutation Assessor, FATHMM, GERP++, PhyloP, CADD, VEST, SiPhy and DANN. Annotations for all these tools are available in dbNFSP via ANNOVAR. We provide with the package 2 example (shortened) ANNOVAR outputs (see next sections):
library(driveR)
path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",
package = "driveR")
We can calculate impact scores for all coding variants in this
ANNOVAR file via predict_coding_impact()
:
metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv)
head(metaprediction_df)
#> gene_symbol metaprediction_score
#> 531 MUTYH 1.000
#> 2144 AMPD3 1.000
#> 6609 RPIA 1.000
#> 8510 GPAT3 1.000
#> 10040 GUSB 1.000
#> 854 SELENBP1 0.998
By default, predict_coding_impact()
keeps only the
maximal score per gene. This behavior can be altered by setting
keep_highest_score = FALSE
:
metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv,
keep_highest_score = FALSE)
Also by default, predict_coding_impact()
keeps only the
first gene symbol for genes where multiple symbols were annotated. This
behavior can be altered by setting
keep_single_symbol = FALSE
:
metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv,
keep_single_symbol = FALSE)
The default na.string
(string that was used to indicate
when a score is not available during annotation with ANNOVAR) is “.”,
this can be altered via:
Below, a step-by-step work flow for driveR
for an
individual tumor sample is provided. Note that some steps require
operations outside of R. The example data provided within the package
are for a lung adenocarcinoma patient studied in Imielinski et al. 1
As input, create_features_df()
, the function used to
create the features table for driver gene prioritization, requires the
following:
annovar_csv_path
: the
/path/to/ANNOVAR/csv/output/file
scna_df
: a data frame containing SCNA segments,
containing the columns “chr”, “start”, “end” and “log2ratio”phenolyzer_annotated_gene_list_path
: the
/path/to/phenolyzer/annotated_gene_list/output/file
We first run ANNOVAR. An example command provided below:
table_annovar.pl example.avinput ~/annovar/humandb/ -buildver hg19 -out /path/to/output -remove -protocol refGene,cytoBand,exac03,avsnp150,dbnsfp30a,cosmic92_coding,cosmic92_noncoding -operation gx,r,f,f,f,f,f -nastring . -csvout -polish
The required filters are, as listed in the above command,
refGene,cytoBand,exac03,avsnp147,dbnsfp30a,cosmic91_coding,cosmic91_noncoding
.
With the package, an example (modified) ANNOVAR csv output file is available:
Next, we prepare a SCNA data frame (GRCh37 or GRCh38). Again, an example data frame (GRCh37) is provided within the package:
Finally, for the phenolyzer annotated_gene_list output file, we first
obtain the genes to be scored using create_features_df
and
setting prep_phenolyzer_input=TRUE
:
phenolyzer_genes <- create_features_df(annovar_csv_path = path2annovar_csv,
scna_df = example_scna_table,
prep_phenolyzer_input = TRUE,
build = "GRCh37")
Next, we save these genes to be used as input for phenolyzer:
write.table(x = data.frame(gene = phenolyzer_genes),
file = "input_genes.txt",
row.names = FALSE, col.names = FALSE, quote = FALSE)
We create another text file, named
phenolyzer_disease.txt
, containing the phenotype
(i.e. cancer type. in this case “lung adenocarcinoma”) to be used for
scoring with phenolyzer. An example command for running phenolyzer is
provided below:
perl ~/phenolyzer/disease_annotation.pl -f phenolyzer_disease.txt -prediction -phenotype -logistic --gene input_genes.txt -out phenolyzer_out/example
This should produce, among other outputs, the annotated_gene_list output file. An example annotated_gene_list output file is provided within the package:
After creating the necessary input data (as detailed above), we
simply run create_features_df()
to obtain the features data
frame:
Finally, we can prioritize cancer driver genes using
prioritize_driver_genes()
. For this function,
cancer_type
can be of the short names for the 21 different
cancer types that was used to train the multi-task learning model
utilized in this package:
short_name | description |
---|---|
BLCA | Bladder Urothelial Cancer |
BRCA | Breast Cancer |
CESC | Cervical Squamous Cell Carcinoma |
COAD | Colon Adenocarcinoma |
GBM | Brain Glioblastoma Multiforme |
HNSC | Head and Neck Squamous Cell Carcinoma |
KIRC | Kidney Renal Clear Cell Carcinoma |
KIRP | Kidney Renal Papillary Cell Carcinoma |
LAML | Acute Myeloid Leukemia |
LGG | Brain Lower Grade Glioma |
LIHC | Liver Hepatocellular carcinoma |
LUAD | Lung Adenocarcinoma |
LUSC | Lung Squamous Cell Carcinoma |
OV | Ovarian Serous Cystadenocarcinoma |
PAAD | Pancreatic Cancer |
PRAD | Prostate Adenocarcinoma |
READ | Rectum Adenocarcinoma |
SKCM | Skin Cutaneous melanoma |
STAD | Gastric Adenocarcinoma |
THCA | Head and Neck Thyroid Carcinoma |
UCEC | Uterine Corpus Endometrial Carcinoma |
Below is the example run for the lung adenocarcinoma patient:
driver_prob_df <- prioritize_driver_genes(features_df = features_df,
cancer_type = "LUAD")
head(driver_prob_df, 10)
#> gene_symbol driverness_prob prediction
#> 842 TP53 0.9614544 driver
#> 3211 CCND3 0.8604087 driver
#> 3654 EGFR 0.7485849 non-driver
#> 1966 EP300 0.5799741 non-driver
#> 679 ATM 0.5753917 non-driver
#> 510 KDR 0.5012796 non-driver
#> 386 IFNA10 0.4526894 non-driver
#> 740 IL7R 0.4271572 non-driver
#> 3572 PIK3R2 0.3780803 non-driver
#> 2300 ATR 0.3666499 non-driver
The function returns a data frame of genes with probabilities of being cancer driver genes (using the selected sub-model of a multi-task learning model trained using 21 different cancer types) and the prediction of driver genes based on a cancer type specific probability threshold. In the above example, the top 10 genes contain recognizable cancer driver genes.
Below, a step-by-step work flow for driveR
for a cohort
of tumor samples is provided. Note that some steps require operations
outside of R. The example data provided within the package are for 10
randomly-selected samples from The Cancer Genome Atlas (TCGA) program’s
LAML (Acute Myeloid Leukemia) cohort.
As before, as input, create_features_df()
, the function
used to create the features table for driver gene prioritization,
requires the following:
annovar_csv_path
: the
/path/to/ANNOVAR/csv/output/file
scna_df
: a data frame containing SCNA segments,
containing the columns “chr”, “start”, “end”, “log2ratio” and
“tumor_id”phenolyzer_annotated_gene_list_path
: the
/path/to/phenolyzer/annotated_gene_list/output/file
The required filters are again, as listed in the example ANNOVAR
command above,
refGene,cytoBand,exac03,avsnp147,dbnsfp30a,cosmic91_coding,cosmic91_noncoding
.
Additionally, for cohort-level analysis, a column named
tumor_id
is required, containing tumor id of each
variant.
With the package, an example cohort-level (modified) ANNOVAR csv output file is available:
Next, we prepare a SCNA data frame (GRCh37 is required). Again, an example data frame is provided within the package:
head(example_cohort_scna_table)
#> chr start end log2ratio tumor_id
#> 29597 22 25922342 42892582 -0.0085 DO21131
#> 29598 15 20581451 20586957 0.8688 DO21131
#> 29599 1 72768418 72809431 0.3540 DO21131
#> 29600 13 58814227 62806981 0.0023 DO21131
#> 29601 2 208354955 208359336 -0.8954 DO21131
#> 29602 5 117388963 120417166 -0.0114 DO21131
Again, for cohort-level analysis, it’s crucial to have a column named
tumor_id
is required, containing tumor id for each SCNA
segment.
Finally, for the phenolyzer annotated_gene_list output file, we first
obtain the genes to be scored using create_features_df
and
setting prep_phenolyzer_input=TRUE
:
phenolyzer_genes <- create_features_df(annovar_csv_path = path2annovar_csv,
scna_df = example_cohort_scna_table,
prep_phenolyzer_input = TRUE,
batch_analysis = TRUE)
Next, we save these genes to be used as input for phenolyzer:
write.table(x = data.frame(gene = phenolyzer_genes),
file = "input_genes.txt",
row.names = FALSE, col.names = FALSE, quote = FALSE)
We create another text file, named
phenolyzer_disease.txt
, containing the phenotype
(i.e. cancer type. in this case “acute myeloid leukemia”) to be used for
scoring with phenolyzer. An example command for running phenolyzer is
provided below:
perl ~/phenolyzer/disease_annotation.pl -f phenolyzer_disease.txt -prediction -phenotype -logistic --gene input_genes.txt -out phenolyzer_out/example
This should produce, among other outputs, the annotated_gene_list output file. An example annotated_gene_list output file for the cohort-level data is provided within the package:
After creating the necessary input data (as detailed above), we
simply run create_features_df()
to obtain the features data
frame. Notice that we set batch_analysis = TRUE
. When
batch_analysis = TRUE
, both the ANNOVAR output and the SCNA
table should have a column named ‘tumor_id’.
Finally, we can prioritize cancer driver genes using
prioritize_driver_genes()
.
Below is the example run for the acute myeloid leukemia cohort:
driver_prob_df <- prioritize_driver_genes(features_df = features_df,
cancer_type = "LAML")
head(driver_prob_df, 10)
#> gene_symbol driverness_prob prediction
#> 49 FLT3 0.9922470 driver
#> 5 TP53 0.9569072 driver
#> 291 EP300 0.9451127 driver
#> 338 CDKN2A 0.5800416 non-driver
#> 11 SMC1A 0.3474567 non-driver
#> 258 ANAPC4 0.2596510 non-driver
#> 311 IFNGR1 0.2538748 non-driver
#> 67 MDM4 0.2396303 non-driver
#> 324 TRAF3 0.2383674 non-driver
#> 359 PIEZO1 0.2371796 non-driver
Once again, in the above example, the top 10 genes contain recognizable cancer driver genes.
Imielinski M, Greulich H, Kaplan B, et al. Oncogenic and sorafenib-sensitive ARAF mutations in lung adenocarcinoma. J Clin Invest. 2014;124(4):1582-6.↩︎