1 Introduction

Somatic mutations play an important role in cancer development and disease progression, and mutational profiling is used far more commonly than other ‘omics’ analyses in clinical practice. With the development of PD-1 immunotherapy technology, many experiments show that patients with high TMB have been shown more clinical benefits of immunotherapy (PD-1/PD-L1). However, to date, WES remains confined to research settings, due to the high cost of the large genomic space sequenced. In the clinical setting, instead, targeted enrichment panels (gene panels) of various genomic sizes are emerging as the routine technology for TMB assessment. This package attempts to develop a new pathway-based gene panel for TMB assessment (pathway-based tumor mutational burden, PTMB), using somatic mutations files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format. Besides, we develop a multiple machine learning method using sample PTMB profiles to identify cancer-specific dysfunction pathways, which can be a biomarker of prognostic and predictive for cancer immunotherapy.

2 MAF field requirements

MAF files contain many fields ranging from chromosome names to cosmic annotations. However, most of the analysis in PTMB uses the following fields.

Mandatory fields:
Hugo_Symbol,
Variant_Classification,
Variant_Type,
Tumor_Sample_Barcode.

Complete specification of MAF files can be found on NCI GDC documentation page.

3 Installation

install.packages("pathwayTMB")
library(pathwayTMB)

4 Overview of the package

The pathwayTMB package is a Bioinformatics tool to identify Cancer-specific pathways. And pathwayTMB functions can be categorized into mainly Visualization and Analysis modules. Each of these functions and a short description is summarized as shown below:

1.Obtain survival benefit mutations frequency matrix.
2.Calculate the length of encoded genes.
3.Infer pathway-based tumor mutational burden profiles (PTMB).
4.Identify Cancer-specific dysfunction pathways.
5.Visualization results:
5.1 Plot Patients’ Kaplan-Meier Survival Curves.
5.2 Plot patient-specific dysfunction pathways and user-interested geneset mutually exclusive and co-occurrence plots.
5.3 Plot patient-specific dysfunction pathways’ waterfall plots.
5.4 Plot the ROC curve.

4.1 Obtain survival-related genes’ mutation frequency matrix

We downloaded patients’ mutation data from cBioPortal or other open source databases in MAF format. For each sample’s mutation, we converted MAF format data into a mutation matrix, in which every row represents the gene and every column represents the sample. In our study, we only extract the nonsilent somatic mutations (nonsense mutation, missense mutation, frame-shift indels, splice site, nonstop mutation, translation start site, inframe indels) in protein-coding regions. Then a univariate Cox regression test is performed to identify the mutant genes with survival benefit using the sample’s mutant and survival data. The function get_mut_matrix in the pathwayTMB package can implement the above process. Take simulated melanoma data as an example, the command lines are as follows:

#get the path of the mutation annotation file and samples' survival data
maf<-system.file("extdata","data_mutations_extended.txt",package = "pathwayTMB")
sur_path<-system.file("extdata","sur.csv",package = "pathwayTMB")
sur<-read.csv(sur_path,header=TRUE,row.names = 1)
#perform the function 'get_mut_matrix'
#mut_matrix<-get_mut_matrix(maffile=maf,is.TCGA=FALSE,mut_fre=0,nonsynonymous = TRUE,   cut_fisher.pval=1,cut_oddsRatio=1,sur=sur)
#view the first six lines of mutation matrix
head(mut_matrix)[1:6,1:6]
#>         Pat03 Pat06 Pat07 Pat08 Pat100 Pat101
#> DNAH5       1     0     0     0      0      0
#> DOCK2       1     0     0     0      0      0
#> COL12A1     1     0     0     0      0      0
#> COL11A1     1     0     0     0      0      0
#> RSBN1       1     0     0     0      0      0
#> SMG5        1     0     0     0      0      0

4.2 Calculate coding genes’ length using gene transfer format (GTF) file.

The function get_gene_length can extract coding genes’ length from gene transfer format (GTF) file. The GTF file can download from GENCODE.

#calculate coding genes' length, filepath--the path of the GTF file
#get_gene_length(filepath)

4.3 Infer pathway-based tumor mutational burden profiles–PTMB.

To identify cancer-specific pathways, PTMB requires sample mutation data with survival information. For each pathway, PTMB is defined as pathway tumor mutational burden corrected by genes’ length and number, the calculation formula is as follows:

\[PTMB_{ij}=\frac{\sum_{k=1}^N\frac{mut_{ijk}}{length_{jk}}}{N}\tag{1}\]
where \(mut_{ij}\) is the \(i\) th sample’s number of mutations of the \(k\) th gene in the \(j\) th pathway; \(length_{jk}\) is the length of \(k\) th gene in the \(j\) th pathway; \(N\) is the total number of genes in pathway \(j\).The \(PTMB_{ij}\) denotes the pathway \(j\) tumor mutational burden for sample \(i\), which was calculated in the context of survival-related genes’ mutation frequency matrix.

#perform the function `get_PTMB`
#PTMB_matrix<-get_PTMB(freq_matrix=mut_matrix,genesmbol=genesmbol,gene_path=gene_path)
#show the first six lines of PTMB matrix
head(PTMB_matrix)[1:6,1:6]
#>                                           Pat03   Pat06     Pat07 Pat08
#> Hematopoietic cell lineage            0.0000000 0.00000 0.0000000     0
#> Complement and coagulation cascades   0.0000000 0.00000 4.1867281     0
#> Platelet activation                   0.0000000 0.00000 2.1018828     0
#> Toll-like receptor signaling pathway  0.0000000 0.00000 0.8930421     0
#> NOD-like receptor signaling pathway   0.4252838 2.25137 2.8148339     0
#> RIG-I-like receptor signaling pathway 0.0000000 0.00000 2.6852846     0
#>                                         Pat100   Pat101
#> Hematopoietic cell lineage            0.000000 0.000000
#> Complement and coagulation cascades   3.145643 0.000000
#> Platelet activation                   3.938654 0.000000
#> Toll-like receptor signaling pathway  0.000000 0.000000
#> NOD-like receptor signaling pathway   1.962122 0.000000
#> RIG-I-like receptor signaling pathway 0.000000 3.281809

4.4 Identify cancer-specific pathways

In order to identify cancer-specific dysfunction pathways, we develop a multiple machine learning method using pathway-based tumor mutational burden profiles. The multiple machine learning method consists of three sequential steps, as follows:

Wilcoxon test is used to tentatively identify differentially PTMB pathways between death and alive sample groups.
Then random forest regression model is performed for further dysfunction pathways screening.
Lasso+cox regression model is also performed for feature selection.

The function get_final_signature in the pathwayTMB package can implement the above process.

# filter the survival-related pathways
#set.seed(1)
#final_character<-get_final_signature(PTMB=PTMB_matrix,sur=sur)
#view the final_character
final_character
#>                                          a1 
#> "Natural killer cell mediated cytotoxicity" 
#>                                          a4 
#>       "Antigen processing and presentation"

4.5 Visualization results

The function ‘plotKMcurves’ uses to draw Kaplan-Meier survival curves based on PTMB-related risk score. The risk score is generated by the signature’s PTMB and the coefficient of ‘Univariate’ or ‘Multivariate’ cox regression. The command lines are as follows:

#Drawing Kaplan Meier Survival Curves using the final survival-related PTMB.
plotKMcurves(t(PTMB_matrix[final_character,]),sur=sur,returnAll = FALSE,risk.table = TRUE)

The function ‘plotMutInteract’ uses Fisher Exact tests to detect mutually exclusive, co-occurring, and altered genesets or pathways, then visualize the results.

#a mutually exclusive co-occurrence chart showing the top 20 genes for mutation rates
gene_fre<-apply(mut_matrix,1,function(x){length(which(x!=0))/length(x)})
genes<-names(sort(gene_fre,decreasing = TRUE))[1:20]
plotMutInteract(freq_matrix=mut_matrix, genes=genes,returnAll = FALSE)

#pathways' mutually exclusive co-occurrence chart
plotMutInteract(freq_matrix=PTMB_matrix,genes=final_character,
                nShiftSymbols =0.3,returnAll = FALSE)

The function ‘GenePathwayOncoplots’ takes output generated by read.maf and draws a GenePathwayOncoplots.

#calculate the PTMB-related riskscore
riskscore<-plotKMcurves(t(PTMB_matrix[final_character,]),sur=sur,plots = FALSE)$risk_score
cut_off<-median(riskscore)
#draw an GenePathwayOncoplots
GenePathwayOncoplots(maffile=maf,freq_matrix =mut_matrix,risk_score=riskscore,    
cut_off=cut_off,final_character=final_character,gene_path = gene_path,removeNonMutated = FALSE)
#> Natural killer cell mediated cytotoxicity not found in MAF. Excluding it..
#> Antigen processing and presentation not found in MAF. Excluding it..

#> -Reading
#> -Validating
#> -Silent variants: 26 
#> -Summarizing
#> --Possible FLAGS among top ten genes:
#>   MUC16
#>   USH2A
#> -Processing clinical data
#> --Missing clinical data
#> -Finished in 0.190s elapsed (0.170s cpu) 
#> -Reading
#> -Validating
#> -Silent variants: 26 
#> -Summarizing
#> --Possible FLAGS among top ten genes:
#>   MUC16
#>   USH2A
#> -Processing clinical data
#> -Finished in 0.170s elapsed (0.170s cpu) 
#> --Possible FLAGS among top ten genes:
#>   MUC16
#>   USH2A
#> -Processing clinical data

The function ‘plotROC’ uses to plot a ROC curve.

#get the path of samples' immunotherapy response data
res_path<- system.file("extdata","response.csv",package = "pathwayTMB")
response<-read.csv(res_path,header=TRUE,stringsAsFactors =FALSE,row.name=1)
plotROC(riskscore=riskscore,response=response,main="Objective Response",print.auc=TRUE,grid = TRUE)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls > cases

pathwayTMB:Pathway Based Tumor Mutational Burden

Xiangmei Li,Junwei Han

2022-08-01