Analyzing Proteomics UPS1 Spike-in Experiments (Example Ramus 2016 Dataset)

Introduction

This vignette complements the more basic vignette ‘Getting started with wrProteo’ also from this package (wrProteo) and shows in more detail how UPS1_spike-in_ experiments may be analyzed, using this package (wrProteo).

Furthermore, wrMisc, wrGraph and RColorBrewer from CRAN as well as the Bioconductor package limma (for it’s moderated statistical testing) will be used internally.

So, to get started on a fresh session of R, you might have to install the following packages:

## This is R code, you can run this to redo all analysis presented here.
install.packages("wrMisc")
## These packages are used for the graphics
install.packages("wrGraph")
install.packages("RColorBrewer")
if(!requireNamespace("knitr", quietly=TRUE)) install.packages("knitr")

## Installation of limma from Bioconductor
if(!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")
BiocManager::install("limma")

## now all dependecies are installed...
install.packages("wrProteo")

## You cat also see all vignettes for this package by typing :
browseVignettes("wrProteo")    #  ... and the select the html output

As you will see in the interactive window from browseVignettes(), this package has 2 vignettes, a more general introductory vignette (mentioned above) and this UPS1 dedicated vignette.

Now let’s load the packages needed :

## Let's assume this is a fresh R-session
library(knitr)
library(wrMisc)
library(wrGraph)
library(wrProteo)

# Version number for wrProteo :
packageVersion("wrProteo")
#> [1] '2.0.0.2'

Experimental Setup For Benchmark Tests

The main aim of the experimental setup using heterologous spike-in experiments is to provide a framework to test identification and quantitation procedures in proteomics. The overall idea is based on providing samples where the amount of a few proteins vary in a very controlled way, ie it is exactely known in advance which proteins vary how much. The easiest way to obtain samples consists in taking of advantage that proteins from different species vary many times enough so that mass spectrometry experiments can distinguish the species origin.

The exeriment reanalyzed here used a base (‘matrix’) of yeast protein extract constant in all samples. Then, varying amounts of a commerical collection of 48 purified human proteins were added in different well documented amounts to the constant yeast protein extract. For this purpose the UPS1 preparation, commerically available from Sigma-Aldrich (www.sigmaaldrich.com), is frequently used.

In terms of ROC curves (see also ROC on Wikipedia) the spike-in proteins are expected to show up as true positives (TP). In contrast, since all yeast proteins were added in the same quantity to the same samples, they should be observed as constant, ie as true negatives (TN) when looking for proteins changing abundance.

The specific dataset used here (seen also next section Ramus Data Set) is not so recent and better performing mass spectrometers have gotten availabale in the meantime. Thus, for addressing scientific questions concerning comparison and choice of quantification software it is suggetsed to perform similar comparisons on more recent datasets. The main aim of this vignette is to show the possibilities of how such comparisons can be performed using wrProteo.

The Ramus Data-Set

The data used in this vignette was published with the article : Ramus et al 2016 “Benchmarking quantitative label-free LC-MS data processing workflows using a complex spiked proteomic standard dataset” in J Proteomics 2016 Jan 30;132:51-62.

This dataset is available on PRIDE as PXD001819 (and on ProteomeXchange).

Briefly, this experiment aims to evaluate and compare various quantification appoaches of the heterologous spike-in UPS1 (available from Sigma-Aldrich) in yeast protein extracts as constant matrix. 9 different concentrations of the heterologous spike-in (UPS1) were run in triplicates. The proteins were initially digested by Trypsin and then analyzed by LC-MS/MS in DDA mode.

As described in more detail in the reference, this dataset was generated using a LTQ-Orbitrap, in the meantime more powerful and precises mass-spectrometers have become avialable. Thus, scientific questions about the comparison and choice of quantification software may be better addressed using more recent datasets.

Meta-Data Describing The Experiment (sdrf)

The project Proteomics Sample Metadata Format aims to provide a framework of providing a uniform format for documenting experimental meta-data (sdrf-format). The meta-data for experiments already integrated can be directly read/accessed from wrProteo.

Either you download the meta-data as file ‘sdrf.tsv’ from Pride/PXD001819, or you may read file ‘PXD001819.sdrf.tsv’ directly from github/bigbio.

## Read meta-data from  github.com/bigbio/proteomics-metadata-standard/
pxd001819meta <- readSdrf("PXD001819")
#> readSdrf : Successfully read 27 annotation columns for 27 samples

## The concentration of the UPS1 spike-in proteins in the samples
if(length(pxd001819meta) >0) {
  UPSconc <- sort(unique(as.numeric(wrMisc::trimRedundText(pxd001819meta$characteristics.spiked.compound.))))  # trim to get to 'essential' info
} else {
  UPSconc <- c(50, 125, 250, 500, 2500, 5000, 12500, 25000, 50000)       # in case access to github failed
}

The import-functions used later in this vignette can directly download the spike-in metadata if the associated PXD-accession-number is provided.

Key Elements And Additional Functions

## A few elements and functions we'll need lateron
methNa <- c("ProteomeDiscoverer","MaxQuant","Proline")
names(methNa) <- c("PD","MQ","PL")

In this project the old version for the accession-number of UBB (which has been withdrown by the database in the meantime) and protein-sequence as originally cited by Sigma-Aldrich (www.sigmaaldrich.com/) has been used. Se we’ll ‘retrograde’ the output from ‘getUPS1acc()’.

spikeType <- "UPS1"            ## information about spike used 
matrixType <- "Saccharomyces cerevisiae"   ## information about matrix used  (user-provided)
matrixType2 <- wrProteo::inspectSpeciesIndic(matrixType)               # check & make uniform
sdrfNa <- "PXD001819"                     # name of sdrf to use

## The accession numbers for the UPS1 proteins
UPS1 <- getUPS1acc(updated=FALSE)
## global information about what could be contaminants
contaInf <- c("Bos tauris|Gallus", MQ="CON_|LYSC_CHICK")

## additional functions
replSpecType <- function(x, annCol="SpecType", replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")), silent=TRUE) {
  ## rename $annot[,"SpecType"] to more specific names
  fxNa <- "replSpecType"
  chCol <- annCol[1] %in% colnames(x$annot)
  if(chCol) { chCol <- which(colnames(x$annot)==annCol[1])
    chIt <- replBy[,1] %in% unique(x$annot[,chCol])    # check items to replace if present
    if(any(chIt)) for(i in which(chIt)) {useLi <- which(x$annot[,chCol] %in% replBy[i,1]); cat("useLi",head(useLi),"\n"); x$annot[useLi,chCol] <- replBy[i,2]}
  } else if(!silent) message(fxNa," 'annCol' not found in x$annot !")
  x }

plotConcHist <- function(mat, ref, refColumn=3:4, matCluNa="cluNo", lev=NULL, ylab=NULL, tit=NULL) {
  ## plot histogram like counts of UPS1 concentrations
  if(is.null(tit)) tit <- "Frequency of UPS1 Concentrations Appearing in Cluster"
  gr <- unique(mat[,matCluNa])
  ref <- ref[,refColumn]
  if(length(lev) <2) lev <- sort(unique(as.numeric(as.matrix(ref))))
  if(length(ylab) !=1) ylab <- "Frequency"
  tbl <- table(factor( as.numeric(ref[which(rownames(ref) %in% rownames(mat)),]), levels=lev))
  graphics::barplot(tbl, las=1, beside=TRUE, main=paste(tit,gr), col=grDevices::gray(0.8), ylab=ylab)
}

plotMultRegrPar <- function(dat, methInd, tit=NULL, useColumn=c("logp","slope","medAbund","startFr"), lineGuide=list(v=c(-12,-10),h=c(0.7,0.75),col="grey"), xlim=NULL,ylim=NULL,subTit=NULL) {
  ## scatter plot logp (x) vs slope (y) for all UPS proteins, symbol by useColumn[4], color by hist of useColumn[3]
  ## dat (array) UPS1 data
  ## useColumn (character) 1st as 'logp', 2nd as 'slope', 3rd as median abundance, 4th as starting best regression from this point
  fxNa <- "plotMultRegrPar"
   #fxNa <- wrMisc::.composeCallName(callFrom,newNa="plotMultRegrPar")
  if(length(dim(dat)) !=3) stop("invalid input, expecting as 'dat' array with 3 dimensions (proteins,Softw,regrPar)")
  if(any(length(methInd) >1, methInd > dim(dat)[2], !is.numeric(methInd))) stop("invalid 'methInd'")
  chCol <- useColumn %in% dimnames(dat)[[3]]
  if(any(!chCol)) stop("argument 'useColumn' does not fit to 3rd dim dimnames of 'dat'")
  useCol <- colorAccording2(dat[,methInd,useColumn[3]], gradTy="rainbow", revCol=TRUE, nEndOmit=14)
  graphics::plot(dat[,methInd,useColumn[1:2]], main=tit, type="n",xlim=xlim,ylim=ylim)   #col=1, bg.col=useCol, pch=20+lmPDsum[,"startFr"],
  graphics::points(dat[,methInd,useColumn[1:2]], col=1, bg=useCol, pch=20+dat[,methInd,useColumn[4]],)
  graphics::legend("topright",paste("best starting from ",1:5), text.col=1, pch=21:25, col=1, pt.bg="white", cex=0.9, xjust=0.5, yjust=0.5)
  if(length(subTit)==1) graphics::mtext(subTit,cex=0.9)
  if(is.list(lineGuide) & length(lineGuide) >0) {if(length(lineGuide$v) >0) graphics::abline(v=lineGuide$v,lty=2,col=lineGuide$col)
    if(length(lineGuide$h) >0) graphics::abline(h=lineGuide$h,lty=2,col=lineGuide$col)}
  hi1 <- graphics::hist(dat[,methInd,useColumn[3]], plot=FALSE)
  wrGraph::legendHist(sort(dat[,methInd,useColumn[3]]), colRamp=useCol[order(dat[,methInd,useColumn[3]])][cumsum(hi1$counts)],
    cex=0.5, location="bottomleft", legTit="median raw abundance")  #
}

Protein Identification and Initial Quantification

Multiple algorithms and software implementations have been developed for quantitation label-free proteomics experiments, in particular for extracted ion chromatograms (XIC). For background information you may look at Wikipedia labell-free Proteomics. Here, the use of the output for 3 such implementations for extracting peptide/protein quantifications is shown. These 3 software implementations were run individually using equivalent settings, ie identifcation based on the same fasta-database, starting at a single peptide with 1% FDR, MS mass tolerance for ion precursors at 0.7 ppm, oxidation of methionins and N-terminal acetylation as fixed as well as carbamidomethylation of cysteins as variable modifications.

Since in this context it is crucial to recognize all UPS1 proteins as such (see also this data-set), the import-functions make use of the specPref argument, allowing to define custom tags. Most additional arguments to the various import-functions have been kept common for conventient use and for generating output structured the same way. Indeed, simply separating proteins by their species origin is not sufficient since common contaminants like human Keratin might get considered by error as UPS1.

MaxQuant

MaxQuant is free software provided by the Max-Planck-Institute, see also Tyanova et al 2016. Later in this document data from MaxQuant will by frequently abbreviated as MQ.

Typically MaxQuant exports quantitation data on level of consensus-proteins by default to a folder called txt with a file called “proteinGroups.txt” . So in a standard case (when the file name has not been changed manually) it is sufficient to provide the path to this file. Of course, you can explicitely point to a specific file, as shown below. The data presented here were processed using MaxQuant version 1.6.10. Files compressed as .gz can be read, too (like in the example below).

path1 <- system.file("extdata", package="wrProteo")
fiNaMQ <- "proteinGroups.txt.gz"

## We need to define the setup of species
specPrefMQ <- list(conta=contaInf[2], matrix=paste0("OS=",matrixType2), spike=spikeType, sampleNames="sdrf") #fasta=fastaFi, 
dataMQ <- readMaxQuantFile(file=fiNaMQ, path=path1, refLi="mainSpe", specPref=specPrefMQ, 
  sdrf=c(sdrfNa,"max",sdrfOrder=TRUE,skipCol=1), suplAnnotFile=TRUE, plotGraph=FALSE, silent=TRUE)   # , refLi=useRefLi
#> .readCsvTxt Importing table:  nCol= 1, 1 and 52   ie, best import : 52 cols

The data were imported, log2-transformed and median-normalized, the protein annotation was parsed to automatically extract IDs, protein-names and species information. The species anotation was extracted out of the fasta-headers, as given in the specPref argument (MaxQuant specific setting). As explained in more detail in the general vignette wrProteoVignette1, In this example we use only proteins annotated as Homo sapiens for determining the normalization-factors via the argument refLi.

If you wish to inspect the graphs for the distribution of abundance values for each sample before and after median-normalization, please set the argument plotGraph=TRUE (default). Please note, that in the example above we directly added information about the experimental setup from the sdrf repository and we asked for arranging the order of samples as they appear in the sdrf.

## The number of lines and colums
dim(dataMQ$quant)
#> [1] 1104   27
## A quick summary of some columns of quantitation data
summary(dataMQ$quant[,1:7])                # the first 8 cols
#>   12500amol_1     12500amol_2     12500amol_3      125amol_1    
#>  Min.   :17.54   Min.   :15.66   Min.   :14.93   Min.   :15.18  
#>  1st Qu.:22.51   1st Qu.:22.49   1st Qu.:22.50   1st Qu.:22.39  
#>  Median :23.46   Median :23.46   Median :23.46   Median :23.43  
#>  Mean   :23.69   Mean   :23.65   Mean   :23.67   Mean   :23.60  
#>  3rd Qu.:24.81   3rd Qu.:24.76   3rd Qu.:24.77   3rd Qu.:24.82  
#>  Max.   :30.29   Max.   :30.27   Max.   :30.32   Max.   :30.26  
#>  NA's   :98      NA's   :100     NA's   :104     NA's   :114    
#>    125amol_2       125amol_3      25000amol_1   
#>  Min.   :14.85   Min.   :14.93   Min.   :15.82  
#>  1st Qu.:22.36   1st Qu.:22.40   1st Qu.:22.53  
#>  Median :23.42   Median :23.44   Median :23.53  
#>  Mean   :23.59   Mean   :23.62   Mean   :23.74  
#>  3rd Qu.:24.81   3rd Qu.:24.79   3rd Qu.:24.94  
#>  Max.   :30.26   Max.   :30.29   Max.   :30.27  
#>  NA's   :106     NA's   :114     NA's   :109
table(dataMQ$annot[,"SpecType"], useNA="always")
#> 
#> mainSpecies       spike        <NA> 
#>        1047          48           9
table(dataMQ$annot[,"Species"], useNA="always")
#> 
#>            Gallus gallus             Homo sapiens             Mus musculus 
#>                        1                       49                        1 
#> Saccharomyces cerevisiae               Sus scrofa                     <NA> 
#>                     1047                        1                        5

Now we can summarize the presence of UPS1 proteins after treatment by MaxQuant : In sum, 47 UPS1 proteins were found, 1 is/are missing.

ProteomeDiscoverer

ProteomeDiscoverer is commercial software from ThermoFisher (www.thermofisher.com). Later in this document data from ProteomeDiscoverer will by frequently abbreviated as PD.

With the data (see also this data-set) used here, the identification was performed using the XCalibur module of ProteomeDiscoverer version 2.4 . Quantitation data at the level of consensus-proteins can be exported to tabulated text files, which can be treated by the function shown below. The resultant data were export in tablulated format and the file automatically named ‘_Proteins.txt_’ by ProteomeDiscoverer (the option R-headers was checked, however this option is not mandatory). Files compressed as .gz can be read, too (like in the example below).

Note, since ProteomeDiscoverer frequently does not provide customized column-names we’ll use the information from the sdrf-file (see argument sdrf=…). However, in this case the 1st suitable column of the sdrf doesn’t allow either to extract useful names we’ll tell the import-function to skip this 1st column of the sdrf when looking for a column indicating a maximum number of groups. Since the order of samples may appear different in various quantification-software we’ll also ask the import-function to adjust the order of samples to the sdrf (see argument sdrf=c(sdrfNa,“max”,sdrfOrder=TRUE,skipCol=1)).

path1 <- system.file("extdata", package="wrProteo")
fiNaPd <- "pxd001819_PD24_Proteins.txt.gz"
## Next, we define the setup of species
specPrefPD <- list(conta=contaInf[1], mainSpecies=matrixType2, spike=spikeType, sampleNames="sdrf", lowNumberOfGroups=FALSE)   # fasta=fastaFi,
dataPD <- readProteomeDiscovererFile(file=fiNaPd, path=path1,  refLi="mainSpe", specPref=specPrefPD, sdrf=c(sdrfNa,"max",sdrfOrder=TRUE,skipCol=1), suplAnnotFile=TRUE, plotGraph=FALSE, silent=TRUE)   # refLi=useRefLi
#> .readCsvTxt Importing table:  nCol= 1, 1, 10 and 10   ie, best import : 10 cols

The data were imported, log2-transformed and median-normalized, the protein annotation was parsed to automatically extract IDs, protein-names and species information. Please note, that quantitation data exported from ProteomeDiscoverer frequently have very generic column-names (increasing numbers). When calling the import-function they can be replaced by more meaningful names either using the argument sampNa, or from reading the default annotation in the file ‘InputFiles.txt’ or, finally, from the sdrf-annotation. In the example below both the default annotation as file ‘InputFiles.txt’ and sdrf annotation are available and were integrated to object produced by the import-function.

The species anotation was extracted out as given in the specPref argument. In this example we use only proteins annotated as Homo sapiens for determining the normalization-factors via the argument refLi.

If you wish to inspect the graphs for the distribution of abundance values for each sample before and after median-normalization, please set the argument plotGraph=TRUE (default).

## The number of lines and colums
dim(dataPD$quant)
#> [1] 1296   27
## A quick summary of some columns of quantitation data
summary(dataPD$quant[,1:7])        # the first 8 cols
#>   12500amol_R1    12500amol_R2    12500amol_R3     125amol_R1   
#>  Min.   :10.86   Min.   :11.33   Min.   :11.57   Min.   :10.90  
#>  1st Qu.:18.32   1st Qu.:18.31   1st Qu.:18.31   1st Qu.:18.22  
#>  Median :19.48   Median :19.46   Median :19.45   Median :19.37  
#>  Mean   :19.62   Mean   :19.62   Mean   :19.60   Mean   :19.52  
#>  3rd Qu.:20.84   3rd Qu.:20.83   3rd Qu.:20.80   3rd Qu.:20.84  
#>  Max.   :26.36   Max.   :26.36   Max.   :26.39   Max.   :26.43  
#>  NA's   :69      NA's   :59      NA's   :67      NA's   :88     
#>    125amol_R2      125amol_R3     25000amol_R1  
#>  Min.   :10.48   Min.   :11.16   Min.   :11.55  
#>  1st Qu.:18.20   1st Qu.:18.20   1st Qu.:18.35  
#>  Median :19.40   Median :19.40   Median :19.53  
#>  Mean   :19.51   Mean   :19.53   Mean   :19.67  
#>  3rd Qu.:20.82   3rd Qu.:20.83   3rd Qu.:21.01  
#>  Max.   :26.39   Max.   :26.46   Max.   :26.40  
#>  NA's   :99      NA's   :86      NA's   :64
table(dataPD$annot[,"SpecType"], useNA="always")
#> 
#> mainSpecies       spike        <NA> 
#>        1239          48           9
table(dataPD$annot[,"Species"], useNA="always")
#> 
#>            Gallus gallus             Homo sapiens Saccharomyces cerevisiae 
#>                        1                       44                     1239 
#>                     <NA> 
#>                       12

Confirming the presence of UPS1 proteins by ProteomeDiscoverer:

Now we can summarize the presence of UPS1 proteins after treatment by ProteomeDiscoverer : In sum, 47 UPS1 proteins were found, 1 is/are missing.

Proline

Proline is open-source software provided by the Profi-consortium (see also proline-core on github), published by Bouyssie et al 2020. Later in this document data from Proline will by frequently abbreviated as PL.

Protein identification in Proline gets performed by SearchGUI, see also Vaudel et al 2015. In this case X!Tandem (see also Duncan et al 2005) was used as search engine.

Quantitation data at the level of consensus-proteins can be exported from Proline as .xlsx or tabulated text files, both formats can be treated by the import-functions shown below. Here, Proline version 1.6.1 was used with addition of Percolator (via MS-Angel from the same authors).

path1 <- system.file("extdata", package="wrProteo")
fiNaPl <- "pxd001819_PL.xlsx"

specPrefPL <- list(conta=contaInf[1], mainSpecies=matrixType2, spike=spikeType, sampleNames="sdrf")   # fasta=fastaFi,    # same as PL
dataPL <- readProlineFile(file=fiNaPl, path=path1, specPref=specPrefPL, sdrf=c(sdrfNa,"max",sdrfOrder=TRUE,skipCol=1), suplAnnotFile=TRUE, plotGraph=FALSE, silent=TRUE)   # refLi=useRefLi

The (log2-transformed) data were imported and median-normalized, the protein annotation was parsed to automatically extract IDs, protein-names and species information. The species anotation was extracted out of protein annotation columns, as specified with the specPref argument. As explained in more detail in the general vignette wrProteoVignette1, In this example we use only proteins annotated as Homo sapiens for determining the normalization-factors via the argument refLi.

## The number of lines and colums
dim(dataPL$quant)
#> [1] 1186   27
## A quick summary of some columns of quantitation data
summary(dataPL$quant[,1:8])        # the first 8 cols
#>     25fmol_1        25fmol_2        25fmol_3       250amol_1    
#>  Min.   :14.17   Min.   :14.03   Min.   :13.16   Min.   :14.49  
#>  1st Qu.:19.90   1st Qu.:19.90   1st Qu.:19.90   1st Qu.:19.81  
#>  Median :21.70   Median :21.70   Median :21.70   Median :21.70  
#>  Mean   :21.76   Mean   :21.77   Mean   :21.74   Mean   :21.80  
#>  3rd Qu.:23.59   3rd Qu.:23.59   3rd Qu.:23.57   3rd Qu.:23.75  
#>  Max.   :29.38   Max.   :29.39   Max.   :29.35   Max.   :29.53  
#>  NA's   :43      NA's   :43      NA's   :45      NA's   :46     
#>    250amol_2       250amol_3        50fmol_1        50fmol_2    
#>  Min.   :13.61   Min.   :14.82   Min.   :14.03   Min.   :14.70  
#>  1st Qu.:19.83   1st Qu.:19.82   1st Qu.:19.87   1st Qu.:19.82  
#>  Median :21.70   Median :21.70   Median :21.70   Median :21.70  
#>  Mean   :21.80   Mean   :21.78   Mean   :21.75   Mean   :21.72  
#>  3rd Qu.:23.75   3rd Qu.:23.70   3rd Qu.:23.59   3rd Qu.:23.57  
#>  Max.   :29.54   Max.   :29.47   Max.   :29.30   Max.   :29.32  
#>  NA's   :46      NA's   :48      NA's   :49      NA's   :48
table(dataPL$annot[,"SpecType"], useNA="always")
#> 
#> mainSpecies       spike        <NA> 
#>        1137          48           1
table(dataPL$annot[,"Species"], useNA="always")
#> 
#>             Homo sapiens Saccharomyces cerevisiae               Sus scrofa 
#>                       48                     1137                        1 
#>                     <NA> 
#>                        0

Now we can summarize the presence of UPS1 proteins after treatment by Proline : In sum, 47 UPS1 proteins were found, 1 is/are missing.

Further Preparation Of Data

For easy and proper comparisons we need to make sure all columns are in the same order, since we have forced to use the initial order of the Sdrf, this is already the case.

Next, we’ll replace some missing protein-names:

## Need to address missing ProteinNames (UPS1) due to missing tags in Fasta
#dataPD <- replMissingProtNames(dataPD)
#dataMQ <- replMissingProtNames(dataMQ)
#dataPL <- replMissingProtNames(dataPL)
    table(dataMQ$annot[,"SpecType"])
#> 
#> mainSpecies       spike 
#>        1047          48
    table(dataPD$annot[,"SpecType"])
#> 
#> mainSpecies       spike 
#>        1239          48
    table(dataPL$annot[,"SpecType"])
#> 
#> mainSpecies       spike 
#>        1137          48

## synchronize order of groups
(grp9 <- dataMQ$sampleSetup$level)
#> 12500 amol 12500 amol 12500 amol   125 amol   125 amol   125 amol 25000 amol 
#>          1          1          1          2          2          2          3 
#> 25000 amol 25000 amol  2500 amol  2500 amol  2500 amol   250 amol   250 amol 
#>          3          3          4          4          4          5          5 
#>   250 amol 50000 amol 50000 amol 50000 amol  5000 amol  5000 amol  5000 amol 
#>          5          6          6          6          7          7          7 
#>   500 amol   500 amol   500 amol    50 amol    50 amol    50 amol 
#>          8          8          8          9          9          9
#dataPL$sampleSetup$groups <- dataMQ$sampleSetup$groups <- dataPD$sampleSetup$groups <- grp9  # synchronize order of groups

## extract names of quantified UPS1-proteins
NamesUpsPD <- dataPD$annot[which(dataPD$annot[,"SpecType"]=="spike"), "Accession"]
NamesUpsMQ <- dataMQ$annot[which(dataMQ$annot[,"SpecType"]=="spike"), "Accession"]
NamesUpsPL <- dataPL$annot[which(dataPL$annot[,"SpecType"]=="spike"), "Accession"]

tabS <- mergeVectors(PD=table(dataPD$annot[,"SpecType"]), MQ=table(dataMQ$annot[,"SpecType"]), PL=table(dataPL$annot[,"SpecType"]))
tabT <- mergeVectors(PD=table(dataPD$annot[,"Species"]), MQ=table(dataMQ$annot[,"Species"]), PL=table(dataPL$annot[,"Species"]))
tabS[which(is.na(tabS))] <- 0
tabT[which(is.na(tabT))] <- 0
kable(cbind(tabS[,2:1], tabT), caption="Number of proteins identified, by custom tags, species and software")

Number of proteins identified, by custom tags, species and software
	spike	mainSpecies	Gallus gallus	Homo sapiens	Mus musculus	Saccharomyces cerevisiae	Sus scrofa
PD	48	1239	1	44	0	1239	0
MQ	48	1047	1	49	1	1047	1
PL	48	1137	0	48	0	1137	1

The initial fasta file also contained the yeast strain number, this has been stripped off when using default parameters.

Basic Data Treatment

Structure of Experiment

The global structure of experiments can be provided as sdrf-file and/or from meta-data stored with the experimental data read. For convenience, this information about the groups of replicates was already deduced and can be found (for example) in dataMQ $ sampleSetup $ sdrf. Below, a few columns of the sdrf are shown.

kable(cbind(dataMQ$sampleSetup$sdrfDat[,c(23,7,19,22)], groups=dataMQ$sampleSetup$groups))

comment.data.file.	characteristics.biological.replicate.	comment.technical.replicate.	comment.proteomics.data.acquisition.method.	groups
UPS1_12500amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	12500 amol
UPS1_12500amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	12500 amol
UPS1_12500amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	12500 amol
UPS1_125amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	125 amol
UPS1_125amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	125 amol
UPS1_125amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	125 amol
UPS1_25000amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	25000 amol
UPS1_25000amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	25000 amol
UPS1_25000amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	25000 amol
UPS1_2500amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	2500 amol
UPS1_2500amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	2500 amol
UPS1_2500amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	2500 amol
UPS1_250amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	250 amol
UPS1_250amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	250 amol
UPS1_250amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	250 amol
UPS1_50000amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	50000 amol
UPS1_50000amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	50000 amol
UPS1_50000amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	50000 amol
UPS1_5000amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	5000 amol
UPS1_5000amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	5000 amol
UPS1_5000amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	5000 amol
UPS1_500amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	500 amol
UPS1_500amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	500 amol
UPS1_500amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	500 amol
UPS1_50amol_R1.raw	1	1	NT=Data-dependent acquisition;AC=PRIDE:0000627	50 amol
UPS1_50amol_R2.raw	1	2	NT=Data-dependent acquisition;AC=PRIDE:0000627	50 amol
UPS1_50amol_R3.raw	1	3	NT=Data-dependent acquisition;AC=PRIDE:0000627	50 amol

Normalization

To get more general information about normalization, please refer also to the vignette “Getting started with wrProteo” from this package.

No additional normalization is needed with this particular data-set. All data were already median normalized directly at import to the host proteins (ie Saccharomyces cerevisiae) after importing the initial quantification-output using ‘readMaxQuantFile()’, ‘readProlineFile()’ and ‘readProteomeDiscovererFile()’.

Presence of NA-values

As mentioned in the general vignette of this package, ‘wrProteoVignette1’, it is important to investigate the nature of NA-values. In particular, checking the hypothesis that NA-values originate from very low abundance instances is very important for deciding how to treat NA-values furtheron.

## Let's inspect NA values from ProteomeDiscoverer as graphic
matrixNAinspect(dataPD$quant, gr=grp9, tit="ProteomeDiscoverer")
#> stableMode : Method='density',  length of x =976, 'bandw' has been set to 44

## Let's inspect NA values from MaxQuant as graphic
matrixNAinspect(dataMQ$quant, gr=grp9, tit="MaxQuant")
#> stableMode : Method='density',  length of x =1142, 'bandw' has been set to 47

## Let's inspect NA values from Proline as graphic
matrixNAinspect(dataPL$quant, gr=grp9, tit="Proline")
#> stableMode : Method='density',  length of x =413, 'bandw' has been set to 28

A key element to understand the nature of NA-value is to investigate their NA-neighbours. If a given protein has for just one of the 3 replicates an NA, the other two valid quantifications can be considered as NA-neighbours. In the figures above all NA-neighbours are shown in the histogram and their mode is marked by an arrow. One can see, that NA-neighbours are predominantely (but not exclusively) part of the lower quantitation values. This supports the hypothesis that NAs occur most frequently with low abundance proteins.

NA-Imputation and Statistical Testing for Changes in Abundance

NA-values represent a challange for statistical testing. In addition, techniques like PCA don’t allow NAs, neither.

The number of NAs varies between samples : Indeed, very low concentrations of UPS1 are difficult to get detected and contribute largely to the NAs (as we will see later in more detail). Since the amout of yeast proteins (ie the matrix in this setup) stays constant across all samples, yeast proteins should always get detected the same way.

## Let's look at the number of NAs. Is there an accumulated number in lower UPS1 samples ?
tabSumNA <- rbind(PD=sumNAperGroup(dataPD$raw, grp9), MQ=sumNAperGroup(dataMQ$raw, grp9), PL=sumNAperGroup(dataPL$raw, grp9) )
kable(tabSumNA, caption="Number of NAs per group of samples", align="r")

Number of NAs per group of samples
	1	2	3	4	5	6	7	8	9
PD	195	273	209	205	257	220	207	234	272
MQ	302	334	330	282	323	322	297	337	318
PL	131	140	141	137	157	131	139	140	124

In the section above we investigated the circumstances of NA-instances and provided evidence that NA-values typically represent proteins with low abundance which frequently ended up as non-detectable (NA). Thus, we hypothesize that (in most cases) NA-values might also have been detected in quantities like their NA-neighbours. In consequence, we will model a normal distribution based on the NA-neighbours and use for substituting.

The function testRobustToNAimputation() from this package (wrProteo) allows to perform NA-imputation and subsequent statistical testing (after repeated imputation) between all groups of samples (see also the general vignette). One of the advantages of this implementation, is that multiple rounds of imputation are run, so that final results (including pair-wise testing) get stabilized to (rare) stochastic effects. For this reason one may also speak of stabilized NA-imputations.

The statistical tests used underneith make use of the shrinkage-procedure provided from the empirical Bayes procedure as implemented to the Bioconductor package limma, see also Ritchie et al 2015. In addition, various formats of multiple testing correction can be added to the results : Benjamini-Hochberg FDR (lateron referred to as BH or BH-FDR, see FDR on Wikipedia, see also Benjamini and Hochberg 1995), local false discovery rate (lfdr, using the package fdrtool, see Strimmer 2008), or modified testing by ROTS, etc … In this vignette we will make use of the BH-FDR.

We are ready to launch the NA-imputation and testing for data from ProteomeDiscoverer. Please note, that the procedure including repetive NA-imputations may take a several seconds.

testPD <- testRobustToNAimputation(dataPD, imputMethod="informed")     # ProteomeDiscoverer

Then for MaxQuant …

testMQ <- testRobustToNAimputation(dataMQ, imputMethod="informed")      # MaxQuant , ok

And finally for Proline :

testPL <- testRobustToNAimputation(dataPL, imputMethod="informed")      # Proline

From these results we’ll use i) the NA-imputed version of our datasets for plotting principal components (PCA) and ii) the (stabilized) testing results for counting TP, FP, etc and to construct ROC curves.

Let’s add the NA-imputed data to our main object :

dataPD$datImp <- testPD$datImp       # recuperate imputeded data to main data-object
dataMQ$datImp <- testMQ$datImp
dataPL$datImp <- testPL$datImp

Analysis Using All Proteins Identified (Matrix + UPS1)

In this section we’ll consider all proteins identified and quantified in a pair-wise fashion, using the t-tests already run in the previous section. As mentioned, the experimental setup is very special, since all proteins that are truly changing are known in advance (the UPS1 spike-in proteins). Tables get constructed by counting based on various thresholds for considering given protein abundances as differential or not. A traditional 5 percent FDR cut-off is used for Volcano-plots, while ROC-curves allow inspecting the entire range of potential cut-off values.

Pairwise Testing Summary

A very universal and simple way to analyze data is by checking as several pairwise comparisons, in particular, if the experimental setup does not include complete multifactorial plans.

This UPS1 spike-in experiment (see also Experimental Setup) has 27 samples organized (according to meta-information) as 9 groups. Thus, one obtains in total 36 pair-wise comparisons which will make comparisons very crowded. The original publication by Ramus et al 2016 focussed on 3 pairwise comparisons only. In this vignette it is shown how all of them can get considered.

Now, we’ll construct a table showing all possible pairwise-comparisons. Using the function numPairDeColNames() we can easily extract the UPS1 concentrations as numeric content and show the (log-)ratio of the pairwise comparisons (column ‘log2rat’), the final concentrations (columns ‘conc1’ and ‘conc2’, in amol) and the number of differentially abundant proteins passing 5% FDR (using classical Benjamini-Hochberg FDR (columns ‘sig.xx.BH’) or lfdr (Strimmer 2008, columns ‘sig._xx_.lfdr’ ).

## The number of differentially abundant proteins passing 5% FDR (ProteomeDiscoverer and MaxQuant)
signCount <- cbind( sig.PD.BH=colSums(testPD$BH < 0.05, na.rm=TRUE), sig.PD.lfdr=if("lfdr" %in% names(testPD)) colSums(testPD$lfdr < 0.05, na.rm=TRUE),
  sig.MQ.BH=colSums(testMQ$BH < 0.05, na.rm=TRUE), sig.MQ.lfdr=if("lfdr" %in% names(testMQ)) colSums(testMQ$lfdr < 0.05, na.rm=TRUE),
  sig.PL.BH=colSums(testPL$BH < 0.05, na.rm=TRUE), sig.PL.lfdr=if("lfdr" %in% names(testPL)) colSums(testPL$lfdr < 0.05, na.rm=TRUE)  )

  
  
table1 <- numPairDeColNames(testPD$BH, stripTxt="amol", sortByAbsRatio=TRUE)
table1 <- cbind(table1, signCount[table1[,1],])
rownames(table1) <- colnames(testMQ$BH)[table1[,1]]

kable(table1, caption="All pairwise comparisons and number of significant proteins", align="c")

All pairwise comparisons and number of significant proteins
	index	log2rat	conc1	conc2	sig.PD.BH	sig.PD.lfdr	sig.MQ.BH	sig.MQ.lfdr	sig.PL.BH	sig.PL.lfdr
50 amol-50000 amol	33	9.966	50	50000	553	470	340	274	746	675
25000 amol-50 amol	27	8.966	50	25000	595	539	382	343	733	675
125 amol-50000 amol	8	8.644	125	50000	367	285	227	166	721	655
12500 amol-50 amol	12	7.966	50	12500	528	490	316	283	615	557
125 amol-25000 amol	4	7.644	125	25000	348	299	179	167	698	653
250 amol-50000 amol	21	7.644	250	50000	385	317	311	264	703	620
125 amol-12500 amol	1	6.644	125	12500	242	180	105	78	543	486
250 amol-25000 amol	17	6.644	250	25000	233	192	179	138	619	567
50 amol-5000 amol	32	6.644	50	5000	588	524	358	302	551	470
500 amol-50000 amol	35	6.644	500	50000	333	279	203	148	704	639
12500 amol-250 amol	9	5.644	250	12500	138	102	58	39	416	387
2500 amol-50 amol	23	5.644	50	2500	553	529	308	255	490	424
25000 amol-500 amol	28	5.644	500	25000	243	173	143	94	638	582
125 amol-5000 amol	7	5.322	125	5000	267	239	128	94	291	251
12500 amol-500 amol	13	4.644	500	12500	158	99	57	33	423	414
125 amol-2500 amol	3	4.322	125	2500	195	155	89	77	163	112
250 amol-5000 amol	20	4.322	250	5000	167	105	125	89	138	105
2500 amol-50000 amol	26	4.322	2500	50000	340	244	211	168	713	629
250 amol-2500 amol	16	3.322	250	2500	121	91	56	45	106	72
2500 amol-25000 amol	22	3.322	2500	25000	59	44	16	10	599	530
50 amol-500 amol	31	3.322	50	500	437	366	232	182	373	326
500 amol-5000 amol	34	3.322	500	5000	145	108	65	36	133	95
5000 amol-50000 amol	36	3.322	5000	50000	279	215	125	83	609	552
12500 amol-2500 amol	10	2.322	2500	12500	29	19	11	9	330	375
250 amol-50 amol	18	2.322	50	250	360	284	192	144	312	253
2500 amol-500 amol	24	2.322	500	2500	116	78	38	24	96	62
25000 amol-5000 amol	29	2.322	5000	25000	45	30	17	15	446	394
125 amol-500 amol	6	2.000	125	500	22	17	6	2	23	7
12500 amol-50000 amol	15	2.000	12500	50000	221	150	98	61	339	298
125 amol-50 amol	5	1.322	50	125	287	243	150	109	209	179
12500 amol-5000 amol	14	1.322	5000	12500	24	17	4	2	109	88
125 amol-250 amol	2	1.000	125	250	15	10	0	0	3	2
12500 amol-25000 amol	11	1.000	12500	25000	12	9	0	3	87	60
250 amol-500 amol	19	1.000	250	500	4	2	2	0	3	2
2500 amol-5000 amol	25	1.000	2500	5000	5	3	1	0	14	8
25000 amol-50000 amol	30	1.000	25000	50000	145	116	66	58	90	62

resMQ1 <- extractTestingResults(testMQ, compNo=1, thrsh=0.05, FCthrs=2)
resPD1 <- extractTestingResults(testPD, compNo=1, thrsh=0.05, FCthrs=2)
resPL1 <- extractTestingResults(testPL, compNo=1, thrsh=0.05, FCthrs=2)

You can see that in numerous cases much more than the 48 UPS1 proteins showed up significant, ie yeast proteins supposed to remain constant also showed up in part as ‘sigificantly changing’. However, some proteins with enthousiastic FDR values have very low log-FC amplitude and will be removed by filtering in the following steps.

par(mar=c(5.5, 4.7, 4, 1))
imageW(table1[,c("sig.PD.BH","sig.MQ.BH","sig.PL.BH" )], col=(RColorBrewer::brewer.pal(9,"YlOrRd")),
  transp=FALSE, tit="Number of BH.FDR passing proteins by the quantification approaches")
mtext("Dark red for high number signif proteins", cex=0.75)

In the original Ramus et al 2016 et al paper only 3 pairwise comparisons were further analyzed :

## Selection in Ramus paper
    # c("125amol-250amol","500amol-50amol","5000amol-50amol") %in% rownames(table1) 
kable(table1[which(rownames(table1) %in% c("125 amol-250 amol","50 amol-500 amol","50 amol-5000 amol")),], caption="Selected pairwise comparisons (as in Ramus et al)", align="c")

Selected pairwise comparisons (as in Ramus et al)
	index	log2rat	conc1	conc2	sig.PD.BH	sig.PD.lfdr	sig.MQ.BH	sig.MQ.lfdr	sig.PL.BH	sig.PL.lfdr
50 amol-5000 amol	32	6.644	50	5000	588	524	358	302	551	470
50 amol-500 amol	31	3.322	50	500	437	366	232	182	373	326
125 amol-250 amol	2	1.000	125	250	15	10	0	0	3	2

Here we’ll consider all possible pairwise comparisons, as shown below.

Volcano Plots

Volcano-plots offer additional insight in how statistical test results relate to log-fold-change of pair-wise comparisons. In addition, we can mark the different protein-groups (or species) by different symbols, see also the general vignette ‘wrProteoVignette1’ (from this package) and the vignette to the package wrGraph. Counting the number of proteins passing a classical threshold for differential expression combined with a filter for minimum log-fold-change is a good way to start.

As mentioned, the dataset from Ramus et al 2016 contains 9 different levels of UPS1 concentrations (Ramus Data Set),
in consequence 36 pair-wise comparisons are possible. Again, plotting all possible Volcano plots would make way too crowded plots, instead we’ll try to summarize (see ROC curves), cluster into groups and finally plot only a few representative ones.

ROC for Multiple Pairs

Receiver Operator Curves (ROC) curves display sensitivity (True Positive Rate) versus 1-Specificity (False Positive Rate). They are typically used as illustrate and compare the discriminiative capacity of a yes/no decision system (here: differential abundance or not), see eg also the original publication Hand and Till 2001.

The data get constructed by sliding through a panel of threshold-values for the statistical tests instead of just using 0.05. Due to the experimental setup we know that all yeast proteins should stay constant and only UPS1 proteins (see also Experimental Setup) are expected to change. For each of these threshold values one counts the number of true positives (TP), false positives (FP) etc, allowing then to calculate sensitivity and specificity.

In the case of bechmarking quantitation efforts, ROC curves are used to judge how well heterologous spikes UPS1 proteins can be recognized as differentially abundant while constant yeast matrix proteins should not get classified as differential. Finally, ROC curves let us also gain some additional insights in the question which cutoff may be optimal or if the commonly used 5-percent FDR threshld cutoff allows getting the best out of the testing system.

The next step consists in calculating the area under the curve (AUC) for the individual profiles of each pairwise comparison. Below, these calculations of summarizeForROC() are run in batch.

## calulate  AUC for each ROC
layout(1)
  #aaa <- summarizeForROC(testPD, useComp=table1[1,1], annotCol="SpecType", spec=c("mainSpecies","spike"), tyThr="BH", plotROC=FALSE,debug=TRUE)

rocPD <- lapply(table1[,1], function(x) summarizeForROC(testPD, useComp=x, annotCol="SpecType", 
  spec=c("mainSpecies","spike"), tyThr="BH", plotROC=FALSE,silent=TRUE))
rocMQ <- lapply(table1[,1], function(x) summarizeForROC(testMQ, useComp=x, annotCol="SpecType", 
  spec=c("mainSpecies","spike"), tyThr="BH", plotROC=FALSE,silent=TRUE))
rocPL <- lapply(table1[,1], function(x) summarizeForROC(testPL, useComp=x, annotCol="SpecType", 
  spec=c("mainSpecies","spike"), tyThr="BH", plotROC=FALSE,silent=TRUE))

# we still need to add the names for the pair-wise groups:
names(rocPD) <- names(rocMQ) <- names(rocPL) <- rownames(table1)

AucAll <- cbind(ind=table1[match(names(rocPD), rownames(table1)),"index"], clu=NA,
  PD=sapply(rocPD, AucROC), MQ=sapply(rocMQ, AucROC), PL=sapply(rocPL, AucROC) )

To provide a quick overview, the clustered AUC values are displayed as PCA :

try(biplot(prcomp(AucAll[,names(methNa)]), cex=0.7, main="PCA of AUC from ROC Curves"))

On this PCA one can see the three software types in red. We can see that AUC values from MaxQuant correlate somehow less to Proline and ProteomeDiscoverer (red arrows). The pair-wise ratios constructed from the different rations are shown in black. They form a compact area with mostly wide ratios (one rather high and one low concentration of UPS1 proteins). Besides, there is a number of disperse points, typically containig the point of 125 and/or 250 fmol. These disperse points do not replicate well and follow their own characteristics captured by PC2.

Now we are ready to inspect the 5 clusters in detail :

Grouping of ROC Curves to Display Representative Ones

As mentioned, there are too many pair-wise combinations available for plotting and inspecting all ROC-curves. So we can try to group similar pairwise comparison AUC values into clusters and then easily display representative examples for each cluster/group. Again, we (pre)define that we want to obtain 5 groups (like customer-ratings from 5 to 1 stars), a k-Means clustering approach was chosen.

## number of groups for clustering
nGr <- 5
## K-Means clustering
kMAx <- stats::kmeans(standardW(AucAll[,c("PD","MQ","PL")]), nGr)$cluster
   table(kMAx)
#> kMAx
#>  1  2  3  4  5 
#>  6  6 14  7  3
AucAll[,"clu"] <- kMAx

AucAll <- reorgByCluNo(AucAll, cluNo=kMAx, useColumn=c("PD","MQ","PL"))
AucAll <- cbind(AucAll, iniInd=table1[match(rownames(AucAll), rownames(table1)), "index"])
colnames(AucAll)[1:(which(colnames(AucAll)=="index")-1)] <- paste("Auc",colnames(AucAll)[1:(which(colnames(AucAll)=="index")-1)], sep=".")
order(tapply(AucAll[,5],AucAll[,6], mean))
#> [1] 5 4 3 2 1

AucAll[,"cluNo"] <- rep(order(tapply(AucAll[,5],AucAll[,6], mean)), table(AucAll[,"cluNo"]))        # make cluNo descending (to mean of geometric mean of AUCs)

kMAx <- AucAll[,"cluNo"]      # update
  table(AucAll[,"cluNo"])
#> 
#>  1  2  3  4  5 
#>  6  6  3  7 14
 ## note : column 'index' is relative to table1, iniInd to ordering inside objects from clustering

To graphically summarize the AUC values, the clustered AUC values are plotted accompagnied by the geometric mean:

try(profileAsClu(AucAll[,c(1:length(methNa),(length(methNa)+2:3))], clu="cluNo", meanD="geoMean", tit="Pairwise Comparisons as Clustered AUC from ROC Curves",
  xlab="Comparison number", ylab="AUC", meLty=1, meLwd=3))

From this figure we can see clearly that there are some pairwise comparisons where all initial analysis-software results yield high AUC values, while other pairwise comparisons less discriminative power.

Again, now we can select a representative pairwise-comparison for each cluster (from the center of each cluster):

AucRep <- table(AucAll[,"cluNo"])[rank(unique(AucAll[,"cluNo"]))]   # representative for each cluster
AucRep <- round(cumsum(AucRep) -AucRep/2 +0.1)

## select representative for each cluster
kable(round(AucAll[AucRep,c("Auc.PD","Auc.MQ","Auc.PL","cluNo")], 3), caption="Selected representative for each cluster ", align="c")

Selected representative for each cluster
	Auc.PD	Auc.MQ	Auc.PL	cluNo
25000 amol-500 amol	0.966	0.997	0.965	5
12500 amol-2500 amol	0.944	1.000	0.914	4
25000 amol-50000 amol	0.813	0.912	0.925	3
250 amol-2500 amol	0.899	0.789	0.942	2
125 amol-250 amol	0.540	0.421	0.510	1

Now we can check if some experimental UPS1 log-fold-change have a bias for some clusters.

ratTab <- sapply(5:1, function(x) { y <- table1[match(rownames(AucAll),rownames(table1)),]
  table(factor(signif(y[which(AucAll[,"cluNo"]==x),"log2rat"],1), levels=unique(signif(table1[,"log2rat"],1))) )})
colnames(ratTab) <- paste0("\nclu",5:1,"\nn=",rev(table(kMAx)))
layout(1)
imageW(ratTab, tit="Frequency of rounded log2FC in the 5 clusters", xLab="log2FC (rounded)", 
  yLab="log2 Fold-Change", col=c("grey95", RColorBrewer::brewer.pal(8,"YlOrRd")), las=1)
mtext("Dark red for enrichment of given pair-wise ratio", cex=0.7)

We can see, that the cluster of best ROC-curves (cluster 5) covers practically all UPS1 log-ratios from this experiment without being restricted just to the high ratios.

Plotting ROC Curves for the Best Cluster (the ‘+++++’)

colPanel <- 2:5
gr <- 5
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for best pairwise-comparisons ", align="c")

AUC details for best pairwise-comparisons
	cluNo	Auc.PD	Auc.MQ	Auc.PL	log2rat	sig.PD.BH	sig.MQ.BH	sig.PL.BH
250 amol-50000 amol	5	0.981	0.998	0.979	7.644	385	311	703
12500 amol-250 amol	5	0.967	0.998	0.982	5.644	138	58	416
250 amol-25000 amol	5	0.973	0.999	0.974	6.644	233	179	619
500 amol-50000 amol	5	0.971	0.998	0.976	6.644	333	203	704
125 amol-50000 amol	5	0.970	0.998	0.968	8.644	367	227	721
50 amol-50000 amol	5	0.981	0.998	0.953	9.966	553	340	746
25000 amol-500 amol	5	0.966	0.997	0.965	5.644	243	143	638
2500 amol-50000 amol	5	0.976	0.999	0.945	4.322	340	211	713
25000 amol-50 amol	5	0.969	0.996	0.952	8.966	595	382	733
12500 amol-500 amol	5	0.959	0.989	0.968	4.644	158	57	423
12500 amol-50 amol	5	0.962	0.987	0.965	7.966	528	316	615
125 amol-25000 amol	5	0.956	0.993	0.955	7.644	348	179	698
250 amol-5000 amol	5	0.948	0.957	0.989	4.322	167	125	138
500 amol-5000 amol	5	0.948	0.935	0.983	3.322	145	65	133

## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)

## This required package 'wrGraph' at version 1.2.5 (or higher)
if(packageVersion("wrGraph")  >= "1.2.5") {
  layout(matrix(1:4,ncol=2))
  try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...

ROC Curves for 2nd Best Cluster (the ‘++++’)

gr <- 4
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for cluster '++++' pairwise-comparisons ", align="c")

AUC details for cluster ‘++++’ pairwise-comparisons
	cluNo	Auc.PD	Auc.MQ	Auc.PL	log2rat	sig.PD.BH	sig.MQ.BH	sig.PL.BH
2500 amol-25000 amol	4	0.971	1.000	0.907	3.322	59	16	599
5000 amol-50000 amol	4	0.945	0.991	0.935	3.322	279	125	609
125 amol-12500 amol	4	0.917	0.982	0.961	6.644	242	105	543
12500 amol-2500 amol	4	0.944	1.000	0.914	2.322	29	11	330
12500 amol-50000 amol	4	0.913	0.983	0.942	2.000	221	98	339
25000 amol-5000 amol	4	0.939	0.997	0.886	2.322	45	17	446
50 amol-5000 amol	4	0.923	0.932	0.949	6.644	588	358	551

## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4,ncol=2))
  try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...

ROC Curves for the 3rd Best Cluster (the ‘+++’)

gr <- 3
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for cluster '+++' pairwise-comparisons ", align="c")

AUC details for cluster ‘+++’ pairwise-comparisons
	cluNo	Auc.PD	Auc.MQ	Auc.PL	log2rat	sig.PD.BH	sig.MQ.BH	sig.PL.BH
12500 amol-5000 amol	3	0.886	0.962	0.871	1.322	24	4	109
25000 amol-50000 amol	3	0.813	0.912	0.925	1.000	145	66	90
12500 amol-25000 amol	3	0.846	0.944	0.827	1.000	12	0	87

## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]],rocMQ[[jR]],rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4,ncol=2))
  try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...

ROC Curves for the 4th Best Cluster (the ‘++’)

gr <- 2
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for cluster '++' pairwise-comparisons ", align="c")

AUC details for cluster ‘++’ pairwise-comparisons
	cluNo	Auc.PD	Auc.MQ	Auc.PL	log2rat	sig.PD.BH	sig.MQ.BH	sig.PL.BH
125 amol-5000 amol	2	0.903	0.836	0.969	5.322	267	128	291
2500 amol-500 amol	2	0.875	0.827	0.958	2.322	116	38	96
250 amol-2500 amol	2	0.899	0.789	0.942	3.322	121	56	106
125 amol-2500 amol	2	0.795	0.835	0.933	4.322	195	89	163
2500 amol-5000 amol	2	0.813	0.854	0.887	1.000	5	1	14
2500 amol-50 amol	2	0.860	0.770	0.889	5.644	553	308	490

## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4,ncol=2))
  try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...

ROC Curves for the Weakest Cluster 1 (the ‘+’)

gr <- 1
j <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))

## table of all proteins in cluster
useLi <- which(AucAll[,"cluNo"]==gr)
tmp <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
  as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for cluster '+' pairwise-comparisons ", align="c")

AUC details for cluster ‘+’ pairwise-comparisons
	cluNo	Auc.PD	Auc.MQ	Auc.PL	log2rat	sig.PD.BH	sig.MQ.BH	sig.PL.BH
250 amol-500 amol	1	0.555	0.434	0.607	1.000	4	2	3
125 amol-500 amol	1	0.506	0.367	0.733	2.000	22	6	23
125 amol-250 amol	1	0.540	0.421	0.510	1.000	15	0	3
50 amol-500 amol	1	0.546	0.374	0.524	3.322	437	232	373
125 amol-50 amol	1	0.510	0.306	0.468	1.322	287	150	209
250 amol-50 amol	1	0.352	0.456	0.439	2.322	360	192	312

## frequent concentrations :
layout(matrix(1:2, ncol=1), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)

## representative ROC
jR <- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
  txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4,ncol=2))
  try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
  try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in min(x): no non-missing arguments to max; returning -Inf
#> UNABLE TO PLOT !!  check plotting device ...

Analysis Focussing on UPS1 Spike-In Proteins Only

We know from the experimental setup that there were 48 UPS1 proteins (see also Experimental Setup). present in the commercial mix added to a constant background of yeast-proteins. The lowest concentrations are extremely challenging and it is no surprise that many of them were not detected at the lowest concentration(s). In order to choose among the various concentrations of UPS1, let’s look how many NAs are in each group of replicates (ie before NA-imputation), and in particular, the number of NAs among the UPS1 proteins.

Previsouly we’ve looked at the total number of NAs, now let’s focus just on the UPS1 proteins. Obviously, instances of non-quantified UPS1 proteins make the following comparisons using these samples rather insecure, since NA-imputation is just an ‘educated guess’.

tab1 <- rbind(PD=sumNAperGroup(dataPD$raw[which(dataPD$annot[,"SpecType"]=="spike"),], grp9),
  MQ=sumNAperGroup(dataMQ$raw[which(dataMQ$annot[,"SpecType"]=="spike"),], grp9),
  PL= sumNAperGroup(dataPL$raw[which(dataPL$annot[,"SpecType"]=="spike"),], grp9)  )
kable(tab1, caption="The number of NAs in the UPS1 proteins", align="c")

The number of NAs in the UPS1 proteins
	1	2	3	4	5	6	7	8	9
PD	0	73	0	1	69	0	0	43	79
MQ	3	113	4	19	109	1	10	98	112
PL	0	32	0	0	34	0	0	20	27

One can see that starting the 5th level of UPS1 concentrations almost all UPS1 proteins were found in nearly all samples. In consequence we’ll avoid using all of them at all times, but this should be made depending on the very protein and quantification method.

Let’s look graphically at the number of NAs in each of the UPS1 proteins along the quantification methods :

countRawNA <- function(dat, newOrd=UPS1$ac, relative=FALSE) {  # count number of NAs per UPS protein and order as UPS
  out <- rowSums(is.na(dat$raw[match(newOrd,rownames(dat$raw)),]))
  if(relative) out/nrow(dat$raw) else out }

sumNAperMeth <- cbind(PD=countRawNA(dataPD), MQ=countRawNA(dataMQ), PL=countRawNA(dataPL) )
UPS1na <- sub("_UPS","", dataPL$annot[(rownames(dataPL$annot) %in% UPS1$acFull),"EntryName"])
par(mar=c(6.8, 3.5, 4, 1))
#imageW(sumNAperMeth, rowNa=UPS1na, tit="Number of NAs in UPS  proteins", xLab="", yLab="",
imageW(sumNAperMeth, tit="Number of NAs in UPS  proteins", xLab="", yLab="",
  transp=FALSE, col=RColorBrewer::brewer.pal(9,"YlOrRd"))
mtext("Dark red for high number of NAs",cex=0.7)

Typically the number of NAs is similar when comparing the different quantitation approaches, it tends to be a bit higher with MaxQuant. This means that some UPS1 proteins which are easier to (detect and) quantify than others. We can conclude, the capacity to successfully quantify a given protein depends on its abundance and its composition.

Similarity by PCA (UPS1 proteins only)

Plotting the principal components (PCA) typically allows to gain an overview on how samples are related to each other. This type of experiment is special for the fact that the majority of proteins is expected to remain constant (yeast matrix), while only the UPS1 proteins (see also Experimental Setup) vary. Since we are primarily intereseted in the UPS1 proteins, the regular plots of PCA are not shown here, but PCA of the lines identified as UPS1.

Principal component analysis (PCA) cannot handle NA-values. Either all lines with any NAs have to be excluded, or data after NA-imputation have to be used. Here, the option of plotting data after NA-imputation was chosen (in the context of filtering UPS1 lines only one whould loose too many lines, ie proteins). Below plots are be made using the function plotPCAw() from the package wrGraph. Via indexing we choose only the lines./proteins with the annoation ‘spike’ (ie UPS1).

PCA of UPS1 for ProteomeDiscoverer

try(plotPCAw(testPD$datImp[which(testPD$annot[,"SpecType"]=="spike"),], sampleGrp=grp9, tit="PCA on ProteomeDiscoverer, UPS1 only (NAs imputed)", rowTyName="proteins", useSymb2=0, silent=TRUE), silent=TRUE)

PCA of UPS1 for MaxQuant

try(plotPCAw(testMQ$datImp[which(testMQ$annot[,"SpecType"]=="spike"),], sampleGrp=grp9, tit="PCA on MaxQuant, UPS1 only (NAs imputed)", rowTyName="proteins", useSymb2=0, silent=TRUE), silent=TRUE)

PCA of UPS1 for Proline

try(plotPCAw(testPL$datImp[which(testPL$annot[,"SpecType"]=="spike"),], sampleGrp=grp9, tit="PCA on Proline, UPS1 only (NAs imputed)", rowTyName="proteins", useSymb2=0, silent=TRUE), silent=TRUE)

Based on PCA plots one can see that the concentrations 125 - 500 aMol are very much alike and detecting differences may perform better when not combining them, as also confirmed by ROC part later. In the Screeplot we can see that the first principal component captures almost all variability. Thus, displaying the 3rd principal component (as done above) finally has no importance.

CV of Replicates

In order to have more data available for linear regression modelling it was decided to use UPS1 abundance values after NA-Imputation for linear regressions. Previously it was shown that NA values originate predominantly from absent or very low abundance quantitations, which justified relplacing NA values by low abundance values in a shrinkage like fashion.

As general indicator for data-quality and -usability let’s look at the intra-replicate variability. Here we plot all intra-group CVs (defined by UPS1-concentration), either the CVs for all quantified proteins or the UPS1 proteins only.

In the figure below the complete series (including yeast) is shown on the left side, the human UPS1 proteins only on the right side. Briefly, vioplots show a kernel-estimate for the distribution, in addition, a box-plot is also integrated (see vignette to package wrGraph).

## combined plot : all data (left), Ups1 (right)
layout(1:3)
sumNAinPD <- list(length=18)
sumNAinPD[2*(1:length(unique(grp9))) -1] <- as.list(as.data.frame(log2(rowGrpCV(testPD$datImp, grp9))))
sumNAinPD[2*(1:length(unique(grp9))) ] <- as.list(as.data.frame(log2(rowGrpCV(testPD$datImp[which(testPD$annot[,"SpecType"]=="spike"),], grp9))))
names(sumNAinPD)[2*(1:length(unique(grp9))) -1] <-  sub("amol","",unique(grp9))
names(sumNAinPD)[2*(1:length(unique(grp9))) ] <- paste(sub("amol","",unique(grp9)),"Ups",sep=".")
try(vioplotW(sumNAinPD, halfViolin="pairwise", tit="CV Intra Replicate, ProteomeDiscoverer", cexNameSer=0.6))
mtext("left part : all data\nright part: UPS1",adj=0,cex=0.8)

sumNAinMQ <- list(length=18)
sumNAinMQ[2*(1:length(unique(grp9))) -1] <- as.list(as.data.frame(log2(rowGrpCV(testMQ$datImp, grp9))))
sumNAinMQ[2*(1:length(unique(grp9))) ] <- as.list(as.data.frame(log2(rowGrpCV(testMQ$datImp[which(testMQ$annot[,"SpecType"]=="spike"),], grp9))))
names(sumNAinMQ)[2*(1:length(unique(grp9))) -1] <- sub("amol","",unique(grp9))                        # paste(unique(grp9),"all",sep=".")
names(sumNAinMQ)[2*(1:length(unique(grp9))) ] <- paste(sub("amol","",unique(grp9)),"Ups",sep=".")      #paste(unique(grp9),"Ups1",sep=".")
try(vioplotW(sumNAinMQ, halfViolin="pairwise", tit="CV intra replicate, MaxQuant",cexNameSer=0.6))
mtext("left part : all data\nright part: UPS1",adj=0,cex=0.8)

sumNAinPL <- list(length=18)
sumNAinPL[2*(1:length(unique(grp9))) -1] <- as.list(as.data.frame(log2(rowGrpCV(testPL$datImp, grp9))))
sumNAinPL[2*(1:length(unique(grp9))) ] <- as.list(as.data.frame(log2(rowGrpCV(testPL$datImp[which(testPL$annot[,"SpecType"]=="spike"),], grp9))))
names(sumNAinPL)[2*(1:length(unique(grp9))) -1] <-  sub("amol","",unique(grp9))
names(sumNAinPL)[2*(1:length(unique(grp9))) ] <- paste(sub("amol","",unique(grp9)),"Ups",sep=".")
try(vioplotW(sumNAinPL, halfViolin="pairwise", tit="CV Intra Replicate, Proline", cexNameSer=0.6))
mtext("left part : all data\nright part: UPS1",adj=0,cex=0.8)

The distribution of intra-group CV-values showed (without major surprise) that the highest UPS1 concentrations replicated best. This phenomenon also correlates with the content of NAs in the original data. When imputing NA-values it is a challange to respect the variability of the respective data (NA-neighbours) before NA-imputation. Many NA-values can be observed when looking at very low UPS1-doses and too few initial quantitations values may remain for meaningful comparisons. Of course, with an elevanted content of NAs the mechanism of NA-substitution will also contribute to masking (in part) the true variability.

In consequence pair-wise comparisons using one of the higher UPS1-concentrations group are expected to have a decent chance to rather specifically reveil a high number of UPS1 proteins.

Once can see that lower concentrations of UPS1 usually have worse CV (coefficient of variance) in the respective samples,

Testing All Individual UPS1 Proteins By Linear Regression

First, we construct a container for storing various measures and results which we will look at lateron.

## prepare object for storing all results
datUPS1 <- array(NA, dim=c(length(UPS1$ac),length(methNa),7), dimnames=list(UPS1$ac,c("PD","MQ","PL"),
  c("sco","nPep","medAbund", "logp","slope","startFr","cluNo")))

Now we’ll calculate the linear models, extract slope & pval for each UPS1 protein. The functions used also allow plotting the resulting regression results, but plotting each UPS1 protein would make very crowded figures. Instead, we’ll plot representative examples only after clustering the regression-results.

Linear Regression for each UPS1 : ProteomeDiscoverer

lmPD <- list(length=length(NamesUpsPD))
doPl <- FALSE
lmPD[1:length(NamesUpsPD)] <- lapply(NamesUpsPD[1:length(NamesUpsPD)], wrMisc::linModelSelect, dat=dataPD,
  expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=doPl, silent=TRUE)
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
names(lmPD) <- NamesUpsPD

## We make a little summary of regression-results (ProteomeDiscoverer)
tmp <- cbind(log10(sapply(lmPD, function(x) x$coef[2,4])), sapply(lmPD, function(x) x$coef[2,1]), sapply(lmPD, function(x) x$startLev))
datUPS1[,1,c("logp","slope","startFr")] <- tmp[match(rownames(datUPS1), names(lmPD)), ]
datUPS1[,1,"medAbund"] <- apply(wrMisc::.scale01(dataPD$datImp)[match(UPS1$ac, rownames(dataPD$datImp)),], 1,median,na.rm=TRUE)

Linear Regression for each UPS1 : MaxQuant

lmMQ <- list(length=length(NamesUpsMQ))
lmMQ[1:length(NamesUpsMQ)] <- suppressWarnings(lapply(NamesUpsMQ[1:length(NamesUpsMQ)], linModelSelect, dat=dataMQ,
  expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=doPl, silent=TRUE))
names(lmMQ) <- NamesUpsMQ

## We make a little summary of regression-results (MaxQuant)
tmp <- cbind(log10(sapply(lmMQ, function(x) x$coef[2,4])), sapply(lmMQ, function(x) x$coef[2,1]), sapply(lmMQ, function(x) x$startLev))
datUPS1[,2,c("logp","slope","startFr")] <- tmp[match(rownames(datUPS1), names(lmMQ)), ]
datUPS1[,2,"medAbund"] <- apply(wrMisc::.scale01(dataMQ$datImp)[match(UPS1$ac,rownames(dataMQ$datImp)),],1,median,na.rm=TRUE)

Linear Regression for each UPS1 : Proline

lmPL <- list(length=length(NamesUpsPL))
lmPL[1:length(NamesUpsPL)] <- suppressWarnings(lapply(NamesUpsPL[1:length(NamesUpsPL)], linModelSelect, dat=dataPL,
  expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=doPl, silent=TRUE))
names(lmPL) <- NamesUpsPL

tmp <- cbind(log10(sapply(lmPL, function(x) x$coef[2,4])), sapply(lmPL, function(x) x$coef[2,1]), sapply(lmPL, function(x) x$startLev))
datUPS1[,3,c("logp","slope","startFr")] <- tmp[match(rownames(datUPS1), names(lmPL)), ]
datUPS1[,3,"medAbund"] <- apply(wrMisc::.scale01(dataPL$datImp)[match(UPS1$ac,rownames(dataPL$datImp)),],1,median,na.rm=TRUE)

Frequency Of Starting Levels For Regression

To get a general view, let’s look where regressions typically have their best starting-site (ie how many low concentrations points are usually better omitted):

## at which concentration of UPS1 did the best regression start ?
stTab <- sapply(1:5, function(x) apply(datUPS1[,,"startFr"], 2, function(y) sum(x==y, na.rm=TRUE)))
colnames(stTab) <- paste("lev", 1:5, sep="_")
kable(stTab, caption = "Frequency of starting levels for regression")

Frequency of starting levels for regression
	lev_1	lev_2	lev_3	lev_4	lev_5
PD	3	19	11	3	11
MQ	5	11	8	6	17
PL	7	14	12	7	7

Global Comparison Of Regression Models

Next, we’ll inspect the relation between regression-slopes and p-values (for H0: slope=0) :

layout(matrix(1:4,ncol=2))
subTi <- "fill according to median abundance (blue=low - green - red=high)"
xyRa <- apply(datUPS1[,,4:5], 3, range, na.rm=TRUE)

plotMultRegrPar(datUPS1, 1, xlim=xyRa[,1], ylim=xyRa[,2],tit="ProteomeDiscoverer UPS1, p-value vs slope",subTit=subTi)    # adj wr 9jan23
plotMultRegrPar(datUPS1, 2, xlim=xyRa[,1], ylim=xyRa[,2],tit="MaxQuant UPS1, p-value vs slope",subTit=subTi)
plotMultRegrPar(datUPS1, 3, xlim=xyRa[,1], ylim=xyRa[,2],tit="Proline UPS1, p-value vs slope",subTit=subTi)

We can observe, that sope and (log)p-value of the resultant regressions do not necessarily correlate well. Thus, considering only one of these resultant values may not be sufficient.

Summarize Linear Regression Results

When judging results for individual spike-in proteins both the value of the slope as well as the p-value (for H0: slope=0) are important to consider. For example, there are some cases where the quantitations line up well giving a good p-value, however with slopes < 0.4, although a slope=1.0 is expected. This is definitely not the type of dose-response characteristics we are looking for.

In order to consider both characteristics (slope and p-value) at the same time, we’ll introduce a penalized objective score using slope and p-value for easier consideration of both elements at once : The overal model is :

score = sqrt ( | log_pValue | + lambda * | slope - 1 | )

In our case lambda is set to -5 .

for(i in 1:(dim(datUPS1)[2])) datUPS1[,i,"sco"] <- sqrt(abs(datUPS1[,i,"logp"])) - 5*abs(datUPS1[,i,"slope"] -1)    #

Next, let’s bring together all linear-model scores, the number of peptides and meadian protein abundance for each of UPS1 proteins in one object to facilite further steps.

datUPS1[,1,2] <- rowSums(dataPD$counts[match(UPS1$ac,dataPD$annot[,1]),,1], na.rm=TRUE)
datUPS1[,2,2] <- rowSums(dataMQ$counts[match(UPS1$ac,dataMQ$annot[,1]),,1], na.rm=TRUE)
datUPS1[,3,2] <- rowSums(dataPL$counts[match(UPS1$ac,dataPL$annot[,1]),], na.rm=TRUE)


nNApGrp <- array(dim=c(length(UPS1$ac), length(methNa), length(unique(grp9))), dimnames=list(UPS1$ac, names(methNa), unique(names(grp9)[order(grp9)])))
nNApGrp[,1,] <- wrMisc::rowGrpNA(dataPD$raw[match(UPS1$ac,dataPD$annot[,1]),], grp9)
nNApGrp[,2,] <- wrMisc::rowGrpNA(dataMQ$raw[match(UPS1$ac,dataMQ$annot[,1]),], grp9)
nNApGrp[,3,] <- wrMisc::rowGrpNA(dataPL$raw[match(UPS1$ac,dataPL$annot[,1]),], grp9)

layout(matrix(1:length(methNa), nrow=1))
for(i in 1:length(methNa )) { wrGraph::imageW(nNApGrp[,i,], tit=paste(names(methNa)[i],": Number Of NAs /Group"))
  mtext("Blue for low, dark red for high number of NAs", cex=0.75, adj=0)  }

Now we can explore the regression score and its context to other parameters, below it’s done graphically.

layout(matrix(1:4, ncol=2))
par(mar=c(5.5, 2.2, 4, 0.4))
col1 <- RColorBrewer::brewer.pal(9,"YlOrRd")
imageW(datUPS1[,,1], col=col1, tit="Linear regression score", xLab="",yLab="",transp=FALSE)
mtext("Drak red for elevated", cex=0.75)

imageW(log(datUPS1[,,2]), tit="Number of peptides", xLab="",yLab="", col=col1, transp=FALSE)
mtext("Dark red for high number of peptides", cex=0.75)

## ratio : regression score vs no of peptides
imageW(datUPS1[,,1]/log(datUPS1[,,2]), col=rev(col1), tit="Regression score / Number of peptides", xLab="",yLab="", transp=FALSE)
mtext("Dark red for high lmScore/peptide ratio)", cex=0.75)

## score vs abundance
imageW(datUPS1[,,1]/datUPS1[,,3], col=rev(col1), tit="Regression score / median Abundance", xLab="",yLab="", transp=FALSE)
mtext("Dark red for high lmScore/abundance ratio)", cex=0.75)

From the heatmap-like plots we can see that some proteins are rather consistently quantified by any of the methods. Some of the varaibility may be explained by the number of peptides (in case of MaxQuant ‘razor-peptides’ were used), see plot of ‘regression score / number of peptides’. In contrast, UPS-protein median abundance does not correlate or explain this phenomenon (see last plot ‘regression score / median abundance’). So we cannot support the hypothesis that highly abundant proteins get quantified better.

Grouping of UPS1 Proteins to Display Representative Proteins

Using the linear regression score defined above we can rank UPS1 proteins and display representative ones in order to avoid crowded and repetitive figures.

Now, we can try to group the regression scores into groups and easily display representative examples for each group. Here, we (pre)define that we want to obtain 5 groups (like ratings from 1 -5 starts), a k-Means clustering approach was chosen.

## number of groups for clustering
nGr <- 5
chFin <- is.finite(datUPS1[,,"sco"])
if(any(!chFin)) datUPS1[,,"sco"][which(!chFin)] <- -1      # just in case..


## clustering using kMeans
kMx <- stats::kmeans(standardW(datUPS1[,,"sco"], byColumn=FALSE), nGr)$cluster
datUPS1[,,"cluNo"] <- matrix(rep(kMx, dim(datUPS1)[2]), nrow=length(kMx))

geoM <- apply(datUPS1[,,"sco"], 1, function(x) prod(x)^(1/length(x)))        # geometric mean across analysis soft
geoM2 <- lrbind(by(cbind(geoM,datUPS1[,,"sco"], clu=kMx), kMx, function(x) x[order(x[,1],decreasing=TRUE),]))  # organize by clusters
tmp <- tapply(geoM2[,"geoM"], geoM2[,"clu"], median)
geoM2[,"clu"] <- rep(rank(tmp, ties.method="first"), table(kMx))
geoM2 <- geoM2[order(geoM2[,"clu"],geoM2[,"geoM"],decreasing=TRUE),]         # order as decreasing median.per.cluster
geoM2[,"clu"] <- rep(1:max(kMx), table(geoM2[,"clu"])[rank(unique(geoM2[,"clu"]))])    # replace cluster-names to increasing

try(profileAsClu(geoM2[,2:4], geoM2[,"clu"], tit="Clustered Regression Results for UPS1 Proteins", ylab="Linear regression score"))

datUPS1 <- datUPS1[match(rownames(geoM2), rownames(datUPS1)),,]               # bring in new order
datUPS1[,,"cluNo"] <- geoM2[,"clu"]                                          # update cluster-names

### prepare annotation of UPS proteins
annUPS1 <- dataPL$annot[match(rownames(datUPS1), dataPL$annot[,1]), c(1,3)]
annUPS1[,2] <- substr(sub("_UPS","",sub("generic_ups\\|[[:alnum:]]+-{0,1}[[:digit:]]\\|","",annUPS1[,2])),1,42)

## index of representative for each cluster  (median position inside cluster)
UPSrep <- tapply(geoM2[,"geoM"], geoM2[,"clu"], function(x) ceiling(length(x)/2)) + c(0, cumsum(table(geoM2[,"clu"]))[-nGr])

Previously we organized all UPS1 proteins according to their regression characteristics into 5 clusters and each cluster was ordered for descending scores. Now we can use the median position within each cluster as representative example for this cluster.

Representative UPS1-protein of the Best Group (the ‘+++++’)

gr <- 5
useLi <- which(datUPS1[,1,"cluNo"]==gr)
colNa <- c("Protein",paste(colnames(datUPS1), rep(c("slope","logp"), each=ncol(datUPS1)), sep=" "))
try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
  caption=paste("Regression details for cluster of the",length(useLi),"best UPS1 proteins "), col.names=colNa, align="l"),silent=TRUE)

Regression details for cluster of the 15 best UPS1 proteins
	Protein	PD slope	MQ slope	PL slope	PD logp	MQ logp	PL logp
O76070	Gamma-synuclein (Chain 1-127)	1.06	1.04	0.668	-16.4	-15.2	-11.9
P12081	Histidyl-tRNA synthetase, cytoplasmic (Cha	0.884	1.28	0.814	-16.4	-15	-13.3
Q06830	Peroxiredoxin 1 (Chain 2-199)	0.782	0.89	0.668	-14.9	-12.2	-16.8
P10145	Interleukin-8, IL-8 (Chain 28-99)	0.886	1.09	0.787	-10.9	-8.34	-15.5
P02787	Serotransferrin (Chain 20-698)	1.1	1.46	0.764	-22.7	-17.7	-12.1
P63279	SUMO-conjugating enzyme UBC9 (Chain 1-158)	0.858	1.08	0.835	-12.8	-8.46	-11.6
P02144	Myoglobin (Chain 2-154)	1.18	1.22	0.55	-17.9	-19.1	-14
P02788	Lactotransferrin (Chain 20-710)	1.31	1.47	0.659	-16.2	-18.1	-16.4
P01112	GTPase HRas (Chain 1-189)	0.807	0.704	0.687	-12.8	-11.1	-14.6
P51965	Ubiquitin-conjugating enzyme E2 E1 (Chain	1.01	1.16	0.442	-18.8	-11.4	-14
P63165	Small ubiquitin-related modifier 1 (Chain	1.3	0.845	0.675	-9.1	-11	-13.2
P09211	Glutathione S-transferase P (Chain 2-210)	0.718	0.901	0.594	-10.5	-11	-12
P68871	Hemoglobin subunit beta (Chain 2-147)	1.06	1.51	0.593	-13.4	-13.9	-14.5
P06396	Gelsolin (Chain 28-782)	1.12	1.41	0.376	-21.8	-19	-13.2
P00441	Superoxide dismutase [Cu-Zn] (Chain 2-154)	0.759	1.18	0.475	-11.5	-7.49	-10.3

## Plotting the best regressions, this required package wrGraph version 1.2.5 (or higher)
if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4, ncol=2))
  tit <- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion

Representative UPS1-protein of the 2nd Best Group (the ‘++++’)

gr <- 4
useLi <- which(datUPS1[,1,"cluNo"]==gr)
try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
  caption=paste("Regression details for cluster of the",length(useLi),"2nd best UPS1 proteins "), col.names=colNa, align="l"),silent=TRUE)

Regression details for cluster of the 16 2nd best UPS1 proteins
	Protein	PD slope	MQ slope	PL slope	PD logp	MQ logp	PL logp
P55957	BH3-interacting domain death agonist (Chai	1.08	1.06	1.01	-19.9	-18.1	-18.5
P01375	Tumor necrosis factor, soluble form (Chain	1.16	0.948	1.11	-22.7	-15.2	-20.1
P41159	Leptin (Chain 22-167)	1.22	1.11	0.989	-21.7	-19	-17.5
P00167	Cytochrome b5 (Chain 1-134, N-terminal His	1.05	1.19	0.937	-22.1	-18.1	-16.6
P01344	Insulin-like growth factor II (Chain 25-91	1	0.892	0.853	-17.9	-18.1	-16.5
P00709	Alpha-lactalbumin (Chain 20-142)	0.996	1.2	0.968	-19.7	-15.9	-16.7
P00915	Carbonic anhydrase 1 (Chain 2-261)	1.08	1.39	0.991	-23.7	-17	-23.3
P01579	Interferon Gamma (Chain 23-166)	1.07	1.05	0.801	-17.9	-13.6	-19
P05413	Fatty acid-binding protein, heart (Chain 2	1.07	1.21	0.839	-18.3	-17.1	-19.3
P01133	Pro-Epidermal growth factor (EGF) (Chain 9	0.983	1.07	0.84	-17	-13.7	-12.3
P01008	Antithrombin-III (Chain 33-464)	0.96	1.25	0.834	-17.2	-16.7	-16.3
O00762	Ubiquitin-conjugating enzyme E2 C (Chain 1	1.09	1.2	0.868	-18.2	-12.9	-17.6
P08758	Annexin A5 (Chain 2-320)	1.09	1.18	0.671	-20.1	-17.2	-16.5
P00918	Carbonic anhydrase 2 (Chain 2-260)	1.17	1.29	0.813	-18.2	-16.9	-18.1
P04040	Catalase (Chain 2-527)	0.993	1.36	0.725	-13.5	-17.4	-18.7
P62937	Peptidyl-prolyl cis-trans isomerase A (Cha	0.98	1.26	0.737	-18.2	-12.3	-15.6

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4, ncol=2))
  tit <- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion

Representative UPS1-protein of the 3rd Group (the ‘+++’)

gr <- 3
useLi <- which(datUPS1[,1,"cluNo"]==gr)
try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
  caption="Regression details for 3rd cluster UPS1 proteins ", col.names=colNa, align="l"),silent=TRUE)

Regression details for 3rd cluster UPS1 proteins
	Protein	PD slope	MQ slope	PL slope	PD logp	MQ logp	PL logp
P01031	Complement C5 (C5a anaphylatoxin) (Chain 6	0.467	0.434	1.14	-8.02	-6.33	-15.4
P69905	Hemoglobin subunit alpha (Chain 2-142)	0.508	0.312	0.965	-9.15	-1.51	-8.49
P02753	Retinol-binding protein 4 (Chain 19-201)	0.33	0.547	0.607	-10.4	-9.43	-16
NA	NA	NA	NA	NA	NA	NA	NA

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4, ncol=2))
  tit <- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion

Representative UPS1-protein of the 4th Group (the ‘++’)

gr <- 2
useLi <- which(datUPS1[,1,"cluNo"]==gr)
try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
  caption="Regression details for 3rd cluster UPS1 proteins ", col.names=colNa, align="l"),silent=TRUE)

Regression details for 3rd cluster UPS1 proteins
	Protein	PD slope	MQ slope	PL slope	PD logp	MQ logp	PL logp
P02768	Serum albumin (Chain 26-609)	1.01	1.46	0.778	-16.5	-15.5	-20.2
P06732	Creatine kinase M-type (Chain 1-381)	1.03	1.55	0.854	-18.2	-18	-18.6
P61626	Lysozyme C (Chain 19-148)	1.17	0.663	1.13	-17.1	-11.7	-17.6
Q15843	NEDD8 (Chain 1-81)	1.17	0.849	0.955	-12.7	-7.13	-13.6
P01127	Platelet-derived growth factor B chain (Ch	1.1	1.48	1.02	-16.6	-11.1	-20.9
P61769	Beta-2-microglobulin (Chain 21-119)	1.16	0.623	1.03	-17.5	-8.52	-16.6
P16083	Ribosyldihydronicotinamide dehydrogenase [	1.24	1.75	0.858	-13.6	-16.3	-16.1
P10599	Thioredoxin (Chain 2-105, N-terminal His	0.939	0.337	0.951	-15.4	-5.22	-20.3


if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4, ncol=2))
  tit <- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion

Representative UPS1-protein of the 5th (And Last) Group (the ‘+’)

gr <- 1
useLi <- which(datUPS1[,1,"cluNo"]==gr)
try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
  caption="Regression details for 5th cluster UPS1 proteins ", col.names=colNa, align="l"),silent=TRUE)

Regression details for 5th cluster UPS1 proteins
	Protein	PD slope	MQ slope	PL slope	PD logp	MQ logp	PL logp
P15559	NAD(P)H dehydrogenase [quinone] 1 (Chain 2	0.0754	1.05	0.088	-11.5	-14.8	-10
P99999	Cytochrome c (Chain 2-105)	0.523	1.32	0.415	-12	-12.8	-12.1
P62988	Ubiquitin (Chain 1-76, N-terminal His tag)	0.864	1.06	0.0383	-13	-12.1	-10.3
P08263	Glutathione S-transferase A1 (Chain 2-222)	0.405	1.13	0.187	-10.2	-16.4	-9.93
P02741	C-reactive protein (Chain 19-224)	0.266	1.03	0.674	-12.1	-7.18	-11.1

if(packageVersion("wrGraph")  >= "1.2.5"){
  layout(matrix(1:4, ncol=2))
  tit <- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
  try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=names(grp9), startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs introduced
#> by coercion

In some (less frequent) cases on can recognize unexpected characteristics of regression lines. This illustrates that not all proteins are quantified as perfectly as obtention of initial quantitation data may suggest.

Additional Comments

The choice of the ‘best suited’ approach to quantify and compare proteomics data is not trivial at all. Particular attention has to be given to the choice of the numerous ‘small’ parameters which may have a very strong impact on the final outcome, as it has been experienced when preparing the data for this vignette or at other places (eg Chawade et al 2015). Thus, knowing and understanding well the software/tools one has chosen is of prime importance ! Of course, this also concerns the protein-identifcation part/software.

The total number of proteins identified varies considerably between methods, this information may be very important to the user in real-world settings but is only taken in consideration in part in the comparisons presented here.

ROC curves allow us to gain more insight on the impact of cutoff values (alpha) for statistical testing. Frequently the ideal threshold maximizing sensitivity and specificity lies quite distant to the common 5-percent threshold. This indicates that many times the common 5-percent threshold may not be the ‘optimal’ compromise for calling differential abundant proteins. However, the optimal point varies very much between data-sets and in a real world setting with unknown samples this type of analysis is not possible.

As mentioned before, the dataset used in this vignette is not very recent, much better performing mass-spectrometers have been introduced since then. The main aim of this vignette consists in showing how to use wrProteo with a smaller example (allowing to limit file-size of this package). Thus, for rather scientific conclusions the user is encouraged to run the same procedure using data run on more recent mass-spectrometers.

	1	2	3	4	5	6	7	8	9
PD	195	273	209	205	257	220	207	234	272
MQ	302	334	330	282	323	322	297	337	318
PL	131	140	141	137	157	131	139	140	124

	1	2	3	4	5	6	7	8	9
PD	195	273	209	205	257	220	207	234	272
MQ	302	334	330	282	323	322	297	337	318
PL	131	140	141	137	157	131	139	140	124