SAVER (Single-cell Analyses Via Expression Recovery) is a method for denoising single-cell RNA sequencing data by borrowing information across genes and cells. Here, we demonstrate how to use the method. For more information, take a look at the Github page and the paper.
You can install the most recent updates of SAVER from github with:
To install the stable release of SAVER:
Once we have the package installed, we can load the package.
In this tutorial, we will run SAVER on the mouse cortex data from Zeisel (2015).
data.path <- "../data/expression_mRNA_17-Aug-2014.txt"
# Need to remove first 10 rows of metadata
raw.data <- read.table(data.path, header = FALSE, skip = 11, row.names = 1,
check.names = FALSE)
cortex <- as.matrix(raw.data[, -1])
cellnames <- read.table(data.path, skip = 7, nrows = 1, row.names = 1,
stringsAsFactors = FALSE)
colnames(cortex) <- cellnames[-1]
dim(cortex)
The input to SAVER is a matrix of gene expression counts with genes along the rows and cells along the columns. SAVER was developed for use on UMI count data but it seems to work well with non-UMI TPM data. Standard quality control such as removing low quality cells and filtering out genes with low expression should be performed prior to running SAVER.
Here, the Zeisel dataset has already been preprocessed.
The main function to run SAVER is the saver
function. Since SAVER is computationally intensive for large datasets, we recommend running it in parallel on a cluster either by creating your own parallel environment or by specifying ncores
. We will run SAVER on the Zeisel dataset across 12 cores.
cortex.saver
is a saver object with the following components:
estimate
gives the library size normalized SAVER estimates.se
gives the standard error of the estimatesinfo
gives the more information about the run.If you are only interested in the estimate, you can run saver
setting estimates.only = TRUE
. For example,
By default, SAVER takes in an unnormalized count matrix and performs library size normalization during the denoising step. However, if your data is already normalized or normalization is not desired, you can set size.factor = 1
. In addition, if you want to use custom cell size factors, you can set size.factor
to be a vector of size factors.
SAVER has two main steps: the first is the prediction step and the second is a shrinkage step. The prediction step is the time-consuming part of SAVER. If you are only interested in a handful of genes, you can specify those genes and generate predictions for those genes only and choose to return the entire data matrix or for those genes only. For example,
# Identify the indices of the genes of interest
genes <- c("Thy1", "Mbp", "Stim2", "Psmc6", "Rps19")
genes.ind <- which(rownames(cortex) %in% genes)
# Generate predictions for those genes and return entire dataset
cortex.saver.genes <- saver(cortex, pred.genes = genes.ind,
estimates.only = TRUE)
# Generate predictions for those genes and return only those genes
cortex.saver.genes.only <- saver(cortex, pred.genes = genes.ind,
pred.genes.only = TRUE, estimates.only = TRUE)
The main contribution of SAVER is the shrinkage step after generating the predictions. The shrinkage step allows the algorithm to adaptively evaluate the quality of the predictions and weigh the estimates either towards the predictions or the observed value. SAVER performs a Lasso Poisson regression for each gene using other genes as covariates to generate the predictions. However, if you want to use other predictions such as MAGIC or DCA, you can input the results in matrix format into the mu
parameter of the saver
function.
By default, if you specify ncores
without already having a parallel backend set up, saver
will make a default cluster using the makeCluster
function in the parallel package and register a parallel backend via the registerDoParallel
function in the doParallel package. However, if you want to set up a more complicated backend, such as using MPI across nodes, you can do so and run saver
without specifying the ncores
option.
Running SAVER on large datasets may run into memory issues, especially when parallelized across cores where each core has limited memory. One solution would be to register a parallel backend using SOCK or MPI to parallelize across nodes. Another solution would be to split the genes into multiple runs using pred.genes
and pred.genes.only
. For example, say we have a dataset x
with 10,000 genes. We can split this up into 4 runs on different machines and combine using the combine.saver
function. We recommend setting do.fast = FALSE
when splitting the dataset.
saver1 <- saver(x, pred.genes = 1:2500, pred.genes.only = TRUE,
do.fast = FALSE)
saver2 <- saver(x, pred.genes = 2501:5000, pred.genes.only = TRUE,
do.fast = FALSE)
saver3 <- saver(x, pred.genes = 5001:7500, pred.genes.only = TRUE,
do.fast = FALSE)
saver4 <- saver(x, pred.genes = 7501:10000, pred.genes.only = TRUE,
do.fast = FALSE)
saver.all <- combine.saver(list(saver1, saver2, saver3, saver4))
Oftentimes for downstream analysis, you may want to sample from the posterior distributions to account for estimation uncertainty. This is similar to performing multiple imputation to obtain multiple imputed datasets. To sample from the posterior distributions, we use the function sample.saver
, which takes in the saver
result and outputs sampled datasets. For example,
Because the SAVER estimates contain uncertainty, correlations between genes and cells cannot be directly calculated using the SAVER estimates. To adjust for the uncertainty, we provide the functions cor.genes
and cor.cells
to construct gene-to-gene and cell-to-cell correlation matrices respectively for the SAVER output. These functions take in the saver
result and outputs the gene-to-gene or cell-to-cell correlation matrix. For example,