Abstract
The package SLEMI is designed to estimate channel capacity between finite state input and multidimensional continuous output from experimental data. For efficient computations, it uses an iterative algorithm based on logistic regression. In addition, functions to estimate mutual information and calculate probabilities of correct discrimination between a pair of input values are implemented. The method is published in PLOS Computational Biology (Jetka et al. 2019).The main software requirement is the installation of the R environment (version: >= 3.2), which can be downloaded from R project website and is distributed for all common operating systems. We tested the package in R environment installed on Windows 7, 10; Mac OS X 10.11 - 10.13 and Ubuntu 18.04 with no significant differences in the performance. The use of a dedicated Integrated development environment (IDE), e.g. posit is recommended.
Apart from a base installation of R, SLEMI requires the following R packages:
Each of the above packages can be installed by executing
install.packages("name_of_a_package")
in the R console.
Importantly, during installation availability of the above packages will be verified and missing packages will be automatically installed.
The package can be directly installed from GitHub. For installation, open RStudio (or base R) and run following commands in the R console
install.packages("devtools") # run if 'devtools' is not installed
library(devtools)
install_github("sysbiosig/SLEMI")
Are required packages not found, they will be installed automatically.
The package implements methods published in PLOS Computational Biology, please cite:
Jetka T, Nienałtowski K, Winarski T, Błoński S, Komorowski M (2019) Information-theoretic analysis of multivariate single-cell signaling responses. PLoS Comput Biol 15(7): e1007132. https://doi.org/10.1371/journal.pcbi.1007132
All problems, issues and bugs can be reported here:
or directly via e-mail: t.jetka a t gmail.com.
The three functions listed below constitute the key wrapper (interface) functions of the package.
mi_logreg_main()
enables calculation of the mutual
information
capacity_logreg_main()
enables calculation of the
information capacity
prob_discr_pairwise()
serves to calculate
probabilities of correct discrimination between pairs of input
values
The function capacity_logreg_main()
triggers
preprocessing of the data
estimation of channel capacity
running diagnostic procedures
visualisation.
Each of the above steps is implemented within auxiliary functions as presented in the Figure 1 below.
The algorithm to compute the information capacity is implemented
within the function capacity_logreg_algorithm()
, which uses
logistic regression from the nnet
package.
Diagnostic procedures (significance and uncertainties of estimates)
are provided in an internal function
capacity_logreg_testing()
. These are based on data
bootstrapping and overfitting test.
For visualization, a set of graphs is created by an internal function
capacity_output_graphs()
and saved in a specified
directory. In addition, capacity_logreg_main()
returns a
list with capacity estimates, optimal input probability distribution,
diagnostic measures and other summary information about the
analysis.
The function mi_logreg_main()
serves to
calculate the mutual information. It initiates similar steps as the
function capacity_logreg_main()
but without performing the
optimization of the distribution of the input. Instead, it requires the
input distribution to be specified by the user as a function’s
argument.
Logistic regression and Monte Carlo methods, following an analogous
algorithm as within the capacity_logreg_algorithm()
function, are combined to estimate mutual information within a function
mi_logreg_algorithm()
. Visualisation and diagnostics are
carried out by the same set of auxiliary functions as for channel
capacity (internal functions capacity_output_graphs()
and
capacity_logreg_testing()
).
The prob_discr_pairwise()
allows to
estimate probabilities of correct discrimination between two different
values of the input. It implements estimation of probabilities of
correct classification by logistic regression (from nnet
package) for each pair of input values. The probabilities of correct
discrimination are visualized with a graph composed of pie charts.
SLEMI package is designed to estimate information-theoretic measures between a discrete-valued input, \(X\), and multivariate, continuous output, \(Y\). In a typical experiment aimed to quantify information flow a given signaling system, input values \(x_1\leq x_2 \ldots... \leq x_m\), ranging from 0 to saturation are considered.
Then, for each input level, \(x_i\),\(n_i\) observations are collected, which are represented as vectors \[y^i_j \sim P(Y|X = x_i)\] Within information theory the degree of information transmission is measured as the mutual information \[MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(y|X = x_i)}{P(y)}dy\] where \(P(y)\) is the marginal distribution of the output. MI is expressed in bits and \(2^{MI}\) can be interpreted as the number of inputs that the system can resolve on average.
The maximization of mutual information with respect to the input distribution, \(P(X)\), defines the information capacity, \(C^*\). Formally, \[C^* = max_{P(X)} MI(X,Y)\] Information capacity is expressed in bits and \(2^{C^*}\) can be interpreted as the maximal number of inputs that the system can effectively resolve. For details regarding information theory or its application in systems biology please see Methods section and Supplementary Information of the corresponding paper (Jetka et al. 2019).
Functions mi_logreg_main()
,
capacity_logreg_main()
, prob_discr_pairwise()
require data in the form of the object data.frame
with a
specific structure of rows and columns. Responses \(y^i_j\) are assumed to be measured for a
finite set of stimuli levels \(x_1,x_2,\ldots,x_m\). The responses \(y^i_j\) can be multidimensional. Usually,
experimental dataset is represented as a table with rows and columns
organized as shown in Figure 2.
Therefore, the input data frame is expected to have the form represented by the above table, which can be formally described by the following conditions
numeric
; order
and number of outputs should be the same for all cells.An example of the input data.frame
, which contains the
measurements of the NfkB system presented in the MP is
available within the package under the variable data_nfkb
.
It has the following format
signal | response_0 | response_3 | response_21 | |
---|---|---|---|---|
1 | 0ng | 0.3840744 | 0.4252835 | 0.5643459 |
2 | 0ng | 0.4709216 | 0.5777821 | 0.2962640 |
3 | 0ng | 0.4274474 | 0.6696011 | 0.5696369 |
10001 | 8ng | 0.3120216 | 0.3475484 | 9.7036535 |
10002 | 8ng | 0.2544961 | 0.6611051 | 8.1628482 |
10003 | 8ng | 0.1807391 | 0.4336810 | 5.3928484 |
11540 | 100ng | 1.3534083 | 3.0158004 | 6.8983046 |
11541 | 100ng | 1.7007936 | 2.2224497 | 2.8677178 |
11542 | 100ng | 0.1997087 | 0.2886905 | 5.8193494 |
where each row represents measurements of a single-cell, the column
named signal
specifies the level of stimulation, while
response_T is the response of the NfkB system in an individual cell at
time point T. The above table can be shown in R by calling
Calculation of the information capacity with default settings is performed by the command
where the required arguments are
dataRaw
- data frame with column of type
factor
containing values of input (X) and columns of type
numeric
containing values of output (Y), where each row
represents a single observationsignal
- a character which indicates the name of the
column in dataRaw
with values of input (X)response
- a character vector which indicates names of
columns in dataRaw
with values of output (Y)output_path
- a character with the directory, to which
output should be savedThe function returns a list with the following elements
nnet
object describing fitted logistic
regression modeldata_out=TRUE
)By default, all returned elements are saved in
output_path
directory in a file output.rds
.
Along with the output data, results of the computations are visualised
as the graphs listed below
The function mi_logreg_main()
takes a similar list of
arguments and generates analogous plots to the function
capacity_logreg_main()
. The differences are listed
below.
Firstly, user must specify the distribution of input that should be
used for calculation of the mutual information. It is done by passing a
numeric vector via the argument pinput
of
mi_logreg_main()
function. Secondly, the returned list
stores the value of the computed mutual information (in bits) under the
element mi
.
Calculation of the probabilities of correct discrimination between pairs of input values is performed by running the following command
where the required arguments are analogous to the arguments of the
functions capacity_logreg_main()
and
mi_logreg_main()
. The probabilities of correct
discrimination are computed for each pair of unique input values and
returned as a list with the following elements
In addition, a plot of corresponding pie charts is created in
output_path
in the pdf format.
In addition to the sole calculation of the information capacity, the
function capacity_logreg_main()
can also be used to asses
accuracy of the channel capacity estimates resulting from potentially
insufficient sample size and potential over-fitting of the regression
model. Two test are implemented. Precisely, the function can perform
In order to perform diagnostic tests, that by default are turned off, user must set the value of the input argument
In addition, settings of the diagnostic test can be altered by changing the following parameters
doParallel
package) in parallel computing,capacity_logreg_main()
In addition, to the basic functionalities described above, the
function capacity_logreg_main()
allows to control several
other parameters of the algorithm that computes the information
capacity. These parameters and their effects are listed below.
model_out
(default=TRUE
) - logical,
specify if nnet
model object should be saved into output
fileplot_width
(default = 6
) - numeric, the
basic width of created plotsplot_height
(default = 4
) - numeric, the
basic height of created plotsscale
(default = TRUE
) - logical, value
indicating if the columns of dataRaw
are to be centered and
scaled, what is usually recommended for the purpose of stability of
numerical computations. From a purely theoretical perspective, such
transformation does not influence the value of channel capacity.lr_maxit
(default = 1000
) - a maximum
number of iterations of fitting step of logistic regression algorithm in
nnet
function. If a warning regarding lack of convergence
of logistic model occurs, should be set to a larger value (possible if
data is more complex or of a very high dimension).MaxNWts
(default = 5000
) - a maximum
number of parameters in logistic regression model. A limit is set to
prevent accidental over-loading the memory. It should be set to a larger
value in case of exceptionally high dimension of the output data or very
high number of input values. In principle, logistic model requires
fitting \((m-1)\cdot(d+1)\) parameters,
where \(m\) is the number of unique
input values and \(d\) is the dimension
of the output.The latter two parameters, i.e lr_maxit
and
MaxNWts
, allow to change the parameters of the logistic
regression model fitting within the dependent nnet
package.
Below, we present a minimal model that may serve as a quick introduction to computations within the package. Precisely, we consider a system
The example is analogous to the Test scenario 2 of the Supplementary Information of (Jetka et al. 2019) (Section 3.2).
Input data
Firstly, we generate a a synthetic dataset. The data corresponding to
the model can be generated, and represented as the data frame
tempdata
with columns input
and
output
, by running
xs=c(0,0.1,1,10) # concentration of input.
tempdata = data.frame(input = factor(c(t(replicate(1000,xs))),
levels=xs),
output = c(matrix(rnorm(4000, mean=10*(xs/(1+xs)),sd=c(1,1,1,1)),
ncol=4,byrow=TRUE) ))
The generated data.frame has the following structure
input | output | |
---|---|---|
1 | 0 | -0.6747968 |
2 | 0 | 0.9103099 |
2001 | 1 | 5.3381979 |
2002 | 1 | 5.0503969 |
3999 | 10 | 7.0037857 |
4000 | 10 | 10.2305361 |
Calculation of the information capacity
The Information capacity can be calculated using the
capacity_logreg_main()
function that takes the data frame
“tempdata” as dataRaw
argument. Column names “input” and
“output” are used as arguments signal
and
response
, respectively. The output_path
is set
as “minimal_example/”. Therefore, the function is run as follows
tempoutput <- capacity_logreg_main(dataRaw=tempdata,
signal="input", response="output",
output_path="minimal_example/")
Results of the computations are returned as a data structure
described before. In addition, results are presented in the form of the
following graph (by default saved as MainPlot.pdf in
minimal_example/
directory). It represents the input-output
data and gives the corresponding channel capacity.
Calculation of the mutual information
To compare mutual information of experimental data with its channel capacity, we can run (uniform distribution of input values is assumed, as default)
tempoutput_mi <- mi_logreg_main(dataRaw=tempdata,
signal="input", response="output",
output_path="minimal_exampleMI/",
pinput=rep(1/4,4))
and display results
print(paste("Mutual Information:", tempoutput_mi$mi,"; ",
"Channel Capacity:", tempoutput$cc, sep=" "))
## [1] "Mutual Information: 1.48828226407725 ; Channel Capacity: 1.54565042478486"
Alternatively, the distribution of the input can be defined with probabilities \((0.4,0.1,0.4,0.1)\)
tempoutput_mi <- mi_logreg_main(dataRaw=tempdata,
signal="input", response="output",
output_path="minimal_exampleMI/",
pinput=rc(0.4,0.1,0.4,0.1))
and display results
print(paste("Mutual Information:", tempoutput_mi$mi,"; ",
"Channel Capacity:", tempoutput$cc, sep=" "))
## [1] "Mutual Information: 1.33866963524711 ; Channel Capacity: 1.54565042478486"
Calculation of the probabilities of correct discrimination
Probabilities of correct discrimination between input values are calculated as follows
tempoutput_probs <- prob_discr_pairwise(dataRaw=tempdata,
signal="input", response="output",
output_path="minimal_exampleProbs/")
The above command generates graph shown in Figure 4 in the output directory
Diagnostics
The diagnostic test can be performed as follows
dir.create("example1_testing/")
outputCLR=capacity_logreg_main(dataRaw=data_example1,
signal="signal",response="response",
output_path="example1_testing/",
testing=TRUE, TestingSeed = 1234, testing_cores = 4,
boot_num = 40, boot_prob = 0.8,
traintest_num = 40,partition_trainfrac = 0.6)
It will run diagnostics with 40 re-sampling of the data, where bootstrap is calculated using 80% of the data, while the over-fitting test uses 60% of the original dataset.
Its results are provided in graph presented in Figure 5.
The top diagram shows the value of the capacity estimate (in black) obtained from the complete dataset and the mean value of bootstrap repetitions with indicated +/- standard deviation (in red). Plots that follow show histograms of calculated capacities for different diagnostic regimes. The black dot represents the estimate of the channel capacity based on the complete dataset. In addition, corresponding empirical p-values of both tests (left- and right-sided) are calculated to assess the randomness of obtained results (PV in the plots).
A reliable estimation of the information capacity should yield the following results of the bootstrap and overfitting tests.
The bootstrap test should yield distribution of the capacity estimates with small variance. In addition, the capacity estimated based on the complete dataset should not be an outlier (p-value>0.05). Otherwise, it would indicate that the sample size is too low for an accurate estimation of the channel capacity.
The over-fitting test should provide similar results. The capacity estimate obtained based on the complete dataset should lie within the distribution of capacities generated in the test. In the opposite case, it could mean that the logistic regression model does not fully grasp the essential aspects of input-output dependencies in the data.
Two step-by-step examples that further illustrate the applicability of the SLEMI package are provided in the Section 6 of the ‘Testing procedures’ pdf file that is added to the publication (Jetka et al. 2019) and can be found here.
To reproduce results of the NFkB analysis presented in the publication, see Section 7 of the ‘Testing procedures’ pdf file that is added to the publication (Jetka et al. 2019) and can be found here.
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.24.so; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SLEMI_1.0.2 reshape2_1.4.4 stringr_1.5.0 gridExtra_2.3 ggplot2_3.4.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 xfun_0.40 bslib_0.5.1
## [4] recipes_1.0.8 lattice_0.21-9 vctrs_0.6.4
## [7] tools_4.3.1 generics_0.1.3 stats4_4.3.1
## [10] parallel_4.3.1 proxy_0.4-27 tibble_3.2.1
## [13] fansi_1.0.5 ModelMetrics_1.2.2.2 pkgconfig_2.0.3
## [16] Matrix_1.6-1.1 data.table_1.14.8 lifecycle_1.0.3
## [19] compiler_4.3.1 munsell_0.5.0 codetools_0.2-19
## [22] htmltools_0.5.6.1 class_7.3-22 sass_0.4.7
## [25] yaml_2.3.7 prodlim_2023.08.28 pillar_1.9.0
## [28] jquerylib_0.1.4 MASS_7.3-60 cachem_1.0.8
## [31] gower_1.0.1 iterators_1.0.14 rpart_4.1.21
## [34] foreach_1.5.2 nlme_3.1-163 parallelly_1.36.0
## [37] lava_1.7.2.1 tidyselect_1.2.0 digest_0.6.33
## [40] stringi_1.7.12 future_1.33.0 dplyr_1.1.3
## [43] purrr_1.0.2 listenv_0.9.0 splines_4.3.1
## [46] fastmap_1.1.1 colorspace_2.1-0 cli_3.6.1
## [49] magrittr_2.0.3 survival_3.5-7 utf8_1.2.3
## [52] e1071_1.7-13 future.apply_1.11.0 withr_2.5.1
## [55] scales_1.2.1 lubridate_1.9.3 timechange_0.2.0
## [58] rmarkdown_2.25 globals_0.16.2 nnet_7.3-19
## [61] timeDate_4022.108 evaluate_0.22 knitr_1.44
## [64] hardhat_1.3.0 caret_6.0-94 rlang_1.1.1
## [67] Rcpp_1.0.11 glue_1.6.2 pROC_1.18.4
## [70] ipred_0.9-14 jsonlite_1.8.7 R6_2.5.1
## [73] plyr_1.8.9
author and maintainer, please contact via t.jetka a t gmail.com↩︎