% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/mi_logreg_main.R
\name{mi_logreg_main}
\alias{mi_logreg_main}
\title{Estimate mutual information between discrete input and continuouse output}
\usage{
mi_logreg_main(dataRaw, signal = "input", response = NULL,
  output_path = NULL, side_variables = NULL, pinput = NULL,
  formula_string = NULL, lr_maxit = 1000, MaxNWts = 5000,
  testing = FALSE, model_out = TRUE, scale = TRUE,
  TestingSeed = 1234, testing_cores = 1, boot_num = 10,
  boot_prob = 0.8, sidevar_num = 10, traintest_num = 10,
  partition_trainfrac = 0.6, plot_width = 6, plot_height = 4,
  data_out = FALSE)
}
\arguments{
\item{dataRaw}{must be a data.frame object}

\item{signal}{is a character object with names of columns of dataRaw to be treated as channel's input.}

\item{response}{is a character vector with names of columns of dataRaw  to be treated as channel's output}

\item{output_path}{is the directory in which output will be saved}

\item{side_variables}{(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included}

\item{pinput}{is a numeric vector with piror probabilities of the input values. Uniform distribution is assumed as default (pinput=NULL).}

\item{formula_string}{(optional) is a character object that includes a formula syntax to use in logistic regression model. 
If NULL, a standard additive model of response variables is assumed. Only for advanced users.}

\item{lr_maxit}{is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.}

\item{MaxNWts}{is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.}

\item{testing}{is the logical indicating if the testing procedures should be executed}

\item{model_out}{is the logical indicating if the calculated logisitc regression model should be included in output list}

\item{scale}{is a logical indicating if the response variables should be scaled and centered before fitting logistic regression}

\item{TestingSeed}{is the seed for random number generator used in testing procedures}

\item{testing_cores}{- number of cores to be used in parallel computing (via doParallel package)}

\item{boot_num}{is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.}

\item{boot_prob}{is the proportion of initial size of data to be used in bootstrap}

\item{sidevar_num}{is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.}

\item{traintest_num}{is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.}

\item{partition_trainfrac}{is the fraction of data to be used as a training dataset}

\item{plot_width}{- the basic dimnesions (width) of plots, in inches}

\item{plot_height}{- the basic dimnesions (height) of plots, in inches}

\item{data_out}{is the logical indicating if the data should be included in output list}
}
\value{
a list with several elements:
\itemize{
\item output$regression - confusion matrix of logistic regression predictions
\item output$mi         - mutual information in bits
\item output$model      - nnet object describing logistic regression model (if model_out=TRUE)
\item output$params     - parameters used in algorithm
\item output$time       - computation time of calculations
\item output$testing    - a 2- or 4-element output list of testing procedures (if testing=TRUE)
\item output$testing_pv - one-sided p-values of testing procedures (if testing=TRUE)
\item output$data       - raw data used in analysis
}
}
\description{
The main wrapping function for basic usage of SLEMI package for estimation of mutual information. Firstly, data is pre-processed
(all arguments are checked, observation with NAs are removed, variables are scaled and centered (if scale=TRUE)). Then basic estimation is carried out
and (if testing=TRUE) diagnostic tests are computed. If output directory path is given (output_path is not NULL), graphs visualising the data and the analysis
are saved there, together with a compressed output object (as .rds file) with full estimation results.
}
\details{
In a typical experiment aimed to quantify information flow a given signaling system, input values \eqn{x_1\leq x_2 \ldots... \leq x_m}, ranging from 0 to saturation are considered.
Then, for each input level, \eqn{x_i}, \eqn{n_i} observations are collected, which are represetned as vectors 
\deqn{y^i_j \sim P(Y|X = x_i)}
Within information theory the degree of information transmission is measured as the mutual information
\deqn{MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(y|X = x_i)}{P(y)}dy,}
where \eqn{P(y)} is the marginal distribution of the output. MI is expressed in bits and \eqn{2^{MI}} can be interpreted as the number of 
inputs that the system can resolve on average.

In contrast to existing approaches, instead of estimating, possibly highly dimensional, conditional output distributions \eqn{P(Y|X =x_i)}, we propose to estimate the discrete, conditional input distribution, 
\eqn{P(x_i |Y = y)}, which is known to be a simpler problem. Estimation of the MI using estimates of \eqn{P(x_i |Y = y)}, denoted here as \eqn{\hat{P}(x_i|Y = y)}, is possible as the MI, can be
alternatively written as
\deqn{MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(x_i|Y = y)}{P(x_i)}dy}
The expected value (as in above expression) with respect to distribution \eqn{P(Y|X = x_i)} can be approximated by the average with respect to data
\deqn{MI(X,Y) \approx \sum_{i=1}^{m} P(x_i)\frac{1}{n_i} \sum_{j=1}^{n_i} P(y|X = x_i)log_2\frac{\hat{P}(x_i|Y = y^i_j)}{P(x_i)}dy}
Here, we propose to use logistic regression as \eqn{\hat{P}(x_i|Y = y^i_j)}. Specifically,
\deqn{log\frac{P(x_i |Y = y)}{P(x_m|Y = y)} \approx \alpha_i +\beta_iy}

Additional parameters: lr_maxit and maxNWts are the same as in definition of multinom function from nnet package. An alternative
model formula (using formula_string arguments) should be provided if  data are not suitable for description by logistic regression
(recommended only for advanced users). Preliminary scaling of  data (argument scale) should be used similarly as in other 
data-driven approaches, e.g. if response variables are comparable, scaling (scale=FALSE) can be omitted, while if they represent 
different phenomenon (varying by units and/or magnitude) scaling is recommended.
}
\section{References}{

[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M,  
Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI,
\emph{PLoS Comput Biol}, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
}

\examples{
tempdata=data_example1
outputCLR1=mi_logreg_main(dataRaw=tempdata, signal="signal", response="response")

tempdata=data_example2
outputCLR2=mi_logreg_main(dataRaw=tempdata, signal="signal", response=c("X1","X2","X3")) 

#For further details see vignette
}
