% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Fast_analysis.R
\name{fast_analysis}
\alias{fast_analysis}
\title{Efficiently analyze nucleotide recoding data}
\usage{
fast_analysis(
  df,
  pnew = NULL,
  pold = NULL,
  no_ctl = FALSE,
  read_cut = 50,
  features_cut = 50,
  nbin = NULL,
  prior_weight = 2,
  MLE = TRUE,
  lower = -7,
  upper = 7,
  se_max = 2.5,
  mut_reg = 0.1,
  p_mean = 0,
  p_sd = 1,
  StanRate = FALSE,
  Stan_data = NULL,
  null_cutoff = 0,
  NSS = FALSE,
  Chase = FALSE,
  BDA_model = FALSE
)
}
\arguments{
\item{df}{Dataframe in form provided by cB_to_Fast}

\item{pnew}{Labeled read mutation rate; default of 0 means that model estimates rate from s4U fed data. If pnew is provided by user, must be  a vector
of length == number of s4U fed samples. The 1st element corresponds to the s4U induced mutation rate estimate for the 1st replicate of the 1st
experimental condition; the 2nd element corresponds to the s4U induced mutation rate estimate for the 2nd replicate of the 1st experimental condition,
etc.}

\item{pold}{Unlabeled read mutation rate; default of 0 means that model estimates rate from no-s4U fed data}

\item{no_ctl}{Logical; if TRUE, then -s4U control is not used for background mutation rate estimation}

\item{read_cut}{Minimum number of reads for a given feature-sample combo to be used for mut rate estimates}

\item{features_cut}{Number of features to estimate sample specific mutation rate with}

\item{nbin}{Number of bins for mean-variance relationship estimation. If NULL, max of 10 or (number of logit(fn) estimates)/100 is used}

\item{prior_weight}{Determines extent to which logit(fn) variance is regularized to the mean-variance regression line}

\item{MLE}{Logical; if TRUE then replicate logit(fn) is estimated using maximum likelihood; if FALSE more conservative Bayesian hypothesis testing is used}

\item{lower}{Lower bound for MLE with L-BFGS-B algorithm}

\item{upper}{Upper bound for MLE with L-BFGS-B algorithm}

\item{se_max}{Uncertainty given to those transcripts with estimates at the upper or lower bound sets. This prevents downstream errors due to
abnormally high standard errors due to transcripts with extreme kinetics}

\item{mut_reg}{If MLE has instabilities, empirical mut rate will be used to estimate fn, multiplying pnew by 1+mut_reg and pold by 1-mut_reg to regularize fn}

\item{p_mean}{Mean of normal distribution used as prior penalty in MLE of logit(fn)}

\item{p_sd}{Standard deviation of normal distribution used as prior penalty in MLE of logit(fn)}

\item{StanRate}{Logical; if TRUE, a simple 'Stan' model is used to estimate mutation rates for fast_analysis; this may add a couple minutes
to the runtime of the analysis.}

\item{Stan_data}{List; if StanRate is TRUE, then this is the data passed to the 'Stan' model to estimate mutation rates. If using the \code{bakRFit}
wrapper of \code{fast_analysis}, then this is created automatically.}

\item{null_cutoff}{bakR will test the null hypothesis of |effect size| < |null_cutoff|}

\item{NSS}{Logical; if TRUE, logit(fn)s are compared rather than log(kdeg) so as to avoid steady-state assumption.}

\item{Chase}{Logical; Set to TRUE if analyzing a pulse-chase experiment. If TRUE, kdeg = -ln(fn)/tl where fn is the fraction of
reads that are s4U (more properly referred to as the fraction old in the context of a pulse-chase experiment)}

\item{BDA_model}{Logical; if TRUE, variance is regularized with scaled inverse chi-squared model. Otherwise a log-normal
model is used.}
}
\value{
List with dataframes providing information about replicate-specific and pooled analysis results. The output includes:
\itemize{
\item Fn_Estimates; dataframe with estimates for the fraction new and fraction new uncertainty for each feature in each replicate.
The columns of this dataframe are:
\itemize{
\item Feature_ID; Numerical ID of feature
\item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
\item Replicate; Numerical ID for replicate
\item logit_fn; logit(fraction new) estimate, unregularized
\item logit_fn_se; logit(fraction new) uncertainty, unregularized and obtained from Fisher Information
\item nreads; Number of reads mapping to the feature in the sample for which the estimates were obtained
\item log_kdeg; log of degradation rate constant (kdeg) estimate, unregularized
\item kdeg; degradation rate constant (kdeg) estimate
\item log_kd_se; log(kdeg) uncertainty, unregularized and obtained from Fisher Information
\item sample; Sample name
\item XF; Original feature name
}
\item Regularized_ests; dataframe with average fraction new and kdeg estimates, averaged across the replicates and regularized
using priors informed by the entire dataset. The columns of this dataframe are:
\itemize{
\item Feature_ID; Numerical ID of feature
\item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
\item avg_log_kdeg; Weighted average of log(kdeg) from each replicate, weighted by sample and feature-specific read depth
\item sd_log_kdeg; Standard deviation of the log(kdeg) estimates
\item nreads; Total number of reads mapping to the feature in that condition
\item sdp; Prior standard deviation for fraction new estimate regularization
\item theta_o; Prior mean for fraction new estimate regularization
\item sd_post; Posterior uncertainty
\item log_kdeg_post; Posterior mean for log(kdeg) estimate
\item kdeg; exp(log_kdeg_post)
\item kdeg_sd; kdeg uncertainty
\item XF; Original feature name
}
\item Effects_df; dataframe with estimates of the effect size (change in logit(fn)) comparing each experimental condition to the
reference sample for each feature. This dataframe also includes p-values obtained from a moderated t-test. The columns of this
dataframe are:
\itemize{
\item Feature_ID; Numerical ID of feature
\item Exp_ID; Numerical ID for experimental condition (Exp_ID from metadf)
\item L2FC(kdeg); Log2 fold change (L2FC) kdeg estimate or change in logit(fn) if NSS TRUE
\item effect; LFC(kdeg)
\item se; Uncertainty in L2FC_kdeg
\item pval; P-value obtained using effect_size, se, and a z-test
\item padj; pval adjusted for multiple testing using Benjamini-Hochberg procedure
\item XF; Original feature name
}
\item Mut_rates; list of two elements. The 1st element is a dataframe of s4U induced mutation rate estimates, where the mut column
represents the experimental ID and the rep column represents the replicate ID. The 2nd element is the single background mutation
rate estimate used
\item Hyper_Parameters; vector of two elements, named a and b. These are the hyperparameters estimated from the uncertainties for each
feature, and represent the two parameters of a Scaled Inverse Chi-Square distribution. Importantly, a is the number of additional
degrees of freedom provided by the sharing of uncertainty information across the dataset, to be used in the moderated t-test.
\item Mean_Variance_lms; linear model objects obtained from the uncertainty vs. read count regression model. One model is run for each Exp_ID
}
}
\description{
\code{fast_analysis} analyzes nucleotide recoding data maximum likelihood estimation with the L-BFGS-B algorithm
implemented by \code{stats::optim} combined with analytic solutations to simple Bayesian models to perform
approximate partial pooling. Output includes kinetic parameter estimates in each replicate, kinetic parameter estimates
averaged across replicates, and log-2 fold changes in the degradation rate constant (L2FC(kdeg)).
Averaging takes into account uncertainties estimated using the Fisher Information and estimates
are regularized using analytic solutions of fully Bayesian models. The result is that kdegs are
shrunk towards population means and that uncertainties are shrunk towards a mean-variance trend estimated as part of the analysis.
}
\details{
Unless the user supplies estimates for pnew and pold, the first step of \code{fast_analysis} is to estimate the background
and metabolic label (will refer to as s4U for simplicity, though bakR is compatible with other metabolic labels such as s6G)
induced mutation rates. The former is best performed with a -s4U control sample, that is, a normal RNA-seq sample
that lacks a -s4U feed or TimeLapse chemistry conversion of s4U to a C analog. If this sample is missing, both background and
s4U induced mutation rates are estimated from the s4U fed samples. For the s4U mutation rate, features with sufficient read depth,
as defined by the \code{read_cut} parameter, and the highest mutation rates are assumed to be completely labeled. Thus, the
average mutation rates in these features is taken as the estimate of the s4U induced mutation rate in that sample. s4U induced mutation
rates are estimated on a per-sample basis as there is often much more variability in these mutation rates than in the background
mutation rates.

If a -s4U control is included, the background mutation rate is estimated using all features in the control sample(s) with read depths
greater than \code{read_cut}. The average mutation rate among these features is taken as the estimated background mutation rate,
and that background is assumed to be constant for all samples. If a -s4U control is missing, then a strategy similar to that used
to estimate s4U induced mutation rates is used. In this case, the lowest mutation rate features with sufficient read depths are used,
and there average mutation rate is the background mutation rate estimate, as these features are assumed to be almost entirely unlabeled.
Another slightly more computationally intensive but more accurate strategy to estimate mutation rates is to set \code{StanRate} = TRUE.
This will fit a non-hierarchical mixture model to a small subset of transcripts using 'Stan'. The default in \code{bakRFit} is to use
25 transcripts. If \code{StanRate} is TRUE, then a data list must be passed to \code{Stan_data} of the form that appears in the
bakRFit object's Data_list$Stan_data entry.

Once mutation rates are estimated, fraction news for each feature in each sample are estimated. The approach utilized is MLE
using the L-BFGS-B algorithm implemented in \code{stats::optim}. The assumed likelihood function is derived from a Poisson mixture
model with rates adjusted according to each feature's empirical U-content (the average number of Us present in sequencing reads mapping
to that feature in a particular sample). Fraction new estimates are then converted to degradation rate constant estimates using
a solution to a simple ordinary differential equation model of RNA metabolism.

Once fraction new and kdegs are estimated, the uncertainty in these parameters is estimated using the Fisher Information. In the limit of
large datasets, the variance of the MLE is inversely proportional to the Fisher Information evaluated at the MLE. Mixture models are
typically singular, meaning that the Fisher information matrix is not positive definite and asymptotic results for the variance
do not necessarily hold. As the mutation rates are estimated a priori and fixed to be > 0, these problems are eliminated. In addition, when assessing
the uncertainty of replicate fraction new estimates, the size of the dataset is the raw number of sequencing reads that map to a
particular feature. This number is often large (>100) which increases the validity of invoking asymptotics.

With kdegs and their uncertainties estimated, replicate estimates are pooled and regularized. There are two key steps in this
downstream analysis. 1st, the uncertainty for each feature is used to fit a linear ln(uncertainty) vs. log10(read depth) trend,
and uncertainties for individual features are shrunk towards the regression line. The uncertainty for each feature is a combination of the
Fisher Information asymptotic uncertainty as well as the amount of variability seen between estimates. Regularization of uncertainty
estimates is performed using the analytic results of a Normal distribution likelihood with known mean and unknown variance and conjugate
priors. The prior parameters are estimated from the regression and amount of variability about the regression line. The strength of
regularization can be tuned by adjusting the \code{prior_weight} parameter, with larger numbers yielding stronger shrinkage towards
the regression line. The 2nd step is to regularize the average kdeg estimates. This is done using the analytic results of a
Normal distribution likelihood model with unknown mean and known variance and conjugate priors. The prior parameters are estimated from the
population wide kdeg distribution (using its mean and standard deviation as the mean and standard deviation of the normal prior).
In the 1st step, the known mean is assumed to be the average kdeg, averaged across replicates and weighted by the number of reads
mapping to the feature in each replicate. In the 2nd step, the known variance is assumed to be that obtained following regularization
of the uncertainty estimates.

Effect sizes (changes in kdeg) are obtained as the difference in log(kdeg) means between the reference and experimental
sample(s), and the log(kdeg)s are assumed to be independent so that the variance of the effect size is the sum of the
log(kdeg) variances. P-values assessing the significance of the effect size are obtained using a moderated t-test with number
of degrees of freedom determined from the uncertainty regression hyperparameters and are adjusted for multiple testing using the Benjamini-
Hochberg procedure to control false discovery rates (FDRs).

In some cases, the assumed ODE model of RNA metabolism will not accurately model the dynamics of a biological system being analyzed.
In these cases, it is best to compare logit(fraction new)s directly rather than converting fraction new to log(kdeg).
This analysis strategy is implemented when \code{NSS} is set to TRUE. Comparing logit(fraction new) is only valid
If a single metabolic label time has been used for all samples. For example, if a label time of 1 hour was used for NR-seq
data from WT cells and a 2 hour label time was used in KO cells, this comparison is no longer valid as differences in
logit(fraction new) could stem from differences in kinetics or label times.
}
\examples{
\donttest{

# Simulate small dataset
sim <- Simulate_bakRData(300, nreps = 2)

# Fit fast model to get fast_df
Fit <- bakRFit(sim$bakRData)

# Fit fast model with fast_analysis
Fast_Fit <- fast_analysis(Fit$Data_lists$Fast_df)
}
}
