% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/mlim.R
\name{mlim}
\alias{mlim}
\title{missing data imputation with automated machine learning}
\usage{
mlim(
  data = NULL,
  load.mlim = NULL,
  algos = c("ELNET", "DRF"),
  preimpute = "rf",
  preimputed_df = NULL,
  ignore = NULL,
  init = TRUE,
  save.mlim = NULL,
  maxiter = 10L,
  miniter = 2L,
  cv = 10L,
  tuning_time = 180,
  max_models = NULL,
  matching = "AUTO",
  balance = NULL,
  ignore.rank = FALSE,
  weights_column = NULL,
  seed = NULL,
  verbosity = NULL,
  report = NULL,
  iteration_stopping_metric = "RMSE",
  iteration_stopping_tolerance = 0.005,
  stopping_metric = "AUTO",
  stopping_rounds = 3,
  stopping_tolerance = 0.001,
  cpu = -1,
  ram = NULL,
  flush = FALSE,
  shutdown = TRUE,
  sleep = 0.5,
  ...
)
}
\arguments{
\item{data}{a \code{data.frame} or \code{matrix} with missing data to be
imputed. if \code{load.mlim} is provided, this argument will be ignored.}

\item{load.mlim}{an object of class "mlim", which includes the data, arguments,
and settings for re-running the imputation, from where it was
previously stopped. the "mlim" object saves the current state of
the imputation and is particularly recommended for large datasets
or when the user specifies a computationally extensive settings
(e.g. specifying several algorithms, increasing tuning time, etc.).}

\item{algos}{character. specify a vector of algorithms to be used
       in the process of auto-tuning. the supported main algorithms are
       \code{"ELNET"}, \code{"RF"},
       \code{"GBM"}, \code{"DL"}, \code{"XGB"} (available for Mac and Linux), and \code{"Ensemble"}.

       the default is \code{c("ELNET", "RF")}, which tunes fast.Note that the
       choice of algorithms to be trained can largely increase the runtime.
       for advice on algorithm selection visit \url{https://github.com/haghish/mlim}.
       GBM, DL, XGB, and Ensemble take the full given "tuning_time" (see below) to
       tune the best model for imputing he given variable.
       if \code{load.mlim} is provided, this argument will be ignored.}

\item{preimpute}{character. specifies the procedure for handling the missing
data before initiating the procedures. the default procedure
is "rf", which models the missing data with parallel Random Forest
model. possible alternatives are \code{"knn"} or \code{"mm"}.}

\item{preimputed_df}{data.frame. if you have used another software for missing
data imputation, you can still optimize the imputation
by handing the data.frame to this argument, which will
bypass the "preimpute" procedure.}

\item{ignore}{character vector of column names or index of columns that should
should be ignored in the process of imputation.}

\item{init}{logical. should h2o Java server be initiated? the default is TRUE.
however, if the Java server is already running, set this argument
to FALSE.}

\item{save.mlim}{filename. if a filename is specified, an \code{mlim} object is
saved after the end of each variable imputation. this object not only
includes the imputed dataframe and estimated cross-validation error, but also
includes the information needed for continuing the imputation,
which is very useful feature for imputing large datasets, with a
long runtime. this argument is activated by default and an
mlim object is stored in the local directory named \code{"mlim.rds"}.}

\item{maxiter}{integer. maximum number of iterations. the default value is \code{10},
but it can be reduced to \code{3} (not recommended, see below).}

\item{miniter}{integer. minimum number of iterations. the default value is
2.}

\item{cv}{logical. specify number of k-fold Cross-Validation (CV). values of
10 or higher are recommended. default is 10.}

\item{tuning_time}{integer. maximum runtime (in seconds) for fine-tuning the
imputation model for each variable in each iteration. the default
time is 600 seconds but for a large dataset, you
might need to provide a larger model development
time. this argument also influences \code{max_models},
see below.}

\item{max_models}{integer. maximum number of models that can be generated in
the proecess of fine-tuning the parameters. this value
default to 100, meaning that for imputing each variable in
each iteration, up to 100 models can be fine-tuned. increasing
this value should be consistent with increasing
\code{max_model_runtime_secs}, allowing the model to spend
more time in the process of individualized fine-tuning.
as a result, the better tuned the model, the more accurate
the imputed values are expected to be}

\item{matching}{logical. if \code{TRUE}, imputed values are coerced to the
closest value to the non-missing values of the variable.
if set to "AUTO", 'mlim' decides whether to match
or not, based on the variable classes. the default is "AUTO".}

\item{balance}{character vector, specifying variable names that should be
balanced before imputation. balancing the prevalence might
decrease the overall accuracy of the imputation, because it
attempts to ensure the representation of the rare outcome.
this argument is optional and intended for advanced users that
impute a severely imbalance categorical (nominal) variable.}

\item{ignore.rank}{logical, if FALSE (default), ordinal variables
are imputed as continuous integers with regression plus matching
and are reverted to ordinal later again. this procedure is
recommended. if FALSE, the rank of the categories will be ignored
the the algorithm will try to optimize for classification accuracy.
WARNING: the latter often results in very high classification accuracy but at
the cost of higher rank error. see the "mlim.error" function
documentation to see how rank error is computed. therefore, if you
intend to carry out analysis on the rank data as numeric, it is
recommended that you set this argument to FALSE.}

\item{weights_column}{non-negative integer. a vector of observation weights
can be provided, which should be of the same length
as the dataframe. giving an observation a weight of
Zero is equivalent of ignoring that observation in the
model. in contrast, a weight of 2 is equivalent of
repeating that observation twice in the dataframe.
the higher the weight, the more important an observation
becomes in the modeling process. the default is NULL.}

\item{seed}{integer. specify the random generator seed}

\item{verbosity}{character. controls how much information is printed to console.
the value can be "warn" (default), "info", "debug", or NULL.}

\item{report}{filename. if a filename is specified (e.g. report = "mlim.md"), the \code{"md.log"} R
package is used to generate a Markdown progress report for the
imputation. the format of the report is adopted based on the
\code{'verbosity'} argument. the higher the verbosity, the more
technical the report becomes. if verbosity equals "debug", then
a log file is generated, which includes time stamp and shows
the function that has generated the message. otherwise, a
reduced markdown-like report is generated. default is NULL.}

\item{iteration_stopping_metric}{character. specify the minimum improvement
in the estimated error to proceed to the
following iteration or stop the imputation.
the default is 10^-4 for \code{"MAE"}
(Mean Absolute Error). this criteria is only
applied from the end of the fourth iteration.}

\item{iteration_stopping_tolerance}{numeric. the minimum rate of improvement
in estimated error metric to qualify the
imputation for another round of iteration,
if the \code{maxiter} is not yet reached.
the default value is 50^-3, meaning that
in each iteration, the error must be
reduced by at least 0.5% of the previous
iteration.}

\item{stopping_metric}{character.}

\item{stopping_rounds}{integer.}

\item{stopping_tolerance}{numeric.}

\item{cpu}{integer. number of CPUs to be dedicated for the imputation.
the default takes all of the available CPUs.}

\item{ram}{integer. specifies the maximum size, in Gigabytes, of the
memory allocation.
large memory size is particularly advised, especially
for multicore processes. the more you give the more you get!}

\item{flush}{logical (experimental). if TRUE, after each model, the server is
cleaned to retrieve RAM. this feature is in testing mode.}

\item{shutdown}{logical. if TRUE, h2o server is closed after the imputation.
the default is TRUE}

\item{sleep}{integer. number of seconds to wait after each interaction with h2o
server. the default is 1 second. larger values might be needed
depending on your computation power or dataset size.}

\item{...}{Arguments passed to \code{h2o.automl()}.
The following arguments are e.g. incompatible with \code{ranger}: \code{write.forest}, \code{probability}, \code{split.select.weights}, \code{dependent.variable.name}, and \code{classification}.}
}
\value{
a \code{data.frame}, showing the
        estimated imputation error from the cross validation within the data.frame's
        attribution
}
\description{
imputes data.frame with mixed variable types using automated
             machine learning (AutoML)
}
\examples{

\donttest{
data(iris)
irisNA <- mlim.na(iris, p = 0.1, stratify = TRUE, seed = 2022)

# run the default imputation (fastest imputation via 'mlim')
MLIM <- mlim(irisNA)
mlim.error(MLIM, irisNA, iris)

# run GBM model and allow 15 minutes of tuning for each variable
MLIM <- mlim(irisNA, algos = "GBM", tuning_time=60*15)
mlim.error(MLIM, irisNA, iris)
}
}
\author{
E. F. Haghish
}
