Title: | Clustering and Prediction using Multi-Task Gaussian Processes with Common Mean |
Version: | 1.2.1 |
Description: | An implementation for the multi-task Gaussian processes with common mean framework. Two main algorithms, called 'Magma' and 'MagmaClust', are available to perform predictions for supervised learning problems, in particular for time series or any functional/continuous data applications. The corresponding articles has been respectively proposed by Arthur Leroy, Pierre Latouche, Benjamin Guedj and Servane Gey (2022) <doi:10.1007/s10994-022-06172-1>, and Arthur Leroy, Pierre Latouche, Benjamin Guedj and Servane Gey (2023) https://jmlr.org/papers/v24/20-1321.html. Theses approaches leverage the learning of cluster-specific mean processes, which are common across similar tasks, to provide enhanced prediction performances (even far from data) at a linear computational cost (in the number of tasks). 'MagmaClust' is a generalisation of 'Magma' where the tasks are simultaneously clustered into groups, each being associated to a specific mean process. User-oriented functions in the package are decomposed into training, prediction and plotting functions. Some basic features (classic kernels, training, prediction) of standard Gaussian processes are also implemented. |
License: | MIT + file LICENSE |
URL: | https://github.com/ArthurLeroy/MagmaClustR, https://arthurleroy.github.io/MagmaClustR/ |
BugReports: | https://github.com/ArthurLeroy/MagmaClustR/issues |
Imports: | broom, dplyr, ggplot2, magrittr, methods, mvtnorm, plyr, purrr, Rcpp, rlang, stats, tibble, tidyr, tidyselect |
Suggests: | gganimate, gifski, gridExtra, knitr, plotly, png, rmarkdown, testthat (≥ 3.0.0), transformr |
LinkingTo: | Rcpp |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
Depends: | R (≥ 2.10) |
NeedsCompilation: | yes |
Packaged: | 2024-06-28 20:01:23 UTC; Arthur Leroy |
Author: | Arthur Leroy |
Maintainer: | Arthur Leroy <arthur.leroy.pro@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-06-28 20:20:02 UTC |
MagmaClustR : Clustering and Prediction using Multi-Task Gaussian Processes
Description
The MagmaClustR package implements two main algorithms, called Magma and MagmaClust, using a multi-task GPs model to perform predictions for supervised learning problems. Theses approaches leverage the learning of cluster-specific mean processes, which are common across similar tasks, to provide enhanced prediction performances (even far from data) at a linear computational cost (in the number of tasks). MagmaClust is a generalisation of Magma where the tasks are simultaneously clustered into groups, each being associated to a specific mean process. User-oriented functions in the package are decomposed into training, prediction and plotting functions. Some basic features of standard GPs are also implemented.
Details
For a quick introduction to MagmaClustR, please refer to the README at https://github.com/ArthurLeroy/MagmaClustR
Author(s)
Arthur Leroy, Pierre Pathe and Pierre Latouche
Maintainer: Arthur Leroy - arthur.leroy.pro@gmail.com
References
Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey.
MAGMA: Inference and Prediction with Multi-Task Gaussian Processes.
Machine Learning, 2022,
https://link.springer.com/article/10.1007/s10994-022-06172-1
Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey.
Cluster-Specific Predictions with Multi-Task Gaussian Processes.
Journal of Machine Learning Research, 2023,
https://jmlr.org/papers/v24/20-1321.html
Examples
Simulate a dataset, train and predict with Magma
set.seed(4242)
data_magma <- simu_db(M = 11, N = 10, K = 1)
magma_train <- data_magma %>% subset(ID %in% 1:10)
magma_test <- data_magma %>% subset(ID == 11) %>% head(7)
magma_model <- train_magma(data = magma_train)
magma_pred <- pred_magma(data = magma_test, trained_model = magma_model,
grid_inputs = seq(0, 10, 0.01))
Simulate a dataset, train and predict with MagmaClust
set.seed(4242)
data_magmaclust <- simu_db(M = 4, N = 10, K = 3)
list_ID = unique(data_magmaclust$ID)
magmaclust_train <- data_magmaclust %>% subset(ID %in% list_ID[1:11])
magmaclust_test <- data_magmaclust %>% subset(ID == list_ID[12]) %>%
head(5)
magmaclust_model <- train_magmaclust(data = magmaclust_train)
magmaclust_pred <- pred_magmaclust(data = magmaclust_test,
trained_model = magmaclust_model, grid_inputs = seq(0, 10, 0.01))
Author(s)
Maintainer: Arthur Leroy arthur.leroy.pro@gmail.com (ORCID)
Authors:
Pierre Latouche pierre.latouche@gmail.com
Other contributors:
Pierre Pathé pathepierre@gmail.com [contributor]
Alexia Grenouillat grenouil@insa-toulouse.fr [contributor]
Hugo Lelievre lelievre@insa-toulouse.fr [contributor]
See Also
Useful links:
Report bugs at https://github.com/ArthurLeroy/MagmaClustR/issues
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs)
.
Round a matrix to make if symmetric
Description
If a matrix is non-symmetric due to numerical errors, round with a decreasing number of digits until the matrix becomes symmetric.
Usage
check_symmetric(mat, digits = 10)
Arguments
mat |
A matrix, possibly non-symmetric. |
digits |
A number, the starting number of digits to round from if
|
Value
A matrix, rounded approximation of mat
that is symmetric.
Examples
TRUE
Inverse a matrix using an adaptive jitter term
Description
Inverse a matrix from its Choleski decomposition. If (nearly-)singular, increase the order of magnitude of the jitter term added to the diagonal until the matrix becomes non-singular.
Usage
chol_inv_jitter(mat, pen_diag)
Arguments
mat |
A matrix, possibly singular. |
pen_diag |
A number, a jitter term to add on the diagonal. |
Value
A matrix, inverse of mat
plus an adaptive jitter term
added on the diagonal.
Examples
TRUE
Allocate training data into the most probable cluster
Description
Allocate training data into the most probable cluster
Usage
data_allocate_cluster(trained_model)
Arguments
trained_model |
A list, containing the information coming from a
MagmaClust model, previously trained using the
|
Value
The original dataset used to train the MagmaClust model, with additional 'Cluster' and associated 'Proba' columns, indicating the most probable cluster for each individual/task at the end of the training procedure.
Examples
TRUE
Compute the Multivariate Gaussian likelihood
Description
Modification of the function dmvnorm()
from the package
mvtnorm
, providing an implementation of the Multivariate Gaussian
likelihood. This version uses inverse of the covariance function as argument
instead of the traditional covariance.
Usage
dmnorm(x, mu, inv_Sigma, log = FALSE)
Arguments
x |
A vector, containing values the likelihood is evaluated on. |
mu |
A vector or matrix, specifying the mean parameter. |
inv_Sigma |
A matrix, specifying the inverse of covariance parameter. |
log |
A logical value, indicating whether we return the log-likelihood. |
Value
A number, corresponding to the Multivariate Gaussian log-likelihood.
Examples
TRUE
Draw a number
Description
Draw uniformly a number within a specified interval
Usage
draw(int)
Arguments
int |
An interval of values we want to draw uniformly in. |
Value
A 2-decimals-rounded random number
Examples
TRUE
E-Step of the EM algorithm
Description
Expectation step of the EM algorithm to compute the parameters of the hyper-posterior Gaussian distribution of the mean process in Magma.
Usage
e_step(db, m_0, kern_0, kern_i, hp_0, hp_i, pen_diag)
Arguments
db |
A tibble or data frame. Columns required: ID, Input, Output. Additional columns for covariates can be specified. |
m_0 |
A vector, corresponding to the prior mean of the mean GP. |
kern_0 |
A kernel function, associated with the mean GP. |
kern_i |
A kernel function, associated with the individual GPs. |
hp_0 |
A named vector, tibble or data frame of hyper-parameters
associated with |
hp_i |
A tibble or data frame of hyper-parameters
associated with |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A named list, containing the elements mean
, a tibble
containing the Input and associated Output of the hyper-posterior's mean
parameter, and cov
, the hyper-posterior's covariance matrix.
Examples
TRUE
Penalised elbo for multiple mean GPs with common HPs
Description
Penalised elbo for multiple mean GPs with common HPs
Usage
elbo_GP_mod_common_hp_k(hp, db, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble, data frame or named vector containing hyper-parameters. |
db |
A tibble containing values we want to compute elbo on. Required columns: Input, Output. Additional covariate columns are allowed. |
mean |
A list of the K mean GPs at union of observed timestamps. |
kern |
A kernel function used to compute the covariance matrix at corresponding timestamps. |
post_cov |
A List of the K posterior covariance of the mean GP (mu_k). Used to compute correction term (cor_term). |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
The value of the penalised Gaussian elbo for the sum of the k mean GPs with common HPs.
Examples
TRUE
Evidence Lower Bound for a mixture of GPs
Description
Evidence Lower Bound for a mixture of GPs
Usage
elbo_clust_multi_GP(hp, db, hyperpost, kern, pen_diag)
Arguments
hp |
A tibble, data frame or named vector containing hyper-parameters. |
db |
A tibble containing the values we want to compute the elbo on. Required columns: Input, Output. Additional covariate columns are allowed. |
hyperpost |
List of parameters for the K mean GPs. |
kern |
A kernel function used to compute the covariance matrix at corresponding timestamps. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
The value of the penalised Gaussian elbo for a mixture of GPs
Examples
TRUE
Penalised elbo for multiple individual GPs with common HPs
Description
Penalised elbo for multiple individual GPs with common HPs
Usage
elbo_clust_multi_GP_common_hp_i(hp, db, hyperpost, kern, pen_diag)
Arguments
hp |
A tibble, data frame or named vector containing hyper-parameters. |
db |
A tibble containing values we want to compute elbo on. Required columns: Input, Output. Additional covariate columns are allowed. |
hyperpost |
List of parameters for the K mean Gaussian processes. |
kern |
A kernel function used to compute the covariance matrix at corresponding timestamps. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
The value of the penalised Gaussian elbo for the sum of the M individual GPs with common HPs.
Examples
TRUE
Evidence Lower Bound maximised in MagmaClust
Description
Evidence Lower Bound maximised in MagmaClust
Usage
elbo_monitoring_VEM(hp_k, hp_i, db, kern_i, kern_k, hyperpost, m_k, pen_diag)
Arguments
hp_k |
A tibble, data frame or named vector of hyper-parameters for each clusters. |
hp_i |
A tibble, data frame or named vector of hyper-parameters for each individuals. |
db |
A tibble containing values we want to compute elbo on. Required columns: Input, Output. Additional covariate columns are allowed. |
kern_i |
Kernel used to compute the covariance matrix of individuals GPs at corresponding inputs. |
kern_k |
Kernel used to compute the covariance matrix of the mean GPs at corresponding inputs. |
hyperpost |
A list of parameters for the variational distributions of the K mean GPs. |
m_k |
Prior value of the mean parameter of the mean GPs (mu_k). Length = 1 or nrow(db). |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
Value of the elbo that is maximised during the VEM algorithm used for training in MagmaClust.
Examples
TRUE
Expand a grid of inputs
Description
Expand a grid of inputs
Usage
expand_grid_inputs(Input, ...)
Arguments
Input |
A vector of inputs. |
... |
As many vector of covariates as desired. We advise to give explicit names when using the function. |
Value
A tibble containing all the combination of values of the parameters.
Examples
TRUE
Gradient of the logLikelihood of a Gaussian Process
Description
Gradient of the logLikelihood of a Gaussian Process
Usage
gr_GP(hp, db, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble, data frame or named vector containing hyper-parameters. |
db |
A tibble containing the values we want to compute the logL on. Required columns: Input, Output. Additional covariate columns are allowed. |
mean |
A vector, specifying the mean of the GP at the reference inputs. |
kern |
A kernel function. |
post_cov |
(optional) A matrix, corresponding to covariance parameter of the hyper-posterior. Used to compute the hyper-prior distribution of a new individual in Magma. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A named vector, corresponding to the value of the hyper-parameters gradients for the Gaussian log-Likelihood (where the covariance can be the sum of the individual and the hyper-posterior's mean process covariances).
Examples
TRUE
Gradient of the modified logLikelihood for GPs in Magma
Description
Gradient of the modified logLikelihood for GPs in Magma
Usage
gr_GP_mod(hp, db, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble, data frame or named vector containing hyper-parameters. |
db |
A tibble containing the values we want to compute the logL on. Required columns: Input, Output. Additional covariate columns are allowed. |
mean |
A vector, specifying the mean of the GPs at the reference inputs. |
kern |
A kernel function. |
post_cov |
A matrix, covariance parameter of the hyper-posterior. Used to compute the correction term. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A named vector, corresponding to the value of the hyper-parameters gradients for the modified Gaussian log-Likelihood involved in Magma.
Examples
TRUE
Gradient of the modified logLikelihood with common HPs for GPs in Magma
Description
Gradient of the modified logLikelihood with common HPs for GPs in Magma
Usage
gr_GP_mod_common_hp(hp, db, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble or data frame containing hyper-parameters for all individuals. |
db |
A tibble containing the values we want to compute the logL on. Required columns: ID, Input, Output. Additional covariate columns are allowed. |
mean |
A vector, specifying the mean of the GPs at the reference inputs. |
kern |
A kernel function. |
post_cov |
A matrix, covariance parameter of the hyper-posterior. Used to compute the correction term. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A named vector, corresponding to the value of the hyper-parameters' gradients for the modified Gaussian log-Likelihood involved in Magma with the 'common HP' setting.
Examples
TRUE
Gradient of the penalised elbo for multiple mean GPs with common HPs
Description
Gradient of the penalised elbo for multiple mean GPs with common HPs
Usage
gr_GP_mod_common_hp_k(hp, db, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble, data frame or named vector containing hyper-parameters. |
db |
A tibble containing the values we want to compute the elbo on. Required columns: Input, Output. Additional covariate columns are allowed. |
mean |
A list of the k means of the GPs at union of observed timestamps. |
kern |
A kernel function |
post_cov |
A list of the k posterior covariance of the mean GP (mu_k). Used to compute correction term (cor_term) |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
The gradient of the penalised Gaussian elbo for the sum of the k mean GPs with common HPs.
Examples
TRUE
Gradient of the elbo for a mixture of GPs
Description
Gradient of the elbo for a mixture of GPs
Usage
gr_clust_multi_GP(hp, db, hyperpost, kern, pen_diag)
Arguments
hp |
A tibble, data frame or named vector containing hyper-parameters. |
db |
A tibble containing the values we want to compute the elbo on. Required columns: Input, Output. Additional covariate columns are allowed. |
hyperpost |
List of parameters for the K mean Gaussian processes. |
kern |
A kernel function. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
The gradient of the penalised Gaussian elbo for a mixture of GPs
Examples
TRUE
Gradient of the penalised elbo for multiple individual GPs with common HPs
Description
Gradient of the penalised elbo for multiple individual GPs with common HPs
Usage
gr_clust_multi_GP_common_hp_i(hp, db, hyperpost, kern, pen_diag = NULL)
Arguments
hp |
A tibble, data frame or name vector of hyper-parameters. |
db |
A tibble containing values we want to compute elbo on. Required columns: Input, Output. Additional covariate columns are allowed. |
hyperpost |
List of parameters for the K mean Gaussian processes. |
kern |
A kernel function used to compute the covariance matrix at corresponding timestamps. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
The gradient of the penalised Gaussian elbo for the sum of the M individual GPs with common HPs.
Examples
TRUE
Gradient of the mixture of Gaussian likelihoods
Description
Compute the gradient of a sum of Gaussian log-likelihoods, weighted by their mixture probabilities.
Usage
gr_sum_logL_GP_clust(hp, db, mixture, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble, data frame or named vector of hyper-parameters. |
db |
A tibble containing data we want to evaluate the logL on. Required columns: Input, Output. Additional covariate columns are allowed. |
mixture |
A tibble or data frame, indicating the mixture probabilities of each cluster for the new individual/task. |
mean |
A list of hyper-posterior mean parameters for all clusters. |
kern |
A kernel function. |
post_cov |
A list of hyper-posterior covariance parameters for all clusters. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A named vector, corresponding to the value of the hyper-parameters' gradients for the mixture of Gaussian log-likelihoods involved in the prediction step of MagmaClust.
Examples
TRUE
Generate random hyper-parameters
Description
Generate a set of random hyper-parameters, specific to the chosen type of kernel, under the format that is used in Magma.
Usage
hp(
kern = "SE",
list_ID = NULL,
list_hp = NULL,
noise = FALSE,
common_hp = FALSE
)
Arguments
kern |
A function, or a character string indicating the chosen type of kernel among:
In case of a custom kernel function, the argument |
list_ID |
A vector, associating an |
list_hp |
A vector of characters, providing the name of each
hyper-parameter, in case where |
noise |
A logical value, indicating whether a 'noise' hyper-parameter should be included. |
common_hp |
A logical value, indicating whether the set of hyper-parameters is assumed to be common to all individuals. |
Value
A tibble, providing a set of random hyper-parameters associated with
the kernel specified through the argument kern
.
Examples
TRUE
Compute the hyper-posterior distribution in Magma
Description
Compute the parameters of the hyper-posterior Gaussian distribution of the
mean process in Magma (similarly to the expectation step of the EM
algorithm used for learning). This hyper-posterior distribution, evaluated
on a grid of inputs provided through the grid_inputs
argument, is a
key component for making prediction in Magma, and is required in the function
pred_magma
.
Usage
hyperposterior(
trained_model = NULL,
data = NULL,
hp_0 = NULL,
hp_i = NULL,
kern_0 = NULL,
kern_i = NULL,
prior_mean = NULL,
grid_inputs = NULL,
pen_diag = 1e-10
)
Arguments
trained_model |
A list, containing the information coming from a
Magma model, previously trained using the |
data |
A tibble or data frame. Required columns: 'Input',
'Output'. Additional columns for covariates can be specified.
The 'Input' column should define the variable that is used as
reference for the observations (e.g. time for longitudinal data). The
'Output' column specifies the observed values (the response
variable). The data frame can also provide as many covariates as desired,
with no constraints on the column names. These covariates are additional
inputs (explanatory variables) of the models that are also observed at
each reference 'Input'. Recovered from |
hp_0 |
A named vector, tibble or data frame of hyper-parameters
associated with |
hp_i |
A tibble or data frame of hyper-parameters
associated with |
kern_0 |
A kernel function, associated with the mean GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
kern_i |
A kernel function, associated with the individual GPs. ("SE",
"PERIO" and "RQ" are aso available here). Recovered from
|
prior_mean |
Hyper-prior mean parameter of the mean GP. This argument, can be specified under various formats, such as:
|
grid_inputs |
A vector or a data frame, indicating the grid of additional reference inputs on which the mean process' hyper-posterior should be evaluated. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A list gathering the parameters of the mean processes' hyper-posterior distributions, namely:
mean: A tibble, the hyper-posterior mean parameter evaluated at each training
Input
.cov: A matrix, the covariance parameter for the hyper-posterior distribution of the mean process.
pred: A tibble, the predicted mean and variance at
Input
for the mean process' hyper-posterior distribution under a format that allows the direct visualisation as a GP prediction.
Examples
TRUE
Compute the hyper-posterior distribution for each cluster in MagmaClust
Description
Recompute the E-step of the VEM algorithm in MagmaClust for a new set of
reference Input
. Once training is completed, it can be necessary to
evaluate the hyper-posterior distributions of the mean processes at specific
locations, for which we want to make predictions. This process is directly
implemented in the pred_magmaclust
function but the user
might want to use hyperpost_clust
for a tailored control of
the prediction procedure.
Usage
hyperposterior_clust(
trained_model = NULL,
data = NULL,
mixture = NULL,
hp_k = NULL,
hp_i = NULL,
kern_k = NULL,
kern_i = NULL,
prior_mean_k = NULL,
grid_inputs = NULL,
pen_diag = 1e-10
)
Arguments
trained_model |
A list, containing the information coming from a
Magma model, previously trained using the |
data |
A tibble or data frame. Required columns: |
mixture |
A tibble or data frame, indicating the mixture probabilities
of each cluster for each individual. Required column: |
hp_k |
A tibble or data frame of hyper-parameters
associated with |
hp_i |
A tibble or data frame of hyper-parameters
associated with |
kern_k |
A kernel function, associated with the mean GPs. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
kern_i |
A kernel function, associated with the individual GPs. ("SE",
"LIN", PERIO" and "RQ" are also available here). Recovered from
|
prior_mean_k |
The set of hyper-prior mean parameters (m_k) for the K mean GPs, one value for each cluster. cluster. This argument can be specified under various formats, such as:
|
grid_inputs |
A vector or a data frame, indicating the grid of additional reference inputs on which the mean process' hyper-posterior should be evaluated. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A list containing the parameters of the mean processes' hyper-posterior distribution, namely:
mean: A list of tibbles containing, for each cluster, the hyper-posterior mean parameters evaluated at each
Input
.cov: A list of matrices containing, for each cluster, the hyper-posterior covariance parameter of the mean process.
mixture: A tibble, indicating the mixture probabilities in each cluster for each individual.
Examples
TRUE
Run a k-means algorithm to initialise clusters' allocation
Description
Run a k-means algorithm to initialise clusters' allocation
Usage
ini_kmeans(data, k, nstart = 50, summary = FALSE)
Arguments
data |
A tibble containing common Input and associated Output values to cluster. |
k |
A number of clusters assumed for running the kmeans algorithm. |
nstart |
A number, indicating how many re-starts of kmeans are set. |
summary |
A boolean, indicating whether we want an outcome summary |
Value
A tibble containing the initial clustering obtained through kmeans.
Examples
TRUE
Mixture initialisation with kmeans
Description
Provide an initial kmeans allocation of the individuals/tasks in a dataset into a definite number of clusters, and return the associated mixture probabilities.
Usage
ini_mixture(data, k, name_clust = NULL, nstart = 50)
Arguments
data |
A tibble or data frame. Required columns: |
k |
A number, indicating the number of clusters. |
name_clust |
A vector of characters. Each element should correspond to the name of one cluster. |
nstart |
A number of restart used in the underlying kmeans algorithm |
Value
A tibble indicating for each ID
in which cluster it belongs
after a kmeans initialisation.
Examples
TRUE
Create covariance matrix from a kernel
Description
kern_to_cov()
creates a covariance matrix between input values (that
could be either scalars or vectors) evaluated within a kernel function,
which is characterised by specified hyper-parameters. This matrix is
a finite-dimensional evaluation of the infinite-dimensional covariance
structure of a GP, defined thanks to this kernel.
Usage
kern_to_cov(input, kern = "SE", hp, deriv = NULL, input_2 = NULL)
Arguments
input |
A vector, matrix, data frame or tibble containing all inputs for one individual. If a vector, the elements are used as reference, otherwise , one column should be named 'Input' to indicate that it represents the reference (e.g. 'Input' would contain the timestamps in time-series applications). The other columns are considered as being covariates. If no column is named 'Input', the first one is used by default. |
kern |
A kernel function. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hp |
A list, data frame or tibble containing the hyper-parameters used
in the kernel. The name of the elements (or columns) should correspond
exactly to those used in the kernel definition. If |
deriv |
A character, indicating according to which hyper-parameter the derivative should be computed. If NULL (default), the function simply returns the covariance matrix. |
input_2 |
(optional) A vector, matrix, data frame or tibble under the
same format as |
Value
A covariance matrix, where elements are evaluations of the associated kernel for each pair of reference inputs.
Examples
TRUE
Create inverse of a covariance matrix from a kernel
Description
kern_to_inv()
creates the inverse of a covariance matrix between
input values (that could be either scalars or vectors) evaluated within
a kernel function, which is characterised by specified hyper-parameters.
This matrix is a finite-dimensional evaluation of the
infinite-dimensional covariance structure of a GP, defined thanks to this
kernel.
Usage
kern_to_inv(input, kern, hp, pen_diag = 1e-10, deriv = NULL)
Arguments
input |
A vector, matrix, data frame or tibble containing all inputs for one individual. If a vector, the elements are used as reference, otherwise ,one column should be named 'Input' to indicate that it represents the reference (e.g. 'Input' would contain the timestamps in time-series applications). The other columns are considered as being covariates. If no column is named 'Input', the first one is used by default. |
kern |
A kernel function. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hp |
A list, data frame or tibble containing the hyper-parameters used in the kernel. The name of the elements (or columns) should correspond exactly to those used in the kernel definition. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
deriv |
A character, indicating according to which hyper-parameter the derivative should be computed. If NULL (default), the function simply returns the inverse covariance matrix. |
Value
The inverse of a covariance matrix, which elements are evaluations of the associated kernel for each pair of reference inputs.
Examples
TRUE
Linear Kernel
Description
Linear Kernel
Usage
lin_kernel(x, y, hp, deriv = NULL, vectorized = FALSE)
Arguments
x |
A vector (or matrix if vectorized = T) of inputs. |
y |
A vector (or matrix if vectorized = T) of inputs. |
hp |
A tibble, data frame or named vector, containing the kernel's hyperparameters. Required columns: 'lin_slope' and 'lin_offset'. |
deriv |
A character, indicating according to which hyper-parameter the derivative should be computed. If NULL (default), the function simply returns the evaluation of the kernel. |
vectorized |
A logical value, indicating whether the function provides
a vectorized version for speeded-up calculations. If TRUE, the |
Value
A scalar, corresponding to the evaluation of the kernel.
Examples
TRUE
Compute a covariance matrix for multiple individuals
Description
Compute the covariance matrices associated with all individuals in the database, taking into account their specific inputs and hyper-parameters.
Usage
list_kern_to_cov(data, kern, hp, deriv = NULL)
Arguments
data |
A tibble or data frame of input data. Required column: 'ID'. Suggested column: 'Input' (for indicating the reference input). |
kern |
A kernel function. |
hp |
A tibble or data frame, containing the hyper-parameters associated with each individual. |
deriv |
A character, indicating according to which hyper-parameter the derivative should be computed. If NULL (default), the function simply returns the list of covariance matrices. |
Value
A named list containing all of the inverse covariance matrices.
Examples
TRUE
Compute an inverse covariance matrix for multiple individuals
Description
Compute the inverse covariance matrices associated with all individuals in the database, taking into account their specific inputs and hyper-parameters.
Usage
list_kern_to_inv(db, kern, hp, pen_diag, deriv = NULL)
Arguments
db |
A tibble or data frame of input data. Required column: 'ID'. Suggested column: 'Input' (for indicating the reference input). |
kern |
A kernel function. |
hp |
A tibble or data frame, containing the hyper-parameters associated with each individual. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
deriv |
A character, indicating according to which hyper-parameter the derivative should be computed. If NULL (default), the function simply returns the list of covariance matrices. |
Value
A named list containing all of the inverse covariance matrices.
Examples
TRUE
Log-Likelihood function of a Gaussian Process
Description
Log-Likelihood function of a Gaussian Process
Usage
logL_GP(hp, db, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble, data frame or named vector containing hyper-parameters. |
db |
A tibble containing the values we want to compute the logL on. Required columns: Input, Output. Additional covariate columns are allowed. |
mean |
A vector, specifying the mean of the GP at the reference inputs. |
kern |
A kernel function. |
post_cov |
(optional) A matrix, corresponding to covariance parameter of the hyper-posterior. Used to compute the hyper-prior distribution of a new individual in Magma. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, corresponding to the value of Gaussian log-Likelihood (where the covariance can be the sum of the individual and the hyper-posterior's mean process covariances).
Examples
TRUE
Modified log-Likelihood function for GPs
Description
Log-Likelihood function involved in Magma during the maximisation step of the training. The log-Likelihood is defined as a simple Gaussian likelihood added with correction trace term.
Usage
logL_GP_mod(hp, db, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble, data frame or named vector of hyper-parameters. |
db |
A tibble containing values we want to compute logL on. Required columns: Input, Output. Additional covariate columns are allowed. |
mean |
A vector, specifying the mean of the GP at the reference inputs. |
kern |
A kernel function. |
post_cov |
A matrix, covariance parameter of the hyper-posterior. Used to compute the correction term. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, corresponding to the value of the modified Gaussian log-Likelihood defined in Magma.
Examples
TRUE
Modified log-Likelihood function with common HPs for GPs
Description
Log-Likelihood function involved in Magma during the maximisation step of the training, in the particular case where the hyper-parameters are shared by all individuals. The log-Likelihood is defined as a sum over all individuals of Gaussian likelihoods added with correction trace terms.
Usage
logL_GP_mod_common_hp(hp, db, mean, kern, post_cov, pen_diag)
Arguments
hp |
A tibble, data frame of hyper-parameters. |
db |
A tibble containing the values we want to compute the logL on. Required columns: ID, Input, Output. Additional covariate columns are allowed. |
mean |
A vector, specifying the mean of the GP at the reference inputs. |
kern |
A kernel function. |
post_cov |
A matrix, covariance parameter of the hyper-posterior. Used to compute the correction term. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, corresponding to the value of the modified Gaussian log-Likelihood with common hyper-parameters defined in Magma.
Examples
TRUE
Log-Likelihood for monitoring the EM algorithm in Magma
Description
Log-Likelihood for monitoring the EM algorithm in Magma
Usage
logL_monitoring(
hp_0,
hp_i,
db,
m_0,
kern_0,
kern_i,
post_mean,
post_cov,
pen_diag
)
Arguments
hp_0 |
A named vector, tibble or data frame, containing the hyper-parameters associated with the mean GP. |
hp_i |
A tibble or data frame, containing the hyper-parameters with the individual GPs. |
db |
A tibble or data frame. Columns required: ID, Input, Output. Additional columns for covariates can be specified. |
m_0 |
A vector, corresponding to the prior mean of the mean GP. |
kern_0 |
A kernel function, associated with the mean GP. |
kern_i |
A kernel function, associated with the individual GPs. |
post_mean |
A tibble, coming out of the E step, containing the Input and associated Output of the hyper-posterior mean parameter. |
post_cov |
A matrix, coming out of the E step, being the hyper-posterior covariance parameter. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, expectation of joint log-likelihood of the model. This quantity is supposed to increase at each step of the EM algorithm, and thus used for monitoring the procedure.
Examples
TRUE
M-Step of the EM algorithm
Description
Maximisation step of the EM algorithm to compute hyper-parameters of all the kernels involved in Magma.
Usage
m_step(
db,
m_0,
kern_0,
kern_i,
old_hp_0,
old_hp_i,
post_mean,
post_cov,
common_hp,
pen_diag
)
Arguments
db |
A tibble or data frame. Columns required: ID, Input, Output. Additional columns for covariates can be specified. |
m_0 |
A vector, corresponding to the prior mean of the mean GP. |
kern_0 |
A kernel function, associated with the mean GP. |
kern_i |
A kernel function, associated with the individual GPs. |
old_hp_0 |
A named vector, tibble or data frame, containing the hyper-parameters from the previous M-step (or initialisation) associated with the mean GP. |
old_hp_i |
A tibble or data frame, containing the hyper-parameters from the previous M-step (or initialisation) associated with the individual GPs. |
post_mean |
A tibble, coming out of the E step, containing the Input and associated Output of the hyper-posterior mean parameter. |
post_cov |
A matrix, coming out of the E step, being the hyper-posterior covariance parameter. |
common_hp |
A logical value, indicating whether the set of hyper-parameters is assumed to be common to all indiviuals. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A named list, containing the elements hp_0
, a tibble
containing the hyper-parameters associated with the mean GP,
hp_i
, a tibble containing the hyper-parameters
associated with the individual GPs.
Examples
TRUE
Periodic Kernel
Description
Periodic Kernel
Usage
perio_kernel(x, y, hp, deriv = NULL, vectorized = FALSE)
Arguments
x |
A vector (or matrix if vectorized = T) of inputs. |
y |
A vector (or matrix if vectorized = T) of inputs. |
hp |
A tibble, data frame or named vector, containing the kernel's hyperparameters. Required columns: 'perio_variance', 'perio_lengthscale', and 'period'. |
deriv |
A character, indicating according to which hyper-parameter the derivative should be computed. If NULL (default), the function simply returns the evaluation of the kernel. |
vectorized |
A logical value, indicating whether the function provides
a vectorized version for speeded-up calculations. If TRUE, the |
Value
A scalar, corresponding to the evaluation of the kernel.
Examples
TRUE
Plot smoothed curves of raw data
Description
Display raw data under the Magma format as smoothed curves.
Usage
plot_db(data, cluster = FALSE, legend = FALSE)
Arguments
data |
A data frame or tibble with format : ID, Input, Output. |
cluster |
A boolean indicating whether data should be coloured by cluster. Requires a column named 'Cluster'. |
legend |
A boolean indicating whether the legend should be displayed. |
Value
Graph of smoothed curves of raw data.
Examples
TRUE
Create a GIF of Magma or GP predictions
Description
Create a GIF animation displaying how Magma or classic GP predictions evolve and improve when the number of data points increase.
Usage
plot_gif(
pred_gp,
x_input = NULL,
data = NULL,
data_train = NULL,
prior_mean = NULL,
y_grid = NULL,
heatmap = FALSE,
prob_CI = 0.95,
size_data = 3,
size_data_train = 1,
alpha_data_train = 0.5,
export_gif = FALSE,
path = "gif_gp.gif",
...
)
Arguments
pred_gp |
A tibble, typically coming from the |
x_input |
A vector of character strings, indicating which input should be displayed. If NULL(default) the 'Input' column is used for the x-axis. If providing a 2-dimensional vector, the corresponding columns are used for the x-axis and y-axis. |
data |
(Optional) A tibble or data frame. Required columns: 'Input', 'Output'. Additional columns for covariates can be specified. The 'Input' column should define the variable that is used as reference for the observations (e.g. time for longitudinal data). The 'Output' column specifies the observed values (the response variable). The data frame can also provide as many covariates as desired, with no constraints on the column names. These covariates are additional inputs (explanatory variables) of the models that are also observed at each reference 'Input'. |
data_train |
(Optional) A tibble or data frame, containing the training
data of the Magma model. The data set should have the same format as the
|
prior_mean |
(Optional) A tibble or a data frame, containing the 'Input' and associated 'Output' prior mean parameter of the GP prediction. |
y_grid |
A vector, indicating the grid of values on the y-axis for which
probabilities should be computed for heatmaps of 1-dimensional
predictions. If NULL (default), a vector of length 50 is defined, ranging
between the min and max 'Output' values contained in |
heatmap |
A logical value indicating whether the GP prediction should be represented as a heatmap of probabilities for 1-dimensional inputs. If FALSE (default), the mean curve and associated 95% CI are displayed. |
prob_CI |
A number between 0 and 1 (default is 0.95), indicating the level of the Credible Interval associated with the posterior mean curve. |
size_data |
A number, controlling the size of the |
size_data_train |
A number, controlling the size of the
|
alpha_data_train |
A number, between 0 and 1, controlling transparency
of the |
export_gif |
A logical value indicating whether the animation should be exported as a .gif file. |
path |
A character string defining the path where the GIF file should be exported. |
... |
Any additional parameters that can be passed to the function
|
Value
Visualisation of a Magma or GP prediction (optional: display data points, training data points and the prior mean function), where data points are added sequentially for visualising changes in prediction as information increases.
Examples
TRUE
Plot Magma or GP predictions
Description
Display Magma or classic GP predictions. According to the dimension of the inputs, the graph may be a mean curve + Credible Interval or a heatmap of probabilities.
Usage
plot_gp(
pred_gp,
x_input = NULL,
data = NULL,
data_train = NULL,
prior_mean = NULL,
y_grid = NULL,
heatmap = FALSE,
samples = FALSE,
nb_samples = 50,
plot_mean = TRUE,
alpha_samples = 0.3,
prob_CI = 0.95,
size_data = 3,
size_data_train = 1,
alpha_data_train = 0.5
)
plot_magma(
pred_gp,
x_input = NULL,
data = NULL,
data_train = NULL,
prior_mean = NULL,
y_grid = NULL,
heatmap = FALSE,
samples = FALSE,
nb_samples = 50,
plot_mean = TRUE,
alpha_samples = 0.3,
prob_CI = 0.95,
size_data = 3,
size_data_train = 1,
alpha_data_train = 0.5
)
Arguments
pred_gp |
A tibble or data frame, typically coming from
|
x_input |
A vector of character strings, indicating which input should be displayed. If NULL (default) the 'Input' column is used for the x-axis. If providing a 2-dimensional vector, the corresponding columns are used for the x-axis and y-axis. |
data |
(Optional) A tibble or data frame. Required columns: 'Input', 'Output'. Additional columns for covariates can be specified. This argument corresponds to the raw data on which the prediction has been performed. |
data_train |
(Optional) A tibble or data frame, containing the training
data of the Magma model. The data set should have the same format as the
|
prior_mean |
(Optional) A tibble or a data frame, containing the 'Input' and associated 'Output' prior mean parameter of the GP prediction. |
y_grid |
A vector, indicating the grid of values on the y-axis for which
probabilities should be computed for heatmaps of 1-dimensional
predictions. If NULL (default), a vector of length 50 is defined, ranging
between the min and max 'Output' values contained in |
heatmap |
A logical value indicating whether the GP prediction should be represented as a heatmap of probabilities for 1-dimensional inputs. If FALSE (default), the mean curve and associated Credible Interval are displayed. |
samples |
A logical value indicating whether the GP prediction should be represented as a collection of samples drawn from the posterior. If FALSE (default), the mean curve and associated Credible Interval are displayed. |
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
plot_mean |
A logical value, indicating whether the mean prediction
should be displayed on the graph when |
alpha_samples |
A number, controlling transparency of the sample curves. |
prob_CI |
A number between 0 and 1 (default is 0.95), indicating the level of the Credible Interval associated with the posterior mean curve. If this this argument is set to 1, the Credible Interval is not displayed. |
size_data |
A number, controlling the size of the |
size_data_train |
A number, controlling the size of the
|
alpha_data_train |
A number, between 0 and 1, controlling transparency
of the |
Value
Visualisation of a Magma or GP prediction (optional: display data
points, training data points and the prior mean function). For 1-D
inputs, the prediction is represented as a mean curve and its associated
95% Credible Interval, as a collection of samples drawn from the
posterior if samples
= TRUE, or as a heatmap of probabilities if
heatmap
= TRUE. For 2-D inputs, the prediction is represented as a
heatmap, where each couple of inputs on the x-axis and y-axis are
associated with a gradient of colours for the posterior mean values,
whereas the uncertainty is indicated by the transparency (the narrower is
the Credible Interval, the more opaque is the associated colour, and vice
versa)
Examples
TRUE
Plot MagmaClust predictions
Description
Display MagmaClust predictions. According to the dimension of the inputs, the graph may be a mean curve (dim inputs = 1) or a heatmap (dim inputs = 2) of probabilities. Moreover, MagmaClust can provide credible intervals only by visualising cluster-specific predictions (e.g. for the most probable cluster). When visualising the full mixture-of-GPs prediction, which can be multimodal, the user should choose between the simple mean function or the full heatmap of probabilities (more informative but slower).
Usage
plot_magmaclust(
pred_clust,
cluster = "all",
x_input = NULL,
data = NULL,
data_train = NULL,
col_clust = FALSE,
prior_mean = NULL,
y_grid = NULL,
heatmap = FALSE,
samples = FALSE,
nb_samples = 50,
plot_mean = TRUE,
alpha_samples = 0.3,
prob_CI = 0.95,
size_data = 3,
size_data_train = 1,
alpha_data_train = 0.5
)
Arguments
pred_clust |
A list of predictions, typically coming from
|
cluster |
A character string, indicating which cluster to plot from. If 'all' (default) the mixture of GPs prediction is displayed as a mean curve (1-D inputs) or a mean heatmap (2-D inputs). Alternatively, if the name of one cluster is provided, the classic mean curve + credible interval is displayed (1-D inputs), or a heatmap with colour gradient for the mean and transparency gradient for the Credible Interval (2-D inputs). |
x_input |
A vector of character strings, indicating which input should be displayed. If NULL (default) the 'Input' column is used for the x-axis. If providing a 2-dimensional vector, the corresponding columns are used for the x-axis and y-axis. |
data |
(Optional) A tibble or data frame. Required columns: |
data_train |
(Optional) A tibble or data frame, containing the training
data of the MagmaClust model. The data set should have the same format as
the |
col_clust |
A boolean indicating whether backward points are coloured
according to the individuals or to their most probable cluster. If one
wants to colour by clusters, a column |
prior_mean |
(Optional) A list providing, for each cluster, a
tibble containing prior mean parameters of the prediction. This argument
typically comes as an outcome |
y_grid |
A vector, indicating the grid of values on the y-axis for which
probabilities should be computed for heatmaps of 1-dimensional
predictions. If NULL (default), a vector of length 50 is defined, ranging
between the min and max 'Output' values contained in |
heatmap |
A logical value indicating whether the GP mixture should be represented as a heatmap of probabilities for 1-dimensional inputs. If FALSE (default), the mean curve (and associated Credible Interval if available) are displayed. |
samples |
A logical value indicating whether the GP mixture should be represented as a collection of samples drawn from the posterior. If FALSE (default), the mean curve (and associated Credible Interval if available) are displayed. |
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
plot_mean |
A logical value, indicating whether the mean prediction
should be displayed on the graph when |
alpha_samples |
A number, controlling transparency of the sample curves. |
prob_CI |
A number between 0 and 1 (default is 0.95), indicating the level of the Credible Interval associated with the posterior mean curve. If this this argument is set to 1, the Credible Interval is not displayed. |
size_data |
A number, controlling the size of the |
size_data_train |
A number, controlling the size of the
|
alpha_data_train |
A number, between 0 and 1, controlling transparency
of the |
Value
Visualisation of a MagmaClust prediction (optional: display data
points, training data points and the prior mean functions). For 1-D
inputs, the prediction is represented as a mean curve (and its associated
95% Credible Interval for cluster-specific predictions), or as a heatmap
of probabilities if heatmap
= TRUE. In the case of MagmaClust,
the heatmap representation should be preferred for clarity, although the
default display remains mean curve for quicker execution. For 2-D inputs,
the prediction is represented as a heatmap, where each couple of inputs on
the x-axis and y-axis are associated with a gradient of colours for the
posterior mean values, whereas the uncertainty is indicated by the
transparency (the narrower is the Credible Interval, the more opaque is
the associated colour, and vice versa). As for 1-D inputs, Credible
Interval information is only available for cluster-specific predictions.
Examples
TRUE
Display realisations from a (mixture of) GP prediction
Description
Display samples drawn from the posterior of a GP, Magma or MagmaClust prediction. According to the dimension of the inputs, the graph may represent curves or a heatmap.
Usage
plot_samples(
pred = NULL,
samples = NULL,
nb_samples = 50,
x_input = NULL,
plot_mean = TRUE,
alpha_samples = 0.3
)
Arguments
pred |
A list, typically coming from |
samples |
A tibble or data frame, containing the samples generated from
a GP, Magma, or MagmaClust prediction. Required columns: |
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
x_input |
A vector of character strings, indicating which 'column' should be displayed in the case of multidimensional inputs. If NULL(default) the Input' column is used for the x-axis. If providing a 2-dimensional vector, the corresponding columns are used for the x-axis and the y-axis. |
plot_mean |
A logical value, indicating whether the mean prediction should be displayed on the graph. |
alpha_samples |
A number, controlling transparency of the sample curves. |
Value
Graph of samples drawn from a posterior distribution of a GP, Magma, or MagmaClust prediction.
Examples
TRUE
Magma prediction for ploting GIFs
Description
Generate a Magma or classic GP prediction under a format that is compatible
with a further GIF visualisation of the results. For a Magma prediction,
either the trained_model
or hyperpost
argument is required.
Otherwise, a classic GP prediction is applied and the prior mean can be
specified through the mean
argument.
Usage
pred_gif(
data,
trained_model = NULL,
grid_inputs = NULL,
hyperpost = NULL,
mean = NULL,
hp = NULL,
kern = "SE",
pen_diag = 1e-10
)
Arguments
data |
A tibble or data frame. Required columns: 'Input', 'Output'. Additional columns for covariates can be specified. The 'Input' column should define the variable that is used as reference for the observations (e.g. time for longitudinal data). The 'Output' column specifies the observed values (the response variable). The data frame can also provide as many covariates as desired, with no constraints on the column names. These covariates are additional inputs (explanatory variables) of the models that are also observed at each reference 'Input'. |
trained_model |
A list, containing the information coming from a
Magma model, previously trained using the |
grid_inputs |
The grid of inputs (reference Input and covariates) values
on which the GP should be evaluated. Ideally, this argument should be a
tibble or a data frame, providing the same columns as |
hyperpost |
A list, containing the elements 'mean' and 'cov', the
parameters of the hyper-posterior distribution of the mean process.
Typically, this argument should from a previous learning using
|
mean |
Mean parameter of the GP. This argument can be specified under various formats, such as:
|
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A tibble, representing Magma or GP predictions as two column 'Mean'
and 'Var', evaluated on the grid_inputs
. The column 'Input' and
additional covariates columns are associated to each predicted values. An
additional 'Index' column is created for the sake of GIF creation using
the function plot_gif
Examples
TRUE
Gaussian Process prediction
Description
Compute the posterior distribution of a standard GP, using the formalism of Magma. By providing observed data, the prior mean and covariance matrix (by defining a kernel and its associated hyper-parameters), the mean and covariance parameters of the posterior distribution are computed on the grid of inputs that has been specified. This predictive distribution can be evaluated on any arbitrary inputs since a GP is an infinite-dimensional object.
Usage
pred_gp(
data = NULL,
grid_inputs = NULL,
mean = NULL,
hp = NULL,
kern = "SE",
get_full_cov = FALSE,
plot = TRUE,
pen_diag = 1e-10
)
Arguments
data |
A tibble or data frame. Required columns: 'Input', 'Output'. Additional columns for covariates can be specified. The 'Input' column should define the variable that is used as reference for the observations (e.g. time for longitudinal data). The 'Output' column specifies the observed values (the response variable). The data frame can also provide as many covariates as desired, with no constraints on the column names. These covariates are additional inputs (explanatory variables) of the models that are also observed at each reference 'Input'. If NULL, the prior GP is returned. |
grid_inputs |
The grid of inputs (reference Input and covariates) values
on which the GP should be evaluated. Ideally, this argument should be a
tibble or a data frame, providing the same columns as |
mean |
Mean parameter of the GP. This argument can be specified under various formats, such as:
|
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
get_full_cov |
A logical value, indicating whether the full posterior covariance matrix should be returned. |
plot |
A logical value, indicating whether a plot of the results is automatically displayed. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A tibble, representing the GP predictions as two column 'Mean' and
'Var', evaluated on the grid_inputs
. The column 'Input' and
additional covariates columns are associated to each predicted values.
If the get_full_cov
argument is TRUE, the function returns a list,
in which the tibble described above is defined as 'pred' and the full
posterior covariance matrix is defined as 'cov'.
Examples
TRUE
Magma prediction
Description
Compute the posterior predictive distribution in Magma. Providing data of any new individual/task, its trained hyper-parameters and a previously trained Magma model, the predictive distribution is evaluated on any arbitrary inputs that are specified through the 'grid_inputs' argument.
Usage
pred_magma(
data = NULL,
trained_model = NULL,
grid_inputs = NULL,
hp = NULL,
kern = "SE",
hyperpost = NULL,
get_hyperpost = FALSE,
get_full_cov = FALSE,
plot = TRUE,
pen_diag = 1e-10
)
Arguments
data |
A tibble or data frame. Required columns: 'Input',
'Output'. Additional columns for covariates can be specified.
The 'Input' column should define the variable that is used as
reference for the observations (e.g. time for longitudinal data). The
'Output' column specifies the observed values (the response
variable). The data frame can also provide as many covariates as desired,
with no constraints on the column names. These covariates are additional
inputs (explanatory variables) of the models that are also observed at
each reference 'Input'. If NULL, the mean process from
|
trained_model |
A list, containing the information coming from a
Magma model, previously trained using the |
grid_inputs |
The grid of inputs (reference Input and covariates) values
on which the GP should be evaluated. Ideally, this argument should be a
tibble or a data frame, providing the same columns as |
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hyperpost |
A list, containing the elements 'mean' and 'cov', the
parameters of the hyper-posterior distribution of the mean process.
Typically, this argument should come from a previous learning using
|
get_hyperpost |
A logical value, indicating whether the hyper-posterior distribution of the mean process should be returned. This can be useful when planning to perform several predictions on the same grid of inputs, since recomputation of the hyper-posterior can be prohibitive for high dimensional grids. |
get_full_cov |
A logical value, indicating whether the full posterior covariance matrix should be returned. |
plot |
A logical value, indicating whether a plot of the results is automatically displayed. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A tibble, representing Magma predictions as two column 'Mean' and
'Var', evaluated on the grid_inputs
. The column 'Input' and
additional covariates columns are associated to each predicted values.
If the get_full_cov
or get_hyperpost
arguments are TRUE,
the function returns a list, in which the tibble described above is
defined as 'pred_gp' and the full posterior covariance matrix is
defined as 'cov', and the hyper-posterior distribution of the mean process
is defined as 'hyperpost'.
Examples
TRUE
MagmaClust prediction
Description
Compute the posterior predictive distribution in MagmaClust. Providing data from any new individual/task, its trained hyper-parameters and a previously trained MagmaClust model, the multi-task posterior distribution is evaluated on any arbitrary inputs that are specified through the 'grid_inputs' argument. Due to the nature of the model, the prediction is defined as a mixture of Gaussian distributions. Therefore the present function computes the parameters of the predictive distribution associated with each cluster, as well as the posterior mixture probabilities for this new individual/task.
Usage
pred_magmaclust(
data = NULL,
trained_model = NULL,
grid_inputs = NULL,
mixture = NULL,
hp = NULL,
kern = "SE",
hyperpost = NULL,
prop_mixture = NULL,
get_hyperpost = FALSE,
get_full_cov = TRUE,
plot = TRUE,
pen_diag = 1e-10
)
Arguments
data |
A tibble or data frame. Required columns: |
trained_model |
A list, containing the information coming from a
MagmaClust model, previously trained using the
|
grid_inputs |
The grid of inputs (reference Input and covariates) values
on which the GP should be evaluated. Ideally, this argument should be a
tibble or a data frame, providing the same columns as |
mixture |
A tibble or data frame, indicating the mixture probabilities
of each cluster for the new individual/task.
If NULL, the |
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hyperpost |
A list, containing the elements |
prop_mixture |
A tibble or a named vector of the mixture proportions.
Each name of column or element should refer to a cluster. The value
associated with each cluster is a number between 0 and 1. If both
|
get_hyperpost |
A logical value, indicating whether the hyper-posterior distributions of the mean processes should be returned. This can be useful when planning to perform several predictions on the same grid of inputs, since recomputation of the hyper-posterior can be prohibitive for high dimensional grids. |
get_full_cov |
A logical value, indicating whether the full posterior covariance matrices should be returned. |
plot |
A logical value, indicating whether a plot of the results is automatically displayed. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A list of GP prediction results composed of:
pred: As sub-list containing, for each cluster:
pred_gp: A tibble, representing the GP predictions as two column
Mean
andVar
, evaluated on thegrid_inputs
. The columnInput
and additional covariates columns are associated with each predicted values.proba: A number, the posterior probability associated with this cluster.
cov (if
get_full_cov
= TRUE): A matrix, the full posterior covariance matrix associated with this cluster.
mixture: A tibble, indicating the mixture probabilities of each cluster for the predicted individual/task.
hyperpost (if
get_hyperpost
= TRUE): A list, containing the hyper-posterior distributions information useful for visualisation purposes.
Examples
TRUE
Indicates the most probable cluster
Description
Indicates the most probable cluster
Usage
proba_max_cluster(mixture)
Arguments
mixture |
A tibble or data frame containing mixture probabilities. |
Value
A tibble, retaining only the most probable cluster. The column
Cluster
indicates the the cluster's name whereas Proba
refers to its associated probability. If ID
is initially
a column of mixture
(optional), the function returns the most
probable cluster for all the different ID
values.
Examples
TRUE
Regularise a grid of inputs in a dataset
Description
Modify the original grid of inputs to make it more 'regular' (in the sense that the interval between each observation is constant, or corresponds to a specific pattern defined by the user). In particular, this function can also be used to summarise several data points into one, at a specific location. In this case, the output values are averaged according to the 'summarise_fct' argument.
Usage
regularize_data(
data,
size_grid = 30,
grid_inputs = NULL,
summarise_fct = base::mean
)
regularise_data(
data,
size_grid = 30,
grid_inputs = NULL,
summarise_fct = base::mean
)
Arguments
data |
A tibble or data frame. Required columns: |
size_grid |
An integer, which indicates the number of equispaced points each column must contain. Each original input value will be collapsed to the closest point of the new regular grid, and the associated outputs are averaged using the 'summarise_fct' function. This argument is used when 'grid_inputs' is left to 'NULL'. Default value is 30. |
grid_inputs |
A data frame, corresponding to a pre-defined grid of
inputs according to which we want to regularise a dataset. Column names
must be similar to those appearing in |
summarise_fct |
A character string or a function. If several similar inputs are associated with different outputs, the user can choose the summarising function for the output among the following: min, max, mean, median. A custom function can be defined if necessary. Default is "mean". |
Value
A data frame, where input columns have been regularised as desired.
Examples
data = tibble::tibble(ID = 1, Input = 0:100, Output = -50:50)
## Define a 1D input grid of 10 points
regularize_data(data, size_grid = 10)
## Define a 1D custom grid
my_grid = tibble::tibble(Input = c(5, 10, 25, 50, 100))
regularize_data(data, grid_inputs = my_grid)
## Define a 2D input grid of 5x5 points
data_2D = cbind(ID = 1, expand.grid(Input=1:10, Input2=1:10), Output = 1:100)
regularize_data(data_2D, size_grid = 5)
## Define a 2D custom input grid
my_grid_2D = MagmaClustR::expand_grid_inputs(c(2, 4, 8), 'Input2' = c(3, 5))
regularize_data(data_2D, grid_inputs = my_grid_2D)
Rational Quadratic Kernel
Description
Rational Quadratic Kernel
Usage
rq_kernel(x, y, hp, deriv = NULL, vectorized = FALSE)
Arguments
x |
A vector (or matrix if vectorized = T) of inputs. |
y |
A vector (or matrix if vectorized = T) of inputs. |
hp |
A tibble, data frame or named vector, containing the kernel's hyperparameters. Required columns: 'rq_variance', 'rq_lengthscale', and 'rq_scale'. |
deriv |
A character, indicating according to which hyper-parameter the derivative should be computed. If NULL (default), the function simply returns the evaluation of the kernel. |
vectorized |
A logical value, indicating whether the function provides
a vectorized version for speeded-up calculations. If TRUE, the |
Value
A scalar, corresponding to the evaluation of the kernel.
Examples
TRUE
Draw samples from a posterior GP/Magma distribution
Description
Draw samples from a posterior GP/Magma distribution
Usage
sample_gp(pred_gp, nb_samples = 50)
sample_magma(pred_gp, nb_samples = 50)
Arguments
pred_gp |
A list, typically coming from
|
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
Value
A tibble or data frame, containing the samples generated from
a GP prediction. Format: Input
, Sample
, Output
.
Examples
TRUE
Draw samples from a MagmaClust posterior distribution
Description
Draw samples from a MagmaClust posterior distribution
Usage
sample_magmaclust(pred_clust, nb_samples = 50)
Arguments
pred_clust |
A list, typically coming from
|
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
Value
A tibble or data frame, containing the samples generated from
a GP prediction. Format: Cluster
, Proba
, Input
,
Sample
, Output
.
Examples
TRUE
Squared Exponential Kernel
Description
Squared Exponential Kernel
Usage
se_kernel(x, y, hp, deriv = NULL, vectorized = FALSE)
Arguments
x |
A vector (or matrix if vectorized = T) of inputs. |
y |
A vector (or matrix if vectorized = T) of inputs. |
hp |
A tibble, data frame or named vector, containing the kernel's hyperparameters. Required columns: 'se_variance', 'se_lengthscale'. |
deriv |
A character, indicating according to which hyper-parameter the derivative should be computed. If NULL (default), the function simply returns the evaluation of the kernel. |
vectorized |
A logical value, indicating whether the function provides
a vectorized version for speeded-up calculations. If TRUE, the |
Value
A scalar, corresponding to the evaluation of the kernel.
Examples
TRUE
Select the optimal number of clusters
Description
In MagmaClust, as for any clustering method, the number K of clusters has to be provided as an hypothesis of the model. This function implements a model selection procedure, by maximising a variational BIC criterion, computed for different values of K. A heuristic for a fast approximation of the procedure is proposed as well, although the corresponding models would not be properly trained.
Usage
select_nb_cluster(
data,
fast_approx = TRUE,
grid_nb_cluster = 1:10,
ini_hp_k = NULL,
ini_hp_i = NULL,
kern_k = "SE",
kern_i = "SE",
plot = TRUE,
...
)
Arguments
data |
A tibble or data frame. Columns required: |
fast_approx |
A boolean, indicating whether a fast approximation should
be used for selecting the number of clusters. If TRUE, each Magma or
MagmaClust model will perform only one E-step of the training, using
the same fixed values for the hyper-parameters ( |
grid_nb_cluster |
A vector of integer, corresponding to grid of values that will be tested for the number of clusters. |
ini_hp_k |
A tibble or data frame of hyper-parameters associated with
|
ini_hp_i |
A tibble or data frame of hyper-parameters associated with
|
kern_k |
A kernel function associated to the mean processes. |
kern_i |
A kernel function associated to the individuals/tasks. |
plot |
A boolean indicating whether the plot of V-BIC values for all numbers of clusters should displayed. |
... |
Any additional argument that could be passed to
|
Value
A list, containing the results of model selection procedure for selecting the optimal number of clusters thanks to a V-BIC criterion maximisation. The elements of the list are:
best_k: An integer, indicating the resulting optimal number of clusters
seq_vbic: A vector, corresponding to the sequence of the V-BIC values associated with the models trained for each provided cluster's number in
grid_nb_cluster
.trained_models: A list, named by associated number of clusters, of Magma or MagmaClust models that have been trained (or approximated if
fast_approx
= T) during the model selection procedure.
Examples
TRUE
Simulate a dataset tailored for MagmaClustR
Description
Simulate a complete training dataset, which may be representative of various applications. Several flexible arguments allow adjustment of the number of individuals, of observed inputs, and the values of many parameters controlling the data generation.
Usage
simu_db(
M = 10,
N = 10,
K = 1,
covariate = FALSE,
grid = seq(0, 10, 0.05),
grid_cov = seq(0, 10, 0.5),
common_input = TRUE,
common_hp = TRUE,
add_hp = FALSE,
add_clust = FALSE,
int_mu_v = c(4, 5),
int_mu_l = c(0, 1),
int_i_v = c(1, 2),
int_i_l = c(0, 1),
int_i_sigma = c(0, 0.2),
lambda_int = c(30, 40),
m_int = c(0, 10),
lengthscale_int = c(30, 40),
m0_slope = c(-5, 5),
m0_intercept = c(-50, 50)
)
Arguments
M |
An integer. The number of individual per cluster. |
N |
An integer. The number of observations per individual. |
K |
An integer. The number of underlying clusters. |
covariate |
A logical value indicating whether the dataset should include an additional input covariate named 'Covariate'. |
grid |
A vector of numbers defining a grid of observations (i.e. the reference inputs). |
grid_cov |
A vector of numbers defining a grid of observations (i.e. the covariate reference inputs). |
common_input |
A logical value indicating whether the reference inputs are common to all individual. |
common_hp |
A logical value indicating whether the hyper-parameters are common to all individual. If TRUE and K>1, the hyper-parameters remain different between the clusters. |
add_hp |
A logical value indicating whether the values of hyper-parameters should be added as columns in the dataset. |
add_clust |
A logical value indicating whether the name of the clusters should be added as a column in the dataset. |
int_mu_v |
A vector of 2 numbers, defining an interval of admissible values for the variance hyper-parameter of the mean process' kernel. |
int_mu_l |
A vector of 2 numbers, defining an interval of admissible values for the lengthscale hyper-parameter of the mean process' kernel. |
int_i_v |
A vector of 2 numbers, defining an interval of admissible values for the variance hyper-parameter of the individual process' kernel. |
int_i_l |
A vector of 2 numbers, defining an interval of admissible values for the lengthscale hyper-parameter of the individual process' kernel. |
int_i_sigma |
A vector of 2 numbers, defining an interval of admissible values for the noise hyper-parameter. |
lambda_int |
A vector of 2 numbers, defining an interval of admissible values for the lambda parameter of the 2D exponential. |
m_int |
A vector of 2 numbers, defining an interval of admissible values for the mean of the 2D exponential. |
lengthscale_int |
A vector of 2 numbers, defining an interval of admissible values for the lengthscale parameter of the 2D exponential. |
m0_slope |
A vector of 2 numbers, defining an interval of admissible values for the slope of m0. |
m0_intercept |
A vector of 2 numbers, defining an interval of admissible values for the intercept of m0. |
Value
A full dataset of simulated training data.
Examples
## Generate a dataset with 3 clusters of 4 individuals, observed at 10 inputs
data = simu_db(M = 4, N = 10, K = 3)
## Generate a 2-D dataset with an additional input 'Covariate'
data = simu_db(covariate = TRUE)
## Generate a dataset where input locations are different among individuals
data = simu_db(common_input = FALSE)
## Generate a dataset with an additional column indicating the true clusters
data = simu_db(K = 3, add_clust = TRUE)
Simulate a batch of data
Description
Simulate a batch of output data, corresponding to one individual, coming from a GP with a the Squared Exponential kernel as covariance structure, and specified hyper-parameters and input.
Usage
simu_indiv_se(ID, input, mean, v, l, sigma)
Arguments
ID |
An identification code, whether numeric or character. |
input |
A vector of numbers. The input variable that is used as 'reference' for input and outputs. |
mean |
A vector of numbers. Prior mean values of the GP. |
v |
A number. The variance hyper-parameter of the SE kernel. |
l |
A number. The lengthscale hyper-parameter of the SE kernel. |
sigma |
A number. The noise hyper-parameter. |
Value
A tibble containing a batch of output data along with input and additional information for a simulated individual.
Examples
TRUE
Compute a mixture of Gaussian log-likelihoods
Description
During the prediction step of MagmaClust, an EM algorithm is used to compute the maximum likelihood estimator of the hyper-parameters along with mixture probabilities for the new individual/task. This function implements the quantity that is maximised (i.e. a sum of Gaussian log-likelihoods, weighted by their mixture probabilities). It can also be used to monitor the EM algorithm when providing the 'prop_mixture' argument, for proper penalisation of the full log-likelihood.
Usage
sum_logL_GP_clust(
hp,
db,
mixture,
mean,
kern,
post_cov,
prop_mixture = NULL,
pen_diag
)
Arguments
hp |
A tibble, data frame or named vector of hyper-parameters. |
db |
A tibble containing data we want to evaluate the logL on. Required columns: Input, Output. Additional covariate columns are allowed. |
mixture |
A tibble or data frame, indicating the mixture probabilities of each cluster for the new individual/task. |
mean |
A list of hyper-posterior mean parameters for all clusters. |
kern |
A kernel function. |
post_cov |
A list of hyper-posterior covariance parameters for all clusters. |
prop_mixture |
A tibble or a named vector. Each name of column or element should refer to a cluster. The value associated with each cluster is a number between 0 and 1, corresponding to the mixture proportions. |
pen_diag |
A jitter term that is added to the covariance matrix to avoid numerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, expectation of mixture of Gaussian log-likelihoods in the prediction step of MagmaClust. This quantity is supposed to increase at each step of the EM algorithm, and can be used for monitoring the procedure.
Examples
TRUE
French swimmers performances data on 100m freestyle events
Description
A subset of data from reported performances of French swimmers during 100m freestyle competitions between 2002 and 2016. See https://link.springer.com/article/10.1007/s10994-022-06172-1 and https://www.mdpi.com/2076-3417/8/10/1766 for dedicated description and analysis.
Usage
swimmers
Format
swimmers
A data frame with 76,832 rows and 4 columns:
- ID
Indentifying number associated to each swimmer
- Input
Age in years
- Output
Performance in seconds on a 100m freestyle event
- Gender
Competition gender
Source
https://ffn.extranat.fr/webffn/competitions.php?idact=nat
Learning hyper-parameters of a Gaussian Process
Description
Learning hyper-parameters of any new individual/task in Magma
is
required in the prediction procedure. This function can also be used to learn
hyper-parameters of a simple GP (just let the hyperpost
argument set
to NULL, and use prior_mean
instead). When using within Magma
,
by providing data for the new individual/task, the hyper-posterior mean and
covariance parameters, and initialisation values for the hyper-parameters,
the function computes maximum likelihood estimates of the hyper-parameters.
Usage
train_gp(
data,
prior_mean = NULL,
ini_hp = NULL,
kern = "SE",
hyperpost = NULL,
pen_diag = 1e-10
)
Arguments
data |
A tibble or data frame. Required columns: |
prior_mean |
Mean parameter of the GP. This argument can be specified under various formats, such as:
|
ini_hp |
A named vector, tibble or data frame of hyper-parameters
associated with the |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hyperpost |
A list, containing the elements 'mean' and 'cov',
the parameters of the hyper-posterior distribution of the mean process.
Typically, this argument should come from a previous learning using
|
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A tibble, containing the trained hyper-parameters for the kernel of the new individual/task.
Examples
TRUE
Prediction in MagmaClust: learning new HPs and mixture probabilities
Description
Learning hyper-parameters and mixture probabilities of any new
individual/task is required in MagmaClust
in the prediction procedure.
By providing data for the new individual/task, the hyper-posterior mean and
covariance parameters, the mixture proportions, and initialisation values for
the hyper-parameters, train_gp_clust
uses an EM algorithm to compute
maximum likelihood estimates of the hyper-parameters and hyper-posterior
mixture probabilities of the new individual/task.
Usage
train_gp_clust(
data,
prop_mixture = NULL,
ini_hp = NULL,
kern = "SE",
hyperpost = NULL,
pen_diag = 1e-10,
n_iter_max = 25,
cv_threshold = 0.001
)
Arguments
data |
A tibble or data frame. Required columns: |
prop_mixture |
A tibble or a named vector. Each name of column or element should refer to a cluster. The value associated with each cluster is a number between 0 and 1, corresponding to the mixture proportions. |
ini_hp |
A tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hyperpost |
A list, containing the elements |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
n_iter_max |
A number, indicating the maximum number of iterations of the EM algorithm to proceed while not reaching convergence. |
cv_threshold |
A number, indicating the threshold of the likelihood gain under which the EM algorithm will stop. |
Value
A list, containing the results of the EM algorithm used during the prediction step of MagmaClust. The elements of the list are:
hp: A tibble of optimal hyper-parameters for the new individual's GP.
mixture: A tibble of mixture probabilities for the new individual.
Examples
TRUE
Training Magma with an EM algorithm
Description
The hyper-parameters and the hyper-posterior distribution involved in Magma
can be learned thanks to an EM algorithm implemented in train_magma
.
By providing a dataset, the model hypotheses (hyper-prior mean parameter and
covariance kernels) and initialisation values for the hyper-parameters, the
function computes maximum likelihood estimates of the HPs as well as the
mean and covariance parameters of the Gaussian hyper-posterior distribution
of the mean process.
Usage
train_magma(
data,
prior_mean = NULL,
ini_hp_0 = NULL,
ini_hp_i = NULL,
kern_0 = "SE",
kern_i = "SE",
common_hp = TRUE,
grid_inputs = NULL,
pen_diag = 1e-10,
n_iter_max = 25,
cv_threshold = 0.001,
fast_approx = FALSE
)
Arguments
data |
A tibble or data frame. Required columns: |
prior_mean |
Hyper-prior mean parameter (m_0) of the mean GP. This argument can be specified under various formats, such as:
|
ini_hp_0 |
A named vector, tibble or data frame of hyper-parameters
associated with |
ini_hp_i |
A tibble or data frame of hyper-parameters
associated with |
kern_0 |
A kernel function, associated with the mean GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
kern_i |
A kernel function, associated with the individual GPs. ("SE", "PERIO" and "RQ" are also available here). |
common_hp |
A logical value, indicating whether the set of hyper-parameters is assumed to be common to all individuals. |
grid_inputs |
A vector, indicating the grid of additional reference inputs on which the mean process' hyper-posterior should be evaluated. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
n_iter_max |
A number, indicating the maximum number of iterations of the EM algorithm to proceed while not reaching convergence. |
cv_threshold |
A number, indicating the threshold of the likelihood gain
under which the EM algorithm will stop. The convergence condition is
defined as the difference of likelihoods between two consecutive steps,
divided by the absolute value of the last one
( |
fast_approx |
A boolean, indicating whether the EM algorithm should stop after only one iteration of the E-step. This advanced feature is mainly used to provide a faster approximation of the model selection procedure, by preventing any optimisation over the hyper-parameters. |
Details
The user can specify custom kernel functions for the argument
kern_0
and kern_i
. The hyper-parameters used in the kernel
should have explicit names, and be contained within the hp
argument. hp
should typically be defined as a named vector or a
data frame. Although it is not mandatory for the train_magma
function to run, gradients can be provided within kernel function
definition. See for example se_kernel
to create a custom
kernel function displaying an adequate format to be used in Magma.
Value
A list, gathering the results of the EM algorithm used for training in Magma. The elements of the list are:
hp_0: A tibble of the trained hyper-parameters for the mean process' kernel.
hp_i: A tibble of all the trained hyper-parameters for the individual processes' kernels.
hyperpost: A sub-list gathering the parameters of the mean processes' hyper-posterior distributions, namely:
mean: A tibble, the hyper-posterior mean parameter (
Output
) evaluated at each training referenceInput
.cov: A matrix, the covariance parameter for the hyper-posterior distribution of the mean process.
pred: A tibble, the predicted mean and variance at
Input
for the mean process' hyper-posterior distribution under a format that allows the direct visualisation as a GP prediction.
ini_args: A list containing the initial function arguments and values for the hyper-prior mean, the hyper-parameters. In particular, if those arguments were set to NULL,
ini_args
allows us to retrieve the (randomly chosen) initialisations used during training.seq_loglikelihood: A vector, containing the sequence of log-likelihood values associated with each iteration.
converged: A logical value indicated whether the EM algorithm converged or not.
training_time: Total running time of the complete training.
Examples
TRUE
Training MagmaClust with a Variational EM algorithm
Description
The hyper-parameters and the hyper-posterior distributions involved in
MagmaClust can be learned thanks to a VEM algorithm implemented in
train_magmaclust
. By providing a dataset, the model hypotheses
(hyper-prior mean parameters, covariance kernels and number of clusters) and
initialisation values for the hyper-parameters, the function computes
maximum likelihood estimates of the HPs as well as the mean and covariance
parameters of the Gaussian hyper-posterior distributions of the mean
processes.
Usage
train_magmaclust(
data,
nb_cluster = NULL,
prior_mean_k = NULL,
ini_hp_k = NULL,
ini_hp_i = NULL,
kern_k = "SE",
kern_i = "SE",
ini_mixture = NULL,
common_hp_k = TRUE,
common_hp_i = TRUE,
grid_inputs = NULL,
pen_diag = 1e-10,
n_iter_max = 25,
cv_threshold = 0.001,
fast_approx = FALSE
)
Arguments
data |
A tibble or data frame. Columns required: |
nb_cluster |
A number, indicating the number of clusters of individuals/tasks that are assumed to exist among the dataset. |
prior_mean_k |
The set of hyper-prior mean parameters (m_k) for the K mean GPs, one value for each cluster. cluster. This argument can be specified under various formats, such as:
|
ini_hp_k |
A tibble or data frame of hyper-parameters
associated with |
ini_hp_i |
A tibble or data frame of hyper-parameters
associated with |
kern_k |
A kernel function, associated with the mean GPs. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
kern_i |
A kernel function, associated with the individual GPs. (See
details above in |
ini_mixture |
Initial values of the probability to belong to each
cluster for each individual ( |
common_hp_k |
A boolean indicating whether hyper-parameters are common among the mean GPs. |
common_hp_i |
A boolean indicating whether hyper-parameters are common among the individual GPs. |
grid_inputs |
A vector, indicating the grid of additional reference inputs on which the mean processes' hyper-posteriors should be evaluated. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
n_iter_max |
A number, indicating the maximum number of iterations of the VEM algorithm to proceed while not reaching convergence. |
cv_threshold |
A number, indicating the threshold of the likelihood gain
under which the VEM algorithm will stop. The convergence condition is
defined as the difference of elbo between two consecutive steps,
divided by the absolute value of the last one
( |
fast_approx |
A boolean, indicating whether the VEM algorithm should stop after only one iteration of the VE-step. This advanced feature is mainly used to provide a faster approximation of the model selection procedure, by preventing any optimisation over the hyper-parameters. |
Details
The user can specify custom kernel functions for the argument
kern_k
and kern_i
. The hyper-parameters used in the kernel
should have explicit names, and be contained within the hp
argument. hp
should typically be defined as a named vector or a
data frame. Although it is not mandatory for the train_magmaclust
function to run, gradients be can provided within kernel function
definition. See for example se_kernel
to create a custom
kernel function displaying an adequate format to be used in
MagmaClust.
Value
A list, containing the results of the VEM algorithm used in the training step of MagmaClust. The elements of the list are:
hp_k: A tibble containing the trained hyper-parameters for the mean process' kernel and the mixture proportions for each cluster.
hp_i: A tibble containing the trained hyper-parameters for the individual processes' kernels.
hyperpost: A sub-list containing the parameters of the mean processes' hyper-posterior distribution, namely:
mean: A list of tibbles containing, for each cluster, the hyper-posterior mean parameters evaluated at each
Input
.cov: A list of matrices containing, for each cluster, the hyper-posterior covariance parameter of the mean process.
mixture: A tibble, indicating the mixture probabilities in each cluster for each individual.
ini_args: A list containing the initial function arguments and values for the hyper-prior means, the hyper-parameters. In particular, if those arguments were set to NULL,
ini_args
allows us to retrieve the (randomly chosen) initialisations used during training.seq_elbo: A vector, containing the sequence of ELBO values associated with each iteration.
converged: A logical value indicated whether the algorithm converged.
training_time: Total running time of the complete training.
Examples
TRUE
Update the mixture probabilities for each individual and each cluster
Description
Update the mixture probabilities for each individual and each cluster
Usage
update_mixture(db, mean_k, cov_k, hp, kern, prop_mixture, pen_diag)
Arguments
db |
A tibble or data frame. Columns required: |
mean_k |
A list of the K hyper-posterior mean parameters. |
cov_k |
A list of the K hyper-posterior covariance matrices. |
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the individual GPs. |
prop_mixture |
A tibble containing the hyper-parameters associated with each individual, indicating in which cluster it belongs. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
Compute the hyper-posterior multinomial distributions by updating mixture probabilities.
Examples
TRUE
E-Step of the VEM algorithm
Description
Expectation step of the Variational EM algorithm used to compute the parameters of the hyper-posteriors distributions for the mean processes and mixture variables involved in MagmaClust.
Usage
ve_step(db, m_k, kern_k, kern_i, hp_k, hp_i, old_mixture, iter, pen_diag)
Arguments
db |
A tibble or data frame. Columns required: ID, Input, Output. Additional columns for covariates can be specified. |
m_k |
A named list of vectors, corresponding to the prior mean parameters of the K mean GPs. |
kern_k |
A kernel function, associated with the K mean GPs. |
kern_i |
A kernel function, associated with the M individual GPs. |
hp_k |
A named vector, tibble or data frame of hyper-parameters
associated with |
hp_i |
A named vector, tibble or data frame of hyper-parameters
associated with |
old_mixture |
A list of mixture values from the previous iteration. |
iter |
A number, indicating the current iteration of the VEM algorithm. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A named list, containing the elements mean
, a tibble
containing the Input and associated Output of the hyper-posterior mean
parameters, cov
, the hyper-posterior covariance matrices,
and mixture
, the probabilities to belong to each cluster for each
individual.
Examples
TRUE
V-Step of the VEM algorithm
Description
Maximization step of the Variational EM algorithm used to compute hyper-parameters of all the kernels involved in MagmaClust.
Usage
vm_step(
db,
old_hp_k,
old_hp_i,
list_mu_param,
kern_k,
kern_i,
m_k,
common_hp_k,
common_hp_i,
pen_diag
)
Arguments
db |
A tibble or data frame. Columns required: ID, Input, Output. Additional columns for covariates can be specified. |
old_hp_k |
A named vector, tibble or data frame, containing the hyper-parameters from the previous M-step (or initialisation) associated with the mean GPs. |
old_hp_i |
A named vector, tibble or data frame, containing the hyper-parameters from the previous M-step (or initialisation) associated with the individual GPs. |
list_mu_param |
List of parameters of the K mean GPs. |
kern_k |
A kernel used to compute the covariance matrix of the mean GP at corresponding timestamps. |
kern_i |
A kernel used to compute the covariance matrix of individuals GP at corresponding timestamps. |
m_k |
A named list of prior mean parameters for the K mean GPs. Length = 1 or nrow(unique(db$Input)) |
common_hp_k |
A boolean indicating whether hp are common among mean GPs (for each mu_k) |
common_hp_i |
A boolean indicating whether hp are common among individual GPs (for each y_i) |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
Value
A named list, containing the elements hp_k
, a tibble
containing the hyper-parameters associated with each cluster,
hp_i
, a tibble containing the hyper-parameters
associated with the individual GPs, and prop_mixture_k
,
a tibble containing the hyper-parameters associated with each individual,
indicating the probabilities to belong to each cluster.
Examples
TRUE
Weight follow-up data of children in Singapore
Description
A subset of data from the GUSTO project (https://www.gusto.sg/) collecting the weight over time of several children in Singapore. See https://arxiv.org/abs/2011.07866 for dedicated description and analysis.
Usage
weight
Format
weight
A data frame with 3,629 rows and 4 columns:
- ID
Indentifying number associated to each child
- sex
Biological gender
- Input
Age in months
- Output
Weight in kilograms