Title: Machine Learning Method Based on Isolation Kernel Mean Embedding
Version: 1.0.6
Description: Incorporates Approximate Bayesian Computation to get a posterior distribution and to select a model optimal parameter for an observation point. Additionally, the meta-sampling heuristic algorithm is realized for parameter estimation, which requires no model runs and is dimension-independent. A sampling scheme is also presented that allows model runs and uses the meta-sampling for point generation. A predictor is realized as the meta-sampling for the model output. All the algorithms leverage a machine learning method utilizing the maxima weighted Isolation Kernel approach, or 'MaxWiK'. The method involves transforming raw data to a Hilbert space (mapping) and measuring the similarity between simulated points and the maxima weighted Isolation Kernel mapping corresponding to the observation point. Comprehensive details of the methodology can be found in the papers Iurii Nagornov (2024) <doi:10.1007/978-3-031-66431-1_16> and Iurii Nagornov (2023) <doi:10.1007/978-3-031-29168-5_18>.
License: GPL (≥ 3)
Depends: R (≥ 3.3.0)
Imports: methods, stats, utils, scales, parallel, abc, ggplot2
Suggests: rmarkdown, knitr
Encoding: UTF-8
RoxygenNote: 7.3.2
VignetteBuilder: knitr
LazyData: true
NeedsCompilation: no
Packaged: 2025-07-07 05:10:36 UTC; nagornov
Author: Yuri Nagornov ORCID iD [aut, cre, cph]
Maintainer: Yuri Nagornov <nagornov.yuri@gmail.com>
Repository: CRAN
Date/Publication: 2025-07-07 05:30:02 UTC

MaxWiK: Machine Learning Method Based on Isolation Kernel Mean Embedding

Description

Incorporates Approximate Bayesian Computation to get a posterior distribution and to select a model optimal parameter for an observation point. Additionally, the meta-sampling heuristic algorithm is realized for parameter estimation, which requires no model runs and is dimension-independent. A sampling scheme is also presented that allows model runs and uses the meta-sampling for point generation. A predictor is realized as the meta-sampling for the model output. All the algorithms leverage a machine learning method utilizing the maxima weighted Isolation Kernel approach, or 'MaxWiK'. The method involves transforming raw data to a Hilbert space (mapping) and measuring the similarity between simulated points and the maxima weighted Isolation Kernel mapping corresponding to the observation point. Comprehensive details of the methodology can be found in the papers Iurii Nagornov (2024) doi:10.1007/978-3-031-66431-1_16 and Iurii Nagornov (2023) doi:10.1007/978-3-031-29168-5_18.

Author(s)

Maintainer: Yuri Nagornov nagornov.yuri@gmail.com (ORCID) [copyright holder]


List of the objects for the 2D example of the MaxWiK methods usage

Description

A list containing input and output data for 2D example for Approximate Bayesian Computation, including sampling scheme, meta-sampling, and prediction. To understand all details of the dataset, please, be kind to see vignette of the package.

Usage

Data.2D

Format

A list of:

X

Input data frame of the model

Y

Output data frame of the model

observation

Data frame with observation info

ABC

List of hyperparameters, the matrix of Voronoi sites, posteriori distribution, and results of MaxWiK algorithm

metasampling

List of results of meta-sampling algorithm, and the network of points during meta-sampling

sampling

List of object which are necessary for sampling algorithm like function for simulation, parameters of the model, MSE (mean squared error), and X12 - generated points

predictor

List of object which are necessary for predictor algorithm like posteriori.MaxWiK, result of the algorithm, and network of points during meta-sampling


The function to get subset with size psi for Voronoi diagram

Description

The function to get subset with size psi for Voronoi diagram

Usage

GET_SUBSET(data_set, pnts)

Arguments

data_set

Data.frame of Voronoi diagram

pnts

Integer vector of indexes of columns of the data_set

Value

Subset of data_set with columns pnts

Examples

NULL


The function to get the mean square error values for statistics of simulations

Description

The function MSE_sim() allows to get the mean square error values for statistics of simulations

The function MSE_parameters() allows to get MSE for parameters if the truth parameter is known

Usage

MSE_sim(stat.obs, stat.sim)

MSE_parameters(par.truth, par.top = NULL, par.best)

Arguments

stat.obs

Summary statistics of the observation point

stat.sim

Summary statistics of the simulations (model output)

par.truth

The truth parameter

par.top

Parameters from the top of similarities of get.MaxWiK() algorithm

par.best

The best parameter from get.MaxWiK() algorithm

Value

The function MSE_sim() returns numeric vector of the mean square error values for statistics of simulations

The function MSE_parameters() returns list of two numbers:

Functions

Examples

NULL
NULL 

Density plot

Description

Density plot

Usage

MaxWiK.ggplot.density(
  title = "",
  datafr1,
  datafr2,
  var.df,
  obs.true = NULL,
  best.sim = NULL,
  clrs = c("#a9b322", "#f9b3a2", "red", "blue"),
  alpha = c(0.1, 0.4),
  lw = c(0.7, 0.7),
  lt = c("dashed", "dotted")
)

Arguments

title

Title of the plot

datafr1

data frame 1

datafr2

data frame 2

var.df

Variables to show

obs.true

True observation if so, NULL by default

best.sim

The best point from a simulation if so, NULL by default

clrs

Colors to plot, by default it is c( "#a9b322", "#f9b3a2", 'red', 'blue' )

alpha

Transparency values for density plots

lw

Line widths

lt

Line types

Value

Make and return the ggplot object of the densities of the data frames

Examples

MaxWiK::MaxWiK_templates(dir = tempdir()) # See the templates and vignettes for usage. 
# Function 'MaxWiK.ggplot.density()' is used in the MaxWiK.ABC.R and 
# MaxWiK.Predictor.R templates.

Function to copy the templates from extdata folder in the library to /Templates/ folder in the working directory

Description

Function to copy the templates from extdata folder in the library to /Templates/ folder in the working directory

Usage

MaxWiK_templates(dir)

Arguments

dir

Folder to where files should be save, by default dir = './'

Value

List of logic numbers for each copied file, TRUE - success, FALSE - not success

Examples

MaxWiK_templates( dir = tempdir() )

Function to restrict values of the data according with the range for each dimension

Description

Function to restrict values of the data according with the range for each dimension

Usage

apply_range(diapason, input.data)

Arguments

diapason

Vector of min and max values or data frame with two rows (min and max) for each dimension of input data

input.data

Data frame of input where values will be corrected

Value

The same data frame with corrected values according to the diapason

Examples

MaxWiK::MaxWiK_templates(dir = tempdir()) # See the templates and vignettes for usage.

Function to check DATA.FRAME

Description

Check that DATA.FRAME has numeric format for ALL the columns and it has NO 'NA' values

Usage

check_numeric_format(l)

Arguments

l

DATA.FRAME that should have data of numeric type

Value

TRUE if data.frame has ONLY numeric data and FALSE vice verse

Examples

NULL

Check the installation of packages and attach them with corresponding functions

Description

Check the installation of packages and attach them with corresponding functions

Usage

check_packages(pkgs = NULL)

Arguments

pkgs

List of package names with related function names, by default (or when pkgs = NULL) the list of packages are described in Namespace file of the package or 'R/MaxWiK-package.R' file

Value

if the packages are installed then it returns NULL else it returns error message

Examples

NULL

Check the installation of a package for some functions

Description

Check the installation of a package for some functions

Usage

check_pkg(pkg)

Arguments

pkg

Package name

Value

if the package is installed then it returns NULL else it returns error message

Examples

NULL

The function to get inverse Gram matrix

Description

Function get_inverse_GRAM() allows to get inverse Gram matrix based on given positive regularization constant lambda

Function check_positive_definite() returns logical value about n trials on 'is Gram matrix positive definite or not?' Just incorrect trial returns FALSE

Usage

get_inverse_GRAM(G, l = 1e-06, check_pos_def = FALSE)

check_positive_definite(G, n = 10)

Arguments

G

Gram matrix gotten via GRAM_iKernel() function

l

Lambda parameter or positive regularization constant

check_pos_def

Logical parameter to check the Gram matrix is positive definite or do not check

n

Number of iterations to check the positive definite property

Value

Function get_inverse_GRAM() returns the inverse Gram matrix based on the given positive regularization constant lambda l

Function check_positive_definite() returns logical value:
TRUE if Gram matrix is positive definite, and FALSE if it is not

Functions

Examples

NULL
NULL

The function to calculate Maxima weighted kernel mean mapping for Isolation Kernel in RKHS related to parameters space

Description

The function to calculate Maxima weighted kernel mean mapping for Isolation Kernel in RKHS related to parameters space

Usage

get_kernel_mean_embedding(parameters_Matrix_iKernel, Hilbert_weights)

Arguments

parameters_Matrix_iKernel

Matrix of all the points represented in RKHS related to parameters space

Hilbert_weights

Maximal weights in RKHS to get related part of kernel mean embedding from parameters_Matrix_iKernel

Value

Maxima weighted kernel mean mapping in the form of integer vector with length t (number of trees). Each element of the vector is index of Voronoi cell with maximal weight in the Voronoi diagram

Examples

NULL

The function to get subset of points based on feature mapping

Description

The function to get subset of points based on feature mapping

Usage

get_subset_of_feature_map(dtst, Matrix_Voronoi, iFeature_point)

Arguments

dtst

Dataset of all the original points

Matrix_Voronoi

Matrix of Voronoi diagrams based on the Isolation Kernel algorithm

iFeature_point

Feature mapping in RKHS for a point, that can be gotten via add_new_point_iKernel() function

Value

The subset of dtst that has points extracted with feature mapping of an observation point (iFeature_point)

Examples

NULL

The function to get feature representation in RKHS based on Voronoi diagram for WHOLE dataset

Description

The function to get feature representation in RKHS based on Voronoi diagram for WHOLE dataset

Usage

get_voronoi_feature(
  psi = 40,
  t = 350,
  data,
  talkative = FALSE,
  Matrix_Voronoi = NULL
)

add_new_point_iKernel(data, d1, Matrix_Voronoi, dissim, t, psi, nr)

Arguments

psi

Integer number related to the size of each Voronoi diagram

t

Integer number of trees in Isolation Kernel or dimension of RKHS

data

dataset of points, rows - points, columns - dimensions of a point

talkative

logical. If TRUE then print messages, FALSE for the silent execution

Matrix_Voronoi

Matrix of Voronoi diagrams, if it is NULL then the function will calculate Matrix_Voronoi

d1

Data point - usually it is an observation data point

dissim

Matrix of dissimilarity or distances between all points.

nr

Integer number of rows in matrix of distances (dissim) and also the size of dataset

Value

Feature representation in RKHS based on Voronoi diagram for WHOLE dataset

RKHS mapping for a new point based on Isolation Kernel mapping

Functions

Examples

NULL
NULL

The function to get feature representation in RKHS based on Voronoi diagram for PART of dataset

Description

get_voronoi_feature_PART_dataset() function returns the feature (mapping) representation in RKHS based on Voronoi diagram for NEW PART of dataset. The Matrix_Voronoi is based on the PREVIOUS dataset. The NEW PART of dataset will appear at the end of PREVIOUS dataset

Usage

get_voronoi_feature_PART_dataset(
  data,
  talkative = FALSE,
  start_row,
  Matrix_Voronoi
)

Arguments

data

Data.frame of new points

talkative

Logical parameter to print or do not print messages

start_row

Row number from which a new data should be added

Matrix_Voronoi

Matrix of Voronoi diagrams based on the PREVIOUS dataset

Value

List of three matrices: Matrix_Voronoi, Matrix_iKernel and dissim

Examples

NULL

Function returns the value of similarity or Isolation KERNEL for TWO points

Description

iKernel() function returns value of similarity or Isolation KERNEL for TWO points that is number in the range [0,1]

iKernel_point_dataset() function returns vector of values of similarity based on Isolation Kernel between a new point and all the points of dataset

get_weights_iKernel() function returns list of two objects: the first object is numeric vector of weights for RKHS space, and the second object is numeric vector of weights of similarity for iFeature_point corresponding observation point

GRAM_iKernel() is the function to calculate Gram matrix for Isolation Kernel method based on Voronoi diagrams

Usage

iKernel(Matrix_iKernel, pnt_1, pnt_2, t)

iKernel_point_dataset(Matrix_iKernel, t, nr, iFeature_point)

get_weights_iKernel(GI, Matrix_iKernel, t, nr, iFeature_point)

GRAM_iKernel(Matrix_iKernel, check_pos_def = FALSE)

Arguments

Matrix_iKernel

Matrix of indexes of Voronoi cells for each point and each tree based on Isolation Kernel calculation

pnt_1

The first point of dataset

pnt_2

The second point of dataset

t

is a number of columns of Matrix_iKernel or dimension of Matrix_iKernel (corresponding to the number of trees t)

nr

is number of rows in Matrix_iKernel or size of dataset

iFeature_point

Feature mapping in RKHS for a new point, that can be gotten via add_new_point_iKernel() function

GI

The inverse Gram matrix

check_pos_def

Logical parameter to check the Gram matrix is positive definite or do not check

Value

The function iKernel() returns a value of similarity or Isolation KERNEL for TWO points

The function iKernel_point_dataset() returns a value of Isolation Kernel between a new point and dataset represented via Matrix_iKernel

The function get_weights_iKernel() returns the list of weights for RKHS space and weights of similarity for iFeature_point

The function GRAM_iKernel() returns Gram matrix of Isolation Kernel

Functions

Examples

NULL
NULL
NULL 
NULL

Function to get Approximate Bayesian Computation based on Maxima Weighted Isolation Kernel mapping

Description

The function meta_sampling() iteratively generates tracer based on the simple procedure:

The function MaxWiK.predictor() uses the meta-sampling for a prediction

The function get.MaxWiK() is used to get Approximate Bayesian Computation based on Maxima Weighted Isolation Kernel mapping. On given data frame of parameters, statistics of the simulations and an observation, using the internal parameters psi and t, the function get.MaxWiK() returns the estimation of a parameter corresponding to Maxima weighted Isolation Kernel ABC method.

Usage

meta_sampling(
  psi = 4,
  t = 35,
  param,
  stat.sim,
  stat.obs,
  talkative = FALSE,
  check_pos_def = FALSE,
  n_bullets = 16,
  n_best = 10,
  halfwidth = 0.5,
  epsilon = 0.001,
  rate = 0.1,
  max_iteration = 15,
  save_web = TRUE,
  use.iKernelABC = NULL
)

MaxWiK.predictor(
  psi = 4,
  t = 35,
  param,
  stat.sim,
  new.param,
  talkative = FALSE,
  check_pos_def = FALSE,
  n_bullets = 16,
  n_best = 10,
  halfwidth = 0.5,
  epsilon = 0.001,
  rate = 0.1,
  max_iteration = 15,
  save_web = TRUE,
  use.iKernelABC = NULL
)

get.MaxWiK(
  psi = 40,
  t = 350,
  param,
  stat.sim,
  stat.obs,
  talkative = FALSE,
  check_pos_def = TRUE,
  Matrix_Voronoi = NULL
)

Arguments

psi

Integer number. Size of each Voronoi diagram or number of areas/points in the Voronoi diagrams

t

Integer number of trees in the Isolation Forest

param

or par.sim - data frame of parameters of the model

stat.sim

Summary statistics of the simulations (model output)

stat.obs

Summary statistics of the observation point

talkative

Logical parameter to print or do not print messages

check_pos_def

Logical parameter to check the Gram matrix is positive definite or do not check

n_bullets

Number of generating points between two

n_best

Number of the best points to construct the next web net

halfwidth

Parameter for the algorithm of deleting of generated points

epsilon

Criterion to stop meta-sampling

rate

Rate to renew points in the web net of generated points

max_iteration

Maximum of iterations during meta-sampling

save_web

Logical to save all the generated points (web net)

use.iKernelABC

The iKernelABC object to use for meta-sampling. By default it is NULL and is generated.

new.param

New parameter for the predictor input

Matrix_Voronoi

is a predefined matrix of information about Voronoi trees (rows - trees, columns - Voronoi points/areas IDs). By default it is NULL and is generated randomly.

Value

The function meta_sampling() returns the list of the next objects:

The function MaxWiK.predictor() returns the list of the next objects:

The function get.MaxWiK() returns the list of :

Functions

Examples

MaxWiK::MaxWiK_templates(dir = tempdir()) # See the template 'MaxWiK.ABC.R' and 
# vignettes for usage.
MaxWiK::MaxWiK_templates(dir = tempdir()) # See the template 'MaxWiK.Predictor.R' 
# and vignettes for usage. 
MaxWiK::MaxWiK_templates(dir = tempdir()) # See the template 'MaxWiK.ABC.R' and 
# vignettes for usage.

The norm function for vector

Description

The norm function for vector

Usage

norm_vec(x)

norm_vec_sq(x)

Arguments

x

numeric vector

Value

The squared root of sum of squared elements of the vector x or Euclid length of the vector x

The squared Euclid norm or the sum of squared elements of the vector x

Functions

Examples

NULL
NULL

Function to read file

Description

Function to read file

Usage

read_file(file_name = "", stringsAsFactors = FALSE, header = TRUE)

Arguments

file_name

Name of file to read

stringsAsFactors

Parameter for read.table function, by default stringsAsFactors = FALSE

header

Logical type to read or do not read head of a file

Value

data.frame of data from a file

Examples

NULL

Function to read hyperparameters and their values from the file

Description

Function to read hyperparameters and their values from the file

Usage

read_hyperparameters(input)

Arguments

input

File name to input

Value

Parameters and their values

Examples

MaxWiK::MaxWiK_templates(dir = tempdir()) # See the templates and vignettes for usage.

Function to restrict data in the size to accelerate the calculations

Description

restrict_data() is based on rejection ABC method to restrict original dataset

Usage

restrict_data(par.sim, stat.sim, stat.obs, size = 300)

Arguments

par.sim

Data frame of parameters

stat.sim

Data frame of outputs of simulations

stat.obs

Data frame of observation point

size

Integer number of points to leave from original dataset

Value

restrict_data() returns the list of:
par.sim - restricted parameters which are close to observation point
stat.sim - restricted stat.sim which are close to observation point

Examples

MaxWiK::MaxWiK_templates(dir = tempdir()) # See the templates and vignettes for usage.

Function to generate parameters and simulate a model based on MaxWiK algorithm

Description

Function to generate parameters and simulate a model based on MaxWiK algorithm

Usage

sampler_MaxWiK(
  stat.obs,
  stat.sim,
  par.sim,
  model,
  arg0 = list(),
  size = 500,
  psi_t,
  epsilon,
  nmax = 100,
  include_top = FALSE,
  slowly = FALSE,
  rate = 0.2,
  n_simulation_stop = NA,
  check_err = TRUE,
  include_web_rings = TRUE,
  number_of_nodes_in_ring = 2
)

sampler_MaxWiK_parallel(
  stat.obs,
  stat.sim,
  par.sim,
  model,
  arg0 = list(),
  size = 500,
  psi_t,
  epsilon,
  nmax = 100,
  include_top = FALSE,
  slowly = FALSE,
  rate = 0.2,
  n_simulation_stop = NA,
  check_err = TRUE,
  include_web_rings = TRUE,
  number_of_nodes_in_ring = 2,
  cores = 4
)

Arguments

stat.obs

Summary statistics of the observation point

stat.sim

Summary statistics of the simulations (model output)

par.sim

Data frame of parameters of the model

model

Function to get output of simulation during sampling

arg0

List with arguments for a model function, so that arg0 is NOT changed during sampling

size

Number of points in the simulation based on MaxWiK algorithm

psi_t

Vector of psi and t hyperparameters.

epsilon

Criterion to stop simulation when MSE_current - MSE_previous < epsilon

nmax

Maximal number of iterations

include_top

Logical to include top points (network) from spider_web() function to simulate or do not

slowly

Logical for two algorithms: slow and fast seekers in sampling

rate

Rate value in the range [0,1] to define the rate of changing in the original top of sampled points for slow scheme (if slowly = TRUE)

n_simulation_stop

Maximal number of simulations to stop sampling. If n_simulation_stop = NA then there is no restriction (by default)

check_err

Logical parameter to check epsilon or do not

include_web_rings

Logical to include or do not include the cobweb rings to the simulations

number_of_nodes_in_ring

Number of points/nodes between two points in the web ring. By default number_of_nodes_in_ring = 2

cores

Number of cores for parallel calculations of a model (4 by default)

Value

sampler_MaxWiK() returns the list:

sampler_MaxWiK_parallel() returns the same output as in sampler_MaxWiK().

Functions

Examples

MaxWiK::MaxWiK_templates(dir = tempdir()) # See the template 'MaxWiK.Sampling.R' 
# and vignettes for usage.
MaxWiK::MaxWiK_templates(dir = tempdir()) # See the template 'MaxWiK.Sampling.R' 
# and vignettes for usage. For parallel implementation 
# change the function 'sampler_MaxWiK()' to 'sampler_MaxWiK_parallel()'.

The function to get the best tracer bullets related to kernel mean embedding

Description

The function sudoku() allows to get the best tracer bullets related to kernel mean embedding. The calculation performs ONLY for parameters dataset DT = par.sim. This function performs a heuristic algorithm to seek a space/area related to the feature mapping in Hilbert space for the dataset of the parameters.
The main idea of the algorithm is just:

  1. Generate points between the centers of Voronoi diagrams related to the Maxima weighted feature mapping based on Isolation Kernel

  2. Following strategy to puzzle out of SUDOKU: delete all points that do not match feature mapping

  3. Output: The remaining points should be corresponding to the feature mapping.

The function get_pairs_of_data_frame() is used to get pairs of points from the Data Frame that is the most distant each other. In other words, the algorithm seeks the most distant coupled point to each point from the data frame

The function generate_points_between_two_points() is used to generate points between two given points

The function get_tracer_bullets() is used to to get 'tracer bullets' or tracer points generated between all the pairs of the most distant points

Usage

sudoku(DT, iKernelABC, n_bullets = 20, n_best = 10, halfwidth = 0.5)

get_pairs_of_data_frame(DF)

generate_points_between_two_points(pair, n = 10)

get_tracer_bullets(DF, n_bullets = 20)

Arguments

DT

Whole dataset of parameters

iKernelABC

Result of calculations based on Isolation Kernel ABC that can be gotten by the function get.MaxWiK()

n_bullets

Integer number of tracer points between each pair of points from DF

n_best

Integer number of the best tracer bullets / points to consider them at the next algorithmic step

halfwidth

Criterion to choose the best tracer points like:
if similarity_of_point >= halfwidth then it is the point to be included to the pool of the best points

DF

Data frame of oints that is used for generation of tracer points, so it is usually a subset of points corresponding to Voronoi sites/seeds

pair

Data frame of two points

n

Integer number of points that should be located between two input points

Value

The function sudoku() returns the list of next objects:

The function get_pairs_of_data_frame() returns the list of the pairs of points

The function generate_points_between_two_points() returns data frame of generated points between two given points, including given points as the first and the last rows

The function get_tracer_bullets() returns data frame of generated tracer points

Functions

Examples

NULL

NULL 
NULL
NULL