Title: Correlation-Based and Model-Based Predictor Pruning
Version: 3.0.2
Description: Provides functions for predictor pruning using association-based and model-based approaches. Includes corrPrune() for fast correlation-based pruning, modelPrune() for VIF-based regression pruning, and exact graph-theoretic algorithms (Eppstein–Löffler–Strash, Bron–Kerbosch) for exhaustive subset enumeration. Supports linear models, GLMs, and mixed models ('lme4', 'glmmTMB').
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.3
LinkingTo: Rcpp
Imports: Rcpp, methods, stats
Suggests: svglite, GO.db, WGCNA, preprocessCore, impute, energy, minerva, lme4, glmmTMB, MASS, caret, car, carData, microbenchmark, igraph, Boruta, glmnet, corrplot, knitr, rmarkdown, testthat (≥ 3.0.0), tibble
VignetteBuilder: knitr
URL: https://gillescolling.com/corrselect/
BugReports: https://github.com/gcol33/corrselect/issues
Depends: R (≥ 3.5)
LazyData: true
NeedsCompilation: yes
Packaged: 2025-11-28 19:34:15 UTC; Gilles Colling
Author: Gilles Colling [aut, cre]
Maintainer: Gilles Colling <gilles.colling051@gmail.com>
Repository: CRAN
Date/Publication: 2025-11-29 16:40:02 UTC

CorrCombo S4 class

Description

Holds the result of corrSelect or MatSelect: a list of valid variable combinations and their correlation statistics.

This class stores all subsets of variables that meet the specified correlation constraint, along with metadata such as the algorithm used, correlation method(s), variables forced into every subset, and summary statistics for each combination.

Usage

## S4 method for signature 'CorrCombo'
show(object)

Arguments

object

A CorrCombo object to be printed.

Slots

subset_list

A list of character vectors. Each vector is a valid subset (variable names).

avg_corr

A numeric vector. Average absolute correlation within each subset.

min_corr

A numeric vector. Minimum pairwise absolute correlation in each subset.

max_corr

A numeric vector. Maximum pairwise absolute correlation within each subset.

names

Character vector of all variable names used for decoding.

threshold

Numeric scalar. The correlation threshold used during selection.

forced_in

Character vector. Variable names that were forced into each subset.

search_type

Character string. One of "els" or "bron-kerbosch".

cor_method

Character string. Either a single method (e.g. "pearson") or "mixed" if multiple methods used.

n_rows_used

Integer. Number of rows used for computing the correlation matrix (after removing missing values).

See Also

corrSelect, MatSelect, corrSubset

Examples

show(new("CorrCombo",
  subset_list = list(c("A", "B"), c("A", "C")),
  avg_corr = c(0.2, 0.3),
  min_corr = c(0.1, 0.2),
  max_corr = c(0.3, 0.4),
  names = c("A", "B", "C"),
  threshold = 0.5,
  forced_in = character(),
  search_type = "els",
  cor_method = "mixed",
  n_rows_used = as.integer(5)
))


Select Variable Subsets with Low Correlation or Association (Matrix Interface)

Description

Identifies all maximal subsets of variables from a symmetric matrix (typically a correlation matrix) such that all pairwise absolute values stay below a specified threshold. Implements exact algorithms such as Eppstein–Löffler–Strash (ELS) and Bron–Kerbosch (with or without pivoting).

Usage

MatSelect(mat, threshold = 0.7, method = NULL, force_in = NULL, ...)

Arguments

mat

A numeric, symmetric matrix with 1s on the diagonal (e.g. correlation matrix). Column names (if present) are used to label output variables.

threshold

A numeric scalar in (0, 1). Maximum allowed absolute pairwise value. Defaults to 0.7.

method

Character. Selection algorithm to use. One of "els" or "bron-kerbosch". If not specified, the function chooses automatically: "els" when force_in is provided, otherwise "bron-kerbosch".

force_in

Optional integer vector of 1-based column indices to force into every subset.

...

Additional arguments passed to the backend, e.g., use_pivot (logical) for enabling pivoting in Bron–Kerbosch (ignored by ELS).

Value

An object of class CorrCombo, containing all valid subsets and their correlation statistics.

Examples

set.seed(42)
mat <- matrix(rnorm(100), ncol = 10)
colnames(mat) <- paste0("V", 1:10)
cmat <- cor(mat)

# Default method (Bron-Kerbosch)
res1 <- MatSelect(cmat, threshold = 0.5)

# Bron–Kerbosch without pivot
res2 <- MatSelect(cmat, threshold = 0.5, method = "bron-kerbosch", use_pivot = FALSE)

# Bron–Kerbosch with pivoting
res3 <- MatSelect(cmat, threshold = 0.5, method = "bron-kerbosch", use_pivot = TRUE)

# Force variable 1 into every subset (with warning if too correlated)
res4 <- MatSelect(cmat, threshold = 0.5, force_in = 1)


Coerce CorrCombo to a Data Frame

Description

Converts a CorrCombo object into a data frame of variable combinations.

Usage

## S3 method for class 'CorrCombo'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)

Arguments

x

A CorrCombo object.

row.names

Optional row names for the output data frame.

optional

Logical. Passed to data.frame().

...

Additional arguments passed to data.frame().

Value

A data frame where each row corresponds to a subset of variables. Columns are named VarName01, VarName02, ..., up to the size of the largest subset. Subsets shorter than the maximum length are padded with NA.

See Also

CorrCombo

Examples

set.seed(1)
mat <- matrix(rnorm(100), ncol = 10)
colnames(mat) <- paste0("V", 1:10)
res <- corrSelect(cor(mat), threshold = 0.5)
as.data.frame(res)

Select Variable Subsets with Low Association (Mixed-Type Data Frame Interface)

Description

Identifies combinations of variables of any common data type (numeric, ordered factors, or unordered) factors—whose pair-wise association does not exceed a user-supplied threshold. The routine wraps MatSelect() and handles all pre-processing (type conversion, missing-row removal, constant-column checks) for typical data-frame/tibble/data-table inputs.

Usage

assocSelect(
  df,
  threshold = 0.7,
  method = NULL,
  force_in = NULL,
  method_num_num = c("pearson", "spearman", "kendall", "bicor", "distance", "maximal"),
  method_num_ord = c("spearman", "kendall"),
  method_ord_ord = c("spearman", "kendall"),
  ...
)

Arguments

df

A data frame (or tibble / data.table). May contain any mix of:

  • numeric / integer (treated as numeric)

  • ordered factors

  • unordered factors (character vectors are coerced to factors)

threshold

Numeric in (0,1). Maximum allowed pair-wise absolute association. Default 0.7.

method

Character; the subset-search algorithm. One of "els" or "bron-kerbosch". If NULL (default) the function selects automatically: ELS when force_in is supplied, otherwise Bron–Kerbosch.

force_in

Optional character vector or column indices specifying variables that must appear in every returned subset.

method_num_num

Association measure for numeric–numeric pairs. One of "pearson" (default), "spearman", "kendall", "bicor", "distance", or "maximal".

method_num_ord

Association measure for numeric–ordered pairs. One of "spearman" (default) or "kendall".

method_ord_ord

Association measure for ordered–ordered pairs. One of "spearman" (default) or "kendall".

...

Additional arguments passed unchanged to MatSelect() (e.g., use_pivot = TRUE for Bron–Kerbosch).

Details

A single call can therefore screen a data set that mixes continuous and categorical features and return every subset whose internal associations are “sufficiently low” under the metric(s) you choose.

Rows containing NA are dropped with a warning; constant columns are treated as having zero association with every other variable.

The default association measure for each variable-type combination is:

numeric – numeric

method_num_num (default "pearson")

numeric – ordered

method_num_ord

numeric – unordered

"eta" (ANOVA \eta^{2})

ordered – ordered

method_ord_ord

ordered – unordered

"cramersv"

unordered – unordered

"cramersv"

All association measures are rescaled to [0,1] before thresholding. External packages are required for "bicor" (WGCNA), "distance" (energy), and "maximal" (minerva); an informative error is thrown if they are missing.

Value

A CorrCombo S4 object containing:

The object’s show() method prints the association metrics that were actually used for this data set.

See Also

corrSelect(), MatSelect(), corrSubset()

Examples

set.seed(42)
df <- data.frame(
  height = rnorm(15, 170, 10),
  weight = rnorm(15, 70, 12),
  group  = factor(rep(LETTERS[1:3], each = 5)),
  score  = ordered(sample(c("low","med","high"), 15, TRUE))
)

## keep every subset whose internal associations <= 0.6
assocSelect(df, threshold = 0.6)

## use Kendall for all rank-based comparisons and force 'height' to appear
assocSelect(df,
            threshold       = 0.5,
            method_num_num  = "kendall",
            method_num_ord  = "kendall",
            method_ord_ord  = "kendall",
            force_in        = "height")


Example Bioclimatic Data for Ecological Modeling

Description

A simulated dataset with the 19 WorldClim bioclimatic variables (https://www.worldclim.org/data/bioclim.html) measured at 100 geographic locations, with species richness as the response variable. Variables are organized into correlated blocks representing temperature (BIO1-BIO11) and precipitation (BIO12-BIO19).

Usage

bioclim_example

Format

A data frame with 100 rows and 20 variables:

species_richness

Integer. Number of species observed (response variable)

BIO1

Numeric. Annual Mean Temperature

BIO2

Numeric. Mean Diurnal Range

BIO3

Numeric. Isothermality

BIO4

Numeric. Temperature Seasonality

BIO5

Numeric. Max Temperature of Warmest Month

BIO6

Numeric. Min Temperature of Coldest Month

BIO7

Numeric. Temperature Annual Range

BIO8

Numeric. Mean Temperature of Wettest Quarter

BIO9

Numeric. Mean Temperature of Driest Quarter

BIO10

Numeric. Mean Temperature of Warmest Quarter

BIO11

Numeric. Mean Temperature of Coldest Quarter

BIO12

Numeric. Annual Precipitation

BIO13

Numeric. Precipitation of Wettest Month

BIO14

Numeric. Precipitation of Driest Month

BIO15

Numeric. Precipitation Seasonality

BIO16

Numeric. Precipitation of Wettest Quarter

BIO17

Numeric. Precipitation of Driest Quarter

BIO18

Numeric. Precipitation of Warmest Quarter

BIO19

Numeric. Precipitation of Coldest Quarter

Details

This dataset demonstrates a common problem in ecological modeling: bioclimatic predictors are highly correlated within groups (temperature variables BIO1-BIO11 are highly correlated; precipitation variables BIO12-BIO19 are moderately correlated), leading to multicollinearity issues. The species richness response depends on a subset of predictors.

Use case: Demonstrating corrPrune() and modelPrune() for reducing correlated environmental predictors before fitting species distribution models.

Source

Simulated data based on the 19 WorldClim bioclimatic variables

See Also

corrPrune(), modelPrune()

Examples

data(bioclim_example)

# The 19 WorldClim bioclimatic variables (https://www.worldclim.org/data/bioclim.html)
# Many are highly correlated, making them ideal for pruning

# Remove highly correlated variables
pruned <- corrPrune(bioclim_example[, -1], threshold = 0.7)
ncol(pruned)  # Reduced from 19 to ~8 variables

# Model-based pruning with VIF
model_data <- modelPrune(species_richness ~ .,
                         data = bioclim_example,
                         limit = 5)
attr(model_data, "selected_vars")

Example Correlation Matrix with Block Structure

Description

A 20x20 correlation matrix with known block structure designed for demonstrating threshold selection, algorithm comparison, and visualization examples in vignettes.

Usage

cor_example

Format

A 20x20 numeric correlation matrix with row and column names V1-V20. The matrix has four distinct correlation blocks:

Block 1 (V1-V5)

High correlation: mean = 0.81, range = (0.75, 0.95)

Block 2 (V6-V10)

Moderate correlation: mean = 0.57, range = (0.5, 0.7)

Block 3 (V11-V15)

Low correlation: mean = 0.28, range = (0.2, 0.4)

Block 4 (V16-V20)

Minimal correlation: mean = 0.06, range = (0.0, 0.15)

Between-block correlations are low (range = (0.0, 0.3)). The matrix is guaranteed to be positive definite.

Details

This dataset provides a controlled correlation structure useful for:

Expected behavior with different thresholds:

Source

Generated with data-raw/create_cor_example.R using seed 20250125.

Examples

data(cor_example)

# Matrix dimensions
dim(cor_example)

# Visualize structure
if (requireNamespace("corrplot", quietly = TRUE)) {
  corrplot::corrplot(cor_example, method = "color", type = "upper",
                     tl.col = "black", tl.cex = 0.7)
}

# Distribution of correlations
hist(cor_example[upper.tri(cor_example)],
     breaks = 30,
     main = "Distribution of Correlations in cor_example",
     xlab = "Correlation",
     col = "steelblue")

# Use with MatSelect
library(corrselect)
results <- MatSelect(cor_example, threshold = 0.7, method = "els")
show(results)


Association-Based Predictor Pruning

Description

corrPrune() performs model-free variable subset selection by iteratively removing predictors until all pairwise associations fall below a specified threshold. It returns a single pruned data frame with predictors that satisfy the association constraint.

Usage

corrPrune(
  data,
  threshold = 0.7,
  measure = "auto",
  mode = "auto",
  force_in = NULL,
  by = NULL,
  group_q = 1,
  max_exact_p = 100,
  ...
)

Arguments

data

A data.frame containing candidate predictors.

threshold

Numeric scalar. Maximum allowed pairwise association (default: 0.7). Must be non-negative.

measure

Character string specifying the association measure to use. Options: "auto" (default), "pearson", "spearman", "kendall", "cramersv", "eta", etc. When "auto", Pearson correlation is used for all-numeric data, and appropriate measures are selected for mixed-type data.

mode

Character string specifying the search algorithm. Options:

  • "auto" (default): uses exact search if number of predictors <= max_exact_p, otherwise uses greedy search

  • "exact": exhaustive search for maximal subsets (may be slow for large p)

  • "greedy": fast approximate search using iterative removal

force_in

Character vector of variable names that must be retained in the final subset. Default: NULL.

by

Character vector naming one or more grouping variables. If provided, associations are computed separately within each group, then aggregated using the quantile specified by group_q. Default: NULL (no grouping).

group_q

Numeric scalar in (0, 1]. Quantile used to aggregate associations across groups when by is provided. Default: 1 (maximum, ensuring threshold holds in all groups). Use 0.9 for 90th percentile, etc.

max_exact_p

Integer. Maximum number of predictors for which exact mode is used when mode = "auto". Default: 100.

...

Additional arguments (reserved for future use).

Details

corrPrune() identifies a subset of predictors whose pairwise associations are all below threshold. The function works in several stages:

  1. Variable type detection: Identifies numeric vs. categorical predictors

  2. Association measurement: Computes appropriate pairwise associations

  3. Grouping (optional): If by is specified, computes associations within each group and aggregates using the specified quantile

  4. Feasibility check: Verifies that force_in variables satisfy the threshold constraint

  5. Subset selection: Uses either exact or greedy search to find a valid subset

Grouped Pruning: When by is provided, the function ensures the selected predictors satisfy the threshold constraint across groups. For example, with group_q = 1 (default), the returned predictors will have pairwise associations below threshold in all groups. With group_q = 0.9, they will satisfy the constraint in at least 90% of groups.

Mode Selection: Exact mode guarantees finding all maximal subsets and returns the largest one (with deterministic tie-breaking). Greedy mode is faster but approximate, using a deterministic removal strategy based on association scores.

Value

A data.frame containing the pruned subset of predictors. The result has the following attributes:

selected_vars

Character vector of retained variable names

removed_vars

Character vector of removed variable names

mode

Character string indicating which mode was used ("exact" or "greedy")

measure

Character string indicating which association measure was used

threshold

The threshold value used

See Also

corrSelect for exhaustive subset enumeration, assocSelect for mixed-type data subset enumeration, modelPrune for model-based predictor pruning.

Examples

# Basic numeric data pruning
data(mtcars)
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)

# Force certain variables to be included
pruned <- corrPrune(mtcars, threshold = 0.7, force_in = "mpg")

# Use greedy mode for faster computation
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")


Select Variable Subsets with Low Correlation (Data Frame Interface)

Description

Identifies combinations of numeric variables in a data frame such that all pairwise absolute correlations fall below a specified threshold. This function is a wrapper around MatSelect() and accepts data frames, tibbles, or data tables with automatic preprocessing.

Usage

corrSelect(
  df,
  threshold = 0.7,
  method = NULL,
  force_in = NULL,
  cor_method = c("pearson", "spearman", "kendall", "bicor", "distance", "maximal"),
  ...
)

Arguments

df

A data frame. Only numeric columns are used.

threshold

A numeric value in (0, 1). Maximum allowed absolute correlation. Defaults to 0.7.

method

Character. Selection algorithm to use. One of "els" or "bron-kerbosch". If not specified, the function chooses automatically: "els" when force_in is provided, otherwise "bron-kerbosch".

force_in

Optional character vector or numeric indices of columns to force into all subsets.

cor_method

Character string indicating which correlation method to use. One of "pearson" (default), "spearman", "kendall", "bicor", "distance", or "maximal".

...

Additional arguments passed to MatSelect(), e.g., use_pivot.

Details

Only numeric columns are used for correlation analysis. Non‐numeric columns (factors, characters, logicals, etc.) are ignored, and their names and types are printed to inform the user. These can be optionally reattached later using corrSubset() with keepExtra = TRUE.

Rows with missing values are removed before computing correlations. A warning is issued if any rows are dropped.

The cor_method controls how the correlation matrix is computed:

For "bicor", "distance", and "maximal", the corresponding package must be installed.

Value

An object of class CorrCombo, containing selected subsets and correlation statistics.

See Also

assocSelect(), MatSelect(), corrSubset()

Examples

set.seed(42)
n <- 100

# Create 20 variables: 5 blocks of correlated variables + some noise
block1 <- matrix(rnorm(n * 4), ncol = 4)
block2 <- matrix(rnorm(n), ncol = 1)
block2 <- matrix(rep(block2, 4), ncol = 4) + matrix(rnorm(n * 4, sd = 0.1), ncol = 4)
block3 <- matrix(rnorm(n * 4), ncol = 4)
block4 <- matrix(rnorm(n * 4), ncol = 4)
block5 <- matrix(rnorm(n * 4), ncol = 4)

df <- as.data.frame(cbind(block1, block2, block3, block4, block5))
colnames(df) <- paste0("V", 1:20)

# Add a non-numeric column to be ignored
df$label <- factor(sample(c("A", "B"), n, replace = TRUE))

# Basic usage
corrSelect(df, threshold = 0.8)

# Try Bron–Kerbosch with pivoting
corrSelect(df, threshold = 0.6, method = "bron-kerbosch", use_pivot = TRUE)

# Force in a specific variable and use Spearman correlation
corrSelect(df, threshold = 0.6, force_in = "V10", cor_method = "spearman")


Extract Variable Subsets from a CorrCombo Object

Description

Extracts one or more variable subsets from a CorrCombo object as data frames. Typically used after corrSelect or MatSelect to obtain filtered versions of the original dataset containing only low‐correlation variable combinations.

Usage

corrSubset(res, df, which = "best", keepExtra = FALSE)

Arguments

res

A CorrCombo object returned by corrSelect or MatSelect.

df

A data frame or matrix. Must contain all variables listed in res@names. Columns not in res@names are ignored unless keepExtra = TRUE.

which

Subsets to extract. One of:

  • "best" (default) or 1: the top‐ranked subset.

  • A single integer (e.g. 2): the nth ranked subset.

  • A vector of integers (e.g. 1:3): multiple subsets.

  • "all": all available subsets.

Subsets are ranked by decreasing size, then increasing average correlation.

keepExtra

Logical. If TRUE, columns in df not in res@names (e.g., factors, characters) are retained. Defaults to FALSE.

Value

A data frame if a single subset is extracted, or a list of data frames if multiple subsets are extracted. Each data frame contains the selected variables (and optionally extras).

Note

A warning is issued if any rows contain missing values in the selected variables.

See Also

corrSelect, MatSelect, CorrCombo

Examples

# Simulate input data
set.seed(123)
df <- as.data.frame(matrix(rnorm(100), nrow = 10))
colnames(df) <- paste0("V", 1:10)

# Compute correlation matrix
cmat <- cor(df)

# Select subsets using corrSelect
res <- corrSelect(cmat, threshold = 0.5)

# Extract the best subset (default)
corrSubset(res, df)

# Extract the second-best subset
corrSubset(res, df, which = 2)

# Extract the first three subsets
corrSubset(res, df, which = 1:3)

# Extract all subsets
corrSubset(res, df, which = "all")

# Extract best subset and retain additional numeric column
df$CopyV1 <- df$V1
corrSubset(res, df, which = 1, keepExtra = TRUE)


Example Gene Expression Data for Bioinformatics

Description

A simulated gene expression dataset with 200 genes measured across 100 samples, organized into co-expression modules with a binary disease outcome.

Usage

genes_example

Format

A data frame with 100 rows and 202 variables:

sample_id

Character. Unique sample identifier

disease_status

Factor. Disease status (Healthy, Disease)

GENE001, GENE002, GENE003, GENE004, GENE005, GENE006, GENE007, GENE008, GENE009, GENE010, GENE011, GENE012, GENE013, GENE014, GENE015, GENE016, GENE017, GENE018, GENE019, GENE020, GENE021, GENE022, GENE023, GENE024, GENE025, GENE026, GENE027, GENE028, GENE029, GENE030, GENE031, GENE032, GENE033, GENE034, GENE035, GENE036, GENE037, GENE038, GENE039, GENE040, GENE041, GENE042, GENE043, GENE044, GENE045, GENE046, GENE047, GENE048, GENE049, GENE050, GENE051, GENE052, GENE053, GENE054, GENE055, GENE056, GENE057, GENE058, GENE059, GENE060, GENE061, GENE062, GENE063, GENE064, GENE065, GENE066, GENE067, GENE068, GENE069, GENE070, GENE071, GENE072, GENE073, GENE074, GENE075, GENE076, GENE077, GENE078, GENE079, GENE080, GENE081, GENE082, GENE083, GENE084, GENE085, GENE086, GENE087, GENE088, GENE089, GENE090, GENE091, GENE092, GENE093, GENE094, GENE095, GENE096, GENE097, GENE098, GENE099, GENE100, GENE101, GENE102, GENE103, GENE104, GENE105, GENE106, GENE107, GENE108, GENE109, GENE110, GENE111, GENE112, GENE113, GENE114, GENE115, GENE116, GENE117, GENE118, GENE119, GENE120, GENE121, GENE122, GENE123, GENE124, GENE125, GENE126, GENE127, GENE128, GENE129, GENE130, GENE131, GENE132, GENE133, GENE134, GENE135, GENE136, GENE137, GENE138, GENE139, GENE140, GENE141, GENE142, GENE143, GENE144, GENE145, GENE146, GENE147, GENE148, GENE149, GENE150, GENE151, GENE152, GENE153, GENE154, GENE155, GENE156, GENE157, GENE158, GENE159, GENE160, GENE161, GENE162, GENE163, GENE164, GENE165, GENE166, GENE167, GENE168, GENE169, GENE170, GENE171, GENE172, GENE173, GENE174, GENE175, GENE176, GENE177, GENE178, GENE179, GENE180, GENE181, GENE182, GENE183, GENE184, GENE185, GENE186, GENE187, GENE188, GENE189, GENE190, GENE191, GENE192, GENE193, GENE194, GENE195, GENE196, GENE197, GENE198, GENE199, GENE200

Numeric. Gene expression values (log-transformed)

Details

This dataset simulates a high-dimensional, low-sample scenario common in genomics. Genes are organized into four co-expression modules:

Disease outcome depends on a subset of genes from Module 1.

Use case: Demonstrating corrPrune() with mode = "greedy" for handling high-dimensional data efficiently.

Source

Simulated data based on typical gene expression microarray structures

See Also

corrPrune()

Examples

data(genes_example)

# Greedy pruning for high-dimensional data
gene_data <- genes_example[, -(1:2)]  # Exclude ID and outcome
pruned <- corrPrune(gene_data, threshold = 0.8, mode = "greedy")
ncol(pruned)  # Reduced from 200 to ~50 genes

# Use pruned genes for classification
pruned_with_outcome <- data.frame(
  disease_status = genes_example$disease_status,
  pruned
)

Example Longitudinal Data for Clinical Research

Description

A simulated longitudinal study dataset with 50 subjects measured at 10 timepoints each, with 20 correlated predictors and nested random effects (subject and site).

Usage

longitudinal_example

Format

A data frame with 500 rows and 25 variables:

obs_id

Integer. Observation identifier (1-500)

subject

Factor. Subject identifier (1-50)

site

Factor. Study site identifier (1-5)

time

Integer. Measurement timepoint (1-10)

outcome

Numeric. Continuous outcome variable

x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20

Numeric. Correlated predictor variables

Details

This dataset represents a typical longitudinal study with repeated measures. Predictors are correlated both within and between subjects:

The outcome depends on time (linear trend), random effects (subject and site), and a subset of fixed-effect predictors (x1, x5, x15).

Use case: Demonstrating modelPrune() with mixed models (lme4 engine) to prune fixed effects while preserving random effects structure.

Source

Simulated data based on typical clinical trial designs

See Also

modelPrune()

Examples

data(longitudinal_example)

## Not run: 
# Prune fixed effects in mixed model (requires lme4)
if (requireNamespace("lme4", quietly = TRUE)) {
  pruned <- modelPrune(
    outcome ~ x1 + x2 + x3 + x4 + x5 + (1|subject) + (1|site),
    data = longitudinal_example,
    engine = "lme4",
    limit = 5
  )

  # Random effects preserved, only fixed effects pruned
  attr(pruned, "selected_vars")
}

## End(Not run)

Model-Based Predictor Pruning

Description

modelPrune() performs iterative removal of fixed-effect predictors based on model diagnostics (e.g., VIF) until all remaining predictors satisfy a specified threshold. It supports linear models, generalized linear models, and mixed models.

Usage

modelPrune(
  formula,
  data,
  engine = "lm",
  criterion = "vif",
  limit = 5,
  force_in = NULL,
  max_steps = NULL,
  ...
)

Arguments

formula

A model formula specifying the response and predictors. May include random effects for mixed models (e.g., y ~ x1 + x2 + (1|group)).

data

A data.frame containing the variables in the formula.

engine

Either a character string for built-in engines, or a list defining a custom engine.

Built-in engines (character string):

  • "lm" (default): Linear models via stats::lm()

  • "glm": Generalized linear models via stats::glm() (requires family argument)

  • "lme4": Mixed models via lme4::lmer() or lme4::glmer() (requires lme4 package)

  • "glmmTMB": Generalized linear mixed models via glmmTMB::glmmTMB() (requires glmmTMB package)

Custom engine (named list with required components):

  • fit: function(formula, data, ...) that returns a fitted model object

  • diagnostics: function(model, fixed_effects) that returns a named numeric vector of diagnostic scores (one per fixed effect, higher values = worse)

  • name (optional): character string used in error messages (default: "custom")

criterion

Character string specifying the diagnostic criterion for pruning. For built-in engines, only "vif" (Variance Inflation Factor) is supported. For custom engines, this parameter is ignored (diagnostics are computed by the engine's diagnostics function). Default: "vif".

limit

Numeric scalar. Maximum allowed value for the criterion. Predictors with diagnostic values exceeding this limit are iteratively removed. Default: 5 (common VIF threshold).

force_in

Character vector of predictor names that must be retained in the final model. These variables will not be removed during pruning. Default: NULL.

max_steps

Integer. Maximum number of pruning iterations. If NULL (default), pruning continues until all diagnostics are below the limit or no more removable predictors remain.

...

Additional arguments passed to the modeling function (e.g., family for glm/glmer, control parameters for lme4/glmmTMB).

Details

modelPrune() works by:

  1. Parsing the formula to identify fixed-effect predictors

  2. Fitting the initial model

  3. Computing diagnostics for each fixed-effect predictor

  4. Checking feasibility of force_in constraints

  5. Iteratively removing the predictor with the worst diagnostic value (excluding force_in variables) until all diagnostics <= limit

  6. Returning the pruned data frame

Random Effects: For mixed models (lme4, glmmTMB), only fixed-effect predictors are considered for pruning. Random-effect structure is preserved exactly as specified in the original formula.

VIF Computation: Variance Inflation Factors are computed from the fixed-effects design matrix. For categorical predictors, VIF represents the inflation for the entire factor (not individual dummy variables).

Determinism: The algorithm is deterministic. Ties in diagnostic values are broken by removing the predictor that appears last in the formula.

Force-in Constraints: If variables in force_in violate the diagnostic threshold, the function will error. This ensures that the constraint is feasible before pruning begins.

Value

A data.frame containing only the retained predictors (and response). The result has the following attributes:

selected_vars

Character vector of retained predictor names

removed_vars

Character vector of removed predictor names (in order of removal)

engine

Character string indicating which engine was used (for custom engines, this is the engine's name field)

criterion

Character string indicating which criterion was used

limit

The threshold value used

final_model

The final fitted model object (optional)

See Also

corrPrune for association-based predictor pruning, corrSelect for exhaustive subset enumeration.

Examples

# Linear model with VIF-based pruning
data(mtcars)
pruned <- modelPrune(mpg ~ ., data = mtcars, engine = "lm", limit = 5)
names(pruned)

# Force certain predictors to remain
pruned <- modelPrune(mpg ~ ., data = mtcars, force_in = "drat", limit = 20)

# GLM example (requires family argument)
pruned <- modelPrune(am ~ ., data = mtcars, engine = "glm",
                     family = binomial(), limit = 5)

## Not run: 
# Custom engine example (INLA)
inla_engine <- list(
  name = "inla",
  fit = function(formula, data, ...) {
    inla::inla(formula = formula, data = data,
               family = list(...)$family %||% "gaussian",
               control.compute = list(config = TRUE))
  },
  diagnostics = function(model, fixed_effects) {
    scores <- model$summary.fixed[, "sd"]
    names(scores) <- rownames(model$summary.fixed)
    scores[fixed_effects]
  }
)

pruned <- modelPrune(y ~ x1 + x2 + x3, data = df,
                     engine = inla_engine, limit = 0.5)

## End(Not run)


Example Survey Data for Social Science Research

Description

A simulated questionnaire dataset with 30 Likert-scale items measuring three latent constructs (satisfaction, engagement, loyalty), plus demographic variables and an overall satisfaction score.

Usage

survey_example

Format

A data frame with 200 rows and 35 variables:

respondent_id

Integer. Unique respondent identifier

age

Integer. Respondent age (18-75 years)

gender

Factor. Gender (Male, Female, Other)

education

Ordered factor. Education level (High School, Bachelor, Master, PhD)

overall_satisfaction

Integer. Overall satisfaction score (0-100)

satisfaction_1, satisfaction_2, satisfaction_3, satisfaction_4, satisfaction_5, satisfaction_6, satisfaction_7, satisfaction_8, satisfaction_9, satisfaction_10

Ordered factor. Satisfaction items (1-7 Likert scale)

engagement_1, engagement_2, engagement_3, engagement_4, engagement_5, engagement_6, engagement_7, engagement_8, engagement_9, engagement_10

Ordered factor. Engagement items (1-7 Likert scale)

loyalty_1, loyalty_2, loyalty_3, loyalty_4, loyalty_5, loyalty_6, loyalty_7, loyalty_8, loyalty_9, loyalty_10

Ordered factor. Loyalty items (1-7 Likert scale)

Details

This dataset represents a common scenario in survey research: multiple items measuring similar constructs lead to redundancy and multicollinearity. Items within each construct are correlated (satisfaction, engagement, loyalty), and the constructs themselves are inter-correlated.

Use case: Demonstrating assocSelect() for identifying redundant questionnaire items in mixed-type data (ordered factors + numeric variables).

Source

Simulated data based on typical customer satisfaction survey structures

See Also

assocSelect(), corrPrune()

Examples

data(survey_example)

# This dataset has mixed types: numeric (age, overall_satisfaction),
# factors (gender, education), and ordered factors (Likert items)
str(survey_example[, 1:10])


# Use assocSelect() for mixed-type data pruning
# This may take a few seconds with 34 variables
pruned <- assocSelect(survey_example[, -1],  # Exclude respondent_id
                      threshold = 0.8,
                      method_ord_ord = "spearman")
length(attr(pruned, "selected_vars"))