Title: | Fréchet Distance-Based K-Means and Extensions for Longitudinal Data |
Version: | 0.1.1 |
Description: | Implements shape-based clustering algorithms for multidimensional longitudinal data based on the Fréchet distance. It implements two main methods: MFKmL (Multidimensional Fréchet distance-based K-means for Longitudinal data), an extension of the K-means algorithm using the Fréchet distance originally developed in the 'kmlShape' package, adapted for multidimensional trajectories; and SFKmL (Sparse multidimensional Fréchet distance-based K-medoids for Longitudinal data), a K-medoids-based clustering algorithm that incorporates variable selection. These tools are designed to enhance clustering performance in high-dimensional longitudinal data settings, particularly those with time delays, variations in trajectory speed, irregular sampling intervals, and noise. This package implements methods derived from Kang et al. (2023) <doi:10.1007/s11222-023-10237-z>. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | dplyr, ggplot2, proxy, abind |
Depends: | R (≥ 4.3) |
NeedsCompilation: | no |
Packaged: | 2025-06-16 07:13:23 UTC; user |
Author: | Ji Hyun Park [aut, cre], Soon-Sun Kwon [aut], Ilsuk Kang [aut, ctb] |
Maintainer: | Ji Hyun Park <jhn1105@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-07-02 15:00:03 UTC |
Sparse Fréchet Distance-Based K-medoids for Longitudinal Data
Description
Performs clustering on longitudinal trajectories using a sparse feature weighting
scheme and Fréchet distance. The method iteratively updates cluster assignments
and feature weights subject to an \ell_1
norm constraint.
Usage
SFclust(k, l1bound, dist.ary, maxIter = 20, eps = 1e-04)
Arguments
k |
The number of clusters. |
l1bound |
A bound on the |
dist.ary |
A 3-dimensional array of pairwise Fréchet distances. The array should be of shape (n, n, p), where |
maxIter |
The maximum number of iterations before stopping if convergence is not reached. Default is 20. |
eps |
A small positive threshold for convergence. The algorithm stops when the change in weights becomes smaller than this threshold. Default is 1e-4. |
Details
The function assumes that the input dist.array
contains pairwise distances between trajectories for each variable,
using the generalized Fréchet distance. Clustering is performed via a k-medoids algorithm,
and feature weights are updated using between-cluster sum of squares (BCSS) with sparsity control.
If the number of variables is one, only clustering is performed, and no variable weighting is applied.
This function involves random sampling internally. For reproducible results, set the random seed before calling the function using set.seed()
.
Value
A list containing the following components:
- clust
A vector of cluster assignments for each trajectory.
- final.weight
The final weight vector after the last iteration, reflecting the contribution of each variable to the clustering process.
- weight.history
A matrix of weight values at each iteration, showing how the feature weights evolved.
- criteria
A vector of convergence criteria values for each iteration, quantifying the change in weights.
- iteration
The number of iterations performed before convergence or reaching
maxIter
.
Perform Permutation-Based Clustering Evaluation for SFclust
Description
Performs a permutation-based analysis to evaluate clustering results across different
values of the \ell_1
norm constraint (s
). This function is designed to help determine the
most appropriate \ell_1
norm value by comparing the observed clustering outcome with those
obtained under random permutations.
The function computes gap statistics for each \ell_1
norm constraint value based on permuted
versions of the input distance array, and identifies the optimal s
as the one
maximizing the gap statistic. Two ggplot objects are returned to visualize the gap patterns.
Usage
SFclust.permute(dist.ary, k, nperms, l1b)
Arguments
dist.ary |
A 3-dimensional distance array representing pairwise distances
between trajectories across multiple variables. Follows the same format used in |
k |
An integer specifying the number of clusters. |
nperms |
An integer specifying the number of permutations to perform. |
l1b |
A numeric vector of |
Details
This function helps assess the robustness of clustering structure and select an optimal level of sparsity.
If any clustering attempt fails (e.g., due to convergence issues or weight update errors), the corresponding
l1b
values are reported in failed_l1b
and failed_j
.
This function returns two ggplot objects (gapplot.l1b
and gapplot.nnz
) that can be used to visualize the
gap statistics. These are not automatically printed, allowing users to decide when and how to display them.
This function involves random sampling internally. For reproducible results, set the random seed before calling the function using set.seed()
.
Value
A list containing the following components:
- totss
A numeric vector of total within-cluster sum of squared distances for each
\ell_1
norm value.- permtotss
A matrix of total sum of squared distances for each permutation and each
\ell_1
norm value.- nnonzerowss
A numeric vector of the number of nonzero weights for each
\ell_1
norm value.- gaps
A numeric vector of gap statistics: the difference between observed and permuted clustering results.
- sdgaps
A numeric vector of standard deviations of the gaps across permutations.
- l1bounds
A vector of
\ell_1
norm constraint values that were successfully processed without error.- bestl1b
The
\ell_1
norm constraint value that yielded the largest gap.- failed_j
Indices of
l1b
values that caused errors during the clustering process.- failed_l1b
The actual
\ell_1
norm values that caused errors.- gapplot.l1b
A ggplot object showing the gap statistics plotted against
\ell_1
norm constraint values.- gapplot.nnz
A ggplot object showing the gap statistics plotted against the number of nonzero weights.
Compute Distance Array for Multidimensional Functional Data
Description
This function standardizes multidimensional functional data using provided scaling factors, computes pairwise Fréchet distances between trajectories for each variable, and returns a distance array (3-dimensional array of distance matrices).
Usage
dist.array(dt, time_scale, var_scales)
Arguments
dt |
A long-format data.frame containing the following columns in the specified order:
|
time_scale |
A single numeric value used to scale the |
var_scales |
A numeric vector of scaling factors for the measured variables.
Its length must be equal to |
Details
The dist.array
function first applies scaling to the Time
and each measured variable.
Then, it computes pairwise Fréchet distances between trajectories for each variable separately.
The output is a 3-dimensional array in which each slice corresponds to a variable-specific distance matrix.
Unlike the mfkml
function, which requires at least three measurements across time for each trajectory,
the SFKmL ((Sparse multi-dimensional Fréchet distance-based K-medoids for Longitudinal data), which uses dist.array
,
allows for trajectories with missing values, as long as each variable has at least three time points for each trajectory.
Therefore, dt
may include missing values.
Value
A numeric value or matrix.
If form = "scalar"
, returns the Fréchet distance between the two trajectories as a single numeric value.
If form = "matrix"
, returns the dynamic programming matrix used to compute the distance.
A 3-dimensional array of pairwise distances with dimensions [n, n, p]
, where:
- n
Number of unique trajectories.
- p
Number of variables.
Each slice [, , k]
is a distance matrix for variable k
.
Compute the Generalized Fréchet Distance Between Two Trajectories
Description
Calculates the discrete Fréchet distance between two trajectories, which is used as the distance metric in clustering algorithms for longitudinal data.
Usage
fredist(traj1, traj2, form)
Arguments
traj1 |
A numeric matrix or data.frame representing the first trajectory. The first column must be time points, and the remaining columns should be one or more variables observed at each time point (e.g., Variable1, Variable2, ...). Each row corresponds to a single time point. |
traj2 |
A numeric matrix or data.frame representing the second trajectory.
The format should be the same as for |
form |
A character string specifying the return format.
Should be either |
Details
This function is primarily used internally by clustering functions to evaluate
the similarity between trajectories based on the Fréchet distance.
It is used in the mfkml
function and for generating the distance array
used in SFclust
function.
Value
A numeric value or matrix.
If form = "scalar"
, returns the Fréchet distance between the two trajectories as a single numeric value.
If form = "matrix"
, returns the dynamic programming matrix used to compute the distance.
Examples
# Example trajectories with 3 variables
traj1 <- data.frame(
Time = 1:4,
Variable1 = c(1.2, 1.4, 1.6, 1.8),
Variable2 = c(2.3, 2.1, 2.0, 1.9),
Variable3 = c(3.1, 3.3, 3.5, 3.7)
)
traj2 <- data.frame(
Time = 1:3,
Variable1 = c(2.0, 2.2, 2.4),
Variable2 = c(3.0, 2.9, 2.8),
Variable3 = c(1.0, 1.1, 1.2)
)
# Compute Fréchet distance (scalar output)
fredist(traj1, traj2, form = "scalar")
# Compute Fréchet distance matrix
fredist(traj1, traj2, form = "matrix")
Multidimensional Fréchet Distance-Based K-means for Longitudinal Data
Description
Extends kmlShape
to multidimensional (p \ge
2) longitudinal data.
It performs scale adjustment and trajectory alignment across all variables prior to clustering
to reduce distortions caused by differences in time grids and amplitude scales.
When variables exhibit substantially different ranges, standardization is required to prevent any single variable
from disproportionately influencing the clustering outcome.
The clustering process follows an iterative K-means framework, where cluster assignments are updated based on Fréchet distances. Cluster centers are computed using the weighted Fréchet mean, which accounts for variable weights assigned to individual trajectories. This allows the mean to be adjusted according to the relative importance of each trajectory in the clustering process.
Usage
mfkml(dt, clt_n, scales, weight, maxIter = 50)
Arguments
dt |
A long-format data.frame containing the following columns in the specified order:
|
clt_n |
An integer specifying the number of clusters.
The number of unique trajectories must be greater than or equal to |
scales |
A numeric vector used for scaling the time and variable columns. The length of |
weight |
Specifies the weights used for calculating the weighted Fréchet mean. It can take one of the following forms:
|
maxIter |
The maximum number of iterations allowed before stopping if convergence is not reached. The default value is 50. |
Details
The input dataset (dt
) must contain only numeric values (except for the ID column)
and must not include any missing values.
Each variable should be measured at least three times per trajectory,
since the method relies on trajectory shapes.
Two observations per trajectory are insufficient to capture shape trends (e.g., increasing, decreasing, or stable).
Because the Fréchet distance is sensitive to measurement units, proper scaling is essential when applying the mfkml
function.
The scales
vector contains scaling factors for time and each variable,
which are used to rescale the corresponding columns.
This scaling prevents distortion due to differences in the units of time and variables,
allowing for more accurate shape-based comparisons.
This function involves random sampling internally.
For reproducible results, set the random seed before calling the function using set.seed()
.
Value
A list with the following components:
Cluster
A data.frame containing the
ID
andCluster
columns, which indicate the final cluster assignment for each trajectory.Center
A data.frame representing the final cluster centers, with columns for the cluster IDs, time points, and variable values.
Iteration
The number of iterations the algorithm performed before reaching convergence.