Title: | Scores of Nominal Outlyingness (SONO) |
Version: | 1.2 |
Description: | Computes scores of outlyingness for data sets consisting of nominal variables and includes various evaluation metrics for assessing performance of outlier identification algorithms producing scores of outlyingness. The scores of nominal outlyingness are computed based on the framework of Costa and Papatsouma (2025) <doi:10.48550/arXiv.2408.07463>. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | data.table (≥ 1.14.0), DescTools (≥ 0.99.0), ggplot2 (≥ 3.3.5), Rdpack (≥ 2.0), rje (≥ 0.9) |
RdMacros: | Rdpack |
Suggests: | knitr, rmarkdown |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-04-07 20:12:21 UTC; ec1917 |
Author: | Efthymios Costa [aut, cre] |
Maintainer: | Efthymios Costa <efthymios.costa17@imperial.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2025-04-07 20:30:02 UTC |
Estimate MAXLEN
Description
Function estimating the value of MAXLEN (stopping criterion) prior to running the SONO algorithm. The estimation is done using the ideas described in Costa and Papatsouma (2025), using simultaneous confidence intervals for Multinomial proportions, as done by Sison and Glaz (1995).
Usage
MAXLEN_est(data, probs, alpha = 0.01, frequent = FALSE)
Arguments
data |
Dataset; needs to be of class data.frame and consist of factor variables only. |
probs |
List of probability vectors for each variable. Each element of the list must include as many probabilities as the number of levels associated with it in the dataset. |
alpha |
Significance level for the simultaneous Multinomial confidence intervals constructed, determining what the frequency thresholds should be for itemsets of different length, used for outlier detection for discrete features. Must be a positive real, at most equal to 0.50. A greater value leads to a much more conservative algorithm. Default value is 0.01. |
frequent |
Logical determining whether highly frequent or highly infrequent itemsets are considered as outliers. Defaults to FALSE, treating highly infrequent itemsets as outlying. |
Value
Estimated MAXLEN value.
References
Costa E, Papatsouma I (2025). “A novel framework for quantifying nominal outlyingness.” doi:10.48550/arXiv.2408.07463, arXiv:2408.07463, http://arxiv.org/abs/2408.07463.
Sison CP, Glaz J (1995). “Simultaneous Confidence Intervals and Sample Size Determination for Multinomial Proportions.” Journal of the American Statistical Association, 90(429), 366–369. ISSN 0162-1459, doi:10.2307/2291162.
Examples
dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
MAXLEN_est(data = dt, probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)), alpha = 0.01, frequent = FALSE)
Average Rank Of Outlier
Description
Function computing the average rank of the outliers given a vector with scores of outlyingness.
Usage
avg_rank_outs(scores, outs, ties = "min")
Arguments
scores |
Scores of (nominal) outlyingness. A higher score here implies an observation is more likely to be an outlier. |
outs |
Vector of outlier indices. |
ties |
A character string specifying how ties in scores are treated; can be one of "average", "first", "last", "random", "max" or "min". |
Value
Average rank of outliers.
Examples
dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
sono_out <- sono(data = dt, probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)),
alpha = 0.01, r = 2, MAXLEN = 0, frequent = FALSE)
# Suppose observations 1 up to 5 are outliers
avg_rank_outs(scores = sono_out[[2]][, 2], outs = c(1:5), ties = "min")
Recall@K
Description
Function computing the recall based on the top K% scores of outlyingness.
Usage
recall_at_k(scores, outs, grid)
Arguments
scores |
Scores of (nominal) outlyingness. A higher score here implies an observation is more likely to be an outlier. |
outs |
Vector of outlier indices. |
grid |
Grid of K values over which the recall is computed. Must be between 0 and 1. |
Value
Recall values at the points of the provided grid.
Examples
dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
sono_out <- sono(data = dt,
probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)),
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE)
# Suppose observations 1 up to 5 are outliers
recall_at_k(scores = sono_out[[2]][, 2],
outs = c(1:5),
grid = c(1, 2.5, seq(5, 50, by = 5))/100)
ROC AUC function
Description
Function computing the ROC AUC given a vector with scores of outlyingness. The computation for this is based on Hanley and McNeil (1982).
Usage
roc_auc(scores, outs, grid)
Arguments
scores |
Scores of (nominal) outlyingness. A higher score here implies an observation is more likely to be an outlier. |
outs |
Vector of outlier indices. |
grid |
Grid of Top K values over which the ROC AUC is computed. Must be between 0 and 1. |
Value
ROC AUC at the points of the provided grid.
References
Hanley JA, McNeil BJ (1982). “The meaning and use of the area under a receiver operating characteristic (ROC) curve.” Radiology, 143(1), 29–36. ISSN 0033-8419, doi:10.1148/radiology.143.1.7063747.
Examples
dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
sono_out <- sono(data = dt,
probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)),
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE)
# Suppose observations 1 up to 5 are outliers
roc_auc(scores = sono_out[[2]][, 2], outs = c(1:5),
grid = c(1, 2.5, seq(5, 50, by = 5))/100)
SONO (Scores Of Nominal Outlyingness)
Description
Function used to compute scores of nominal outlyingness for datasets consisting of nominal features. The
computation is done using the score of Costa and Papatsouma (2025), defined as follows for an observation \boldsymbol{x}_i
:
s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq \boldsymbol{x}_{i}: \\ \text{supp}(d) \notin (\sigma_d, n], \\ \lvert d \rvert \leq \mathrm{MAXLEN}}} \frac{\sigma_d}{\text{supp}(d) \times \lvert d \rvert^r}, \\
r> 0, \ i=1,\dots,n,
for highly infrequent itemsets and:
s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq \boldsymbol{x}_{i}: \\ \text{supp}(d) \notin [0, \sigma_d), \\ \lvert d \rvert \leq \mathrm{MAXLEN}}} \frac{\text{supp}(d)}{\sigma_d \times \left( \text{MAXLEN} - \lvert d \rvert + 1 \right)^r}, \\
r> 0, \ i=1,\dots,n,
for highly frequent itemsets.
In the above, \text{supp}(d)
is the support of itemset d
, \sigma_d
is the the maximum/minimum support threshold and \text{MAXLEN}
is the maximum length of sequences considered, while r
is an exponent term to be determined by the user.
Usage
sono(
data,
probs,
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE,
verbose = TRUE
)
Arguments
data |
Dataset; needs to be of class data.frame and consist of factor variables only. |
probs |
List of probability vectors for each variable. Each element of the list must include as many probabilities as the number of levels associated with it in the dataset. |
alpha |
Significance level for the simultaneous Multinomial confidence intervals constructed, determining what the frequency thresholds should be for itemsets of different length, used for outlier detection for discrete features. Must be a positive real, at most equal to 0.50. A greater value leads to a much more conservative algorithm. Default value is 0.01. |
r |
Exponent term in the computation of scores. Must be a non-negative number. The greater its value, the less contribution itemsets of greater length will have in the overall score. It is suggested that this is not much larger than 3. Default value is 2. |
MAXLEN |
Maximum itemset sequence length to be considered. Default value is 0 which calculates MAXLEN according to a criterion on the sparsity caused by the total combinations that can be encountered as sequences of greater length are taken into account. Otherwise, MAXLEN can take any value from 1 up to the total number of discrete variables included in the data set. If user-given MAXLEN is larger than the estimated value, MAXLEN will default to the latter and a warning message will be displayed, so that redunand computations are avoided. |
frequent |
Logical determining whether highly frequent or highly infrequent itemsets are considered as outliers. Defaults to FALSE, treating highly infrequent itemsets are outlying. |
verbose |
Defaults to TRUE to print progress messages. Change to FALSE to suppress. |
Value
A list with 4 elements. The first element is the value of MAXLEN. The second element corresponds to a data frame with 2 columns; one for the observation numbers and one with the final score of outlyingness. The third and fourth elements are the matrix of variable contributions and the nominal outlyingness depths vector, respectively.
References
Costa E, Papatsouma I (2025). “A novel framework for quantifying nominal outlyingness.” doi:10.48550/arXiv.2408.07463, arXiv:2408.07463, http://arxiv.org/abs/2408.07463.
Examples
dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
sono(data = dt,
probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)),
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE)
Visualise Contribution Matrix
Description
Function producing a visualisation of the matrix of variable contributions. The user can choose to plot just a subset of the data (for instance the outliers), as well as scale the scores if needed.
Arguments
contribs_mat |
Matrix of variable contributions. Must be of class |
subset |
Subset of observations for which the variable contribution matrix will be plotted. |
scale |
Optional scaling parameter; defaults to "none" for no scaling. Possible options are "row", with each row being divided by its sum or "max", where each element of the matrix is divided by the maximum element. |
Value
Plot
Examples
dt <- as.data.frame(sample(c(1:2), 100, replace = TRUE, prob = c(0.5, 0.5)))
dt <- cbind(dt, sample(c(1:3), 100, replace = TRUE, prob = c(0.5, 0.3, 0.2)))
dt[, 1] <- as.factor(dt[, 1])
dt[, 2] <- as.factor(dt[, 2])
colnames(dt) <- c('V1', 'V2')
sono_out <- sono(data = dt,
probs = list(c(0.5, 0.5), c(1/3, 1/3, 1/3)),
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE)
vis_contribs(contribs_mat = sono_out[[3]], subset = c(1:50), scale = "row")