This vignette introduces the main functions of the
PEAXAI package.
PEAXAI provides a methodology that integrates Data
Envelopment Analysis (DEA) with Machine Learning (ML) algorithms to
train a classifier that estimates the probability that each
decision-making unit (DMU) is (in)efficient.
Once the model is trained, the package offers several functions for post-hoc analysis, including:
These tools facilitate benchmarking and policy simulations in a transparent and explainable framework.
To illustrate the functionality of the PEAXAI package,
we use a dataset of 917 companies in the food industry sector operating
in Spain. The data were retrieved from the SABI (Sistema de Análisis de
Balances Ibéricos) database for the year 2023 and include only firms
with more than 50 employees.
Each row corresponds to a single firm and includes both financial and operational variables relevant for productivity and efficiency analysis.
The output variable used to measure performance is:
operating_income: Operating income in millions of
euros.The input variables include:
total_assets: Total assets (in millions of euros).employees: Number of employees.fixed_assets: Tangible fixed assets (in millions of
euros).personnel_expenses: Personnel-related costs (in
millions of euros).Additionally, the variable autonomous_community
indicates the geographical location of each firm within one of Spain’s
17 autonomous communities. This allows the analysis to reflect regional
institutional and market heterogeneity.
The dataset exhibits a wide dispersion across firms, including both medium-sized and large enterprises. This heterogeneity poses a realistic challenge for evaluating efficiency and provides a rich testing ground for model explainability.
For illustration purposes, this vignette focuses on a subset of the data: firms located in the Autonomous Community of Valencia (Comunidad Valenciana).
For illustration, we focus on firms located in the Comunidad
Valenciana (Autonomous Community of Valencia). Filter the dataset to
include only firms from the Comunidad Valenciana and remove the
autonomous_community column.
Quick structure and summary.
str(data)
#> 'data.frame': 97 obs. of 5 variables:
#> $ total_assets : num 259 185 129 104 153 ...
#> $ employees : num 70 261 354 968 1076 ...
#> $ fixed_assets : num 84.15 21.75 5.76 30.55 103.76 ...
#> $ personnel_expenses: num 16.7 19.9 13.7 36.8 31.8 ...
#> $ operating_income : num 461 358 357 294 268 ...
summary(data)
#> total_assets employees fixed_assets personnel_expenses
#> Min. : 1.537 Min. : 50.0 Min. : 0.1421 Min. : 1.037
#> 1st Qu.: 8.989 1st Qu.: 75.0 1st Qu.: 2.6797 1st Qu.: 2.167
#> Median : 24.555 Median : 98.0 Median : 6.2578 Median : 3.059
#> Mean : 41.030 Mean : 201.1 Mean : 15.2799 Mean : 6.757
#> 3rd Qu.: 52.409 3rd Qu.: 240.0 3rd Qu.: 20.0960 3rd Qu.: 8.244
#> Max. :258.825 Max. :1076.0 Max. :140.6890 Max. :36.789
#> operating_income
#> Min. : 2.382
#> 1st Qu.: 12.994
#> Median : 29.138
#> Mean : 62.307
#> 3rd Qu.: 72.688
#> Max. :460.578Determine inputs (x) and outputs (y) indices.
Determine the technology assumptions.
To address class imbalance, we apply SMOTE based on the best-practice frontier (assuming a convex production technology). First, we identify the combinations of DMUs that define the frontier. Then, we generate one of two types of synthetic DMUs from these combinations: efficient units (on the frontier) and near-frontier inefficient units (inside the frontier’s envelope).
When the efficient class is the minority, efficient-class SMOTE on the best-practice frontier helps the classifier learn the frontier more accurately. Optionally, if the observed imbalance still falls short of the target ratio, we also generate near-frontier inefficient synthetic units derived from the frontier geometry to sharpen the separation between efficient and inefficient regions.
Define machine learning methods and their tuning parameters. Here we configure a neural network (nnet) with a grid of hyperparameters
methods <- list(
"nnet" = list(
tuneGrid = expand.grid(
size = c(1, 5, 10, 20),
decay = 10^seq(-5, -1, by = 1)
),
maxit = 100,
preProcess = c("center", "scale"),
# # --- arguments nnet ---
entropy = TRUE,
skip = TRUE,
maxit = 1000,
MaxNWts = 100000,
trace = FALSE,
weights = NULL
)
)Parameters for controlling model training (k-fold cross-validation)
Classification metrics to optimize during training (priority order in case of ties)
Define hold-out sample to evaluate model performance on unseen data.
These samples are excluded from the cross-validation and hyperparameter
tuning process. By default, is NULL.
Set random seed for reproducibility (ensures consistent results across runs).
We now apply the full PEAXAI pipeline:
models <- PEAXAI_fitting(
data = data,
x = x,
y = y,
RTS = RTS,
imbalance_rate = imbalance_rate,
methods = methods,
trControl = trControl,
metric_priority = metric_priority,
hold_out = hold_out,
verbose = TRUE,
seed = seed
)
#> [1] "Computing maximal friends with 1 DMUs (step 1 of 14 )"
#> [1] "Computing maximal friends with 2 DMUs (step 2 of 14 )"
#> [1] "Computing maximal friends with 3 DMUs (step 3 of 14 )"
#> [1] "Computing maximal friends with 4 DMUs (step 4 of 14 )"
#> [1] "Computing maximal friends with 5 DMUs (step 5 of 14 )"
#> [1] "Computing maximal friends with 6 DMUs (step 6 of 14 )"
#> [1] "Computing maximal friends with 7 DMUs (step 7 of 14 )"
#> [1] "Computing maximal friends with 8 DMUs (step 8 of 14 )"
#> [1] "Computing maximal friends with 9 DMUs (step 9 of 14 )"
#> [1] "Computing maximal friends with 10 DMUs (step 10 of 14 )"
#> [1] "Computing maximal friends with 11 DMUs (step 11 of 14 )"
#> [1] "Computing maximal friends with 12 DMUs (step 12 of 14 )"
#> [1] "Computing maximal friends with 13 DMUs (step 13 of 14 )"
#> [1] "Computing maximal friends with 14 DMUs (step 14 of 14 )"#> $nnet
#> Neural Network
#>
#> 137 samples
#> 5 predictor
#> 2 classes: 'efficient', 'not_efficient'
#>
#> Pre-processing: centered (5), scaled (5)
#> Resampling: None
#> $nnet
#> imbalance size decay Accuracy Kappa Recall Specificity Precision F1
#> 6 0.4 1 1e-05 0.9 0.8 0.91 0.9 0.88 0.89
#> Balanced_Accuracy G_mean ROC_AUC PR_AUC Cross_Entropy Cross_Entropy_Efficient
#> 6 0.91 0.9 0.95 0.4 1.15 2.2
#> AccuracySD KappaSD RecallSD SpecificitySD PrecisionSD F1SD
#> 6 0.04 0.08 0.09 0.1 0.1 0.04
#> Balanced_AccuracySD G_meanSD ROC_AUCSD PR_AUCSD Cross_EntropySD
#> 6 0.04 0.04 0.03 0.01 1
#> Cross_Entropy_EfficientSD diff_imbalance
#> 6 2.91 0.7546
This example uses the SA (Sensitivity Analysis) method from the
rminer package—specifically the 1D perturbation scheme with the AAD
measure—to compute global relative importance with
PEAXAI_global_importance. For alternative XAI methods and
options, see the package documentation.
If the method needs a reference distribution or to train a surrogate,
use background to choose that dataset. target
selects the dataset on which explanations are computed.
If target = train, relative importances are
computed on the training distribution and thus reflect what the model
actually learned during fitting.
If target = real, relative importances are
computed on a dataset intended to approximate the real-world
(deployment) distribution, yielding importances that better represent
the underlying problem as observed in practice.
relative_importance <- PEAXAI_global_importance(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
background = "train",
target = "train",
importance_method = importance_method
)#> total_assets employees fixed_assets personnel_expenses operating_income
#> 1 0.07029074 0.03154095 0.5186401 0.02248952 0.3570387
We evaluate results at three efficiency cutoffs: 0.75, 0.85, and 0.95.
We define a global, model-aware direction for improving inputs/outputs.
relative_importance guides the direction using the
model’s global importance.
scope = global applies a single direction
to all DMUs.
baseline = mean centers perturbations
around the training mean.
The argument relative_importance is the direction to
make the counterfactual analysis. The directional vector can be custom
or provided by PEAXAI_global_importance funtion.
For example, if the directional vector is custom: If it not possible
(or we do not want) to change a specific input (like
employees), we need to write:
relative_importance_custom <- t(matrix(
data = c(0.2, 0, 0.2, 0.2, 0.4),
))
relative_importance_custom <- as.data.frame(relative_importance_custom)
names(relative_importance_custom) <- names(data)[c(x,y)]#> total_assets employees fixed_assets personnel_expenses operating_income
#> 1 0.2 0 0.2 0.2 0.4
But, if we do not know which direction define, we can use the relative importace by the model fitted:
Given the thresholds and directional vector, we compute feasible improvement targets for each DMU. Key controls:
n_expand = 0.5: expansion factor for the search
neighborhood.
n_grid = 50: grid resolution for exploring
adjustments.
max_y = 2, min_x = 1: bounds that cap
output expansion and input contraction.
targets <- PEAXAI_targets(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
efficiency_thresholds = efficiency_thresholds,
directional_vector = directional_vector,
n_expand = 0.5,
n_grid = 50,
max_y = 2,
min_x = 1
)head(targets[["0.85"]][["counterfactual_dataset"]], 10)
#> total_assets employees fixed_assets personnel_expenses operating_income
#> 1 258.53448 70.0000 84.1517800 16.70692 460.5779
#> 2 184.97800 261.0000 21.7550000 19.92400 358.1300
#> 3 128.93344 354.0000 5.7553500 13.70937 356.9468
#> 4 93.15078 943.6587 0.1433018 36.20645 379.2861
#> 5 136.80035 1039.8639 58.6211029 30.88775 394.9861
#> 6 150.82542 733.2535 50.9999216 33.69010 335.3168
#> 7 107.03092 746.8836 19.1893752 23.93526 292.2379
#> 8 237.70296 599.5353 82.6488643 26.75902 352.2977
#> 9 53.38326 441.9395 24.9143425 19.24621 170.3884
#> 10 91.75983 700.0000 23.2029300 21.75058 167.7228head(targets[["0.85"]][["inefficiencies"]], 10)
#> betas probability
#> 1 0.000000000 0.8500000
#> 2 0.000000000 0.8500000
#> 3 0.000000000 0.8500000
#> 4 3.836734694 0.4454719
#> 5 5.695853768 0.8500000
#> 6 3.427734626 0.8500000
#> 7 4.431775876 0.8500000
#> 8 7.323870156 0.8500000
#> 9 0.009536843 0.8500000
#> 10 0.000000000 0.0000000We compute efficiency rankings at each threshold using two bases:
Predicted: ranks by the model’s predicted probability of
being efficient (rank_basis = predicted).
Attainable: ranks by attainable improvements/targets
implied by the model (rank_basis =
attainable).
ranking <- PEAXAI_ranking(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
rank_basis = "predicted"
)
ranking2 <- PEAXAI_ranking(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
efficiency_thresholds = efficiency_thresholds,
targets = targets,
rank_basis = "attainable"
)head(round(ranking, 4), 50)
#> Ranking DMU Probability_predicted
#> 1 1 1 1.0000
#> 2 2 2 1.0000
#> 3 3 3 1.0000
#> 4 4 18 1.0000
#> 5 5 17 1.0000
#> 6 6 46 0.9851
#> 7 7 20 0.9692
#> 8 8 62 0.9672
#> 9 9 9 0.8137
#> 10 10 26 0.7744
#> 11 11 75 0.7154
#> 12 12 44 0.6793
#> 13 13 36 0.5447
#> 14 14 56 0.4485
#> 15 15 85 0.4290
#> 16 16 59 0.3992
#> 17 17 83 0.3484
#> 18 18 93 0.3065
#> 19 19 92 0.2851
#> 20 20 70 0.2690
#> 21 21 71 0.2387
#> 22 22 25 0.2078
#> 23 23 22 0.0965
#> 24 24 31 0.0961
#> 25 25 97 0.0942
#> 26 26 12 0.0862
#> 27 27 88 0.0560
#> 28 28 82 0.0553
#> 29 29 91 0.0436
#> 30 30 48 0.0370
#> 31 31 76 0.0339
#> 32 32 42 0.0279
#> 33 33 68 0.0249
#> 34 34 28 0.0166
#> 35 35 64 0.0140
#> 36 36 51 0.0106
#> 37 37 89 0.0099
#> 38 38 61 0.0084
#> 39 39 86 0.0081
#> 40 40 67 0.0048
#> 41 41 95 0.0044
#> 42 42 55 0.0041
#> 43 43 41 0.0037
#> 44 44 69 0.0015
#> 45 45 35 0.0014
#> 46 46 47 0.0011
#> 47 47 90 0.0006
#> 48 48 53 0.0002
#> 49 49 80 0.0001
#> 50 50 73 0.0000head(round(ranking2[["0.85"]], 4), 50)
#> Ranking DMU Probability_predicted betas Probability_target
#> 1 1 1 1.0000 0.0000 0.85
#> 2 2 2 1.0000 0.0000 0.85
#> 3 3 3 1.0000 0.0000 0.85
#> 4 4 18 1.0000 0.0000 0.85
#> 5 5 17 1.0000 0.0000 0.85
#> 6 6 46 0.9851 0.0000 0.85
#> 7 7 20 0.9692 0.0000 0.85
#> 8 8 62 0.9672 0.0000 0.85
#> 9 9 9 0.8137 0.0095 0.85
#> 10 10 75 0.7154 0.0578 0.85
#> 11 11 26 0.7744 0.0589 0.85
#> 12 12 22 0.0965 0.0649 0.85
#> 13 13 59 0.3992 0.0710 0.85
#> 14 14 44 0.6793 0.1155 0.85
#> 15 15 48 0.0370 0.1168 0.85
#> 16 16 36 0.5447 0.1827 0.85
#> 17 17 53 0.0002 0.2052 0.85
#> 18 18 73 0.0000 0.2177 0.85
#> 19 19 71 0.2387 0.2327 0.85
#> 20 20 89 0.0099 0.2399 0.85
#> 21 21 66 0.0000 0.2401 0.85
#> 22 22 69 0.0015 0.2688 0.85
#> 23 23 94 0.0000 0.3038 0.85
#> 24 24 78 0.0000 0.3155 0.85
#> 25 25 61 0.0084 0.3402 0.85
#> 26 26 25 0.2078 0.3505 0.85
#> 27 27 77 0.0000 0.3644 0.85
#> 28 28 12 0.0862 0.3660 0.85
#> 29 29 55 0.0041 0.3939 0.85
#> 30 30 72 0.0000 0.3952 0.85
#> 31 31 68 0.0249 0.4415 0.85
#> 32 32 43 0.0000 0.4489 0.85
#> 33 33 65 0.0000 0.4537 0.85
#> 34 34 31 0.0961 0.4672 0.85
#> 35 35 54 0.0000 0.5288 0.85
#> 36 36 60 0.0000 0.5365 0.85
#> 37 37 51 0.0106 0.5551 0.85
#> 38 38 42 0.0279 0.6201 0.85
#> 39 39 63 0.0000 0.6523 0.85
#> 40 40 28 0.0166 0.6838 0.85
#> 41 41 47 0.0011 0.7725 0.85
#> 42 42 45 0.0000 0.7803 0.85
#> 43 43 29 0.0000 0.8214 0.85
#> 44 44 41 0.0037 0.8358 0.85
#> 45 45 79 0.0000 0.9419 0.85
#> 46 46 35 0.0014 0.9751 0.85
#> 47 47 74 0.0000 1.1128 0.85
#> 48 48 37 0.0000 1.2416 0.85
#> 49 49 24 0.0000 1.2835 0.85
#> 50 50 40 0.0000 1.7136 0.85For each threshold, we identify peer DMUs that serve as reference comparators.
weighted = FALSE treats peers uniformly; set TRUE to
weight importance given relative_importance.
peers <- PEAXAI_peer(
data = data,
x = x,
y = y,
final_model = models[["best_model_fit"]][["nnet"]],
efficiency_thresholds = efficiency_thresholds,
weighted = FALSE,
relative_importance = relative_importance
)head(peers, 50)
#> DMU 0.75 0.85 0.95
#> 1 1 1 1 1
#> 2 2 2 2 2
#> 3 3 3 3 3
#> 4 4 9 3 3
#> 5 5 9 3 3
#> 6 6 9 3 3
#> 7 7 9 3 3
#> 8 8 9 3 3
#> 9 9 9 3 3
#> 10 10 9 3 3
#> 11 11 9 3 3
#> 12 12 18 18 18
#> 13 13 9 3 3
#> 14 14 9 18 18
#> 15 15 18 18 18
#> 16 16 9 3 3
#> 17 17 17 17 17
#> 18 18 18 18 18
#> 19 19 9 18 18
#> 20 20 20 20 20
#> 21 21 9 3 3
#> 22 22 18 18 18
#> 23 23 9 18 18
#> 24 24 18 18 18
#> 25 25 26 17 17
#> 26 26 26 17 17
#> 27 27 18 18 18
#> 28 28 20 20 20
#> 29 29 18 18 18
#> 30 30 18 18 18
#> 31 31 18 18 18
#> 32 32 18 18 18
#> 33 33 9 18 18
#> 34 34 18 18 18
#> 35 35 18 18 18
#> 36 36 20 20 20
#> 37 37 26 18 18
#> 38 38 9 18 18
#> 39 39 18 18 18
#> 40 40 18 18 18
#> 41 41 18 18 18
#> 42 42 46 46 46
#> 43 43 46 46 46
#> 44 44 46 46 46
#> 45 45 18 18 18
#> 46 46 46 46 46
#> 47 47 26 46 46
#> 48 48 46 46 46
#> 49 49 18 18 18
#> 50 50 46 46 46