kcmeans
is an implementation of the K-Conditional-Means
(KCMeans) regression estimator analyzed by Wiemann (2023; arxiv:2311.17021) for
conditional expectation function estimation using categorical features.
The implementation leverages the unconditional KMeans implementation in
one dimension using dynamic programming of the Ckmeans.1d.dp
package.
See the working paper Optimal Categorical Instrumental Variables for further discussion of the KCMeans estimator.
Install the latest development version from GitHub (requires devtools package):
if (!require("devtools")) {
install.packages("devtools")
}::install_github("thomaswiemann/kcmeans", dependencies = TRUE) devtools
To illustrate kcmeans
, consider simulating a small
dataset with a continuous outcome variable y
, two observed
predictors – a categorical variable Z
and a continuous
variable X
– and an (unobserved) Gaussian error. As in
Wiemann (2023), the reduced form has an unobserved lower-dimensional
representation dependent on the latent categorical variable
Z0
.
# Load package
library(kcmeans)
# Set seed
set.seed(51944)
# Sample parameters
= 800 # sample size
nobs # Sample data
<- rnorm(nobs)
X <- sample(1:20, nobs, replace = T)
Z <- Z %% 4 # lower-dimensional latent categorical variable
Z0 <- Z0 + X + rnorm(nobs) y
kcmeans
is then computed by combining the categorical
feature with the continuous feature. By default, the categorical feature
is the first column. Alternatively, the column corresponding to the
categorical feature can be set via the which_is_cat
argument. Computation is very quick – indeed the dynamic
programming algorithm of the leveraged Ckmeans.1d.dp
package is polynomial in the number of values taken by the categorical
feature Z
. See also ?kcmeans
for details.
system.time({
<- kcmeans(y = y, X = cbind(Z, X), K = 4)
kcmeans_fit
})#> user system elapsed
#> 1.09 0.04 1.14
We may now use the predict.kcmeans
method to construct
fitted values and/or compute predictions of the lower-dimensional latent
categorical feature Z0
. See also
?predict.kcmeans
for details.
# Predicted values for the outcome + R^2
<- predict(kcmeans_fit, cbind(Z, X))
y_hat round(1 - mean((y - y_hat)^2) / mean((y - mean(y))^2), 3)
#> [1] 0.695
# Predicted values for the latent categorical feature + missclassification rate
<- predict(kcmeans_fit, cbind(Z, X), clusters = T) - 1
Z0_hat mean((Z0 - Z0_hat)!=0)
#> [1] 0
Finally, it is also straightforward to compute standard errors for
the final coefficients, e.g., using summary.lm
:
# Compute the linear regression object and call summary.lm
<- lm(y ~ as.factor(Z0_hat) + X)
lm_fit summary(lm_fit)
#>
#> Call:
#> lm(formula = y ~ as.factor(Z0_hat) + X)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.1205 -0.6916 0.0544 0.6700 3.4201
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.03897 0.07434 0.524 0.6
#> as.factor(Z0_hat)1 0.88393 0.10265 8.611 <2e-16 ***
#> as.factor(Z0_hat)2 1.88314 0.10271 18.334 <2e-16 ***
#> as.factor(Z0_hat)3 3.01094 0.10636 28.310 <2e-16 ***
#> X 1.04636 0.03541 29.549 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.03 on 795 degrees of freedom
#> Multiple R-squared: 0.6954, Adjusted R-squared: 0.6939
#> F-statistic: 453.7 on 4 and 795 DF, p-value: < 2.2e-16
Since the cardinality of the support of the underlying
low-dimensional latent categorical variable is often unknown, it is
useful to consider multiple KCMeans estimators with varying values for
K. The below code snippet uses the ddml
package to
compute the cross-validation mean-square prediction error (MSPE) of
three KCMeans estimators (see also ?ddml::crossval
for
details).
In addition, the KCMeans MSPEs are compared to the MSPE of three alternative conditional expectation function estimators:
?ddml::ols
)?ddml::mdl_glmnet
)# load the ddml package
library(ddml)
# one-hot encoding for ols, lasso, and ridge
<- model.matrix(~ as.factor(Z))
Z_indicators
# Combine features and create indices
<- cbind(Z, X, Z_indicators)
X_all <- 1:2
indx_factor <- 2:(2 + ncol(Z_indicators))
indx_indicators
# Create the learners, assign indicators to ols, lasso, and ridge
<- list(list(fun = kcmeans,
learner_list args = list(K = 2),
assign_X = indx_factor),
list(fun = kcmeans,
args = list(K = 4),
assign_X = indx_factor),
list(fun = kcmeans,
args = list(K = 6),
assign_X = indx_factor),
list(fun = ols,
assign_X = indx_indicators),
list(fun = mdl_glmnet,
assign_X = indx_indicators),
list(fun = mdl_glmnet,
args = list(alpha = 0),
assign_X = indx_indicators))
# Compute the cross-valdiation MSPE
<- crossval(y = y, X = X_all,
cv_res learners = learner_list,
cv_folds = 20, silent = T)
The results show that KCMeans with K=4 and K=6 achieve the smallest MSPE among the considered estimators.
# Print the results
names(cv_res$mspe) <- c("KCMeans (K=2)", "KCMeans (K=4)", "KCMeans (K=6)",
"OLS", "Lasso", "Ridge")
round(cv_res$mspe, 4)
#> KCMeans (K=2) KCMeans (K=4) KCMeans (K=6) OLS Lasso
#> 1.3170 1.0650 1.0655 1.0803 1.0797
#> Ridge
#> 1.0890
# Which learner is the best?
names(which.min(cv_res$mspe))
#> [1] "KCMeans (K=4)"
Wiemann T (2023). “Optimal Categorical Instruments.” https://arxiv.org/abs/2311.17021