Decision Rule to Detect Troubling Multicollinearity
multicollinearity.RdGiven a multiple linear regression model with n observations and k independent variables, the degree of near-multicollinearity affects its statistical analysis (with a level of significance of alpha%) if there is a variable i, with i = 1,...,k, that verifies that the null hypothesis is not rejected in the original model and is rejected in the orthogonal model of reference.
Details
This function compares the individual inference of the original model with that of the orthonormal model taken as reference.
Thus, if the null hypothesis is rejected in the individual significance tests in the model where there are no linear relationships between the independent variables (orthonormal) and is not rejected in the original model, the reason for the non-rejection is due to the existing linear relationships between the independent variables (multicollinearity) in the original model.
The second model is obtained from the first model by performing a QR decomposition, which eliminates the initial linear relationships.
Value
The function returns the value of the RVIF and the established thresholds, as well as indicating whether or not the individual significance analysis is affected by multicollinearity at the chosen significance level.
References
Salmerón, R., García, C.B. and García, J. (2025). A Redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics, 65, 337-363, doi: https://doi.org/10.1007/s10614-024-10575-8.
Overcoming the inconsistences of the variance inflation factor: a redefined VIF and a test to detect statistical troubling multicollinearity by Salmerón, R., García, C.B and García, J. (working paper, https://arxiv.org/pdf/2005.02245).
Author
Román Salmerón Gómez (University of Granada) and Catalina B. García García (University of Granada).
Maintainer: Román Salmerón Gómez (romansg@ugr.es)
Examples
### Example 1
set.seed(2024)
obs = 100
cte = rep(1, obs)
x2 = rnorm(obs, 5, 0.01) # related to intercept: non essential
x3 = rnorm(obs, 5, 10)
x4 = x3 + rnorm(obs, 5, 0.5) # related to x3: essential
x5 = rnorm(obs, -1, 3)
x6 = rnorm(obs, 15, 0.5)
y = 4 + 5*x2 - 9*x3 -2*x4 + 2*x5 + 7*x6 + rnorm(obs, 0, 2)
x = cbind(cte, x2, x3, x4, x5, x6)
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 2.522626e+03 710.3979874 1.502436e-01 b.1 Yes
#> 2 9.875420e+01 41.8217535 3.659512e-01 b.1 Yes
#> 3 5.555945e-02 4.5808651 7.263924e-07 a.1 No
#> 4 5.528041e-02 0.3113603 1.122944e-02 a.1 No
#> 5 1.234970e-03 0.2585459 6.317865e-06 a.1 No
#> 6 5.039751e-02 3.3626976 7.553190e-04 a.1 No
### Example 2
### Effect of sample size
obs = 25 # by decreasing the number of observations affected to x4
cte = rep(1, obs)
x2 = rnorm(obs, 5, 0.01) # related to intercept: non essential
x3 = rnorm(obs, 5, 10)
x4 = x3 + rnorm(obs, 5, 0.5) # related to x3: essential
x5 = rnorm(obs, -1, 3)
x6 = rnorm(obs, 15, 0.5)
y = 4 + 5*x2 - 9*x3 -2*x4 + 2*x5 + 7*x6 + rnorm(obs, 0, 2)
x = cbind(cte, x2, x3, x4, x5, x6)
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 1.286600e+04 1.053297e+04 9.384591e-01 b.1 Yes
#> 2 5.707087e+02 5.156288e+02 1.726140e+02 b.1 Yes
#> 3 4.045355e-01 6.263143e+00 2.210555e-05 a.1 No
#> 4 4.005247e-01 4.780367e-02 9.441789e-02 b.1 Yes
#> 5 4.651546e-03 2.659262e-01 7.675577e-05 a.1 No
#> 6 4.833250e-01 1.945277e+00 1.200872e-01 a.1 No
### Example 3
y = 4 - 9*x3 - 2*x5 + rnorm(obs, 0, 2)
x = cbind(cte, x3, x5) # independently generated
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 0.0446929027 0.7442711 1.003540e-04 a.1 No
#> 2 0.0004039021 4.2977544 3.601141e-08 a.1 No
#> 3 0.0044674952 0.2131438 9.363871e-05 a.1 No
### Example 4
### Detection of multicollinearity in Wissel data
head(Wissel, n=5)
#> t D cte C I CP
#> 1 1996 3.8051 1 4.7703 4.8786 808.23
#> 2 1997 3.9458 1 4.7784 5.0510 798.03
#> 3 1998 4.0579 1 4.9348 5.3620 806.12
#> 4 1999 4.1913 1 5.0998 5.5585 865.65
#> 5 2000 4.3585 1 5.2907 5.8425 997.30
y = Wissel[,2]
x = Wissel[,3:6]
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 1.948661e+02 7.371069e+00 1.017198e+00 b.1 Yes
#> 2 3.032628e+01 4.456018e+00 9.157898e-01 b.1 Yes
#> 3 4.765888e+00 2.399341e+00 1.053598e+01 b.2 No
#> 4 3.821626e-05 2.042640e-06 7.149977e-04 b.2 No
### Example 5
### Detection of multicollinearity in euribor data
head(euribor, n=5)
#> E cte HIPC BC GD
#> 1 3.63 1 92.92 17211 -51384.0
#> 2 3.90 1 93.85 2724 -49567.1
#> 3 3.45 1 93.93 17232 -52128.4
#> 4 3.01 1 94.41 9577 -53593.3
#> 5 2.54 1 95.08 4117 -65480.0
y = euribor[,1]
x = euribor[,2:5]
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 5.325408e+00 1.575871e+01 2.166907e-02 a.1 No
#> 2 5.357830e-04 3.219456e-06 4.249359e-05 b.1 Yes
#> 3 5.109564e-11 1.098649e-09 2.586237e-12 a.1 No
#> 4 1.631439e-11 3.216522e-10 8.274760e-13 a.1 No
### Example 6
### Detection of multicollinearity in Cobb-Douglas production function data
head(CDpf, n=5)
#> P cte logK logW
#> 1 37641114 1 17.93734 15.55598
#> 2 42620804 1 18.01187 15.60544
#> 3 37989413 1 17.98800 15.54486
#> 4 40464915 1 18.00700 15.58605
#> 5 41002031 1 18.02283 15.59570
y = CDpf[,1]
x = CDpf[,2:4]
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 6388.881402 88495.933700 1.64951764 a.1 No
#> 2 4.136993 207.628058 0.05043083 a.1 No
#> 3 37.336325 9.445619 147.58213164 b.2 No
### Example 7
### Detection of multicollinearity in number of employees of Spanish companies data
head(employees, n=5)
#> NE cte FA OI S
#> 1 2637 1 44153 38903 38867
#> 2 15954 1 9389509 4293386 4231043
#> 3 162503 1 17374000 23703000 23649000
#> 4 162450 1 9723088 23310532 23310532
#> 5 28389 1 95980120 29827663 29215382
y = employees[,1]
x = employees[,3:5]
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 1.829154e-16 2.307712e-16 4.679301e-17 a.1 No
#> 2 1.696454e-12 9.594942e-13 2.129511e-13 b.1 Yes
#> 3 1.718535e-12 1.100437e-12 2.683809e-12 b.2 No
### Example 8
### Detection of multicollinearity in simple linear model simulated data
head(SLM1, n=5)
#> y1 cte V
#> 1 82.392059 1 19.001420
#> 2 -1.942157 1 -1.733458
#> 3 7.474090 1 1.025146
#> 4 -12.303381 1 -4.445014
#> 5 30.378203 1 6.689864
y = SLM1[,1]
x = SLM1[,2:3]
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 0.0403049717 0.6454323 1.045802e-05 a.1 No
#> 2 0.0002675731 0.8383436 8.540101e-08 a.1 No
head(SLM2, n=5)
#> y2 cte Z
#> 1 43.01204 1 9.978211
#> 2 40.04163 1 9.878235
#> 3 40.17086 1 9.924592
#> 4 40.79076 1 10.019123
#> 5 44.72774 1 10.104728
y = SLM2[,1]
x = SLM2[,2:3]
multicollinearity(y, x)
#> RVIFs c0 c3 Scenario Affects
#> 1 187.800878 21.4798003 0.03277691 b.1 Yes
#> 2 1.879296 0.3687652 9.57724567 b.2 No
### Example 9
### Detection of multicollinearity in soil characteristics data
head(soil, n=5)
#> BaseSat SumCation CECbuffer Ca Mg K Na P Cu Zn
#> 1 2.34 0.1576 0.614 0.0892 0.0328 0.0256 0.010 0.000 0.080 0.184
#> 2 1.64 0.0970 0.516 0.0454 0.0218 0.0198 0.010 0.000 0.064 0.112
#> 3 5.20 0.4520 0.828 0.3306 0.0758 0.0336 0.012 0.240 0.136 0.350
#> 4 4.10 0.3054 0.698 0.2118 0.0536 0.0260 0.014 0.030 0.126 0.364
#> 5 2.70 0.2476 0.858 0.1568 0.0444 0.0304 0.016 0.384 0.078 0.376
#> Mn HumicMatter Density pH ExchAc Diversity
#> 1 3.200 0.1220 0.0822 0.516 0.466 0.2765957
#> 2 2.734 0.0952 0.0850 0.512 0.430 0.2613982
#> 3 4.148 0.1822 0.0746 0.554 0.388 0.2553191
#> 4 3.728 0.1646 0.0756 0.546 0.408 0.2401216
#> 5 4.756 0.2472 0.0692 0.450 0.624 0.1884498
y = soil[,16]
x = soil[,-16]
x = cbind(rep(1, length(y)), x) # the design matrix has to have the intercept in the first column
multicollinearity(y, x)
#> System is computationally singular. Modify the design matrix before running the code.
multicollinearity(y, x[,-3]) # eliminating the problematic variable (SumCation)
#> RVIFs c0 c3 Scenario Affects
#> 1 4.407184e+02 6.150190e-03 1.480048e+00 b.1 Yes
#> 2 3.828858e+00 1.142356e-02 7.653413e+00 b.2 No
#> 3 1.093791e+05 1.254955e+02 7.236491e+04 b.1 Yes
#> 4 9.883235e+04 3.938383e+01 2.237445e+05 b.2 No
#> 5 1.767758e+05 1.101028e+03 3.609837e+05 b.2 No
#> 6 1.150029e+05 1.627349e+03 1.976176e+05 b.2 No
#> 7 4.627807e+04 5.960870e+02 2.033176e+06 b.2 No
#> 8 1.338591e+01 6.062571e-01 4.060382e+02 b.2 No
#> 9 3.113066e+02 4.089095e+01 5.246698e+05 b.2 No
#> 10 5.177176e+01 6.371216e+00 8.094828e+02 b.2 No
#> 11 1.905089e-01 3.907589e-02 9.787963e-01 b.2 No
#> 12 3.379360e+02 4.534540e+01 2.861964e+02 b.1 Yes
#> 13 4.761238e+04 8.453066e+01 3.828016e+08 b.2 No
#> 14 1.502903e+03 7.901580e+01 9.961215e+03 b.2 No
#> 15 1.066711e+05 2.369347e+02 4.802466e+07 b.2 No
### Example 10
### The intercept must be in the first column of the design matrix
set.seed(2025)
obs = 100
cte = rep(1, obs)
x2 = sample(1:500, obs)
x3 = sample(1:500, obs)
x4 = rep(4, obs)
x = cbind(cte, x2, x3, x4)
u = rnorm(obs, 0, 2)
y = 5 + 2*x2 - 3*x3 + 10*x4 + u
multicollinearity(y, x)
#> There is a constant variable. Delete it before running the code or, if it is the intercept, it must be the first column of the design matrix.
#> Perfect multicollinearity exists. Modify the design matrix before running the code.
multicollinearity(y, x[,-4]) # the constant variable is removed
#> RVIFs c0 c3 Scenario Affects
#> 1 7.404884e-02 121.0498871 4.062417e-07 a.1 No
#> 2 4.750899e-07 0.2408853 9.995920e-13 a.1 No
#> 3 4.977510e-07 0.5417202 4.573507e-13 a.1 No