Multicollinearity

Acknowledgement: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012.

Example

Let us start with a data set. Suppose we want to predict and model patients’ total body fat percentages using measurements of body fat percentages on triceps, thigh, and mid-arm.

y: total body fat percentages

x1: triceps body fat percentages

x2: thigh body fat percentages

x3: mid-arm body fat percentages

BF <- read.csv("data_BodyFat.csv",h=T)
dim(BF)[1]

## [1] 20

names(BF)

## [1] "x1" "x2" "x3" "y"

head(BF)

##     x1   x2   x3    y
## 1 19.5 43.1 29.1 11.9
## 2 24.7 49.8 28.2 22.8
## 3 30.7 51.9 37.0 18.7
## 4 29.8 54.3 31.1 20.1
## 5 19.1 42.2 30.9 12.9
## 6 25.6 53.9 23.7 21.7

pairs(BF,pch=20)

cor(BF)

##           x1        x2        x3         y
## x1 1.0000000 0.9238425 0.4577772 0.8432654
## x2 0.9238425 1.0000000 0.0846675 0.8780896
## x3 0.4577772 0.0846675 1.0000000 0.1424440
## y  0.8432654 0.8780896 0.1424440 1.0000000

model1=lm(y~x1+x2+x3,data=BF)
summary(model1)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = BF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7263 -1.6111  0.3923  1.4656  4.1277 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  117.085     99.782   1.173    0.258
## x1             4.334      3.016   1.437    0.170
## x2            -2.857      2.582  -1.106    0.285
## x3            -2.186      1.595  -1.370    0.190
## 
## Residual standard error: 2.48 on 16 degrees of freedom
## Multiple R-squared:  0.8014, Adjusted R-squared:  0.7641 
## F-statistic: 21.52 on 3 and 16 DF,  p-value: 7.343e-06

High correlation between triceps and thigh measurements but not with midarm.

Multicollinearity

A serious problem that may dramatically impact the usefulness of a regression model is multicollinearity, or near-linear dependence among the regression variables.
Multicollinearity implies near-linear dependence among the regressors.
The presence of multicollinearity can dramatically impact the ability to estimate regression coefficients and other uses of the regression model.

Issues with Multicollinearity

Adding or deleting a predictor variable changes the regression coefficients
Standard error of the regression coefficients become large when the predictor variables are highly correlated
(IMPORTANT!!! A large wiggle room in 3D or higher dimension)
Estimated regression coefficients individually may not be significant but a statistical relationship exists for the set of predictor variables and the response variable

Figure examples:

Informal Diagnostics

Large changes in the estimated regression coefficients when a predictor variable is added or deleted
Nonsignificant results in individual tests of regression coefficients for important predictors
Regression coefficients with a sign that is opposite of that expected
Large coefficients of simple correlation between pairs of predictor variables
Wide confidence intervals for the regression coefficients for important predictors

Variance Inflation Factors

Variance inflation factors, VIFs, are an important multicollinearity diagnostic. The variance inflation factors can be written as:

\[ VIF_j=\frac{1}{1-R^2_j} \]

where \(R^2_j\) is the coefficient of determination obtained from regressing \(j\)-th covariate, \(x_j\), on the other covariates.
If \(x_j\) is highly correlated with any other regressor variables, then \(R^2_j\) will be large.
The square root of the variance inflation factor indicates how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model. For example, if VIF = 10, then \(\sqrt{VIF} \approx 3.3\) and the confidence interval is 3.3 times longer than if the regressors had been orthogonal
Largest VIF used as an indicator of multicollinearity
Guideline: VIFs > 5 to 10 are considered significant

Deal with Multicollinearity

Methods for Dealing with Multicollinearity

Collect more data
Specify the model differently
Use Ridge regression

Ridge Regression

Ridge Regression modifies method of least squares to biased estimators of the regression coefficients

Traditional least square estimation: \[ \hat{\beta} = \arg\min \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2 \]

Ridge Regression estimation: \[ \hat{\beta}_\lambda = \arg\min \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \]

Plots \(\lambda\) against the coefficient estimates

Will show impact of multicollinearity in the stability of the coefficients

Choose \(\lambda\) such that \(\hat{\beta}_\lambda\) is stable and the MSE is acceptable

Ridge regression is a good alternative if the model user wants to have all regressors in the model

#install.packages("glmnet")
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 3.0-1

library(MASS)
vif <- function(mod, ...) {
    if (any(is.na(coef(mod)))) 
        stop ("there are aliased coefficients in the model")
    v <- vcov(mod)
    assign <- attr(model.matrix(mod), "assign")
    if (names(coefficients(mod)[1]) == "(Intercept)") {
        v <- v[-1, -1]
        assign <- assign[-1]
    }
    else warning("No intercept: vifs may not be sensible.")
    terms <- labels(terms(mod))
    n.terms <- length(terms)
    if (n.terms < 2) stop("model contains fewer than 2 terms")
    R <- cov2cor(v)
    detR <- det(R)
    result <- matrix(0, n.terms, 3)
    rownames(result) <- terms
    colnames(result) <- c("GVIF", "Df", "GVIF^(1/(2*Df))")
    for (term in 1:n.terms) {
        subs <- which(assign == term)
        result[term, 1] <- det(as.matrix(R[subs, subs])) *
            det(as.matrix(R[-subs, -subs])) / detR
        result[term, 2] <- length(subs)
    }
    if (all(result[, 2] == 1)) result <- result[, 1]
    else result[, 3] <- result[, 1]^(1/(2 * result[, 2]))
    result
}
vif(model1)

##       x1       x2       x3 
## 708.8429 564.3434 104.6060

library(glmnet)
rreg=glmnet(x=as.matrix(BF[,-4]), y=BF[,4],alpha=0,lambda = seq(0,0.5,0.01))
rbind(rreg$lambda,rreg$beta)

## 4 x 51 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 51 column names 's0', 's1', 's2' ... ]]

##                                                                     
##     0.5000000  0.4900000  0.4800000  0.4700000  0.4600000  0.4500000
## x1  0.4312373  0.4316196  0.4322645  0.4330645  0.4339617  0.4349253
## x2  0.4370659  0.4375796  0.4378913  0.4380835  0.4382005  0.4382662
## x3 -0.1141151 -0.1146001 -0.1152155 -0.1159092 -0.1166533 -0.1174327
##                                                                     
##     0.4400000  0.4300000  0.4200000  0.4100000  0.4000000  0.3900000
## x1  0.4359388  0.4369938  0.4380862  0.4392147  0.4403794  0.4415814
## x2  0.4382931  0.4382877  0.4382527  0.4381892  0.4380969  0.4379748
## x3 -0.1182396 -0.1190699 -0.1199218 -0.1207947 -0.1216887 -0.1226044
##                                                                     
##     0.3800000  0.3700000  0.3600000  0.3500000  0.3400000  0.3300000
## x1  0.4428224  0.4441045  0.4454302  0.4468025  0.4482245  0.4496999
## x2  0.4378214  0.4376348  0.4374127  0.4371527  0.4368520  0.4365074
## x3 -0.1235429 -0.1245053 -0.1254929 -0.1265073 -0.1275503 -0.1286238
##                                                                     
##     0.3200000  0.3100000  0.3000000  0.2900000  0.2800000  0.2700000
## x1  0.4512324  0.4528265  0.4547170  0.4566152  0.4585564  0.4605643
## x2  0.4361157  0.4356731  0.4349902  0.4342984  0.4335692  0.4327830
## x3 -0.1297298 -0.1308707 -0.1321646 -0.1334671 -0.1347957 -0.1361629
##                                                                     
##     0.2600000  0.2500000  0.2400000  0.2300000  0.2200000  0.2100000
## x1  0.4626572  0.4651257  0.4676284  0.4704937  0.4733934  0.4766835
## x2  0.4319249  0.4307581  0.4295587  0.4280588  0.4265246  0.4246646
## x3 -0.1375778 -0.1391880 -0.1408211 -0.1426441 -0.1444910 -0.1465431
##                                                                     
##     0.2000000  0.1900000  0.1800000  0.1700000  0.1600000  0.1500000
## x1  0.4800450  0.4838628  0.4881014  0.4927671  0.4979024  0.5035803
## x2  0.4227384  0.4204291  0.4177636  0.4147346  0.4113047  0.4074109
## x3 -0.1486386 -0.1509747 -0.1535340 -0.1563209 -0.1593584 -0.1626850
##                                                                     
##     0.1400000  0.1300000  0.1200000  0.1100000  0.1000000  0.0900000
## x1  0.5099019  0.5172989  0.5258012  0.5358791  0.5476771  0.5616054
## x2  0.4029663  0.3976080  0.3913056  0.3836600  0.3745417  0.3635975
## x3 -0.1663544 -0.1705907 -0.1754120 -0.1810646 -0.1876272 -0.1953170
##                                                                     
##     0.0800000  0.0700000  0.0600000  0.0500000  0.0400000  0.0300000
## x1  0.5792731  0.6012218  0.6302994  0.6697241  0.7271691  0.8186159
## x2  0.3494620  0.3316591  0.3077599  0.2750022  0.2268213  0.1495351
## x3 -0.2049759 -0.2168954 -0.2325715 -0.2537031 -0.2843309 -0.3328752
##                                     
##     0.020000000  0.0100000  .       
## x1  0.987221926  1.4028661  4.113612
## x2  0.006197198 -0.3486234 -2.668242
## x3 -0.422073583 -0.6414275 -2.069967

matplot(rreg$lambda,t(rreg$beta),type="l",ylab="coefficient",xlab="lambda")
abline(h=0)

#select(rreg)
#rreg

Comments on Ridge Regression

\(R^2\) can be defined analogously to least square estimate

Estimates are more stable than least square estimates and little affected by small changes in the data

Estimates provide good estimates of mean responses for levels of the predictor variables outside the region of the observations

Inferential procedures are not applicable since exact distributions are not known

Selection of \(\lambda\) is judgmental.

Another Example

#Webster data
#install.packages("car")
#library(car)
W <- read.csv("data_Webster.csv",h=T)
pairs(W,pch=20)

n=dim(W)[1]
pairs(W+cbind(0,matrix(1.5*runif(n*4),n,4),0,0),pch=20)

model1=lm(y~x1+x2+x3+x4+x5+x6,data=W)
summary(model1)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = W)
## 
## Residuals:
##          1          2          3          4          5          6 
## -3.698e-15 -1.545e+00  1.545e+00  7.649e-01 -2.517e-01 -5.132e-01 
##          7          8          9         10         11         12 
## -6.030e-01  4.730e-01  1.300e-01  1.510e-01 -2.371e-01  8.610e-02 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.6599    14.0465   1.186 0.288885    
## x1           -0.5313     1.3418  -0.396 0.708482    
## x2           -0.8385     1.4206  -0.590 0.580722    
## x3           -0.7753     1.4094  -0.550 0.605914    
## x4           -0.8440     1.4031  -0.601 0.573745    
## x5            1.0232     0.3909   2.617 0.047247 *  
## x6            5.0470     0.7277   6.936 0.000956 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.129 on 5 degrees of freedom
## Multiple R-squared:  0.9457, Adjusted R-squared:  0.8806 
## F-statistic: 14.52 on 6 and 5 DF,  p-value: 0.004993

rreg <- glmnet(as.matrix(W[,-1]),W[,1],alpha=0,lambda=seq(0,0.2,.01))
rbind(rreg$lambda,rreg$beta)

## 7 x 21 sparse Matrix of class "dgCMatrix"

##    [[ suppressing 21 column names 's0', 's1', 's2' ... ]]

##                                                                
##     0.200000000  0.19000000  0.18000000  0.17000000  0.16000000
## x1  0.171841980  0.17147381  0.17141456  0.17135186  0.17109254
## x2 -0.082476146 -0.08346788 -0.08412549 -0.08478592 -0.08565685
## x3 -0.009277846 -0.01080200 -0.01205498 -0.01334375 -0.01485757
## x4 -0.097257945 -0.09808449 -0.09862875 -0.09918995 -0.09995534
## x5  1.001994679  1.00454668  1.00701669  1.00945447  1.01188741
## x6  4.689062458  4.70627701  4.72374209  4.74140345  4.75922489
##                                                                           
##     0.15000000  0.14000000  0.13000000  0.12000000  0.11000000  0.10000000
## x1  0.17083233  0.17053514  0.17017082  0.16971425  0.16891288  0.16755448
## x2 -0.08652962 -0.08744168 -0.08842518 -0.08950679 -0.09095557 -0.09299885
## x3 -0.01638751 -0.01798007 -0.01966612 -0.02147204 -0.02365695 -0.02642894
## x4 -0.10072965 -0.10155258 -0.10245452 -0.10346147 -0.10482746 -0.10676485
## x5  1.01431184  1.01671515  1.01909850  1.02146117  1.02381419  1.02617989
## x6  4.77719648  4.79534417  4.81366687  4.83216556  4.85082235  4.86959199
##                                                                           
##     0.09000000  0.08000000  0.07000000  0.06000000  0.05000000  0.04000000
## x1  0.16614853  0.16384486  0.16116009  0.15711833  0.15157243  0.14295745
## x2 -0.09509029 -0.09813670 -0.10158390 -0.10647031 -0.11294738 -0.12267351
## x3 -0.02928527 -0.03308809 -0.03732844 -0.04300549 -0.05029319 -0.06082395
## x4 -0.10877334 -0.11170891 -0.11506473 -0.11983615 -0.12619561 -0.13576981
## x5  1.02849637  1.03081248  1.03306319  1.03527721  1.03740169  1.03940733
## x6  4.88857033  4.90765970  4.92696009  4.94637967  4.96596429  4.98563943
##                                                 
##     0.03000000  0.0200000  0.01000000  .        
## x1  0.12892511  0.1021866  0.03485326 -0.4249941
## x2 -0.13813117 -0.1670323 -0.23888023 -0.7259783
## x3 -0.07707419 -0.1066999 -0.17901478 -0.6638667
## x4 -0.15102595 -0.1795933 -0.25068390 -0.7328694
## x5  1.04118335  1.0424608  1.04216009  1.0267822
## x6  5.00535961  5.0249027  5.04346264  5.0507129

matplot(rreg$lambda,t(rreg$beta),type="l",ylab="coefficient",xlab="lambda")
abline(h=0)

Multicollinearity

Yichen Qin (qinyn@ucmail.uc.edu)

Example

Multicollinearity

Informal Diagnostics

Variance Inflation Factors

Deal with Multicollinearity

Ridge Regression

Comments on Ridge Regression

Another Example