LabStats: March 2014

1. Introduction

One of the fundamental assumptions of the general linear model

1) y=Xβ + u

where u is a vector of zero-mean and uncorrelated stochastic errors, is that the data matrix X of order n * k has rank k, that is, the explanatory variables are not linearly dependent. This is because the least squares solution

2) b = (X'X)^-1X'y

requires the inversion of X'X, which, however, would not be possible if the rank of X is <k (because X becomes singular). If some or all of the explanatory variables are perfectly collinear, then the system 1) is said to be affected by "extreme" multicollinearity.

Problems in the calculation of the solution 2) can emerge even when the collinearity among the variables is not perfect. the main effects of collinearity are [1]:

Lowered precision of the estimates that makes it difficult, if not impossible, to separate the relative influences of the different variables. Large errors not only may spoil the estimates of the coefficients, but they may become correlated each other as well.
Increased standard errors of the coefficients that causes adverse selection of the variables. That is, the variables of the model may be not significant though in presence of high R² and of significant overall regression.
The coefficients can be either "wrong" or an order of magnitude implausible.
Ill-conditioning. Small changes in the data produce large differences in the estimated parameters. Hence, the estimates of the coefficients become more sensitive to particular sample sub-sets, such that a few additional observations may drastically change some coefficients.

2. Diagnostics of multicollinearity

a) Examination of the correlation matrix of regressors: high correlations (say, >0.9) may indicate the presence of collinearity. However, with this method one can identify problems just for pairs of variables, whilst the doubt remains about what to do if there are more than two variables to create multicollinearity.

b) One alternative strategy is to do "auxiliary regressions" between a variable "suspect" (say, X_j) and the other k-1 explanatory variables. If the coefficient of determination that you get (R_j²) is close to 1, the regression coefficient of the variable in the original regression is affected by the problem of multicollinearity.
An indicator that immediately provides information on the variables that generate multicollinearity is a VIF (Variance Inflaction Factor):

3) VIF_j=1/(1-R_j²)

So, if VIF_j >10, then R_j²>0.9, therefore the variable X_j is strongly correlated with one or more of the other explanatory variables. On the contrary, if X_j is not linearly dependent to the other k-1 variables , then R_j²=0 and VIF_j=1. In presence of non-perfect multicollinearity, VIF_jmeasures to what extent the increase in the variance of the estimated coefficient b_j is due to multicollinearity.

c) Method of Eigenvalues. The determinant of the matrix X'X equals the product of its eigenvalues:

4) ^{det(X’X)=λ}₁^λ₂^…^λ_k
_{det (X'X) close to 0 means that one or more eigenvalues are close to zero. Therefore we can calculate the condition index:}

5) K=√(λ_max/ λ_min)
_{When the columns of X are orthogonal, K = 1, and K increases with the collinearity between the variables; experimental studies have shown that K> 20 is a symptom of multicollinearity.}
_{Furthermore, to identify the variable(s) affected by collinearity we can calculate the condition number for each regressor as:}

6) Kj=√(λmax/λj )

d) Contradiction between the statistical t-test and F-test of joint significance. This is not a necessary condition for the existence of the problem of multicollinearity, but it is a symptom: there is a high value of the index of determination (and hence significance of the regression as a whole) but non-significant values of the t test for regression coefficients individually. In addition, the partial correlations between the regressors are low.

3. How to fix multicollinearity

To solve the problem of multicollinearity there are several methods that can fit for it:

the addition of new observations that make X a full rank matrix (even if this remedy is not always applicable);
the exclusion of either the correlated variables from the model or those for which the estimated variance of the regression coefficient is high;
transform the variables that cause multicollinearity. This technique is particularly appropriate in the case of exact multicollinearity, in fact one can make a substitution of variable and estimate by Ordinary Least Squares the new parameters, obviously abandoning the idea of estimating the original parameters.
the use of principal component regression (PCR): the main components are extracted from the original regressors (these new variables are by definition orthogonal to each other) and regresses the variable response of these;
the use of ridge regression.

3.1. Rescaling the regressors

An easy way to prevent or at least reduce the effect of multicollinearity is to rescale the variables with respect to their means. A regression equation with an intercept is often misunderstood in the context of multicollinearity [2]. The mean-centering facilitates the interpretation of the intercept term, that becomes the expected value of the outcomes y when the explanatory variables are set to their mean values. When variables have been centered, the intercept has no effect on the collinearity of the other variables [3].

Applying other transformations, introducing additive constants or using uncentered variables would result in large effect [4], especially in regressions with higher order terms where the means' level of the predictors may shift the covariance between the interaction terms and each component. Rescaling changes the means so that also the regressors' covariance changes, yielding different regression weights for the predictors in the higher order function [2].

Sunday, 23 March 2014

MULTICOLLINEARITY IN LINEAR MODELS

Understanding Anaerobic Threshold (VT2) and VO2 Max in Endurance Training

Other Links

Sidebar List