Variance inflation factor in python (VIF) – multicollinearity

variance inflation factor in python - www.devpyjp.com

Hey, machine learning humans! Welcome to the variance inflation factor in python tutorial. This is a very important consideration when you are moving to model building in machine learning.

In this tutorial, I am gonna teach you the below topics

  1. Colinearity
  2. Multicolinearity
  3. Variance inflation factior (VIF)
  4. Detecting Multicolinearity
  5. Solutions to multicolinearity

Ok, are you ready to learn new things in machine learning, let’s do it now. Before that check out our best articles on machine learning.

Model evaluation techniques in machine learning

Types of cross-validation in machine learning

1. Colinearity:

Hey, do you remember this colinear points topic in your school days? if no, don’t worry, I am here to help you.

2 or more points in the same line called colinear points, its called as colinearity. It’s just simple dude.

collinear points

2. Multicollinearity:

When we building a multilinear regression model or classification model we must care about this multicollinearity.

So, what is multicollinearity? It is a situation when 2 or more variables give the same data to predict the class label (or) 2 or more variables are highly correlated in the data.

This multicollinearity occurs in both regression and classification. Actually, it occurs in data not in the model.

In this data, both loan and funded data serve the almost same data. Check out more here.

3. Variance inflation factor in python:

Variance inflation factor or VIF in python is measure of indication among the variables or features.

4. Detection of multicollinearity:

Ok, now we need to detect the multicollinear features in our data. There are 2 ways to detect the multilinear features in the data.

  • Correlation
  • Variation Inflation factor (VIF)

Correlation:

correlation represents the relationship between 2 or more variables.

corr_matrix= df.corr()
corr_matrix

Loan and funded are multicollinear features in our data so that they are highly correlated. Remember, if two features are multicollinear then the correlation is either 1 or -1.

Variation Inflation factor (VIF):

we can detect the multicollinear features in our data using the variance inflation factor in python.

from statsmodels.stats.outliers_influence import variance_inflation_factor

X=df[['loan','funded','emp_exp']]

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

vif["features"] = X.columns

print(vif)

Note: The rule of thumb is the VIF factor is greater than 10 then those features are in multicollinearity.

5. Solutions to Multicollinearity

After finding the multicollinear features, we need to do something about the data. let’s do that also.

I will give you the best things to you after finding multicollinear features in our data.

Method-1: Remove either of them which are correlated in our data.

Method-2: Combine the both collinear features and make a new feature in our data and also remove the both collinear features in our data.

Ok! Its time to say bye! Thank you for being here, let’s appreciate us with your comment and subscribe to our newsletter to get more articles on machine learning.

Leave a Reply