Types of cross validation in machine learning

cross validation - www.devpyjp.com

Hello, Machine learning enthusiasts, welcome to another beautiful article of machine learning by DevpyJp. Cross validation is a technique that can help us to improve the model accuracy in machine learning.

Simply, it is a split of our data into test data and train data in a model building in machine learning. It has a major role in the training models in machine learning.

The training phase is the most significant part in machine learning if you are not training your machine learning model with good constraint, it will give you the worst results.

let’s discuss a little deeper into types of cross validations. Here, you will learn the most common and useful types of cross validation techniques. They are

  • Simple cross validation
  • K fold cross validations
  • stratified k fold cross validations
  • Time-based splitting

Read More machine learning articles Here:

1. Simple cross validation

It is a very simple technique, we split the data into train and test sets with the ratio of 80:20 or 75:25 respectfully. So that we use 75-80% of data for training and 20-25% of data for testing by the model.

simple cross validation

It is ok, but not good all the time for machine learning model training.

Drawback: We can’t get the same results all the time if we don’t specify the random state attribute and we don’t know how the data is distributed with respect to class labels.

2. K fold cross validation

It is a better version of cross validation , it overcomes the drawback of the simple cross validation but it has some drawbacks also. Here, K – represents the No.of times split the train data.

In this technique, we follow

  • we split the data into test and train data as the ratio of 80:20 or 75:25.
  • Define K-folds, K is an integer. k =5, usually it should be in 5 -10.
  • Now we will re-split the train data into train and test data according to k -folds. Here we train (k-1)th of data for training and 1 – (k-1)th of data for testing each k times.
  • Calculate the loss of each k fold cross validation.
  • Calculate the average loss and consider that as the final loss of that model in training.
k fold cross validations – www.devpyjp.com

Drawbacks: We don’t know how the data split with respect to class labels. It causes to get a different variance for test and train data. It will take more time for training the model.

3. Stratified k fold cross validation

It avoids the drawback of k fold cross validation. stratified k fold cross validation split the k folds with respect to the target of class labels ratio. so that every time we get an equal ratio of data points of each class in each k fold. The remaining steps are the same as k fold cross validation.

stratified k fold cross validations

Drawback: It will take more time in machine learning model training because we have k folds. Read more

4. Time based splitting

It is a completely different technique in cross validation techniques in machine learning. It is a problem specific technique, I can’t or suggest you all the time.

Example: we have data set from 2015 Jan to 2019 Dec. At the start of the company we initialize some products, In 2017 we added some more products and in 2019 we initialized other products and also we dropped some products which are added previously.

At this moment if we use any other cross validation technique, we can’t predict accurately, because some products may present and some products may not be present and also in the future we may get new products.

In this case, we use Time based splitting for accurate predictions by model. It definitely performs well. In this technique, we follow

  • Sort the date column in our data set
  • split the data into train and test sets
  • Train the model with the train set
  • If needed apply the k fold cross validation ( not all the time )
  • Test the model with the test set
Time based splitting

It is not always accurate, but it is good when we worked on eCommerce or other time-related projects.

Caution: We must have the Time column in our dataset, otherwise we can’t use Time based split on a dataset. Always we test model with the recent data.

We don’t have any scikit-learn modules to implement time-based split, we need to build it from scratch.

These are the most used techniques and these can help us to improve the model performance.

I hope you guys, you definitely love my content and images. Please appreciate us and also subscribe to our newsletter below. If you have any queries, comment below. Thank you!

Don't miss out!
Subscribe To Newsletter

Receive top Machine learning & Python news, lesson ideas, project tips, FREE Ebooks and more!

Invalid email address
Thanks for subscribing!

Leave a Reply