FEATURE ENGINEERING : SOME FAMOUS TECHNIQUES TO HANDLE MISSING VALUES

SOME FAMOUS TECHNIQUES TO HANDLE MISSING VALUES


Author: Bhaskar Kumar Das


Introduction:

                                 Humans are prone to commit errors and more importantly in many cases, errors occur not due to human negligence and faults.They occur due to variety of reasons that are beyond of human imagination, and one such frequent type of error that we encounter in Data Science is due to the presence of missing values.

Missing values are generally caused when who takes/prepares data set fails to include a value or a person/system becomes unwilling to share information.(E.g it is observed that men are not likely to share info about their salary and women are sometimes reluctant to share their ages).. So, being a data scientist, it becomes our duty to handle those missing values.


We have two choices:

1) Drop those rows containing the NAN values.

2) Replace those NAN values by some value.

Here, in this blog we'll opt the second choice as dropping rows sometimes can become costly, as their is problem called Loss Of Information.If the dataset has fewer data points, then we can't afford to do this thing.

SO let's Start...........  


Dataset Source: https://www.kaggle.com/c/titanic/data


Dataset Description:  This Dataset is based on the famous & unfortunate incident of sinking of Titanic ship.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. 

Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Data Dictionary:

VariableDefinitionKey
SurvivedSurvival0 = No, 1 = Yes
pclassTicket class1 = 1st, 2 = 2nd, 3 = 3rd
sexSex
AgeAge in years
sibsp# of siblings / spouses aboard the Titanic
parch# of parents / children aboard the Titanic
ticketTicket number
farePassenger fare
cabinCabin number
embarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

Here the  "survived"  is the target vector.

You can find more about this data set from  https://www.kaggle.com/c/titanic/data


Well Lets load the data set..........



Lets find the features of the data set.


and

 

and we have 891 data points.

Now lets inspect whether there are any missing values in our data set


Nope, we've some missing values Age ,Cabin and Embarked has 177, 687 and 2 missing values respectively.


Lets find what percentage of them are missing.....


Age,Cabin and Embarked have 19.87%, 77.1% and 2% missing values respectively.

19.8% and 77.8% of Age and cabin are missing among 891 points!!!

so we can't afford to drop them...

 

Lets see the data type of the values present in those fields   

              

Although  Age is an integer type field but other those fields  contain categorical ( 'O' mean Object)  type data.

Now, here is a catch, handling a continuous (we know,age is a continuous var and not  discrete) and handling categorical var are completely different and needs different approach. In this blog, we'll focus on continuous var. For categorical var, we'll deal with them in our upcoming blog.

so leave behind those two categorical var and recreate the data set using three features Survived, Age, Fare


Here We'll discuss four famous techniques of  missing value handling. 

Lets dive into some really effective techniques......

1) Median Imputation:

It can be applied when our data are completely missing at random and there is no relationship between missing of age data with any other features.

and here age satisfies that criteria.

Here we replace the NAN values of Age with the median of Age column.

The only reason that we didn't do it with mean is  because means are greatly affected by the presence of outliers but have no effect on medians.


Now my data frame  looks like this.


Now let's find some key details of this transformation


If you notice carefully, you'll see that my actual standard deviation (std Dev) of Age is lower than those of transformed Age_median's std Dev.

and also the graph below confirms this happening. This is not at all wanted as our variance gets distorted, there is a threat to the loss of information  and also our correlation can be impacted due to this.

2) Random Sampling Distribution: 

Here  it takes  random observation from our data set ( here from Age) and we are using  this values to replace the NAN values.

Now our data frame...


Now lets inspect...


Here we can see that blue colored Age  and red colored Age_random has completely overlapped so we can conclude there is very less or no distortion in variance and this transformation can be applied on our Age feature.


3) End of Sample Distribution:

The main motive behind this is to reduce the outliers from the data set.

Lets plot the histograms of age .....


Here we can notice a right skew or positive skew on our histograms of age distribution. It basically, tells  that  our Age contains outliers and they are  present on the extreme right hand sides of the plot.

and our box plot also confirms this.....



 Just watch out the black dots present on right side of  whiskers.This are outliers and right hand side contains lots and lots of these.


So...

 

lets pick a extreme value  about  3rd std Dev to the right side of mean of Age.Since  outliers are present on right,we want to pick such a extreme value(outlier) from right and replace every NAN with these.

This will reduce the outliers.



Lets plot new histogram.....


Although, it looks a little weird but we  somehow  have managed to destroy it's positive skewness to some extent and that was our main motive here.


Just see our box plot says, currently we have no outliers.

But this a scary approach because what if  the no of  missing values are large as in this case, this method will mask a huge number of outliers to our distribution and also it distorts our original distribution.


But it can be a go to approach, if we have a small number of missing values.

4) Arbitrary Value Imputation:

Here we replace the missing values by some arbitrary values



Here we've replaced  every NAN by zero and hundred.

histogram of zero's  distribution
histogram of hundred's distribution....

In both the cases, the distribution is distorted and this is not a good approach to follow.


HURAYYYY!!!! you've reached the end of this post....in our next blog we'll discuss some techniques to handle that categorical feature variable and Thanks for reading my blog..

If you haven't checked out my previous two blogs pls check out these links https://bhaskar47899.blogspot.com/2020/08/exploratory-data-analysis-eda.html
Thank You!!!!

Comments