FEATURE ENGINEERING : SOME FAMOUS TECHNIQUES TO HANDLE MISSING VALUES

SOME FAMOUS TECHNIQUES TO HANDLE MISSING VALUES

Author: Bhaskar Kumar Das

Introduction:

Humans are prone to commit errors and more importantly in many cases, errors occur not due to human negligence and faults.They occur due to variety of reasons that are beyond of human imagination, and one such frequent type of error that we encounter in Data Science is due to the presence of missing values.

Missing values are generally caused when who takes/prepares data set fails to include a value or a person/system becomes unwilling to share information.(E.g it is observed that men are not likely to share info about their salary and women are sometimes reluctant to share their ages).. So, being a data scientist, it becomes our duty to handle those missing values.

We have two choices:

1) Drop those rows containing the NAN values.

2) Replace those NAN values by some value.

Here, in this blog we'll opt the second choice as dropping rows sometimes can become costly, as their is problem called Loss Of Information.If the dataset has fewer data points, then we can't afford to do this thing.

SO let's Start...........

Dataset Source: https://www.kaggle.com/c/titanic/data

Dataset Description: This Dataset is based on the famous & unfortunate incident of sinking of Titanic ship.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg.

Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Data Dictionary:

Variable	Definition	Key
Survived	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Here the "survived" is the target vector.

You can find more about this data set from https://www.kaggle.com/c/titanic/data

Well Lets load the data set..........

Lets find the features of the data set.

and

and we have 891 data points.

Now lets inspect whether there are any missing values in our data set

Nope, we've some missing values Age ,Cabin and Embarked has 177, 687 and 2 missing values respectively.

Lets find what percentage of them are missing.....

Age,Cabin and Embarked have 19.87%, 77.1% and 2% missing values respectively.

19.8% and 77.8% of Age and cabin are missing among 891 points!!!

so we can't afford to drop them...

Lets see the data type of the values present in those fields

Although Age is an integer type field but other those fields contain categorical ( 'O' mean Object) type data.

Now, here is a catch, handling a continuous (we know,age is a continuous var and not discrete) and handling categorical var are completely different and needs different approach. In this blog, we'll focus on continuous var. For categorical var, we'll deal with them in our upcoming blog.

so leave behind those two categorical var and recreate the data set using three features Survived, Age, Fare

Here We'll discuss four famous techniques of missing value handling.

Lets dive into some really effective techniques......

1) Median Imputation:

It can be applied when our data are completely missing at random and there is no relationship between missing of age data with any other features.

and here age satisfies that criteria.

Here we replace the NAN values of Age with the median of Age column.

The only reason that we didn't do it with mean is because means are greatly affected by the presence of outliers but have no effect on medians.

Now my data frame looks like this.

Now let's find some key details of this transformation

If you notice carefully, you'll see that my actual standard deviation (std Dev) of Age is lower than those of transformed Age_median's std Dev.

and also the graph below confirms this happening. This is not at all wanted as our variance gets distorted, there is a threat to the loss of information and also our correlation can be impacted due to this.

2) Random Sampling Distribution:

Here it takes random observation from our data set ( here from Age) and we are using this values to replace the NAN values.

Now our data frame...

Now lets inspect...

Here we can see that blue colored Age and red colored Age_random has completely overlapped so we can conclude there is very less or no distortion in variance and this transformation can be applied on our Age feature.

3) End of Sample Distribution:

The main motive behind this is to reduce the outliers from the data set.

Lets plot the histograms of age .....

Here we can notice a right skew or positive skew on our histograms of age distribution. It basically, tells that our Age contains outliers and they are present on the extreme right hand sides of the plot.

and our box plot also confirms this.....

Just watch out the black dots present on right side of whiskers.This are outliers and right hand side contains lots and lots of these.

So...

lets pick a extreme value about 3rd std Dev to the right side of mean of Age.Since outliers are present on right,we want to pick such a extreme value(outlier) from right and replace every NAN with these.

This will reduce the outliers.

Lets plot new histogram.....

Although, it looks a little weird but we somehow have managed to destroy it's positive skewness to some extent and that was our main motive here.

Just see our box plot says, currently we have no outliers.

But this a scary approach because what if the no of missing values are large as in this case, this method will mask a huge number of outliers to our distribution and also it distorts our original distribution.