FEATURE ENGINEERING : SOME FAMOUS TECHNIQUES TO HANDLE MISSING VALUES
SOME FAMOUS TECHNIQUES TO HANDLE MISSING VALUES
Author: Bhaskar Kumar Das
Introduction:
Humans are prone to commit errors and more importantly in many cases, errors occur not due to human negligence and faults.They occur due to variety of reasons that are beyond of human imagination, and one such frequent type of error that we encounter in Data Science is due to the presence of missing values.
Missing values are generally caused when who takes/prepares data set fails to include a value or a person/system becomes unwilling to share information.(E.g it is observed that men are not likely to share info about their salary and women are sometimes reluctant to share their ages).. So, being a data scientist, it becomes our duty to handle those missing values.
We have two choices:
Dataset Source: https://www.kaggle.com/c/titanic/data
Dataset Description: This Dataset is based on the famous & unfortunate incident of sinking of Titanic ship.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
Data Dictionary:
| Variable | Definition | Key |
|---|---|---|
| Survived | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Here the "survived" is the target vector.
You can find more about this data set from https://www.kaggle.com/c/titanic/data
Well Lets load the data set..........
Lets find the features of the data set.
and
and we have 891 data points.
Now lets inspect whether there are any missing values in our data setNope, we've some missing values Age ,Cabin and Embarked has 177, 687 and 2 missing values respectively.
Lets find what percentage of them are missing.....
Age,Cabin and Embarked have 19.87%, 77.1% and 2% missing values respectively.
19.8% and 77.8% of Age and cabin are missing among 891 points!!!
so we can't afford to drop them...
Lets see the data type of the values present in those fields
Although Age is an integer type field but other those fields contain categorical ( 'O' mean Object) type data.
Now, here is a catch, handling a continuous (we know,age is a continuous var and not discrete) and handling categorical var are completely different and needs different approach. In this blog, we'll focus on continuous var. For categorical var, we'll deal with them in our upcoming blog.
so leave behind those two categorical var and recreate the data set using three features Survived, Age, Fare
Here We'll discuss four famous techniques of missing value handling.
Lets dive into some really effective techniques......
1) Median Imputation:
and here age satisfies that criteria.
Here we replace the NAN values of Age with the median of Age column.
The only reason that we didn't do it with mean is because means are greatly affected by the presence of outliers but have no effect on medians.
Now my data frame looks like this.
Now let's find some key details of this transformation
If you notice carefully, you'll see that my actual standard deviation (std Dev) of Age is lower than those of transformed Age_median's std Dev.
and also the graph below confirms this happening. This is not at all wanted as our variance gets distorted, there is a threat to the loss of information and also our correlation can be impacted due to this.
2) Random Sampling Distribution:
Now lets inspect...
Here we can see that blue colored Age and red colored Age_random has completely overlapped so we can conclude there is very less or no distortion in variance and this transformation can be applied on our Age feature.
3) End of Sample Distribution:
Here we can notice a right skew or positive skew on our histograms of age distribution. It basically, tells that our Age contains outliers and they are present on the extreme right hand sides of the plot.
and our box plot also confirms this.....
So...
lets pick a extreme value about 3rd std Dev to the right side of mean of Age.Since outliers are present on right,we want to pick such a extreme value(outlier) from right and replace every NAN with these.
This will reduce the outliers.

























Comments
Post a Comment