Exploratory Data Analysis (EDA)

Exploratory Data Analysis

 On Iris Dataset

Author: Bhaskar Kumar Das
Linkedin: https://www.linkedin.com/in/bhaskar-kumar-das-64019a168/                    

Introduction:  

                                  Have you ever wondered ,what makes Google to offer their world class service at 0 cost ?. Google, literally offers  search, youtube ,gmail, gdrive ,playstore,G maps, Classroom,blogger and many many more without taking a single penny from customers.Have you ever wondered what makes cab providers like Uber gives you cab facility at such a low cost ? ....have you ever thought of it?. Are they mad?.The answer is no and the only reason that allows them to not only offer valuable services but also generate a huge profit of unimaginable proportion. The  weapon they use is nothing but data, when you agree to their terms of services before using their apps, they ask for your permission to use your data at their own purpose and they sell those  data to their vendors where it is used to determine various behavioural aspects/trends of  customers. Now you know, why Google and all such major data companies are  so good at predicting the market trend, the only reason is the abundance of  varieties of data at their fingertips. 

 Someone has rightly said "Data is the New Oil"--and it will be for  centuries to come. 

But now question is how to use those data because data can be presented in any format be it structured or unstructured and the later is more common.Not only that, there will be lot of impurities in the data like missing values,erroneous values and lots and lots of unwanted stuffs so to derive insights from data, we've to process it to make use of it.

So we need to visualize the data first in order to find any sorts of problems in it and Exploratory Data Analysis is that Technique.

Exploratory Data Analysis (EDA) is the process of visualizing and analyzing data to extract insights from it. In other words, EDA is the process of summarizing important characteristics of data in order to gain better understanding of the dataset.

In this blog,we are going to see some examples of EDA on famous Iris Dataset.


Problem Description:

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Attribute Information:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

Dataset Source: https://www.kaggle.com/uciml/iris


1)Loading the dataset & some  requisite Libraries:

We have used python libraries for our purpose.

Libraries used: Pandas, numpy, matplotlib, seaborn and plotly

we have used pandas to read our csv file.


Notice: here we have to strip off the id columns from our dataset as it can hinder some specific operations that we'll perform on our data.


 2) Visualizing some statistical aspects of Data

we have used describe() method to get the details about count,mean,std, 25th percentile,50th and 75th percentile of the four different features of data.This falls under the descriptive statistical analysis.

3) Getting information about classes and features of our dataset


Now after doing all sorts of gathering info,it's time for some real stuffs ,some real statistical analysis to find some hidden secrets from the data.

Our Analysis is mainly divided into 3 parts:

1) Univariate Analysis: It involves analysis involving only one variable.

2)Bivariate Analysis: It involves analysis involving two variables.

3) Multivariate Analysis: It involves more than two variables.


Univariate Analysis: 

It is mainly done to find out the distribution of the chosen random variable.A rand var can follow any distribution, to effectively determine how it's distribution looks we mainly plot it's pdf (Probability Density Function) which has all outcome of rand var on its X axis and probability of their occurrence on Y-axis.

Lets see some example

here rv is Petal Length

here the bell shaped looking graph is what we call as pdf. Just by looking at it we can simply say setosa differs from the rest two in terms of petal length.


As per petal-width all three species are very much different in nature.


In terms of sepal Length, it is a bit difficult to distinguish between them.so sepal Length cannot be a good predictor for species identification.

As per sepalwidth , there is very less difference between three of them.so we cannot make any distinction using this.

Lets look at it's cumulative distribution function(cdf)

cdf generally denotes what  % of data points are present below a given data point.



In terms of petal length, species like virginica and versicolor shows a lot more spread of data than setosa, as they have more tilted cdf than setosa.

Now lets use some more interesting plots

we have plotted box plot on petal length of three species.
They are excellent tools to point out the 25th,50th and 75th percentile values of data.They also show the mean and outliers.
The lower edge of rectangle box denotes 25th percentile,middle black line as 50th and upper edge as 75th.The upper and lower end of whiskers denote max and min values and black points at extreme denotes outliers.


This is called Violin Plots,they are great way to showcase all the features of box plot along with the distribution.They are in fact combo of box plot and distribution.The side edges of leaf denotes the distribution. 

Lets proceed to Bivariate Analysis




Lets plot a 2-D scatter plot between sepal length and sepal width. We can easily notice that setosa is lot different than rest two.

Lets's plot 3-D scatter plot between sepal length,sepal width,petal width.


Multivariate Analysis: 

we humans can't visualize anything greater than 3d and it is often very difficult to plot 3d graphs. so to solve this we've called pairplots





Here we've plotted each features against each one.
This is basically scatter plot but carried out against multiple features at the same time.In the last diagram at the last row in the 3rd graph, between petal width and petal length.It is very much possible to distinguish all those species so this is by far the best feature to be considered for  prediction.




 
All the above three are the joint plots for three species.
Joint Plot uses Scatter Plot and Histogram. Joint Plot can also display data using Kernel Density Estimate (KDE) and Hexagons. We can also draw a Regression Line in Scatter Plot. By using spearmanr function, we can print the correlation between two variables.


Above diagram is the heatmap.It is the most heavily used metric to determine the correlation between different features.It is mainly used in feature selection stages ,where we need to select those features which have a high correlation to its target variable.Here we can see,
petal length and petal width have a high correlation of 0.96 , sepal length and petal length of 0.87.

NOTE: as it is a classification problem, heatmap is not needed here.
Here confusion matrix ,precision,recall will be more useful in determining performances of our operations.

Comments