Exploratory Data Analysis (EDA)
Exploratory Data Analysis
On Iris Dataset
Introduction:
Have you ever wondered ,what makes Google to offer their world class service at 0 cost ?. Google, literally offers search, youtube ,gmail, gdrive ,playstore,G maps, Classroom,blogger and many many more without taking a single penny from customers.Have you ever wondered what makes cab providers like Uber gives you cab facility at such a low cost ? ....have you ever thought of it?. Are they mad?.The answer is no and the only reason that allows them to not only offer valuable services but also generate a huge profit of unimaginable proportion. The weapon they use is nothing but data, when you agree to their terms of services before using their apps, they ask for your permission to use your data at their own purpose and they sell those data to their vendors where it is used to determine various behavioural aspects/trends of customers. Now you know, why Google and all such major data companies are so good at predicting the market trend, the only reason is the abundance of varieties of data at their fingertips.
Someone has rightly said "Data is the New Oil"--and it will be for centuries to come.
But now question is how to use those data because data can be presented in any format be it structured or unstructured and the later is more common.Not only that, there will be lot of impurities in the data like missing values,erroneous values and lots and lots of unwanted stuffs so to derive insights from data, we've to process it to make use of it.
So we need to visualize the data first in order to find any sorts of problems in it and Exploratory Data Analysis is that Technique.
Exploratory Data Analysis (EDA) is the process of visualizing and analyzing data to extract insights from it. In other words, EDA is the process of summarizing important characteristics of data in order to gain better understanding of the dataset.
In this blog,we are going to see some examples of EDA on famous Iris Dataset.
Problem Description:
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
Dataset Source: https://www.kaggle.com/uciml/iris
1)Loading the dataset & some requisite Libraries:
Libraries used: Pandas, numpy, matplotlib, seaborn and plotly
Notice: here we have to strip off the id columns from our dataset as it can hinder some specific operations that we'll perform on our data.
2) Visualizing some statistical aspects of Data
we have used describe() method to get the details about count,mean,std, 25th percentile,50th and 75th percentile of the four different features of data.This falls under the descriptive statistical analysis.
3) Getting information about classes and features of our dataset
Now after doing all sorts of gathering info,it's time for some real stuffs ,some real statistical analysis to find some hidden secrets from the data.
Our Analysis is mainly divided into 3 parts:
1) Univariate Analysis: It involves analysis involving only one variable.
2)Bivariate Analysis: It involves analysis involving two variables.
3) Multivariate Analysis: It involves more than two variables.























Comments
Post a Comment