EDA on Haberman’s Survival Data Set
Exploratory Data Analysis on
Haberman's Survival Dataset
Introduction:
The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgeryfor breast cancer.
Dataset Attribute Information:
1) Age of patient at time of operation (numerical)
2) Patient's year of operation (year - 1900, numerical)
3) Number of positive axillary nodes detected (numerical)
4) Survival status (class attribute)
1 = the patient survived 5 years or longer
2 = the patient died within 5 year
5) Number of Instances: 306
Data Source: https://www.kaggle.com/gilsousa/habermans-survival-data-set
This blog is in continuation with previous blog,where we have discussed several EDA techniques on Iris Dataset .If you haven't check out that blog,i would highly recommend you to visit this link https://bhaskar47899.blogspot.com/2020/08/exploratory-data-analysis-eda.html.
In this blog, we will perform EDA on this dataset. Here we have given 4 attributes,where 3 of them (Age ,year of operation and no of lymph nodes) will constitute the feature matrix and the target vector will be Survival Status(whether patient has survived more than 5 years or not.Yes is denoted by 1 and no by 2).So basically, this is a classification problem, where have to determine the class.In this blog we will perform only EDA and not the other phases of ML life cycle.
Some Medical Information about Lymph Nodes: Normal lymph nodes are tiny and can be hard to find, but when there’s infection, inflammation, or cancer, the nodes can get larger. Those near the body’s surface often get big enough to feel with your fingers, and some can even be seen. But if there are only a few cancer cells in a lymph node, it may look and feel normal. In that case, the doctor must check for cancer by removing all or part of the lymph node.
When a surgeon operates to remove a primary cancer, one or more of the nearby (regional) lymph nodes may be removed as well. Removal of one lymph node is called a biopsy. When many lymph nodes are removed, it’s called lymph node sampling or lymph node dissection. When cancer has spread to lymph nodes, there’s a higher risk that the cancer might come back after surgery. This information helps the doctor decide whether more treatment, like chemo or radiation, might be needed after surgery.
source: https://www.cancer.org/cancer/cancer-basics/lymph-nodes-and-cancer.html
So let's begin our Analysis.........
Libraries Used: Python pandas,matplotlib , seaborn and numpy
Loading The Dataset:
Now ,the class labels are 1 or 2, we will convert them to 1 and 0.
1 will remain 1 and 2 will be converted to 0.
Not a much significance to notice here.. Right? ..Don't worry,it will come so have patience .
Studying Data Science ,specially data exploration and feature engineering process needs lots and lots of patience and anger management.😅😓
Lets find out details about the count of class labels
lets Have a look At the Scatter Plot:
well, its not that surprising.For binary classification, the graph would surely look like this.For y ,we have only two possibilities 0 or 1.
Now lets proceed to some really interesting part, from here on we'll gain the actual vision of whats going on.
Lets divide the dataset into two dataframes, where one will contain details about patients who survived and other will contain who didn't survive
Now if you look at the descriptive statistical data of two dataframes, you will surely shocked to find some key informations.
Observation 1:
Although mean age of patients who survived is almost equal to the mean age of patients who didn't, there is a difference in the mean no of nodes between patients who survived for 5 years or more and patient who didn't survived by 5.
So, we can conclude patient who survived for 5 years more has less number of nodes than patient who didn't survived.
Lets have some Uni variate Analysis:
Observation 2:
1) Major overlapping is observed which tells that age has nothing to do with survival.
2)persons between age of 30-40 has higher chance of survival and persons between age 40-60 has very lower chance of survival and also the person between age 60-75 has equal chance of surviving and not surviving.
3) to conclude we cannot decide the chance of survival on age.
Observation 3:
1) There is a major overlapping present so it can't be taken as deciding factor
2)People diagnosed in 1960 and 1965 yields more unsuccessful result.
This is the pdf of nodes.
and have our violins play some symphony 🎵🎵
joint plot between year and age
Observation 4:
Patients with 0 or 1 node has high chance of survival although patient having more than 15-20 nodes has very less chance of survival
This node is by far the most important feature for our experiment.
so Lets plot the cdf of nodes.
83% of patients who have nodes from 0 to 4 has survived.
Lets plot some box plots.
Observation 6:
1)A large no of patients has 0 nodes have survived although a small no of people had 0 nodes also died thus it cannot be guaranteed that one will survive if he has 0 node.
2) A large no diagnosis conducted in 1965 caused more deaths.
3)There were more people in the age group 40-65 did not survive
4)Patients age below 40 has better chance of survival.
Lets have some serious and yet interest jointplots
Observation 7:
From year 1960 to 1965 more operations are done on age group from 45 to 60.
joint plot between age and nodes.
Observation 8:
Age group from 40 to 65 are more common to have node greater than 0.
Final Conclusion:
1) Age and year are not deciding factor but patients having age below 40 are more likely to survive
2)No of Node is most important, higher the no of nodes lesser will be the chance of survival.
References:
1)https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
Haberman, S. J. (1976). Generalized Residuals for Log-Linear Models, Proceedings of the 9th International Biometrics Conference, Boston, pp. 104-122.
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI.
[Web Link]
A kind Request to all Readers:
from kindergarten to people pursuing Phd's in esteemed institutions,lets be honest we all use wikipedia to quench our thirst for knowledge, we all know it is a non-profitable organization and solely depends on donation.Now it's hurting me a lot,when every time i visit their page and the request message pops up.Yes i do agree, that they have still a handful of cash, but it's our responsibility too to help them in their need.
I have donated and a very very small amount of 110 to them and also request you to do the same.
and please follow me for more Data Science related blogs.
God Bless You and Thank You 😊
















Comments
Post a Comment