Kaggle competition to create a model that predicts which passengers survived the sinking of the Titanic.
Programming Language: R Code
The Sinking of the Titanic, which was the largest ship, occurred in April 14, 1912 with 2,208 people on board when it hit an iceberg, resulting in the death of 1,496 people, making it one of the deadliest maritime disasters.
Packages used
MASS -> Modern Applied Statistics with S
mlogit -> Estimation of the multinomial logit models in R
sqldf -> Manipulate R data frames using SQL
Ggplot -> Visualization
Hmisc -> Harrell Miscellaneous. Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation
dplyr -> Data manipulation tasks
HH -> Statistical Analysis and Data Display: Heiberger and Holland
gmodels -> various R programming tools for Model Fitting
rms -> a collection of functions that assist with and streamline modeling.
pROC -> For Visualizing area under the curve (AUC)
fastDummies -> to create dummies automatically
EXPLORATOTY DATASET
Variables :
'PassengerId' : Id of each passenger
'Survived' : number of survived
'Pclass' : Ticket Class ( 1: 1st , 2: 2nd , 3: 3rd)
'Name' : Name of the passenger
'Sex': Sex
'Age': Age in Years
'SibSp': number of siblings / spouses aboard the Titanic
'Parch' : number of parents / children aboard the Titanic
'Ticket' : Ticket number
'Fare' : Passenger fare
'Cabin' : Cabin number
'Embarked' : Port of Embarkation ( C = Cherbourg, Q = Queenstown, S = Southampton )
Number of observations: 891
Number of variables: 12
Univariable Analysis
Categorical
Survived : 0 or 1
0 means people will not survive, while 1 means people will survive
Class type on the ship.
We have more people from 3rd class
We have more male in the data
Most people on the ship had no siblings or sponsors
On the ship there were few parents with 1 or 2 kids most had no kids
The age range was from 20 to 40 years old
We can say that 90% of the people on the ship paid less than 100 dollars and had people who paid nothing
( continuous variable)
More than 600 people came from Southampton
Most cabins were not identified
Bivariable Analysis
Continuous Categorical
Fare:
People who paid around 50 dollars are liked to survived
Categorical
Note: this analysis already includes the new variables created (ad_ch, f_size_desc, family_size).
We can see that children, women, 1st class people are more likely to survive
DATA CLEANING
Missing Values : Age : 177 , Cabin : 687 , Embarked : 2
Treatment :
Embarked move to category "C"
Cabin create new variable "Unknow"
Age create new variable "Unknow"
Treatment in Cabin variable:
Remove numbers from Cabin keep only letters
Remove : PassengerId , Name , Ticket and Cabin keep only new Cabin Variable ( These variables will not be important in this model)
FEATURE ENGINEERING
Creating new variables
Family Size : family size equal 1 " Single" , less or equal to 5 "Small" , more then 6 " Large"
Classify if is Child or Adult by age ( less then 18 : Child, More or equal to 18 : Adult and "Unknow" to not identified
Passenger Title remove tittle from each name e.g. Mrs|Master|Mr|Miss|Dr|Col|Rev|Mlle
Mother of not by Age, Parch, Title.
Most people are single on the ship
We have more Adult on the ship
There are more men titled Mr. on the ship
Most women are not mothers
Dummy Variable library ( fastDummies )
VIF : Variance Inflation Factor
Correlation among independent variables ( multicollinearity ) score defined : less then 3
MODEL
Define Train and Test Data, use Train on Model
Best model : 636.1
MODEL VALIDATION
pseudo r squared (McFadden's r sq)
0.357
OBS: range of 0.2 to 0.4 is considered a good model
Concordance
Comparison of every pair 0 and 1 in the Dependent variable
87%
OBS: Any concordance over 80% is good
Accuracy
To see the accuracy on Train and Test
ROC curve
Area under the curve : 0.8611
OBS: the closer it is to 1 the better it is
Confusion Metrics
Train Data
True Positive : 351
True Negative : 230
False Positive : 83
False negative: 49
Formula: TP+TN / TP+FP+FN+FN
Prediction 81% Correct
Test Data
True Positive : 91
True Negative : 52
False Positive : 24
False negative: 11
Formula: TP+TN / TP+FP+FN+FN
Prediction 80% Correct
Train Data Only
Precision : ratio of correctly predicted positive observations
Formula: TP/TP+FP
of all the passengers who survived, how many actually survived? 0.808
Recall : ratio of correctly predicted positive observation to the all observations in actual class (yes)
Formula:TP/TP+FN
of all the passengers who actually survived, How many were identified? 0.877
Specificity: Is the correctly predicted
Formula:TN/TN+FP
of all the passengers who survived, how many of those did we correctly predicted ?0.824
F1 - SCORE : Weighted average of precision and recall
Formula: 2*(Recall * Precision) / (Recall + Precision)
Train : 0.78
test data calculated automatically
Test : 0.75
LIFT CHART
The Lift Chart is the comparison between predicted model and random selection
Lift on Test Data : 41%
CONCLUSION
We can conclude that among the significant variables in the model are sex female, SibSp_4, Cabins ( A,B,C,D and E) and others. We will justify only variables with P-Value more than 2 stars "*".
Sex_ female : We know that women and children have had priority to entering the boats.
SibSp: Families with 4 children had more preferences to enter the boats.
Cabins : Below we can see the illustration of the cabins (decks) on the ship.
First Class : A,B,C,D and E
Second Class: D, E and F
Third Class: D,E,F and G
In the model, we have cabins A, B, C, D and E. The people in these cabins had a better chance of surviving, making sense looking at the image above they were close to lifeboats.
Thank You
Comments