top of page
Search

Titanic - Machine learning from Disaster

Writer's picture: Lila Guimarães ReisLila Guimarães Reis

Updated: Sep 18, 2021

Kaggle competition to create a model that predicts which passengers survived the sinking of the Titanic.

Programming Language: R Code




The Sinking of the Titanic, which was the largest ship, occurred in April 14, 1912 with 2,208 people on board when it hit an iceberg, resulting in the death of 1,496 people, making it one of the deadliest maritime disasters.

 

Packages used

MASS -> Modern Applied Statistics with S

mlogit -> Estimation of the multinomial logit models in R

sqldf -> Manipulate R data frames using SQL

Ggplot -> Visualization

Hmisc -> Harrell Miscellaneous. Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation

dplyr -> Data manipulation tasks

HH -> Statistical Analysis and Data Display: Heiberger and Holland

gmodels -> various R programming tools for Model Fitting

rms -> a collection of functions that assist with and streamline modeling.

pROC -> For Visualizing area under the curve (AUC)

fastDummies -> to create dummies automatically

 

EXPLORATOTY DATASET


Variables :


'PassengerId' : Id of each passenger

'Survived' : number of survived

'Pclass' : Ticket Class ( 1: 1st , 2: 2nd , 3: 3rd)

'Name' : Name of the passenger

'Sex': Sex

'Age': Age in Years

'SibSp': number of siblings / spouses aboard the Titanic

'Parch' : number of parents / children aboard the Titanic

'Ticket' : Ticket number

'Fare' : Passenger fare

'Cabin' : Cabin number

'Embarked' : Port of Embarkation ( C = Cherbourg, Q = Queenstown, S = Southampton )


Number of observations: 891

Number of variables: 12


Univariable Analysis

Categorical


Survived : 0 or 1


0 means people will not survive, while 1 means people will survive







Class type on the ship.


We have more people from 3rd class







We have more male in the data








Most people on the ship had no siblings or sponsors








On the ship there were few parents with 1 or 2 kids most had no kids









The age range was from 20 to 40 years old








We can say that 90% of the people on the ship paid less than 100 dollars and had people who paid nothing

( continuous variable)







More than 600 people came from Southampton








Most cabins were not identified






 

Bivariable Analysis

Continuous Categorical




Fare:

People who paid around 50 dollars are liked to survived




Categorical


Note: this analysis already includes the new variables created (ad_ch, f_size_desc, family_size).

We can see that children, women, 1st class people are more likely to survive

 

DATA CLEANING


Missing Values : Age : 177 , Cabin : 687 , Embarked : 2

Treatment :

  • Embarked move to category "C"

  • Cabin create new variable "Unknow"

  • Age create new variable "Unknow"

Treatment in Cabin variable:

Remove numbers from Cabin keep only letters


Remove : PassengerId , Name , Ticket and Cabin keep only new Cabin Variable ( These variables will not be important in this model)

 

FEATURE ENGINEERING

Creating new variables


  • Family Size : family size equal 1 " Single" , less or equal to 5 "Small" , more then 6 " Large"

  • Classify if is Child or Adult by age ( less then 18 : Child, More or equal to 18 : Adult and "Unknow" to not identified

  • Passenger Title remove tittle from each name e.g. Mrs|Master|Mr|Miss|Dr|Col|Rev|Mlle

  • Mother of not by Age, Parch, Title.



Most people are single on the ship










We have more Adult on the ship








There are more men titled Mr. on the ship








Most women are not mothers







Dummy Variable library ( fastDummies )

 

VIF : Variance Inflation Factor

Correlation among independent variables ( multicollinearity ) score defined : less then 3

 

MODEL

Define Train and Test Data, use Train on Model

Best model : 636.1

 

MODEL VALIDATION


pseudo r squared (McFadden's r sq)

0.357

OBS: range of 0.2 to 0.4 is considered a good model


Concordance

Comparison of every pair 0 and 1 in the Dependent variable

87%

OBS: Any concordance over 80% is good


Accuracy

To see the accuracy on Train and Test

ROC curve

Area under the curve : 0.8611

OBS: the closer it is to 1 the better it is

Confusion Metrics

Train Data

True Positive : 351

True Negative : 230

False Positive : 83

False negative: 49


Formula: TP+TN / TP+FP+FN+FN

Prediction 81% Correct


Test Data

True Positive : 91

True Negative : 52

False Positive : 24

False negative: 11

Formula: TP+TN / TP+FP+FN+FN

Prediction 80% Correct


Train Data Only

Precision : ratio of correctly predicted positive observations

Formula: TP/TP+FP

  • of all the passengers who survived, how many actually survived? 0.808


Recall : ratio of correctly predicted positive observation to the all observations in actual class (yes)

Formula:TP/TP+FN

  • of all the passengers who actually survived, How many were identified? 0.877

Specificity: Is the correctly predicted

Formula:TN/TN+FP

  • of all the passengers who survived, how many of those did we correctly predicted ?0.824


F1 - SCORE : Weighted average of precision and recall

Formula: 2*(Recall * Precision) / (Recall + Precision)

Train : 0.78


test data calculated automatically

Test : 0.75


LIFT CHART




The Lift Chart is the comparison between predicted model and random selection


Lift on Test Data : 41%





 

CONCLUSION

We can conclude that among the significant variables in the model are sex female, SibSp_4, Cabins ( A,B,C,D and E) and others. We will justify only variables with P-Value more than 2 stars "*".


Sex_ female : We know that women and children have had priority to entering the boats.


SibSp: Families with 4 children had more preferences to enter the boats.


Cabins : Below we can see the illustration of the cabins (decks) on the ship.

First Class : A,B,C,D and E

Second Class: D, E and F

Third Class: D,E,F and G



In the model, we have cabins A, B, C, D and E. The people in these cabins had a better chance of surviving, making sense looking at the image above they were close to lifeboats.



Thank You



73 views0 comments

Recent Posts

See All

Comments


Post: Blog2 Post
bottom of page