top of page


Writer's picture: Lila Guimarães ReisLila Guimarães Reis

Updated: Sep 5, 2021

HR Analytics: Job Change of Data Scientists


The Challenge

The business challenge is to create a predictive model that identifies the

factors that contribute to employee churn in a company


Packages used

MASS -> Modern Applied Statistics with S

mlogit -> Estimation of the multinomial logit models in R

sqldf -> Manipulate R data frames using SQL

Hmisc -> Harrell Miscellaneous. Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation

dplyr -> Data manipulation tasks

HH -> Statistical Analysis and Data Display: Heiberger and Holland

gmodels -> various R programming tools for Model Fitting

rms -> a collection of functions that assist with and streamline modeling.

pROC -> For Visualizing area under the curve (AUC)

Dummies -> to create dummies automatically


Data complexities

Missing Values:

Calculated the percentage of missing values in the data and decided to work with 66%.

Total Data – 19, 158 values

Working Data – 12, 673

Created new category for missing values

Inserted missing values in highest category

Removed insignificant variables


Data Exploration : Dependent variable

“Target” basically refers to “churn” value which is either : 0 or 1.

0 means that the employee will not leave the company, while 1 means that the employee will leave the company.

band for experience should be better

•There are 2 Continuous variable (City_development_Index and Training hours) while the other variables are categorical.

•The “experience” variable had several values, so a band was created.

•Also “dummy variables” were created for categorical variables.


Bi Variable Analysis - Categorical

People who live in cities that are rated below 0.8 on the development index scale are likely to churn.

People who spend more than 70 hours on training are more likely to churn.

People with background in STEM education and a first degree are more likely to churn. Males are also more likely to churn than females.

People not currently enrolled in any University are likely to churn.

People with relevant experience between 2 to 6 years are more likely to churn. Also, people who have been on the job for 1 year are likely to churn. Finally, we see that people employed in the private sector, in companies of 50 – 500 are also likely to churn.


Dummy Variable

to categorical


Variance Inflation Factor - VIF

multicollinearity test, score defined : less than 3

Variables removed


The Model

Model outputs & interpretation

In the final model iteration where got the lowest AIC and the most significant variables, factors such as a city’s level of development, company size and an individual’s level of experience affect employee churn rate significantly.

This means people in growing cities where opportunities are readily available are more likely to churn as opposed to cities where the economy is struggling. It could also mean that staff in large companies might churn more frequently as they don’t feel the ‘personal touch’ and maybe feel more like just a number on the payroll.


Model Validation

ROC Curve

Concordance: 0.79%

Area Under Curve- 0.7811 , #this is a good model because it is close to 1.

Accuracy: Train - 0.8255252, Test - 0.8196527, #predictions correct 82% of the times

F1 Score: Train - f1 <- 0.5542958, Testf1 <- 0.5204617, model is more than 50% precise and accurate.

Comparing Model Train, Model test and Random selection

build on Excel



•Using the model, the company will be able to identify (& possibly reach out to) almost double the number of staff that could potentially leave the company. If their concerns are addressed and needs met, the company will be able to save on the costs of rehiring (onboarding, training, restaffing etc.)

•Suggestions for Implementation: any company seeking to implement this model will need to collect similar data and apply the model to their data.

Team members : Corine , Ferheng, Lila and Suganya

24 views0 comments

Recent Posts

See All


Post: Blog2 Post
bottom of page