HR Analytics: Job Change of Data Scientists
Source: https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists dataset: aug_train
The Challenge
The business challenge is to create a predictive model that identifies the
factors that contribute to employee churn in a company
Packages used
MASS -> Modern Applied Statistics with S
mlogit -> Estimation of the multinomial logit models in R
sqldf -> Manipulate R data frames using SQL
Hmisc -> Harrell Miscellaneous. Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation
dplyr -> Data manipulation tasks
HH -> Statistical Analysis and Data Display: Heiberger and Holland
gmodels -> various R programming tools for Model Fitting
rms -> a collection of functions that assist with and streamline modeling.
pROC -> For Visualizing area under the curve (AUC)
Dummies -> to create dummies automatically
Data complexities
•Missing Values:
Calculated the percentage of missing values in the data and decided to work with 66%.
Total Data – 19, 158 values
Working Data – 12, 673
Created new category for missing values
Inserted missing values in highest category
Removed insignificant variables
Data Exploration : Dependent variable
“Target” basically refers to “churn” value which is either : 0 or 1.
0 means that the employee will not leave the company, while 1 means that the employee will leave the company.
band for experience should be better
•There are 2 Continuous variable (City_development_Index and Training hours) while the other variables are categorical.
•The “experience” variable had several values, so a band was created.
•Also “dummy variables” were created for categorical variables.
Bi Variable Analysis - Categorical
People who live in cities that are rated below 0.8 on the development index scale are likely to churn.
People who spend more than 70 hours on training are more likely to churn.
People with background in STEM education and a first degree are more likely to churn. Males are also more likely to churn than females.
People not currently enrolled in any University are likely to churn.
People with relevant experience between 2 to 6 years are more likely to churn. Also, people who have been on the job for 1 year are likely to churn. Finally, we see that people employed in the private sector, in companies of 50 – 500 are also likely to churn.
Dummy Variable
to categorical
Variance Inflation Factor - VIF
multicollinearity test, score defined : less than 3
Variables removed
The Model
Model outputs & interpretation
In the final model iteration where got the lowest AIC and the most significant variables, factors such as a city’s level of development, company size and an individual’s level of experience affect employee churn rate significantly.
This means people in growing cities where opportunities are readily available are more likely to churn as opposed to cities where the economy is struggling. It could also mean that staff in large companies might churn more frequently as they don’t feel the ‘personal touch’ and maybe feel more like just a number on the payroll.
Model Validation
ROC Curve
•Concordance: 0.79%
•Area Under Curve- 0.7811 , #this is a good model because it is close to 1.
•Accuracy: Train - 0.8255252, Test - 0.8196527, #predictions correct 82% of the times
•F1 Score: Train - f1 <- 0.5542958, Testf1 <- 0.5204617, model is more than 50% precise and accurate.
Comparing Model Train, Model test and Random selection
build on Excel
Conclusion
•Using the model, the company will be able to identify (& possibly reach out to) almost double the number of staff that could potentially leave the company. If their concerns are addressed and needs met, the company will be able to save on the costs of rehiring (onboarding, training, restaffing etc.)
•Suggestions for Implementation: any company seeking to implement this model will need to collect similar data and apply the model to their data.
Team members : Corine , Ferheng, Lila and Suganya
コメント