Employee Retention Problem Part 2
In this section, I would like to use Logistic Regression to predict whether the employee will leave or not.
A Brief Explanation of Logistic Regression
If you have studied statistics, you would be familiar with Linear Regression. Linear Regression can predict the dependent variable Y given independent variable X. The model of Linear Regression can be represented by equation (1)
or for the dataset have more than one X, the model would be represented by equation (2)
Everything goes fine until we have categorical data for Y. For instance, the goal we have is to predict whether the employee will leave or not. Suppose we divide the employee into 2 categories i.e., 1 or 0. Number 1 for the employee who left and number 0 for the employee who stayed. So, the problem would be estimating the probability of Y=1 given X or Y=0 given X. Since Linear Regression model produce continues number, the result may be outside the close interval [0,1]. To handle the problem, we have to use a function to transform the result into the close interval [0,1].
Here comes the sigmoid function as a hero. Sigmoid function will transform our result of Linear Regression into S-shaped curve with values in close interval [0,1]. The threshold is 0.5, when the result is below 0.5 then the class is 0. Otherwise, the class is 1. Sigmoid function is represented by equation (3)
Thus, we have Logistic Regression model.
Logistic Regression Model
Before we start our training, we have to prepare the dataset. You can access the dataset from the link below.
IBM HR Analytics Employee Attrition & Performance | Kaggle
Thus, we will use Python to execute our project. First, we must import the important libraries such as pandas, numpy, seaborn, and matplotlib.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Then, we will assign the dataset into data frame using pandas.
employee_df = pd.read_csv('drive/MyDrive/Human_Resources.csv')
Ensure that you have the right path to access the file. In this case, I was using google drive to save the dataset.
The dataset has 35 columns (34 independent variables and 1 dependent variable) and 1470 observations. Fortunately, we don’t have null value in the dataset.
Before we train the model, we must consider the categorical variables. We have 6 categorical variables that have more than two categories and 3 categorical variables that have two categories. For 3 variables that have two categories i.e., Attrition, Over18, and OverTime, we would like to transform “Yes” into 1 and “No” into 0. For 6 categorical variables, we will use one-hot encoding method. So, every category will have their column and here the codes:
# Variables that have two categories
# transform "Yes" into 1 and "No" into 0.
# Let's replace 'Attritition' , 'overtime' , 'Over18' column with integers
employee_df['Attrition'] = employee_df['Attrition'].apply(lambda x:1 if x == "Yes" else 0)
employee_df['Over18'] = employee_df['Over18'].apply(lambda x:1 if x == "Y" else 0)
employee_df['OverTime'] = employee_df['OverTime'].apply(lambda x:1 if x == "Yes" else 0)
# Variables that have more than one two categories
X_cat = employee_df[['BusinessTravel','Department','EducationField','Gender','JobRole','MaritalStatus']]
# Transform The Category data to numeric using One Hot Encoding
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()
# Transform X_cat into dataframe
X_cat = pd.DataFrame(X_cat)
Okay, we have handled categorical variables. Now, let’s see figure 1.
Figure 1 shows us that EmployeeCount, StandardHours, and Over18 only have 1 value. Those variables simply would not affect the model. So, we would like to drop them.
# It makes sense to drop 'EmployeeCount' , 'Standardhours' and 'Over18' since they do not change from one employee to the other
# Let's drop 'EmployeeNumber' as well
employee_df.drop(['EmployeeCount','StandardHours','Over18'], axis=1, inplace=True)
Thus, we have 32 columns including Attrition (dependent variable).
After we have handled the categorical variables, we must handle numeric variables. We would like to transform every numeric variables values into closed interval [0,1] using Min-Max Scaler. We have to transform the values because some variables have huge scale, and some have small scale. We want to have same scale, so we use Min-Max Scaler.
# Numeric Variables
X_num = employee_df[['Age','DailyRate','DistanceFromHome','Education','EnvironmentSatisfaction',
'HourlyRate','JobInvolvement','JobLevel','JobSatisfaction','MonthlyIncome',
'MonthlyRate','NumCompaniesWorked','OverTime','PercentSalaryHike','PerformanceRating',
'RelationshipSatisfaction','StockOptionLevel','TotalWorkingYears','TrainingTimesLastYear','WorkLifeBalance',
'YearsAtCompany','YearsInCurrentRole','YearsSinceLastPromotion','YearsWithCurrManager']]
# Transform numeric varibles using Min-Max Scaler
# Scale the data using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_num = scaler.fit_transform(X_num)
X_num = pd.DataFrame(X_num)
X_num
Then, we can concatenate the categorical variables and the numeric variables as X (our independent variables).
# Concat X_cat and X_num
X = pd.concat([X_cat,X_num], axis=1)
Thus, we set Attrition as our dependent variable.
y = employee_df['Attrition']
Furthermore, we would like to separate training data and test data with 75:25 as the proportion.
# Split the data into train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Now, we are ready to train our Logistic Regression Model.
# Train data using Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
# The prediction if we input test data
y_pred = model.predict(X_test)
The last step is we want to see the metrics to measure how good our model is. Here I will use accuracy, precision, recall, and f1-score as our metric.
# Testing Set Performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
# Accuracy
print("Accuracy {} %".format(100 * accuracy_score(y_test, y_pred)))
# Precision
print("Precision {} %".format(100 * precision_score(y_test, y_pred)))
# Recall
print("Recall {} %".format(100 * recall_score(y_test, y_pred)))
# f1-score
print("f1-score {} %".format(100 * f1_score(y_test, y_pred)))
And we got the result.
It seems our model is not good enough to predict whether the employee will leave or not. So, in the next part, we will improve or using another model to get the best model. See you.
References:
[1] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning. London: Springer New York Heidelberg Dordrecht London, 2014.