Python is one of the most frequently used programming languages for financial data analysis, with plenty of useful libraries and built-in functionality. In this article, you’ll see how Python’s machine learning libraries can be used for customer churn prediction.
Customer churn is a financial term that refers to the loss of a client or customer—that is, when a customer ceases to interact with a company or business. Similarly, the churn rate is the rate at which customers or clients are leaving a company within a specific period of time. A churn rate higher than a certain threshold can have both tangible and intangible effects on a company’s business success. Ideally, companies like to retain as many customers as they can.
With the advent of advanced data science and machine learning techniques, it’s now possible for companies to identify potential customers who may cease doing business with them in the near future. In this article, you’ll see how a bank can predict customer churn based on different customer attributes such as age, gender, geography, and more. The details of the features used for customer churn prediction are provided in a later section.
Overview: Using Python for Customer Churn Prediction
Python comes with a variety of data science and machine learning libraries that can be used to make predictions based on different features or attributes of a dataset. Python’s scikit-learn library is one such tool. In this article, we’ll use this library for customer churn prediction.
The Dataset: Bank Customer Churn Modeling
The dataset you’ll be using to develop a customer churn prediction model can be downloaded from this kaggle link. Be sure to save the CSV to your hard drive.
Taking a closer look, we see that the dataset contains 14 columns (also known as features or variables). The first 13 columns are the independent variable, while the last column is the dependent variable that contains a binary value of 1 or 0. Here, 1 refers to the case where the customer left the bank after 6 months, and 0 is the case where the customer didn’t leave the bank after 6 months. This is known as a binary classification problem, where you have only two possible values for the dependent variable—in this case, a customer either leaves the bank after 6 months or doesn’t.
It’s important to mention that the data for the independent variables was collected 6 months before the data for the dependent variable, since the task is to develop a machine learning model that can predict whether a customer will leave the bank after 6 months, depending on the current feature values.
You can use machine learning classification algorithms to solve this problem.
Note: All the code in this article is executed using the Spyder IDE for Python.
Here’s an overview of the steps we’ll take in this article:
- Importing the libraries
- Loading the dataset
- Selecting relevant features
- Converting categorical columns to numeric ones
- Preprocessing the data
- Training a machine learning algorithm
- Evaluating the machine learning algorithm
- Evaluating the dataset features
All right, let’s begin!
Step 1: Importing the Libraries
The first step, as always, is to import the required libraries. Execute the following code to do so:
import numpy as np import matplotlib.pyplot as plt import pandas as pd
Step 2: Loading the Dataset
The second step is to load the dataset from the local CSV file into your Python program. Let’s use the
read_csv method of the pandas library. Execute the following code:
customer_data = pd.read_csv(r'E:/Datasets/Churn_Modelling.csv')
If you open the
customer_data dataframe in Spyder’s Variable Explorer pane, you should see the columns as shown below:
Step 3: Feature Selection
As a reminder, there are 14 columns total in our dataset (see the screenshot above). You can verify this by executing the following code:
columns = customer_data.columns.values.tolist() print(columns)
In the output, you should see the following list :
['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited']
Not all columns affect the customer churn. Let’s discuss each column one by one:
RowNumber—corresponds to the record (row) number and has no effect on the output. This column will be removed.
CustomerId—contains random values and has no effect on customer leaving the bank. This column will be removed.
Surname—the surname of a customer has no impact on their decision to leave the bank. This column will be removed.
CreditScore—can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
Geography—a customer’s location can affect their decision to leave the bank. We’ll keep this column.
Gender—it’s interesting to explore whether gender plays a role in a customer leaving the bank. We’ll include this column, too.
Age—this is certainly relevant, since older customers are less likely to leave their bank than younger ones.
Tenure—refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
Balance—also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.
NumOfProducts—refers to the number of products that a customer has purchased through the bank.
HasCrCard—denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.
IsActiveMember—active customers are less likely to leave the bank, so we’ll keep this.
EstimatedSalary—as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.
Exited—whether or not the customer left the bank. This is what we have to predict.
After careful observation of the features, we’ll remove the
Surname columns from our feature set. All the remaining columns do contribute to the customer churn in one way or another.
To drop these three columns, execute the following code:
dataset = customer_data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
Notice here that we’ve stored our filtered data in a new data frame named
customer_data data frame still contains all the columns. We’ll reuse that later.
Step 4: Converting Categorical Columns to Numeric Columns
Machine learning algorithms work best with numerical data. However, in our dataset, we have two categorical columns:
Gender. These two columns contain data in textual format; we need to convert them to numeric columns.
Let’s first isolate these two columns from our dataset. Execute the following code to do so:
dataset = dataset.drop(['Geography', 'Gender'], axis=1)
One way to convert categorical columns to numeric columns is to replace each category with a number. For instance, in the
Gender column, female can be replaced with 0 and male with 1, or vice versa. This works for columns with only two categories.
For a column like Geography with three or more categories, you can use the values 0, 1, and 2 for the three countries of France, Germany, and Spain. However, if you do this, the machine learning algorithms will assume that there is an ordinal relationship between the three countries. In other words, the algorithm will assume that 2 is greater than 1 and 0, which actually is not the case in terms of the underlying countries the numbers represent.
A better way to convert such categorical columns to numeric columns is by using one-hot encoding. In this process, we take our categories (France, Germany, Spain) and represent them with columns. In each column, we use a 1 to designate that the category exists for the current row, and a 0 otherwise.
In this case, with the three categories of France, Germany, and Spain, we can represent our categorical data with just two columns (Germany and Spain, for example). Why? Well, if for a given row we have that Geography is France, then the Germany and Spain columns will both have a 0, implying that the country must be the remaining one not represented by any column. Notice, then, that we do not actually need a separate column for France.
Let’s convert both the Geography and Gender columns into numeric columns. Execute the following script:
Geography = pd.get_dummies(customer_data.Geography).iloc[:,1:] Gender = pd.get_dummies(customer_data.Gender).iloc[:,1:]
get_dummies method of the pandas library converts categorical columns to numeric columns. Then,
.iloc[:,1:] ignores the first column and returns the rest of the columns (Germany and Spain). As noted above, this is because we can always represent “n” categories with “n - 1” columns.
Now if you open the
customer_data data frames in the Variable Explorer pane, you should see something like this:
In accordance with our earlier explanation, the
Geography data frame contains two columns instead of three. When the geography is France, both
Spain contain 0. When the geography is Spain, you can see a 1 in the
Spain column and a 0 in the
Germany column. Similarly, in the case of
Germany, you can see a 1 in the
Germany column and a 0 in the
Next, we need to add the
Gender data frames back to the data set to create the final dataset. You can use the
concat function from pandas to horizontally concatenate two data frames as shown below:
dataset = pd.concat([dataset,Geography,Gender], axis=1)
Step 5: Data Preprocessing
Our data is now ready, and we can train our machine learning model. But first, we need to isolate the variable that we’re predicting from the dataset.
X = dataset.drop(['Exited'], axis=1) y = dataset['Exited']
Here, X is our feature set; it contains all the columns except the one that we have to predict (
Exited). The label set, y, contains only the
So we can later evaluate the performance of our machine learning model, let’s also divide the data into a training and test set. The training set contains the data that will be used to train our machine learning model. The test set will be used to evaluate how good our model is. We’ll use 20% of the data for the test set and the remaining 80% for the training set (specified with the
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Step 6: Machine Learning Algorithm Training
Now, we’ll use a machine learning algorithm that will identify patterns or trends in the training data. This step is known as algorithm training. We’ll feed the features and correct output to the algorithm; based on that data, the algorithm will learn to find associations between the features and outputs. After training the algorithm, you’ll be able to use it to make predictions on new data.
There are several machine learning algorithms that can be used to make such predictions. However, we’ll use the random forest algorithm, since it’s simple and one of the most powerful algorithms for classification problems.
To train this algorithm, we call the
fit method and pass in the feature set (X) and the corresponding label set (y). You can then use the predict method to make predictions on the test set. Look at the following script:
from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators=200, random_state=0) classifier.fit(X_train, y_train) predictions = classifier.predict(X_test)
Step 7: Machine Learning Algorithm Evaluation
Now that the algorithm has been trained, it’s time to see how well it performs. For evaluating the performance of a classification algorithm, the most commonly used metrics are the F1 measure, precision, recall, and accuracy. In Python’s scikit-learn library, you can use built-in functions to find all of these values. Execute the following script:
from sklearn.metrics import classification_report, accuracy_score print(classification_report(y_test,predictions )) print(accuracy_score(y_test, predictions ))
The output looks like this:
precision recall f1-score support 0 0.89 0.95 0.92 1595 1 0.73 0.51 0.60 405 avg / total 0.85 0.86 0.85 2000 0.8635
The results indicate an accuracy of 86.35%, which means that our algorithm successfully predicts customer churn 86.35% of the time. That’s pretty impressive for a first attempt!
Step 8: Feature Evaluation
As a final step, let’s see which features play the most important role in the identification of customer churn. Luckily,
RandomForestClassifier contains an attribute named
feature_importance that contains information about the most important features for a given classification.
The following code creates a bar plot of the top 10 features for predicting customer churn:
feat_importances = pd.Series(classifier.feature_importances_, index=X.columns) feat_importances.nlargest(10).plot(kind='barh')
And the output looks like this:
Based on this data, we can see that age has the highest impact on customer churn, followed by a customer’s estimated salary and account balance.
Customer churn prediction is crucial to the long-term financial stability of a company. In this article, you successfully created a machine learning model that’s able to predict customer churn with an accuracy of 86.35%. You can see how easy and straightforward it is to create a machine learning model for classification tasks.
Interested in exploring some other applications of Python for financial data analysis? Enroll in our Python Basics course to gain more hands-on experience.