All code related to the article below can be found here.
‘Customer Churn’ is the loss of clients or customers. In order to avoid losing customers, a company needs to examine why its customers have left in the past.
How can we use data science to better understand customer churn?
The Tool: Survival Analysis
To do so, we’re going to borrow a tool from an unlikely place, survival analysis. Survival analysis was first developed by actuaries and medical professionals to predict survival rates.
Survival analysis works well in situations where we can define:
- A ‘Birth’ event: for our application, this will be a customer entering a contract with a company - A ‘Death’ event: for us, ‘death’ is a customer ending a relationship with a company
The component that makes survival analysis superior to other regression models is its ability to deal with censorship in data.
In the traditional sense, censorship may refer to losing track of an individual or an individual not dying before the end of an observation period. This data is ‘censored’ because everyone dies eventually, we’re just missing the data.
Similarly, we would expect to lose all customers eventually. Just because we haven’t observed them canceling their contact, doesn’t mean they never will.
The Problem: Customer Churn in Telecom
Treselle Systems, a data consulting service, analyzed customer churn data using logistic regression.
This approach works for a binary classification of whether or not a customer has left, but survival analysis is more appropriate.
The data can be found here.
Our goal is to identify ways for the telecom company to reduce customer churn.
The Analysis: Lifelines Library in Python
For our analysis, we will use the lifelines library in Python. Our first step will be to install and import the library, along with some of the classics.
!pip install lifelines
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lifelines
Next, we will import the data and perform some basic cleaning.
For each customer, we will need two important data points for survival analysis:
- ‘Tenure’: how long they have been a customer when the data is observed - ‘Churn’: whether or not the customer left when the data was observed
We will first identify these features and ensure the data type is correct. Note, many customers in our data have not left yet.
churn_data = pd.read_csv(
'https://raw.githubusercontent.com/treselle-systems/'
'customer_churn_analysis/master/WA_Fn-UseC_-Telco-Customer-Churn.csv')
# transform tenure and churn features
churn_data['tenure'] = churn_data['tenure'].astype(float)
churn_data['Churn'] = churn_data['Churn'] == 'Yes'
churn_data.head()
Before going into any further analysis, let’s look at the survival rate for the average customer using a Kaplan-Meier survival curve.
Using the code below, we can fit a KM survival curve to the customer churn data, and plot our survival curve with a confidence interval.
The survival curve is cumulative. Meaning, in the graph below, after 20 months, the chance of a customer not canceling service is just above 80%.
# fitting kmf to churn data
t = churn_data[‘tenure’].values
churn = churn_data[‘Churn’].values
kmf = lifelines.KaplanMeierFitter()
kmf.fit(t, event_observed=churn, label=’Estimate for Average Customer’)# plotting kmf curve
fig, ax = plt.subplots(figsize=(10,7))
kmf.plot(ax=ax)
ax.set_title(‘Kaplan-Meier Survival Curve — All Customers’)
ax.set_xlabel(‘Customer Tenure (Months)’)
ax.set_ylabel(‘Customer Survival Chance (%)’)
plt.show()
The above should give us some basic intuition about the customers.
As we would expect for telecom, churn is relatively low. Even after 72 months, the company is able to retain 60% or more of their customers.
To examine the effects of different features, we will use the Cox Proportional Hazards Model. We can think of this as a Survival Regression model.
‘Hazards’ can be thought of something that would increase/decrease chances of survival. In our business problem, for example, a hazard may be the type of contract a customer has. Customers with multi-year contracts probably cancel less frequently than those with month-to-month contracts.
One restriction is the model assumes a constant ratio of hazards over time across groups. Lifeline offers a built in check_assumptions method for the CoxPHFitter object.
After some data cleaning, including encoding categorical variables (k-1 dummies), we can fit a survival regression model to the data.
cph = lifelines.CoxPHFitter()
cph.fit(churn_hazard, duration_col='tenure', event_col='Churn', show_progress=False)
cph.print_summary()
Survival Regression Coefficients
In the above regression, the key output is exp(coef). This is interpreted as the scaling of hazard risk for each additional unit of the variable, 1.00 being neutral.
For example, the last exp(coefficient), corresponding to PaymentMethod_Mailed check, means a customer that pays by mailing a check is 1.68 times as likely to cancel their service.
For the company, exp(coef) below 1.0 is good, meaning a customer less likely to cancel.
To better visualize the above, we can plot the coefficient outputs and their confidence intervals.
# plotting coefficients
fig_coef, ax_coef = plt.subplots(figsize=(12,7))
ax_coef.set_title('Survival Regression: Coefficients and Confident Intervals')
cph.plot(ax=ax_coef);
Visualizing Coefficient Confidence Intervals
The Conclusion
How can our telecom company reduce customer churn?
We can make recommendations along three dimensions: contract specification, customer selection, and payment systems.
To visualize some of our findings, we will fit categorically based Kaplan-Meier curves and plot them, allowing us to see difference in churn rate between customer categories.
Contract Specification
The most important feature, by far, is the presence of a 1 or 2 year contract. Customers are .25 and .02, respectively, times as likely to cancel their service if they are under contract.
Cancellation fees are a possible underlying cause. As long as these fees do not prohibit new sales, we would recommend continuing to put them into as many contracts as possible.
Kaplan-Meier Curves Segmented by Contract Type
Customer Selection
Customers with a partner or dependents are .82 and .91 times as likely to cancel as normal customers. Families and other large households seem to be less likely to change providers.
This could be due to higher incomes, less time to consider options, or another combination of factors.
Kaplan-Meier Curves Segmented by Dependents
Payment Systems
There is a reason companies now default to opting employees into 401k plans. It takes effort for people to make a change, even if it is beneficial.
Make sure your customer’s default is an automatic payment made monthly. This requires little effort from the customer to remain subscribed.
Conversely, sending a check, in the mail or electronically, is a pain. It requires effort to remain subscribed.
Kaplain-Meier Survival Curve Segmented by Payment Method
That’s all for now! Please let me know if you have comments, questions, or other topics you would like covered.