Data transformation based optimized customer churn prediction model for the telecommunication industry

01/11/2022
by   Joydeb Kumar Sana, et al.
24

Data transformation (DT) is a process that transfers the original data into a form which supports a particular classification algorithm and helps to analyze the data for a special purpose. To improve the prediction performance we investigated various data transform methods. This study is conducted in a customer churn prediction (CCP) context in the telecommunication industry (TCI), where customer attrition is a common phenomenon. We have proposed a novel approach of combining data transformation methods with the machine learning models for the CCP problem. We conducted our experiments on publicly available TCI datasets and assessed the performance in terms of the widely used evaluation measures (e.g. AUC, precision, recall, and F-measure). In this study, we presented comprehensive comparisons to affirm the effect of the transformation methods. The comparison results and statistical test proved that most of the proposed data transformation based optimized models improve the performance of CCP significantly. Overall, an efficient and optimized CCP model for the telecommunication industry has been presented through this manuscript.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

page 10

page 11

page 13

page 17

page 18

12/21/2017

Profit Driven Decision Trees for Churn Prediction

Customer retention campaigns increasingly rely on predictive models to d...
02/27/2018

Time-sensitive Customer Churn Prediction based on PU Learning

With the fast development of Internet companies throughout the world, cu...
07/31/2020

F*: An Interpretable Transformation of the F-measure

The F-measure is widely used to assess the performance of classification...
08/27/2020

Propensity-to-Pay: Machine Learning for Estimating Prediction Uncertainty

Predicting a customer's propensity-to-pay at an early point in the reven...
07/05/2017

Employee turnover prediction and retention policies design: a case study

This paper illustrates the similarities between the problems of customer...
01/31/2022

Advantages and Disadvantages of (Dedicated) Model Transformation Languages A Qualitative Interview Study

In a recent study we have shown, that a large number of claims about mod...
06/26/2020

Simulating human interactions in supermarkets to measure the risk of COVID-19 contagion at scale

Taking the context of simulating a retail environment using agent based ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last few decades, the telecommunication industry (TCI) has witnessed enormous growth and development in terms of technology, level of competition, number of operators, new products and services and so on. However, because of extensive competition, saturated markets, dynamic environment, and attractive and lucrative offers, the TCI faces serious customer churn issues, which is considered to be a formidable problem in this regard Óskarsdóttir et al. (2017). In a competitive market, where customers have numerous choice of service providers, they can easily switch services and even service providers. Such customers are referred to as churned customers Óskarsdóttir et al. (2017) with respect to the original service provider.

The three main generic strategies to generate more revenues in an industry are (i) to increase the retention period of customers, (ii) to acquire new customers and (iii) to up-sell the existing customers being the other two Wei and Chiu (2002). In fact, customer retention is believed to be the most profitable strategy, as customer turnover severely hits the company’s income and its marketing expenses Amin et al. (2016).

Churn is an inevitable result of a customer’s long term dissatisfaction over the company’s services. Complete withdrawal from a service (provider) on part of a customer does not happen in a day; rather the dissatisfaction of the customer, grown over time and exacerbated by the lack of attention by the service provider, results in such a fiery gesture by the customer. To prevent this, the service provider must work on limitations (perceived by the customers) in its services to retain the aggrieved customers. Thus it is highly beneficial for a service provider to be able to identify a customer as a potential churned customer. In this context, non-churn customers are those who are reluctant to move from one service provider to another in contrast to churn customers.

If a telephone company (TELCO) can predict that a customer is likely to churn, then it can potentially cater targeted offerings to that customer to reduce his dissatisfaction, increase his engagement and thus potentially retain him/her. This has a clear positive impact on revenue. Additionally, customer churn adversely affects the company’s fame and branding. As such, churn prediction is a very important task particularly in the telecom sector. To this end, TELCOs generally maintain a detailed standing report of the customer’s to understand their standing and to anticipate their longevity in continuing the services. Since the expense of getting new customers is relatively high Lu and Ph (2002); Hadden et al. (2008), TELCO nowadays principally focus on retaining their long-term customers rather than getting new ones. This makes churn prediction essential in the telecom sector Keramati et al. (2014); Xie et al. (2009). With the above backdrop, in this paper, we revisit the customer churn prediction (CCP) problem as a binary classification problem in which all of the customers are partitioned into two classes, namely, Churn and Non-Churn.

1.1 Brief Literature review

The problem of CCP has been tackled using various approaches including machine learning models, data mining methods, and hybrid techniques. Several Machine Learning (ML) and data mining approaches (e.g., Rough set theory Amin et al. (2016, 2015)

, Naïve Bayes and Bayesian network

Kirui et al. (2013)

, Decision tree

Hung et al. (2006); De Caigny et al. (2018)

, Logistic regression

De Caigny et al. (2018), RotBoostIdris and Khan (2012)

, Support Vector Machine (SVM)

Renjith (2017)

, Genetic algorithm based neural network

Pendharkar (2009), AdaBoost Ensemble learning technique Idris et al. (2017), etc.) have been proposed for churn prediction in the TCI using customer relationship management (CRM) data. Notably, CRM data is widely used in prediction and classification problems Huang et al. (2010). A detailed literature review considering all these works is beyond the scope of this paper; however, we briefly review some of the most relevant papers below.

Brandusoiu et al. Brandusoiu et al. (2016)

presented a data mining based approach for prepaid customer churn prediction. To reduce data dimension, the authors applied Principal Component Analysis (PCA). Three machine learning classifiers were used here, namely, Neural Networks (NN), Support Vector Machine (SVM), and Bayes Networks (BN) to predict churn customers. He et al. 

He et al. (2009) proposed a model based on Neural Networks (NN) in order to tackle the CCP problem in a large Chinese TELCO that had about million customers. Idris et al. Idris et al. (2012)

proposed a technique combining genetic programming with AdaBoost to model the churn problem in the TCI. Huang et al. 

Huang et al. (2015)

studied the problem of CCP in the big data platform. The aim of the study was to show that big data significantly improves the performance of churn prediction using Random Forest classifier.

Makhtar et al. Makhtar et al. (2017) proposed a rough set theory based model for churn prediction in TELCO. Amin et al. Amin et al. (2016) on the other hand focused on tackling the data imbalance issue in the context of CCP in TELCO and compared six unique sampling strategies for oversampling. Burez et al. Burez and Van den Poel (2009) also studied the issue of unbalanced datasets in churn prediction models and conducted a comparative study for different methods for tackling the data imbalance issue. Hybrid strategies have also been used for processing massive amount of customer information together with regression techniques that provide effective churn prediction results S. A. Qureshi et al. (2013). On the other hand, Etaiwi et al. Etaiwi et al. (2017) showed that their Naïve Bayes model was able to beat SVM in terms of precision, recall, and F-measure.

To the best of our knowledge, an important limitation in this context is that most of the methods in the literature have been experimented on a single dataset. Also, the impact of data transformation methods on CCP models have not been investigated deeply. There are various DT methods like the Log, Rank, Z-score, Discretization, Min-max, Box-cox, Aarcsine and so on. Among these, researchers broadly used the Log, Z-score, and Rank DT methods in different domains (e.g., software metrics normality and maintainability

Zhang et al. (2013) Zhang et al. (2017), defect prediction Fukushima et al. (2014), dimensionality reduction Fukushima et al. (2014) etc.). To the best of our knowledge There are only one work in the literature where DT methods have been applied in the context of CCP in TELCO  Amin et al. (2018), where only two DT methods (e.g., Log and Rank) and a single classifier (e.g., Naïve Bayes) have been leveraged. Therefore, a large room for improvement is there in this context, which we consider in this work.

1.2 Our Contributions

This paper makes the following key contributions:

  • We develop customer churn prediction models that leverage various data transformation (DT) methods and various optimized machine learning algorithms. In particular, we have combined six different DT methods with eight different optimized classifiers to develop a number of models to handle the CCP problem. The DT methods we utilized are: Log, Rank, Box-cox, Z-score, Discretization and Weight-of-evidence (WOE). On the other hand the classification algorithms we used include K-Nearest Neighbor (KNN), Naïve Bayes (NB), Logistic Regression (LR), Random forest (RF), Decision tree (DTree), Gradient boosting (GB), Feed-Forward Neural Networks (FNN) and Recurrent Neural Networks (RNN).

  • We have conducted extensive experiments on three different publicly available datasets and evaluated our models using various information retrieval metrics such as, AUC, Precision, Recall and F-measure. Our models achieved promising results and conclusively found that the DT methods have positive impact on CCP models.

  • We also conduct statistical tests to check whether our findings are statistically significant or not. Our results clearly indicate that the impact of DT methods on the classifiers is not only positive but also statistically significant.

2 Materials and Methods

2.1 Datasets

We use three publicly available benchmark datasets (referred to as Dataset- 1, 2 and 3 henceforth), that are broadly used for the CCP problem in the telecommunication area. Table 1 describes these three datasets.

Description Dataset-1 Dataset-2 Dataset-3
No. of samples 100000 5000 3333
No. of attributes 101 20 21
No. of class labels 2 2 2
Percentage churn samples 50.43 85.86 85.5
Percentage non-churn samples 49.56 14.14 14.5
Source of the datasets URL URL URL

URL: https://www.kaggle.com/abhinav89/telecom-customer/data (Last Access: September 29, 2019).
URL: https://data.world/earino/churn (Last Access: February 10, 2020).
URL: https://www.kaggle.com/becksddf/churn-in-telecoms-dataset/data (Last Access: February 10, 2020).

Table 1: Summary of datasets

2.1.1 Data preprocessing

We apply the following essential data preprocessing steps:

  • We ignore the sample IDs and/or descriptive texts which are used only for informational purposes.

  • Redundant attributes are removed.

  • Missing numerical values are replaced with zero (0) and missing categorical values are treated as a separate category.

  • We normalize the categorical values (such as ‘yes’ or ‘no’, ‘true’ or ‘false’) into 0s and 1s where each value represents the corresponding category Amin et al. (2015). Label encoder is used to normalize the categorical attributes.

2.2 Data Transformation (DT) Methods

data transformation refers to the application of a deterministic mathematical function to each point in a data set. Table LABEL:table:DTMethods provides a description of the DT methods leveraged in our research.

Begin of Table
DT Method Description Equation
Log Each variable x is replaced with log(x), where the base of the log is left up to the analyst Zhang et al. (2017) Menzies et al. (2007) Feng et al. (2014). In this study, In case the feature value contains zero, a constant 1 is typically added, along with ln(x)
(1)
where x is the value of any feature variable of the original dataset.
Rank It is a statistically calculated rank value Zhang et al. (2017) Bishara and Hittner (2015). In this research, we followed the study Zhang et al. (2017) to transform the initial values of every feature in a original dataset into ten (10) ranks, using each 10th % (percentile) of the given feature’s values
(2)
where is the percentile of the corresponding metric and symbol is the infinity.
Box-Cox It is a lamba based power transformation method Zhang et al. (2017) Feng et al. (2014)

. This transformation method is a process to transform non-normal dependent feature values into a normal distribution.

(3)
Where is configurable to the analyst, and x is the given value of any feature of the initial dataset. The value = -5 to +5. In this study, we used = 0.5.
Z-score

It indicates the distance of a data point from the mean in units of standard deviation

Cheadle et al. (2003).
(4)
where x is the given value of any feature of the original dataset.
Discretization It is a binning technique Fayyad and Irani (1992)

. For continuous variables, four widely used discretization techniques are K-means, equal width, equal frequency, and decision tree based discretization. We used the equal width discretization technique which is a very simple method.

For any given continuous variable x, the following process is applied: Provided is the minimum of a selected feature and is the maximum, bin width can be computed as
(5)
Hence, the discretization technique generates b bins with boundaries at , where i=1,2,…..(b-1). b is a parameter chosen by the analyst.
Weight-of-evidence (WOE) It is binning and logarithmic based transformationSiddiqi (2005)

. Most of the cases, the WOE solves the skewed problem in the data distribution. WOE is the natural logarithm (ln) of the distribution which is the distribution of the good events (1) divided by the distribution of the bad events (0).

(6)
End of Table
Table 2: List of data transformation methods.

2.3 Evaluation Measures

The confusion matrix is generally used to assess the overall performance of a predictive model. For the CCP problem, the individual components of confusion matrix is defined as follows: (i) True Positives (TP): correctly predicted churn customers (ii) True Negatives (TN): correctly predicted non-churn customers (iii) False Positives (FP): non-churn customers, miss-predicted as churned customers and (iv) False Negatives (FN): churn customers, miss-predicted as non-churn customers. We use the following popular evaluation measures for comparing the performance of the models.

Precision : Mathematically precision can be expressed as:

(7)

The probability of detection (POD)/ Recall:

POD or recall is a valid choice of evaluation metric when we want to capture as many true churn customers as possible. Mathematically POD can be expressed as:

(8)

The probability of false alarm (POF): The value of POF should be small as much as possible (in an ideal case, the value of POF is 0 ). Mathematically POF can be defined as:

(9)

We use POF for measuring incorrect churn predictions.

The area under the curve (AUC): Both POF and POD are used to measure the AUC Zhang et al. (2017) Amin et al. (2019). A higher AUC value indicates a higher performance of the model. Mathematically AUC can be expressed as:

(10)

F-Measure:

The F-measure is the harmonic mean of the precision and recall. F-measure is needed when we want to seek a balance between precision and recall. A perfect model has an F-measure of 1. The Mathematical formula of F-measure is defined below.

(11)
Key Classifer Model type Description
KNN K-Nearest Neighbor Instance-based learning, lazy learning The KNN algorithm assumes that similar things exist in close proximity.
NB Naïve Bayes Gaussian

NB is a family of probabilistic algorithms. It gives the conditional probability, based on the Bayes theorem

LR Logistic Regression Statistical model

Logistic regression is estimating the parameters of a logistic model (a form of binary regression).

RF Random forest Trees RF is an ensemble tree-based learning algorithm
DTree Decision tree Trees DTree builds classification or regression models in the form of tree structure
GB Gradient boosting Trees GB is an ensemble tree-based boosting method
FNN Feed-Forward Networks Deep learning FNN is a deep learning classifier where the input travels in one direction
RNN Recurrent Neural Networks Deep learning RNN is a deep learning classifier where the output from previous step are fed as input to the current step.

Table 3: List of baseline classifiers.

2.4 Optimized CCP models

The baseline classifiers used in our research are presented in Table 3. To examine the effect of the DT methods, we apply them on the original datasets and subsequently, on the transformed data, we train our CCP models with multiple machine learning classifiers (KNN, NB, LR, RF, DTree, GB, FNN and RNN) listed in Table 3.

2.4.1 Validation method and steps

In all our experiments, the classifiers of the CCP models were trained and tested using 10-fold cross-validation on the three different datasets described in Table 1

. Firstly, a RAW data based CCP model was constructed without leveraging any of the DT methods on any features of the original datasets. In this case, we did not apply any feature selection steps either. However, we used the best hyper-parameters for the classifiers.

Subsequently, we applied a DT method on each attribute of the dataset and retrained our models based on this transformed dataset. We experimented with each of the DT methods mentioned in Table 3. For each DT based model, we also used a feature selection and optimization procedure, which is described in the following section.

Figure 1: Flowchart of the Optimized CCP model using data transformation methods.

2.4.2 Feature Selection and Optimization

We have a set of hyper-parameters and we aim to find the right combination of the values thereof which will optimize the objective function. For tuning the hyper-parameters, we have applied grid search Syarif et al. (2016). Figure 1 illustrates the overall flowchart of our proposed optimized CCP model. First, we applied some necessary preprocessing steps on the datasets. Then, DT methods (Log, Rank, Box-cox, Z-score, Discretization, and WOE ) were applied thereon. Next, we used the univariate feature selection technique to select the higher scored features from the dataset (we selected the top 80 features for dataset-1 and top 15 features for both dataset-2 and dataset-3). We applied grid search to find the best hyper-parameters for individual classifier algorithms. Finally, 10-fold cross validation was employed to train and validate the models.

3 Stability measurement tests

We used Friedman non-parametric statistical test (FMT) Demšar (2006) to examine the reliability of the findings and whether the improvement achieved by the DT based classification models are statistically significant. The Friedman test is the non-parametric statistical test for analyzing and finding differences in treatments across multiple attempts Demšar (2006)

. It does not assume any particular distribution of the data. Friedman test ranks all the methods. It ranks the classifiers independently for each dataset. Lower rank indicates a better performer. We performed the Friedman test on the F-measure results. Here, the null hypothesis

represents: “there is no difference among the performances of the CCP models”. In our experiments, the test was carried out with the significance level, .

Subsequently, post hoc Holm test is conducted to perform the paired comparisons with respect to the best performing DT model. In particular, when the null hypothesis is rejected, we used the post hoc Holm test to compare the performance of the models. This test is a similarity measurement process that compares all the models. We performed the Holm’s post hoc comparison for and .

4 DT methods and Data Distribution

Data transformation attempts to change the data from one representation to another to enhance the quality thereof with a goal to enable analysis of certain information for specific purposes. In order to find out the impact of the DT methods on the datasets, data skewness and data normality measurement tests have been performed on the three different datasets and the results are visualized through Q-Q (quantile-quantile) plots

Amin et al. (2019); Zhang et al. (2017).

4.0.1 Coding and Experimental Environment

All experiments were conducted on a machine having Windows 10, 64-bit system with Intel Core i7 3.6GHz processor, 24GB RAM, and 500GB HD. All codes were implemented with Python 3.7. Jupyter Notebook was used for coding. All data and code are available at the following link: https://github.com/joysana1/Churn-prediction.

5 Results

The impact of the DT methods on all the 8 classifiers (through rigorous experimentation on 3 benchmark datasets) are illustrated in Figures 2 through 9. Each of these figures illustrates the performance comparison (in terms of AUC, precision, recall, and F-measure) among the RAW data based CCP model and other DT methods based CCP models for all three datasets as follows (please check Table 4 for a map for understanding the figures). Table 7, 8 and 9 in the supplementary file reports the values for all the measures for all the datasets.

(a) Performance comparison, Dataset-1
(b) Performance comparison, Dataset-2
(c) Performance comparison, Dataset-3
Figure 2: Performance comparison among the CCP methods using KNN classifier
(a) Performance comparison, Dataset-1
(b) Performance comparison, Dataset-2
(c) Performance comparison, Dataset-3
Figure 3: Performance comparison among the DT methods using NB classifier
(a) Performance comparison, Dataset-1
(b) Performance comparison, Dataset-2
(c) Performance comparison, Dataset-3
Figure 4: Performance comparison among the DT methods using RF classifier
(a) Performance comparison, Dataset-1
(b) Performance comparison, Dataset-2
(c) Performance comparison, Dataset-3
Figure 5: Performance comparison among the DT methods using LR classifier
(a) Performance comparison, Dataset-1
(b) Performance comparison, Dataset-2
(c) Performance comparison, Dataset-3
Figure 6: Performance comparison among the DT methods using FNN classifier
(a) Performance comparison, Dataset-1
(b) Performance comparison, Dataset-2
(c) Performance comparison, Dataset-3
Figure 7: Performance comparison among the DT methods using RNN classifier
(a) performance comparison, Dataset-1
(b) performance comparison, Dataset-2
(c) performance comparison, Dataset-3
Figure 8: performance comparison among the DT methods using DTree classifier
(a) performance comparison, Dataset-1
(b) performance comparison, Dataset-2
(c) performance comparison, Dataset-3
Figure 9: performance comparison among the DT methods using GB classifier
Figure Classifier Sub-figure Dataset
1 KNN a 1
b 2
c 3
2 NB a 1
b 2
c 3
3 RF a 1
b 2
c 3
4 LR a 1
b 2
c 3
5 FNN a 1
b 2
c 3
6 RNN a 1
b 2
c 3
7 DTree a 1
b 2
c 3
8 GB a 1
b 2
c 3
Table 4: Map of the results illustrated in different figures

5.1 Results on Dataset 1

The performance of the baseline classifiers (referred to as RAW in the figures) in dataset 1 is quite poor in all the metrics: the best performer in terms of F-measure is NB with a value of 0.636 only. Interestingly, not all DT methods performed better than RAW. However, the performance of WOE is consistently better than RAW across all classifiers. In a few cases of course some other DT methods able to outperform WOE: for example, across all combinations in Dataset 1, the best individula performance is achieved by FNN with Z-SCORE with a staggering F-Measure of 0.917. As for AUC as well, the most consistent performer is WOE with the best value achieved for FNN (0.802)

5.2 Results on Dataset 2

Interestingly, the performance of some baseline classifiers in Dataset 2 is quite impressive in Dataset 2, particularly in the context of AUC. For example, both DT and GB (RAW version) achieved more than 0.82 as AUC; the F-Measure was also acceptable, particularly for GB (0.78).

Among the DT methods, again, WOE performs (in terms of F-Measure) most consistently albeit with the glitch that for DT and GB, it performs slightly worse than RAW. In fact, surprisingly enough, for GB, the best performer is RAW; for DT however, Z-SCORE is the winner, very closely followed by BOX-COX.

5.3 Results on Dataset 3

In Dataset 3 as well, the performance of DT and GB in RAW mode is quite impressive: for DT the AUC and F-Measure values are respectively 0.84 and 0.727 and for GB these are even better, 0.86 and 0.809, respectively. Again, the performance of WOE is the most consistent except in the case of DT and GB where it is beaten by RAW. The overall winner is GB with LOG transformation which registers 0.864 as AUC and 0.818 as F-Measure.

6 Statistical test results

Algorithm Rank (#Position)
WOE 2.4167 (#1)
Z-SCORE 3.5417 (#2)
RAW 3.7917 (#3)
Discritization 4.0833 (#4)
BOX-COX 4.1667 (#5)
RANK 4.9375 (#6)
LOG 5.0625 (#7)
Table 5: Average Rankings of the algorithms

Table 5 summarizes the ranking of the Freedman test among the DT methods. Friedman statistic distributed according to Chi-square with (

-1) degrees of freedom is 24.700893. Here

is the number of methods. P-value computed by Friedman test is 0.00039. Form the Chi-square distribution table, critical value is 12.59. Notably,

confidence interval (CI) has been considered for this test. Our Friedman test statistic value (24.700893) is greater than the critical value (12.59). So the decision is to reject the null hypothesis . Subsequently, the post hoc Holm test revealed significant differences among the DT methods. Figure 10 illustrates the results of Holm’s test as a heat map. p-value was considered as the evidence of significance. Figure 10 tells that WOE performance is significantly different from other DT methods except for the Z-SCORE. Table 6 reflects the post hoc comparisons for and . When the p-value of the test is smaller than the significant rate = 10% and 5% then Holm’s procedure rejects the null hypothesis. Evidently, WOE DT based models are found to be significantly better than the other models.

Figure 10: Performance difference heatmap among DT based CCP models in terms of p-value
Method p-value Hypothesis () Hypothesis ()
1 WOE vs. LOG 0.000022 Rejected Rejected
2 WOE vs. RANK 0.000053 Rejected Rejected
3 WOE vs. BOX-COX 0.005012 Rejected Rejected
4 WOE vs. Discritization 0.007526 Rejected Rejected
5 WOE vs. RAW 0.027461 Rejected Rejected
6 WOE vs. Z-SCORE 0.071229 Not Rejected Rejected
Table 6: Friedman and Holm test result

7 Impact of the DT methods on Data Distribution

The Q-Q plots are shown in Figure 11, 12 and 13 for Dataset-1, Dataset-2 and Dataset-3, respectively. As we found WOE and Z-Score DT methods are performing better than the RAW (without DT) method (see the Friedman ranked table 5), we generated Q-Q plots only for RAW, WOE, and Z-Score methods. In each Q-Q plot, the first 3 features of the respective dataset are shown. From the Q-Q plots, it is observed that after transformation by the WOE DT method, we achieved less skewness (i.e., the data became more normally distributed). Normally distributed data is beneficial for the classifiers Amin et al. (2019); Coussement et al. (2017). Similar performance is also observed for Z-SCORE.

8 Discussion

From the comparative analysis and statistical tests, it is evident that DT methods have a great impact on improving the CCP performance in TELCO. A few prior works (e.g., Zhang et al. (2013), Zhang et al. (2017), and Amin et al. (2018)) also studied the effect of DT methods but in a limited scale and did not consider the optimization issues. We on the other hand conducted a comprehensive study considering six DT methods and eight machine learning classifiers on three different benchmark datasets. The performance of the DT based classifiers have been investigated in terms of AUC, precision, recall, and F-measure.

The data transformation techniques have shown great promise in improving the data distribution quality in general. Specially, in our experiments, the WOE method improved the data normality which in the sequel provided a clear positive impact on the prediction performance for the customer churn prediction (Figures 11 - 13).

The comparative analyses involving the RAW based and DT based CCP models clearly suggested the potential of DT methods in improving the CCP performance (Figures 2 through 9). In particular, our experimental results strongly suggested that the WOE method contributed a lot towards improving the performance, albeit with the exception of DTree and GB classifiers for the datasets 2 and 3. While the performance of WOE in these cases satisfactory, it failed to outperform RAW based model performance. We hypothesize that this is due to the binning technique within the WOE method. Moreover, those two datasets are unbalanced datasets. The DTree and GB classifiers might consider them as some order which is not a specific order.

From Table 5 we notice that WOE is the best ranked method and the rank value is 2.4167. The post hoc comparison heatmap 10 and Table 6 reflect how the WOE is better than the other methods. As Friedman test is rejecting the null hypothesis and post hoc Holm analysis advocates the WOE method’s supremacy, it is clear that DT methods improve the user churn prediction performance significantly for the telecommunication industry. Therefore, to construct a successful CCP model, we recommend to select the best classifier (LR, FNN) and the WOE data transfer method.

9 Conclusion

Predicting customer churn is one of the most important factors in business planning in TELCOs. To improve the churn prediction performance we investigated with six different data transformation methods, namely, Log, Rank, Box-cox, Z-score, Discretization, and Weight-of-evidence. We used eight different machine learning classifiers which are K-Nearest neighbor (KNN), Naïve Bayes (NB), Logistic regression (LR), Random forest (RF), Decision tree (DTree), Gradient boosting (GB), Feed-forward neural networks (FNN), Recurrent neural networks (RNN). For each classifier, we applied univariate feature selection method to select top ranked features and used grid search for hyper-parameter tuning. We evaluated our methods in terms of AUC, precision, recall, and F-measure. The experimental outcomes indicate that, in most cases, the data transformation methods enhance the data quality and improve the prediction performance. To support our experimental results we performed Friedman non-parametric statistical test and post hoc Holm statistical analysis. The Friedman statistical test and post hoc Holm statistical analysis confirmed that Weight-of-evidence and Z-score DT based CCP models perform better than the raw based CCP model. To test the robustness of our DT-augmented CCP models, we performed our experiments on both balanced (dataset-1) and non-balanced datasets (dataset-2 and dataset-3). CCP is still a hard and swiftly developing problem usually for competitive businesses and particularly for telecommunication companies. Future research is probably capable to offer higher outcomes on other datasets with multiple classifiers. Another future direction can be to extend this study with other types of data transformation approaches and classifiers. Our proposed model can be tested on the other telecom datasets to examine the generalization of our results at a larger scale. Last but not the least, work can be done to extend our approach to customer churn datasets from other business sectors to study the generalization of our claim across business domains.

Figure 11: The Q-Q plot for WOE, Z-Score DT method and without DT method on dataset-1
Figure 12: The Q-Q plot for WOE, Z-Score DT method and without DT method on dataset-2
Figure 13: The Q-Q plot for WOE, Z-Score DT method and without DT method on dataset-3

References

  • A. Amin, B. Shah, A. M. Khattak, T. Baker, H. u. Rahman Durani, and S. Anwar (2018) Just-in-time customer churn prediction: with and without data transformation. In

    2018 IEEE Congress on Evolutionary Computation (CEC)

    ,
    Vol. , pp. 1–6. Cited by: §1.1, §8.
  • A. Amin, S. Anwar, A. Adnan, M. Nawaz, K. Aloufi, A. Hussain, and K. Huang (2016) Customer churn prediction in telecommunication sector using rough set approach. Neurocomputing, pp. . External Links: Document Cited by: §1.1, §1.
  • A. Amin, S. Anwar, A. Adnan, M. Nawaz, N. Howard, J. Qadir, A. Hawalah, and A. Hussain (2016) Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEEAccess PP, pp. 7940–7957. External Links: Document Cited by: §1.1.
  • A. Amin, B. Shah, A. M. Khattak, F. J. Lopes Moreira, G. Ali, A. Rocha, and S. Anwar (2019) Cross-company customer churn prediction in telecommunication: a comparison of data transformation methods. International Journal of Information Management 46, pp. 304 – 319. External Links: ISSN 0268-4012, Document, Link Cited by: §2.3, §4, §7.
  • A. Amin, S. Shehzad, C. Khan, and S. Anwar (2015) Churn prediction in telecommunication industry using rough set approach. Vol. 572, pp. 83–95. External Links: Document Cited by: §1.1, item 4.
  • A. J. Bishara and J. B. Hittner (2015) Reducing bias and error in the correlation coefficient due to nonnormality. Educational and Psychological Measurement 75 (5), pp. 785–804. External Links: Document Cited by: Table 2.
  • I. Brandusoiu, G. Toderean, and H. Beleiu (2016) Methods for churn prediction in the pre-paid mobile telecommunications industry. pp. 97–100. External Links: Document Cited by: §1.1.
  • J. Burez and D. Van den Poel (2009) Handling class imbalance in customer churn prediction. Expert Systems with Applications 36 (3 PART 1), pp. 4626–4636. External Links: Document, ISSN 09574174, Link Cited by: §1.1.
  • C. Cheadle, M. Vawter, W. Freed, and K. Becker (2003) Analysis of microarray data using z-score transformation. The Journal of molecular diagnostics : JMD 5 (2), pp. 73–81. External Links: Document Cited by: Table 2.
  • K. Coussement, S. Lessmann, and G. Verstraeten (2017) A comparative analysis of data preparation algorithms for customer churn prediction: a case study in the telecommunication industry. Decision Support Systems 95, pp. 27 – 36. External Links: ISSN 0167-9236, Document, Link Cited by: §7.
  • A. De Caigny, K. Coussement, and K. De Bock (2018) A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. European Journal of Operational Research 269, pp. . External Links: Document Cited by: §1.1.
  • J. Demšar (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, pp. 1–30. External Links: ISSN 15337928 Cited by: §3.
  • W. Etaiwi, M. Biltawi, and G. Naymat (2017) Evaluation of classification algorithms for banking customer’s behavior under apache spark data processing system. Procedia Computer Science 113, pp. 559 – 564. External Links: ISSN 1877-0509, Document, Link Cited by: §1.1.
  • U. M. Fayyad and K. B. Irani (1992) On the handling of continuous-valued attributes in decision tree generation. Machine Learning 8 (1), pp. 87–102. Cited by: Table 2.
  • C. Feng, W. Hongyue, N. Lu, T. Chen, H. He, Y. Lu, and X. Tu (2014) Log-transformation and its implications for data analysis. Shanghai archives of psychiatry 26 (2), pp. 105–9. External Links: Document Cited by: Table 2.
  • T. Fukushima, Y. Kamei, S. McIntosh, K. Yamashita, and N. Ubayashi (2014) An empirical study of just-in-time defect prediction using cross-project models. Empirical Software Engineering 21, pp. 172–181. External Links: ISBN 978-1-4503-2863-0, Document Cited by: §1.1.
  • J. Hadden, A. Tiwari, R. Roy, and D. Ruta (2008) Churn prediction: Does technology matter. World Academy of Science, Engineering and Technology (16), pp. 973–979. External Links: Link Cited by: §1.
  • Y. He, Z. He, and D. Zhang (2009) A study on prediction of customer churn in fixed communication network based on data mining. pp. 92–94. External Links: Document Cited by: §1.1.
  • B.Q. Huang, T.-M. Kechadi, B. Buckley, G. Kiernan, E. Keogh, and T. Rashid (2010) A new feature set with new window techniques for customer churn prediction in land-line telecommunications. Expert Systems with Applications 37 (5), pp. 3657 – 3665. External Links: ISSN 0957-4174, Document, Link Cited by: §1.1.
  • Y. Huang, F. Zhu, M. Yuan, K. Deng, Y. Li, B. Ni, W. Dai, Q. Yang, and J. Zeng (2015) Telco churn prediction with big data. pp. 607–618. External Links: Document Cited by: §1.1.
  • S. Hung, D. Yen, and H. Wang (2006) Applying data mining to telecom chum management. Expert Systems with Applications 31, pp. 515–524. External Links: Document Cited by: §1.1.
  • A. Idris, A. Iftikhar, and Z. Rehman (2017) Intelligent churn prediction for telecom using gp-adaboost learning and pso undersampling. Cluster Computing 22, pp. 7241–7255. Cited by: §1.1.
  • A. Idris, A. Khan, and Y. S. Lee (2012) Genetic programming and adaboosting based churn prediction for telecom. pp. 1328–1332. External Links: ISBN 978-1-4673-1713-9, Document Cited by: §1.1.
  • A. Idris and A. Khan (2012) Customer churn prediction for telecommunication: employing various various features selection techniques and tree based ensemble classifiers. pp. 23–27. External Links: ISBN 978-1-4673-2249-2, Document Cited by: §1.1.
  • A. Keramati, R. Jafari-Marandi, M. Aliannejadi, I. Ahmadian, M. Mozaffari, and U. Abbasi (2014) Improved churn prediction in telecommunication industry using data mining techniques. Applied Soft Computing 24, pp. 994 – 1012. External Links: ISSN 1568-4946, Document, Link Cited by: §1.
  • C. Kirui, L. Hong, W. Cheruiyot, and H. Kirui (2013) Predicting customer churn in mobile telephony industry using probabilistic classifiers in data mining. IJCSI Int. J. Comput. Sci. Issues 10, pp. 165–172. Cited by: §1.1.
  • J. Lu and D. Ph (2002) Predicting Customer Churn in the Telecommunications Industry –– An Application of Survival Analysis Modeling Using SAS. Techniques 114-27, pp. 114–27. External Links: Link Cited by: §1.
  • M. Makhtar, s. Nafis, M. A. Mohamed, M. K. Awang, M.N.A. Rahman, and M. Mat Deris (2017) Churn classification model for local telecommunication company based on rough set theory. Journal of Fundamental and Applied Sciences 9 (6), pp. 854–68. External Links: Document Cited by: §1.1.
  • T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald (2007) Problems with precision: a response to ”comments on ’data mining static code attributes to learn defect predictors’”. IEEE Transactions on Software Engineering 33 (9), pp. 637–640. Cited by: Table 2.
  • M. Óskarsdóttir, C. Bravo, W. Verbeke, C. Sarraute, B. Baesens, and J. Vanathien (2017) Social network analytics for churn prediction in telco: model building, evaluation and network architecture. Expert Systems with Applications 85, pp. . External Links: Document Cited by: §1.
  • P. C. Pendharkar (2009) Genetic algorithm based neural network approaches for predicting churn in cellular wireless network services. Expert Systems with Applications 36 (3, Part 2), pp. 6714 – 6720. External Links: ISSN 0957-4174, Document, Link Cited by: §1.1.
  • S. Renjith (2017) B2C e-commerce customer churn management: churn detection using support vector machine and personalized retention using hybrid recommendations. International Journal on Future Revolution in Computer Science and Communication Engineering (IJFRCSCE) 3, pp. 34 – 39. External Links: Document Cited by: §1.1.
  • S. S. A. Qureshi, A. Rehman, A. Qamar, A. Kamal, and A. Rehman (2013) Telecommunication subscribers’ churn prediction model using machine learning. pp. 131–136. External Links: Document Cited by: §1.1.
  • N. Siddiqi (2005) Credit risk scorecards: developing and implementing intelligent credit scoring. Cited by: Table 2.
  • I. Syarif, A. Prugel-Bennett, and G. Wills (2016) SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommunication Computing Electronics and Control) 14 (4), pp. 1502. External Links: Document Cited by: §2.4.2.
  • C. Wei and I. Chiu (2002) Turning telecommunications call details to churn prediction: a data mining approach. Expert Systems with Applications 23, pp. 103–112. External Links: Document Cited by: §1.
  • Y. Xie, X. Li, E.W.T. Ngai, and W. Ying (2009) Customer churn prediction using improved balanced random forests. Expert Systems with Applications 36 (3, Part 1), pp. 5445 – 5449. External Links: ISSN 0957-4174, Document, Link Cited by: §1.
  • F. Zhang, I. Keivanloo, and Y. Zou (2017) Data transformation in cross-project defect prediction. Empirical Software Engineering 22 (6), pp. 1–33. External Links: Document Cited by: §1.1, §2.3, Table 2, §4, §8.
  • F. Zhang, A. Mockus, Y. Zou, F. Khomh, and A. E. Hassan (2013) How does context affect the distribution of software maintainability metrics?. pp. 350–359. External Links: Document Cited by: §1.1, §8.