Business Intelligence studies have attracted tremendous interests from both academic research and industrial practice. Every decision is made within a decision environment which is defined as the assemblage of all exogenous responses and activities that affect the loss/reward gained by endogenous decision actions.
Thanks to developments in big data and machine learning, data-driven decision support systems are able to record streaming data on the cloud, get awareness of decision environment and act accordingly to achieve optimal system performances. Decision environment characterizations are usually not available to decision makers and need to be inferred from data collected through different types of equipment, recorded in a variety of database streams and combined on the cloud platform. Obtaining ideal characterization of the decision environment is costly, sometimes impossible, which lead to a constrained estimation of the decision environment. Machine learning, as one of the major technology set in data mining, provides a plenty of methods to recognize the pattern of decision environment and predicts current decision environment using point or statistical estimations. Adaptive machine learning further enables the decision maker to obtain awareness of environment on-the-fly when he/she chooses proper decision actions.
E-commerce, as one of the major components of modern business, is under threats from unforeseeable online fraudsters. In the paper , some optimization models built for Microsoft’s dynamic e-commerce transaction fraud control system were discussed. This paper, , pointed out that fraud control decisions of e-commerce merchants should not be made independently, but interactively with other external associated decision parties, such as banks and manual review agent teams. Transaction information could be only partially shared among different decision parties due to information confidentiality and privacy, e.g. Microsoft could not release customers purchase history to banks, and a bank should not share cardholders’ purchase history with other merchandises to Microsoft. Figure 1 depicts the dynamic and interactive decision environment of fraud control decision support system (DSS). After a purchase transaction occurs, the risk scoring engine first evaluates the risk level of this transaction using an estimated risk score (transactions with the same risk score are considered as one category of control objectives), and the DSS provides a control action that either approve, review or reject this purchase request. In other words, the task of DSS is to decide how to assign control actions to control objectives in different score categories. Approved transactions are sent to the payment issuing bank for authorization check, among which only bank authorized transactions can be granted final purchase approvals. Reviewed transactions are first sent to the payment issuing bank for authorization check and those bank authored transactions are then sent to manual review agents for further risk screening. Only both bank authorized and manually approved transactions are marked as final approved transactions. If a purchase request receives a rejection from any of the three decision parties, this transaction will be declined and marked as rejected transactions.
Microsoft had observed some rapid fluctuations in the decision behavior patterns of banks and manual review teams. For example, for transactions belong to the same risk category (having the same risk score), bank approval rate could sometime variate hugely time to time. The decision behavior patterns of banks and manual review teams are dynamic and are highly correlated to the decision quality of fraud control DSS. The behaviors patterns observed are list below:
If the recently reported fraud incidence number increases, banks and manual review agents become more conservative in approving purchases transactions, which leads to higher rejection rates;
On the other hand, if recently reported fraud incidence number decreases, banks and manual review agents then become less conservative, which results in lower rejection rates but likely to bypass more fraudulent transactions in consequence.
In addition, when fewer fraudulent transactions were submitted to manual review teams, since fraud patterns are less massive and obvious, MR teams have more difficult time to detect frauds
From the list above, it is not hard to see the importance of studying the interactive behavior patterns among decision parties with respect to fraud control actions of the e-commerce merchants to achieve the profit optimality.
One of the most commonly used transaction fraud labels in e-commerce is “chargeback” which is the return of funds to the credit card holder, initiated by the payment instrument issuing bank to settle a debt. With this type of fraud label, it is fairly common that the true feedback of decision is inaccessible right away after fraud control decision is made. The delay of fraud labels for purchase transaction is due to the fact that it usually takes some amount of time for the credit card legitimate holders to realize that their cards are misappropriated and file dispute to their bank. In this case, business intelligent models are not able to capture the most recent fraud patterns and to provide the most accurate decision for fraud control. This delay lead time is usually referred as data maturity lead time in big data era. Inaccurate risk decision made by business intelligent models would cause bad looping effects which fluctuate accuracy of decisions made by all decision-making parties in the decision environment and lead to the less profitable outcome for e-commerce merchants. To be able to estimate decision environment as well as its interaction effect, and make it as an input for the model is critical for making the risk decision system to reach the most profitable decisions. For fraud control DSS, if we ignore recent data and only use mature data to estimate behavior patterns of other decision parties, our control policy will be already outdated due to the lag of pattern recognition. If we ignore data maturity lead time and use partially mature labels to estimate behavior patterns of other decision parties, we may be mislead by recent decision quality, and in return overestimate/underestimate approval rates of banks or manual review agents. Predictive modeling of the decision environment thus becomes challenging due to delayed information.
The demand for learning decision environment, as well as the challenge of data maturity lead time, provide strong motivation for the authors to design proper inference methods to predict the rapidly fluctuating decision environment for Microsoft e-commerce fraud control DSS. This paper proposes two frameworks, Current Environment Inference (CEI) and Future Environment Inference (FEI) frameworks, that resolve maturity lead time issues in decision environment prediction. Both frameworks first generate decision environment related features using long-term mature data and short-term partially mature data, and estimate decision environment using variety of learning methods, including linear regression (LR), random forest (RF), gradient boosted tree (GB), artificial neural network (ANN) and recurrent neural network (RNN). CEI module is designed to use partially mature data (delayed information) to predict the decision environment of the coming decision epoch. And, FEI module is designed to further estimate decision environment of a future decision epoch to help evaluate the effect of current control decisions in the future decision epoch. Although these frameworks are designed for fraud control DSS, it can be easily customized for the use of other industrial applications that also face challenges of data mining with delayed information.
This paper is organized as the following. Section 2 includes a number of literature that addressed research related to this research topic. Section 3 first illustrates the structure of partially mature streaming data, then defines decision environment in fraud control. Section 4 illustrates two frameworks that predict decision environment patterns for fraud control DSS. Readers who are interested in the use of these two frameworks may refer to  for more system control operation details. The Performance of these two frameworks are tested through implementing them on a portfolio of transaction data from Microsoft e-commerce database, and all testing results are discussed in Section 5. Section 6 concludes the paper and briefly introduces how to extend the use of proposed frameworks to other industries that also face similar challenges of having only partially mature data.
2 Related Literature
Predictive modeling of decision environment started from the early 1980’s. The paper, , described and emphasized the importance that decision environment should be estimated dynamically and decisions should then be made accordingly. The value of decision environment estimation for a dynamic medical decision-making problem was studied with a simulated medical system in . Results in  suggested that decision behaviors were influenced by the features of decision environment. Following these research, the paper, , proposed probabilistic representations for decision environment prediction and probabilistic predictions are then updated and reinforced using sequential reward/loss returns from the environment. While on the other research branch, the paper  proposed instance-based learning to predict decision environment with the help of similarity-based exemplary database constructed using historical data. Based on the knowledge of the authors, there was currently no existing literature on predictive modeling for e-commerce transaction fraud control. The lack of literature in this field is because of two reasons: (1) It is not easy for academic researchers to have the access to e-commerce transaction data, as these data are strictly confidential; (2) Conventional fraud control models do not consider the interaction effect of decisions made by other decision parties and assume decision environment patterns are fixed.
Suppose a dynamic decision environment can be described using a number of attributes whose values vary as time changes, we can record series of attribute values with respect to time. If at a given time point, the decision environment can be characterized by attributes, then the values of these attributes can be represented by an dimensional trajectory. In this way, dynamic decision environment prediction problem can be modeled as a trajectory prediction problem that has rich literature. Trajectory prediction study started from the 1920’s  using classic time series analysis methods. The type of linear prediction methods for time series data, as a special kind of trajectory, are summarized in . Recent trajectory prediction researches motivated by different application considered higher trajectory dimension with more features as well as the nonlinear relations between these features, and they adopted machine learning (artificial neural network  , random forest , gradient boosted tree ) and deep lLearning ( ) methodologies. The use of artificial neural network in trajectory prediction started from early 90’s motivated by electricity load prediction (e.g.  ). These paper used neural network to model nonlinear relations between electricity load in past periods (trajectory history) and other features, such as temperature and location. Both papers, and , claimed high prediction accuracy with neural network models. The paper,  provided a comprehensive review of other applications using neural network in predicting trajectory type data. The paper, , compared performance of classic time series models with neural network model on real-world price trajectory data, and claimed a significant improvement with neural network model over ARIMA models by reducing the mean square error by 27 - 56 percent. The papers, and , considered a speech trajectory and adopted random forest and gradient boosted Tree methods, respectively, to predict next trajectory value which was associated with a wording database to predict the next input word. The paper, , adopted random forest method in demand prediction for a water supply system. They constructed long-term demand trend feature, short-term demand calibration feature, and other incidence features to increase prediction accuracy of coming water demand based on historical demand trajectory of different locations. Similar idea was widely used in engineering and medical field (see  for electricity demand trajectory prediction, and  for decease outbreak incidence trajectory prediction). A recent research in  conducted extensive amount of numerical performance comparisons among predictive modeling using classic time series forecasting (ARMA and ARIMA) with different parameters and random forest on 16,000 simulated and 135 real temperature trajectories. The paper, , observed that random forest method outperformed the traditional time series methods in most of their tests.  is one of the pioneer articles that introduced recurrent neural network to trajectory prediction. Recurrent neural network methods exploited temporal dependencies in a time sequence and uses internal states to model interactions between different time steps of the trajectory data.  and  claimed that recurrent neural network model provided faster and more accurate speech trajectory predictions than traditional artificial neural network. The papers,  and , adopted recurrent neural network to next location prediction using trajectory history data and user-dependent attributes.
Our research departs from current trajectory prediction research, as what it was mentioned earlier the e-commerce transaction streaming data have delayed labels so that the trajectory cannot represent the exact decision environment history for the periods of recent immature streaming data. Handling delayed information and data mining with delayed labels is considered one of the most important open challenges in big data era . The paper, , identified gaps between current research and meaningful applications, and highlighted the importance and the challenge of predictive modeling using streaming data with some delay in data maturity. Despite the fact that prediction modeling with delayed data immaturity is an important problem, there are only a few literature discussing how to solve it. The papers, ,  and , used semi-supervised nearest neighborhood method to resolve clustering problems when cluster labels are partially observed. However, we have not yet found any literature that address regression problem and continuous valued time series prediction problem with delayed information.
3 Partially mature Data and Fraud Control Decision Environment
In this section, we first give an overview of the structure of streaming data set collected for fraud control decision engine in Section 3.1. Next in Section 3.2, we define the mathematical form of a decision environment.
3.1 Streaming Data Structure
Figure 2 demonstrates the structure of streaming data set.
A transaction carries user’s account information (e.g. Microsoft account information), product information (e.g. a Surface book with a set of specifications, the total price and cost), and payment information (e.g. type of payment instrument, location of payment, etc.). The time of occurrence of a purchase transaction is immediately recorded as ”ReceivingTime”. A risk scoring engine then scores this transaction and record it in the risk control database as ”RiskScore”. A fraud control engine then makes control decision which is then recorded as ”InlineDecision”. The payment issuing bank and the manual review (MR) team make decisions and their decisions are recorded in the database as ”BankDecision” and ”MRDecision”. The fraud label of this transaction is set to ”False” by default in ”FraudFlag” in the database, and after a stochastic lead time in data maturity, we receive the final fraud label and update transaction’s FraudFlag to ”True” if a chargeback returns. The maturity time of a transaction is also recorded simultaneously in the column ”MaturityTime”.
We discretize the time span into equal length periods, e.g. treat one week as a period. Then transactions occurred in the same period, e.g. same week, are treated as occuring at the same time stamp. Let be the maximum data maturity lead time (measured in number of periods), then at the beginning of a given period , the available streaming data can be separated into two segments:
Long-term mature data: streaming data with time stamp no later than ;
Short-term partially mature data: streaming data with time stamp from to .
3.2 Decision Environment of Dynamic Fraud Control
The decision environment of the inline fraud control is characterized by probabilistic measures of Banks’ and MRs’ decision behavior patterns. Decision environment characteristics are introduced by 
in the form of conditional probabilities. Letdenote a risk score of a transaction, where has a finite integral support , and the decision environment of an optimal transaction fraud control is characterized by the following five probabilistic functions:
Probability that a transaction with score is authorized by the payment issuing bank and turns out to be non-fraudulent:
Probability that a transaction with score is authorized by the bank and turns out to be fraudulent:
Probability that a transaction with score is authorized by the bank and approved by the manual review team, and finally turns out to be non-fraudulent:
Probability that a transaction with score is authorized by the bank and approved by the manual review team, and finally turns out to be fraudulent:
probability that a transaction with score is authorized by the bank:
These five probabilistic functions are called -functions which are short for ”gold functions”, since their values describe profit related probabilities associated with different risk operations in transaction fraud control system. can be estimated using the most recent transaction streaming data, as bank decision signals are available instantly (within few seconds). While on the other hand, predicting , , and are not trivial, since we do not have the up-to-date fraud labels due to data maturity lead time. We will focus on the problem of how to predict to in this paper. We choose predictive modeling of as an example, while estimating the other three functions follows exactly the same procedures. Let be the function in period , then decision environment inference tasks can be stated as the following:
At the beginning of a given period, e.g. period , what is the current decision environment for all ?
During a given period , will the series of control decisions made so far affect function in the future? Is there a way to estimate future function, e.g. for period ?
4 Predictive Modeling Frameworks
In this section, we illustrate the details of two predictive modeling frameworks that resolve two major tasks proposed in Section 3.2. We propose a general framework for current period decision environment inference in Section 4.1, named as CEI. And considering the fact that decision actions taken so far in current period, e.g. , will affect future decision environment, e.g. period , we introduce the second inference framework called FEI in Section 4.2. We use as an example through out this section, and estimating other functions follows the exact same procedure.
4.1 Current Environment Inference (CEI) Framework
Figure 3 depicts the system logics for Current Environment Inference (CEI) framework.
CEI framework consists two segments, the data pre-processing and the learning module. As the size of transaction streaming data set expands by each period, CEI model is updated at the beginning of each decision period. Most updated transaction streaming data set is first pre-processed into a training data set, and machine learning/deep learning method can be deployed to build a model for current decision environment inference. In Section 4.1.1, we first demonstrate how to obtain useful features and construct training data for CEI. Pre-processed training data will be later fed into learning module in Section 4.1.2 and produce a inference model that maps input features to an prediction of current period function.
4.1.1 Data Pre-processing
Recall that decision environment in period is characterized by for all . Then at any given risk score , the values of compose a time series along the time axis. Considering the fact that might be correlated with for for any period , we shall include risk score and period number as two features of training data in the first place.
We introduce Long-Term-Short-Term (LTST) idea since we collect features at different time points, that would help improve estimation accuracy. Recall that the lead time in data maturity is at most periods, so that at the beginning of a given decision period , we have access to exact decision environment for any periods earlier than . In this way, we include the most recent mature decision environment information, i.e. , ,…, , as the features of training data. On the other hand, given the fact that the bank and the MR team always adjust their decision behavior patterns as described in Section 1, we calculate biased chargeback rates in recent periods using partial mature streaming data. We do not have the access to the chargeback rates in period , but the biased chargeback rate can be estimated as
Using correlation tests, we are able to find out if any of these biased chargeback rates have connections with . We include the most related as the features of training data. For example, if has significant correlation with , we include this biased chargeback rate in period into the feature set of training data. Responses of the training data are the exact values.
The idea of constructing this training data is adopted from trajectory prediction research. For each risk score , at a series of periods, i.e. a series of, can be considered as a trajectory. Furthermore, we consider the fact that for any different , the trajectory of function might be correlated and should not be estimated separately. The intuition of including the most recent mature decision environment information, i.e. , ,…, , as features of training data is to record the most recent available exact trajectory to estimate the level and trend of long-term function. While on the other hand, including a recent biased chargeback rate in the feature set of training data provides a short-term calibration factor to amend the long-term function estimation so that our function estimations are more representable to reflect the most recent decision environment. We summarize data structure of CEI training in Table I, where ”x” indicates the training feature data and ”o” indicates the training response data.
4.1.2 CEI Learning module
We consider the learning module as a regression problem that maps input features to an estimation of function of current period. A number of alternative methods can be considered as the core learning models. We provide readers a variety of options in machine learning and deep learning, including linear regression (LR), artificial neutral network (ANN), random forest (FR), gradient boosted tree (GB), and recurrent neutral network (RNN). With the model trained (and model parameters are tuned using cross validation) at the beginning of the decision period , most recent mature functions, i.e. for at all , and calibration factor, i.e. , composes the input, and CEI model outputs estimation of functions in period for all , which is denoted by ,
The scope of this paper is to illustrate a general framework for decision environment inference with partial mature data. In this way, we actually leave certain degree of freedom for readers to choose a more suitable learning method (out of LR, ANN, RF, GB and RNN, or using other regression based learning methods with linear or nonlinear structure). We compare the results of performance tests for CEI with LR, ANN, RF, GB and RNN as the learning cores in Section5.
4.2 Future Environment Inference (FEI) Framework
Figure 4 depicts system logic for Future Environment Inference (FEI) framework.
FEI framework consists a data pre-processing module and two learning modules. Same as CEI, FEI model is also updated once per period, after including new streaming data and updating fraud labels. However, FEI has the following two main differences from CEI.
Data processing module of FEI transforms transaction streaming data into two training data sets: Training data set I includes function trajectories of length , and contributes in training learning module I; while training data set II includes shorter function trajectories of length , and will be used by learning module II;
Two training data sets are fed into two learning modules: learning module I is identical with CEI learning module； and learning module II is a modified version of CEI learning module with a shorter function trajectory for training and prediction.
4.2.1 Data Pre-processing
In this section, we will illustrate what are considered as useful features to predict future decision environment, and how to pre-process streaming data into training data sets.
Similar as CEI data pre-processing, we introduce LTST idea to include the potential features observed at different time points. CEI suggests that the most recent mature function trajectory being put into training feature set. For example, for the period long-term trend of function should be estimated from to . However, at the beginning of the decision period at , we only have access of to . In this way, we use to to estimate long-term trend of , so that this shorter trajectory of length should be included into feature set. CEI suggests that chargeback rate in period , i.e. , is correlated with which should be included into feature set. This feature set contributes to pre-process training data II in Figure 4.
Training data II is not enough to predict decision environment at period , because for the time being that we predict in the decision period, i.e. during period , we do not know the chargeback rate . However, if we have access to and , for transaction sequence occurred in current period so far (actions from the beginning of the decision period to the time we make estimations), let be the action sequence, and be the risk score sequence, we are able to obtain an estimation of ,
where is indicator function of event , i.e. if is true and otherwise. As Section 4.1 provide methods to estimate and , we need to utilize CEI model to first obtain estimations of and , i.e. and , and use estimated and to calculate . In this way, we construct training data set I of FEI using exactly the same way that we construct training data set for CEI.
We summarize data structure of FEI training in Table II. Training data set I consists feature data with ”x” tags and response data with ”o” tag. Training data set II consists feature data with ”” tags and response data with ”” tag.
4.2.2 FEI Learning modules
FEI framework includes two learning modules. Learning module I produces a model that maps the most recent mature function trajectory, i.e. for at all , and calibration factor, i.e. , to function estimations, for . With and , we can estimate chargeback rate in the current decision period , i.e. . Learning module II then is used to produce a model that maps a shorter recent mature function trajectory, i.e. for at all , and calibration factor, i.e. , to function estimation in period , .
Similar with CEI framework, FEI framework can use a variety of learning cores for leaning module I and II. We suggest several options for both learning cores, including linear regression (LR), artificial neutral network (ANN), random forest (FR), gradient boosted tree (GB), and recurrent neutral network (RNN). We compare results of performance tests for FEI with LR, ANN, RF, GB and RNN as the learning cores in Section 5.
5 A Case Study of Microsoft E-commerce
In this section, we conduct systematic performance tests for CEI and FEI modules using real-world e-commerce transaction data from Microsoft. For the chosen business, we consider one decision period as one week, and previous research in Microsoft has concluded the maximum data maturity lead time to be weeks, meaning that at week , all transactions occurred in or before week has exact fraud labels. Hence all functions have the exact function value in any period less or equal to . We choose most recent one month functions to estimate long-term function trends, i.e. weeks. Next we show how to choose the proper chargecback and partial chargeback rate as short-term calibration factor in a statistical way.
With mature historical data, we can obtain score aggregated value of function for each week, for instance,
We calculate full chargeback rate and partial chargeback rate weeks ago, e.g. full chargeback rate is calculated as
and week partial chargeback rate is calculated as
We believe that current is correlated with and .
We use as a demonstration example. We collect values of and from historical data, where . Then we conduct non-parametric statistical tests to claim the correlation and the monotonic relation between and . Testing results are summarized in Table III.
|Spearman p-val||0.016||3.297e-4||1.541 e-4||0.007|
The p-values of Kendall tests suggest that we can reject Kendall null hypothesis (: and are independent), and hence we can claim the significant weekly dependence between and for . Spearman tests also verify the monotonically decreasing trend of and , especially for , as the p-values of Spearman tests suggest that Spearman null hypothesis is rejected (: and do not have monotone increasing/decreasing relation). Monotone decreasing trends of v.s. and v.s. is visualized in Figure 5.
We choose and include as a short-term calibration factor for its smallest Kendall and Spearman p-values. Similarly we test correlation and monotonic relation between and , in order to have included as a short-term calibration factor. Results in Table IV confirm dependence between and , as well as monotone decreasing trend of and . This decreasing monotone relation is validated as shown in Figure 6.
|Kendall tau test||-0.467||6.21e-4|
|Spearman r test||-0.639||2.72e-4|
We conducted 10 weeks of field test with Microsoft e-commerce transaction streams. We also conducted performance tests of CEI and FEI framework in parallel with a number of learning cores including linear regression (LR), artificial neural network (ANN), random forest (RF), gradient boosted tree (GB), and recurrent neural network (RNN). We recorded our predictions of functions for all for the 10 testing weeks. True
functions were calculated 12 weeks later and we reported mean square errors (MSE) and error standard deviations of different versions of CEI and FEI in TableV and Table VI.
|CEI Learning Core||Performance Measure|
|FEI Learning Core||Performance Measure|
We first discuss details of CEI framework testing results. From Table V, we observed that CEI with ANN had the smallest MSE as well as the smallest error standard deviation for over all scores. However for , , and , RNN learning core outperformed all other methods by providing the smallest MSE. Moreover, by compare error standard deviations, we can claim that CEI-RNN performance was relatively robust for all functions, as it always yielded the smallest or second smallest error standard deviations. CEI-ANN had the largest error standard deviations for predicting , , and , which indicates its instability in predictive modeling. We can claim CEI-RF the most robust prediction methods by providing the smallest error standard deviations and relatively small MSE’s. CEI with LR and GB had mediocre performances in this group of parallel testing study.
FEI testing results in Table VI show the fact that double estimation of FEI framework did increase prediction MSE’s, as the inputs of learning moduld II are related to outputs of learning module I which could be inaccurate. In this test, FEI with RNN learning core had the best performance by providing the smallest MSE’s, as well as the smallest standard deviations for all functions. CEI-ANN had the worst performance as its MSE’s for , , and were much larger than all other methods. FEI frameworks with LR, RF and GB had similar performance, while FEI-GB yielded slightly smaller MSE’s and FEI-LR had relatively smaller error standard deviations.
Lastly, we discuss computational performance of CEI and FEI framework. Algorithm designers should care about both prediction accuracy and computational complexity of each version of CEI and FEI frameworks. A good predictive modeling framework should not only provide accurate decision environment predictions, but also has the acceptable computational time for training the learning module(s) to derive a prediction, i.e. output of functions and function.
Table VII summarized the average computation time and its standard deviation for each learning module (e.g. CEI learning module, FEI learning module I and FEI learning module II) with each learning core (e.g. LR, ANN, RF, GB, RNN) based on parallel tests on 100 training samples using the same personal computer with Intel Core i7-7700HQ CPU. Each computation time included cross validation time and model training time. LR method had the shortest computational time in average with less than 1 millisecond. ANN had the second shortest computational time with a few milliseconds. RF and GB methods have similar computational performance. Obtaining a prediction using RF and GB as learning core required 80 to 95 milliseconds. Training a RNN based learning module required around 10 seconds which was the cost of accurate prediction. We provide these computational time results for algorithm designers in Microsoft e-commerce risk control group. Algorithm designers may sometime face the trade-off between prediction accuracy and computational complexity.
We conclude our research by highlighting contributions in this paper, and by introducing how the use of proposed frameworks can be extended to other industries with the challenges of having only partially mature data.
In this paper, we discussed about predictive modeling with delayed information based on a real world application in e-commerce transaction fraud dynamic control. Two frameworks , Current Environment Inference (CEI) and Future Environment Inference (FEI) frameworks, are proposed to resolve the issue of long lead time in data maturity in the fraud control decision environment prediction. These two frameworks construct prediction features using long-term-short-term idea, and obtain long-term trend features using mature data and short-term calibration features using partially mature data. A number of learning methods are also proposes as candidates for learning core of both framework, including linear regression, Random Forest, Gradient Boosted Tree, Artificial Neural Network, and Recurrent Neural Network. Performance tests were conducted on a portfolio of e-commerce transaction data from Microsoft to compare different versions of CEI and FEI. Testing results suggest that proposed frameworks have a great prediction accuracy. We had observed the great potential of using Recurrent Neural Network in predictive modeling with delayed information. However, if the predictive modeling requires millisecond level computational time, we would suggest Random Forest or Gradient Boosted Tree as the candidates to be first considered by algorithm designers.
The ideas behind CEI and FEI modules can be easily adopted and extended to other industries with delayed information, such as sales prediction for inventory control, citation prediction for journal ranking, multi-sensor recognition with delayed signals, and etc.. Instead of using up-to-date data to learn the most recent fraud patterns for predictions, we can keep track of long-term trend of the predicting target for the long-term estimations, and find a correlated short-term factor to calibrate predictions based on the long-term estimation. By assuming linear or nonlinear relations between the long-term factors, the short-term factors and the predicting target, data scientists are able to train machine learning and deep learning models using the mature data set. In this way, the use of proposed frameworks can be generalized to a more boarder category of applications in predictive modeling with delayed information.
This research was supported by Microsoft, Seattle, WA. The authors are thankful to members in Membership Knowledge and Growth group at Microsoft for data technical support and any anonymous reviewers and referees for their constructive commons.
-  J. Li, Y. Liu, Y. Jia, and J. Nanduri, “Discriminative data-driven self-adaptive fraud control decision system with incomplete information,” arXiv:1810.01982 [cs.AI], 2018. [Online]. Available: https://arxiv.org/pdf/1810.01982.pdf
-  J. W. Payne, “Contingent decision behavior,” Psychological Bulletin, vol. 92, no. 2, pp. 382–402, 1982.
-  D. N. Kleinmuntz and J. B. Thomas, “The value of action and inference in dynamic decision making,” Organizational Behavior and Human Decision Processes, vol. 39, pp. 341–364, 1987.
-  R. S. Sutton and A. G. Barto, Reinforcement Learning—An Introduction. Cambridge, MA: MIT Press, 1998.
-  C. Gonzalez, J. F. Lerch, and C. Lebiere, “Instance-based learning in dynamic decision making,” Cognitive Science, vol. 27, p. 591–635, 2003.
-  G. U. Yule, “On a method of investigating periodicities in disturbed series with special reference to wolfer’s sunspot numbers,” Philosophical Transactions of the Royal Society London, no. 226, pp. 267–298, 1927.
-  G. E. P. Box, G. M. Jenkins, and G. C. Reinsel, Time Series Analysis: Forecasting and Control. NJ: PrenticeHall: Englewood Cliffs, 1994.
-  D. Park, M. El-Sharkawi, R. Marks, L. Atlas, and M. Damborg, “Electric load forecasting using an artificial neural network,” IEEE Transactions on Power Engineering, vol. 6, no. 2, pp. 442–449, 1991.
-  K. Y. Lee, Y. T. Cha, and J. Park, “Short-term load forecasting using an artificial neural network,” Transactions on Power Systems, vol. 7, no. 1, pp. 124–132, 1992.
-  G. Zhang, B. E. Patuwo, and M. Y. Hu, “Forecasting with artificial neural networks: The state of the art,” International Journal of Forecasting, no. 14, pp. 35–62, 1998.
-  N. Kohzadi, M. S. Boyd, B. Kermanshahi, and I. Kaastra, “A comparison of artificial neural network and time series models for forecasting commodity prices,” Neurocomputing, no. 10, pp. 169–181, 1996.
-  P. Xu and F. Jelinek, “Random forests and the data sparseness problem in language modeling,” Computer Speech and Language, vol. 21, pp. 105–152, 2007.
-  M. Herrera, L. Torgo, J. Izquierdo, and R. Perez-Garcia, “Predictive models for forecasting hourly urban water demand,” Journal of Hydrology, vol. 141-150, no. 387, 2010.
-  G. Dudek, Advances in Intelligent Systems and Computing. Springer, Cham, 2015, vol. 323, ch. Short-Term Load Forecasting Using Random Forests.
-  M. J. Kane, N. Price, M. Scotch, and P. Rabinowitz, “Comparison of arima and random forest time series models for prediction of avian influenza h5n1 outbreaks,” BMC Bioinformatics, vol. 15, no. 276, 2014.
-  H. Tyralis and G. Papacharalampous, “Variable selection in time series forecasting using random forests,” Algorithms, vol. 10, no. 114, 2017.
-  T. G. Dietterich, A. Ashenfelter, and Y. Bulatov, “Training conditional random fields via gradient tree boosting,” in Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, 2004., 2004.
-  T. Mikolov, M. Karafiat, L. Burget, J. H. Cernocky, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH 2010, Makuhari, Chiba, Japan, 26-30 September 2010, pp. 1045–1048.
-  ——, “Extensions of recurrent neural network language model,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 22-27 May 2011.
Q. Liu, S. Wu, LiangWang, and T. Tan, “Predicting the next location: A
recurrent model with spatial and temporal contexts,” in
Proceedings of the 13th AAAI Conference on Artificial Intelligence, 2016, pp. 194–200.
-  D. Yao, C. Zhang, J. Huang, and J. Bi, “Serm: A recurrent model for next location prediction in semantic trajectories,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2017, p. 2411–2414.
-  G. Krempl, I. Zliobaite, D. Brzezinski, E. Hüllermeier, M. Last, V. Lemaire, T. Noack, A. Shaker, S. Sievi, M. Spiliopoulou, and J. Stefanowski, “Open challenges for data stream mining research,” ACM SIGKDD Explorations Newsletter, vol. 16, no. 1, pp. 1–10, 2014.
L. I. Kuncheva and J. S. Sanchez, “Nearest neighbour classifiers for streaming data with delayed labelling,” in8th IEEE International Conference on Data Mining, 2008, p. 869–874.
-  M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham, “Classification and novel class detection in concept-drifting data streams under time constraints,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 859 – 874, 2011.
-  M. M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, and N. C. Oza, “Facing the reality of data stream classification: coping with scarcity of labeled data,” Knowledge and Information Systems, vol. 33, no. 1, pp. 213–244, 2012.