In the analysis of large-scale data, preprocessing steps that include outlier detection and normalization are required. These steps are especially important when the data are used to assess real-world applications, as in consumer price statistics. In particular, outliers or anomalies bias the analysis and should therefore be removed before the analysis. In this paper, we discuss outlier detection in scanner data.
Scanner data are detailed data on the sales of products, obtained by scanning the bar codes of individual products at the points of sale in retail outlets. The data contain the information about which item is sold at which store, with the volume and price of the sold item. Both researchers and practitioners are interested in the use of the data and make efforts to analyze them. In particular, many national statistical offices (NSOs), including the UK NSO, Swiss NSO, and Dutch NSO, use the scanner data when calculating various price indices (Bird et al., 2014; Haan and van der Grient, 2011; Haan and Krsinich, 2014). The goal of our research is to detect abnormal transactions in prices from the scanner data. The existence of anomalies in the data results in a biased conclusion about the price trend and variation.
Outlier detection in price changes is often linked to the traditional consumer price index (CPI) survey (Saidi and Rubin-Bleuer, 2005; Rais, 2008). There are two representative methods for outlier detection: the quartile method and Tukey algorithm. Saidi and Rubin-Bleuer (2005) at Statistics Canada and Rais (2008) show that the quartile method is preferred over other methods, while the United Kingdom’s ONS uses the Tukey algorithm to detect outliers in their CPI. The ILO Consumer Price Index Manual (2004) recommends both the quartile method and Tukey algorithm as tools for outlier detection. However, the two methods depend only on the information of price changes. The methods set intervals for the relative prices and determine that the price changes that lie outside the intervals are outliers. This is because, unlike the scanner data, the traditional CPI survey does not contain other detailed information such as the sales time and volume of each item. Thus, the direct application of the traditional methods does not fully use all of the given information and could be inaccurate or inefficient in determining outliers.
To understand the difficulty encountered by existing methods in detecting abnormal transactions, let us consider an example where the price depends on the sales volume (see Figure 1
). Suppose that the log of the sales price of an item follows the normal distribution with the meanand variance when sold individually, whereas, in a bulk sale, it follows the normal distribution with the mean and variance (). This scenario implies the price is more variable when the item is sold individually. If we do not consider the sales volume in constructing the limits, the in-control intervals determined by existing methods lie between the intervals for individual sales and bulk sales, as in Figure 1. If we apply the existing methods that set the lower and upper limits for outlier detection, many individual sales are determined as abnormal transactions, although they are not.
To resolve this problem, we propose a new method to detect outliers in the scanner data. As in traditional methods, we set the confidence limits for the log of the price change and determine the price changes that exceed the limits as outliers. However, unlike the traditional method, where the values of the limits are constant on sales volume, we allow for the upper and lower limits of the intervals to depend on the sales volume, the additional covariate. To accomplish this, we assume that the variance of the log of the price change is a smooth function of the sales volumes and estimate it from the previously observed data.
The rest of this paper is organized as follows. In Section 2, we review the existing outlier detection methods. In Section 3, we introduce a procedure to estimate the variance function of the log of the price change and propose a new outlier detection method for the scanner data. In Section 4, we numerically show the advantage of the new method over existing methods. In Section 5, we apply the methods to the scanner data collected in weekly intervals by the Korean Chamber of Commerce and Industry (KCCI) between the years 2013 and 2014. We conclude the paper with a brief summary and discussion in Section 6.
2 Outlier detection methods for scanner data
When the unit price at time for an item is set as , the price change between time and is defined as the ratio . The two existing methods, the quartile method and Tukey algorithm, construct an interval for ( or ) and determine the occurrence of outliers if exists outside the interval. This corresponds to the concept of statistical quality control, which motivates our interest in finding the proper interval for monitoring the price change ( or ). We refer to this interval as being defined by control limits in the rest of this paper. Hence, control charts visualize the control limits of the natural variation inherent in the price change.
2.1 Quartile method
The quartile method is regarded as the outlier detection method in a conventional sense. It builds up the control limits using quartiles for as in the standard boxplot (Tukey (1977)). Let and be the first, second, and third quartiles, respectively. Then, the upper control limit (UCL) and the lower control limit (LCL) for the price change are defined as follows:
where and are some predetermined constants. If follows the normal distribution, is equal to and . Hence, approximately leads to the interval
, which indicates that among normal observations, the probability of falling outside of the control limits in (1) is approximately 0.27%. Note that this interval is set according to the method with constant variance in the simulation study. When the price change 1). In general, the control limits can be adjusted by () so that a much less tolerable falls outside the interval. However, this method has some disadvantages. The first one is that when there is only a slight price fluctuation, (
), most data are classified as outliers even ifand are large. The second problem occurs when analyzing right-skewed data. The control limits for price changes are more sensitive to the right tail of the distribution and less sensitive to the left tail. Since the price distribution tends to be skewed to the right, the natural log transformation is generally applied before the analysis (Saidi and Rubin-Bleuer, 2005; Thompson and Sigman, 1999).
2.2 Tukey algorithm
As previously mentioned, the United Kingdom’s ONS adopts the Tukey algorithm as the outlier detection method in CPI. Unlike the quartile method, it works well even if there is a slight price fluctuation. The Tukey algorithm removes data with no price fluctuation () and reconstructs a sample for , which is called the Tukey sample. Let us denote as Tukey samples. Then, the control limits are defined as follows:
In this case, is the sample mean of the Tukey samples, is the average of the Tukey samples larger than , and is the average of the Tukey samples that are smaller than . The Tukey algorithm is known to be a robust method for calculating intervals even when some of the outliers are present since it eliminates unnecessary data in advance (Office for National Statistics, 2014). However, since only a fraction of the data is used to predict outliers, a small number of Tukey samples may not provide accurate outlier detection (Rais, 2008).
In addition to these difficulties, the quartile method and Tukey algorithm have defects in applications with scanner data. They consider the price only when determining the control limits for price changes. We expect that existing methods will yield higher type I errors if the variance of price changes depends on the quantity. Therefore, we propose a new method to improve the control limits of previous methods by incorporating the additional information of sales volume.
3 Covariate-dependent control chart
As mentioned in the Introduction, the distribution of the price change is expected to be influenced by the sales volume. Here, we assume that the log value of price change , for , follows the model
where and are continuous functions in and , and is IID from a distribution with a mean of and a variance of . If the log of the price change is the in-control status, there is no abrupt change in the price at time , implying . The control limit with type I error at time becomes simply
where and are the sales volume at time and , respectively. The limits in (4) yield a type 1 error of 0.27% when is normally distributed. Moreover, if is a constant function, it corresponds to the limits of the quartile method.
The variance function is unknown in practice and is estimated from the data. Many methods are proposed to estimate the variance functions in the heterogeneous regression model (3). Two of the major approaches are the residual-based and difference-based approaches. The residual-based method estimates the variance function by estimating the mean of the squared residual and using . Here, under the in-control status, . In practice, the squared residuals are evaluated at the “data points” , where for , and the mean function is plugged-in with its estimate . The local polynomial (e.g., linear, quadratic) regression estimator is widely used to estimate the mean of the squared residuals (Hall and Carroll, 1989; Fan and Yao, 1998). For give , it solves
is the vector of, , and with is the bandwidth. The residual-based variance function estimator is defined as .
Another popular procedure for the variance function is the difference-based method, which actually does not require the estimation of the mean function. The method utilizes the fact that when the data points are sorted so that , the pseudoresidual
where is a fixed constant and coefficients satisfy and
, making an unbiased estimator of the variance. For example, if , , , and , the estimator becomes
. In the heteroscedastic model (3) above, Brown and Levine (2007)
consider applying the local linear regression to the pseudoresiduals to estimate the variance function. However, in our case, the variance function is bivariate, and the difference methods are not directly applicable.
4 Simulation study
In this section, we numerically investigate the performance of our proposal (Var) in Section 3 to detect abnormal changes in the price. We consider four methods for comparison: (i) the method with constant variance independent of sales volume (Const), (ii) the Tukey algorithm (Tukey), (iii) the quartile method (Quartile), and (iv) the method with a known true variance function (Oracle). The Oracle is the gold standard, and the first three existing methods do not use the information on the sales volume.
The data for the simulation study are generated as follows. In each dataset, we generate data points, where the first data points are under the control status (training period) and the next data points are to be tested (the period possibly having abnormal changes). The data are generated from the following model for :
where , , , and for ; otherwise, .
We set the proportion of abnormal changes to comprise (15 data points) and (30 data points) of the testing data. Finally, we consider three cases regarding the variance function : (a) , (b) , and (c) . Case (a) implies that the variance function does not depend on the sales volume. In case (b), the variance is affected only by the sales volume at time . Finally, case (c) implies that the variance of day is complexly related to the sales volume at both time and .
We compare the above mentioned five methods, which include our approach, from the aspects of “sensitivity” (SEN), “specificity” (SPE) and “accuracy” (ACC). The three measures are formally defined as
where true positive (TP) refers to the number of data points that are assigned as positive among observations of abrupt changes (P); true negative (TN) is the number of data points determined to be negative from the normal observations (N); false negative (FN) is the number of data points that are determined to be normal points among observations of abrupt changes; and finally, false positive is the number of data points determined as positive among the normal observations.
We simulate datasets for the three types of variance functions and two choices of the proportion of abnormal changes. In each dataset, the variance function is estimated using the local constant regression, which is the solution to (5) with for all and as
where is a set of training data used for estimation and is the second-order Gaussian kernel function obtained by differentiating the Gaussian kernel twice. The kernel estimator (8) is supported by the ‘npreg’ function of the ‘np’ package in R software. The bandwidth is selected by a cross-validation procedure minimizing the cross-validated error ,
where is the kernel estimator calculated without the observation of the -th observation.
Tables 1 and 2 report the performance measures of the five methods when the proportion of abnormal changes is and , respectively. First, the tables show that our proposed method (Var) achieves a higher accuracy compared with all of the other methods (Quartile, Tukey and Const) when the variance of the price change depends on the sales volume (cases (b) and (c)). Further, the performance numbers from our method are close to those of Oracle, which uses the true variance function. Second, in case (a), the case of constant variance of the price change, both Const and our method (Var) perform similarly better than the other two (Quartile and Tukey) methods in terms of accuracy. Third, the Tukey method has high sensitivity but low specificity in all three cases. This is because the Tukey method tends to build a narrower control limit s than the other methods, and the data points are more likely to be assigned as outliers. Table 3 shows that the Tukey algorithm has a larger size than the other methods. This supports the fact that the Tukey algorithm builds a narrower control limits.
We also investigate the accuracy of the variance estimation. Figure 5 displays the mean square error (MSE) of variance estimates for each case. In the figure, the black dots are the observed data points of . The figure shows that MSE is relatively small for the area of , where more data points are observed, whereas it is large on the upper-right side of and , where few data points are observed.
5 Data example
In this section, we apply the proposed detection method to the real scanner data obtained from the Korean Chamber of Commerce and Industry (KCCI). These scanner data were collected at weekly intervals between the years 2013 and 2014 from approximately 2,000 retail stores. We monitor price changes for one of the popular items, that is, A-brand cartons of milk for toddlers.
We set up a retail store and monitor for anomalies in price changes for this item. We calculate the weekly average price, , and the log value of the price change , for . The data collected in 2013 in all of the stores are used as the training samples to compute control limits, and the data collected in 2014 for the specified store are used as the test samples for monitoring. We aim to compare four different methods of establishing control limits for monitoring the price changes as in the simulation studies: (i) the new method with estimated variance (Var), (ii) the method with constant variance independent of sales volume (Const), (iii) the quartile method (Quartile), and (iv) the Tukey algorithm (Tukey).
Figure 6 plots the monitoring results for the log of the price change for the test sample, with four control limits from four methods that are computed by the training samples. The x-axis represents a one-week sampling interval, and the y-axis on the left represents the log of the price change, . To facilitate an understanding of this item’s price, the right axis represents the weekly average selling price, (unit: KRW) at that point. The solid lines represent the control limit of the proposed method (Var), and the two-dashed, dot-dashed and long dashed lines are for the existing methods: Const, Quartile and Tukey, respectively. The gray solid line around the center represents the average selling price of this item sold every week. In addition, ‘’ represents the log of the price change, ‘’ represents the outlier detected by the variable method, and ‘’ represents the outlier detected by the existing methods (Const, Quartile and Tukey). Therefore, we can observe that our new method identifies fewer rare events compared with all of the other methods (Const, Quartile and Tukey).
In Figure 6, outliers are found by all of the existing methods, but only outliers are found by the variable method, implying that the variable method may tend to show high specificity. From the observations between week and , two points are judged as outliers in all methods, and we can see that these are cases when the price dropped from KRW to KRW and again increased to KRW . It is not surprising that price changes are detected since prices change approximately .
On the other hand, from the observations between week and , all existing methods detect the nd and rd points as outliers except the variable method. The price changes around week are relatively small, approximately KRW in width. This means when the price change is regarded as sufficiently acceptable, the new method in this study does not judge the observations as obvious outliers or unusual observations. This is a more reasonable judgment to predict outliers.
Figure 4 shows the contour plot of the variance estimates for price changes of this item as a function of sales volume at time and . The x-axis and the y-axis indicate the sales volume of the test sample at time and , respectively. Note that we choose as a test sample one retail store, where the weekly average of the sales volume lies between and . In Figure 4, the variance estimates when the sales volume is relatively small (bottom left) appear larger than when the sales volume is relatively large (top right).
In this study, we propose an outlier detection method based on the fact that the variance of the price change depends on the sales volume. While the existing methods judge whether or not the price changes are detected irrespective of the sales volume, the proposed method (Var) reflects the fact that the dispersion of the price changes can differ according to the sales volume. The simulation results and empirical analysis show that the detection method considering the variance of the price change as a function of the sales volume achieves high accuracy and, especially, an increased specificity.
These characteristics allow for conservative anomaly detection when the price change varies depending on the sales volume, i.e., when the variables of interest are affected by the information of other explanatory variables.
In this paper, we propose a new procedure to monitor the changes in a sequence of target variables (prices) when the target variable is influenced by another observed covariate (sales volume). This study is motivated by the utility of employing scanner data collected by the Korean Chamber of Commerce and Industry (KCCI), where we are interested in detecting and removing abnormal changes in the log of the price. To build the control limits for monitoring, we model the variance of the changes of the log of the price that is under control as a smooth function of the sales volume and adopt the local polynomial regression to estimate it. The numerical study and data examples show the advantages of the new proposal over existing methods in various measures of performance.
We now conclude the paper with a discussion on the covariate-dependent mean of the log of the price. If the mean of the log of the price that is under control also depends on the sales volume, the log of the price is modeled as follows:
where and are continuous functions in and , and all others are defined the same as those in (3). In the sequel, the control limit with type I error at time becomes
where and are the sales volume at time and , respectively, and and are the smooth mean and variance function, respectively. However, in our analysis, we consider the price under control as a constant that does not depend on the sales volume because, otherwise, it raises a fundamental question of “what is the price of a product”.
- Bird et al. (2014) Bird, D., Breton, R., Payne, C., and Restieaux, A. (2014). Initial report on experiences with scanner data in ONS, Office for National Statistics, UK. http://www.ons.gov.uk/ons/guide-method/usesr-guidance/prices/cpi-and-rpi/intial-report-on-experiences-with-scanner-data-in-ons.pdf (accessed 17 Feburary, 2019).
- Haan and van der Grient (2011) Haan, J.de. and van der Grient, H. (2011). Eliminating chain drift in price indexes based on scanner data, Journal of Econometrics, 161, 36-46.
- Haan and Krsinich (2014) Haan, J.de. and Krsinich, F. (2014). Scanner data and the treatment of quality change in nonrevisable price indexes, Journal of Business & Economic Statistics, 32, 341-358.
- Saidi and Rubin-Bleuer (2005) Saidi, S. and Rubin-Bleuer, S. (2005). Detection of outliers in the canadian consumer price index, Business Survey Methods Division, Statistics Canada, 5, 16-18.
- Rais (2008) Rais, S. (2008). Outlier detection for the consumer price index, Proceeding of Statistical Society of Canada. http://ssc.ca/sites/default/files/survey/documents/SSC2008_S_Rais.pdf (accessed 17 Feburary, 2019).
- Tukey (1977) Tukey, J. W. (1977). Exploratory data analysis, Addison-Wesley, Massachusetts.
- ILO Consumer Price Index Manual (2004) International Labor Organization, International Monetary Fund, Organization for Economic Co-operation and Development, United Nations Economic Commission for Europe, The World Bank (2004). Consumer Price Index Manual: Theory and Practice. https://www.ilo.org/wcmsp5/groups/public/---dgreports/---stat/documents/presentation/wcms_331153.pdf (accessed 17 Feburary, 2019).
- Hall and Carroll (1989) Hall, P. and Carroll, R. J. (1989). Variance function estimation in regression: the effect of estimating the mean. Journal of the Royal Statistical Society - Series B, 1, 3-14.
- Fan and Yao (1998) Fan, J. and Yao, Q. (1998). Efficient estimation of conditional variance functions in stochastic interest rates, Biometrika, 85, 645-660.
- Brown and Levine (2007) Brown, L. D. and Levine, M. (2007). Variance estimation in nonparametric regression via the difference sequence method. The Annals of Statistics, 35(5), 2219-2232.
- Thompson and Sigman (1999) Thompson, K. and Sigman, S. (1999). Statistical methods for developing ratio edit tolerances for economic data, Journal of Official Statistics, 15(4), 517-535.
- Office for National Statistics (2014) Office for National Statistics (2014). Consumer price indices technical manual, UK. http://doc.ukdataservice.ac.uk/doc/7022/mrdoc/pdf/7222technical_manual_2014.pdf (accessed 17 Feburary, 2019).