Forecasting Granular Audience Size for Online Advertising

01/08/2019 ∙ by Ritwik Sinha, et al. ∙ 0

Orchestration of campaigns for online display advertising requires marketers to forecast audience size at the granularity of specific attributes of web traffic, characterized by the categorical nature of all attributes (e.g. US, Chrome, Mobile). With each attribute taking many values, the very large attribute combination set makes estimating audience size for any specific attribute combination challenging. We modify Eclat, a frequent itemset mining (FIM) algorithm, to accommodate categorical variables. For consequent frequent and infrequent itemsets, we then provide forecasts using time series analysis with conditional probabilities to aid approximation. An extensive simulation, based on typical characteristics of audience data, is built to stress test our modified-FIM approach. In two real datasets, comparison with baselines including neural network models, shows that our method lowers computation time of FIM for categorical data. On hold out samples we show that the proposed forecasting method outperforms these baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The online display advertising (hereafter, display ad) ecosystem has many players that intermediate between publishers and marketers (Muthukrishnan, 2009). For targeting ad campaigns to consumers it is imperative for a marketer to estimate the number of visitors satisfying a set of targeted attribute values in a future time period. Consider one such target-{Country:US, Browser:Chrome, Device:Mobile}. The marketer may be interested in predicting the number of advertising bid requests for this target flowing into the Demand Side Platform(DSP) in the following week. This helps optimize spend allocation among various targets in a campaign, as well as helps manage the marketing budget across campaigns.

Forecasting granular-level audience size poses a considerable data mining challenge because of the explosion in the number of possible categorical attribute value combinations. One of our two real world datasets contains attributes, each taking many values (even or more), resulting in unique targets. While not all combinations are observed in the data, it is still infeasible to store data for all observed combinations and apply time series estimation methods. Notably, forecasting audience size for web traffic is an under-researched area, although programmatic advertising is the subject of growing research, with inroads in diverse topics like bid optimization (Zhang et al., 2014), targeting (Goldfarb and Tucker, 2011) as well as estimating conversion rate (Lee et al., 2012) and click-through rate (Zhang et al., 2016).

In proposing a practicable solution, we develop a three stage approach: first, bringing the problem to a tractable scale using frequent itemset mining (FIM); second, using conditional probability to extend to unobserved targets and third, leveraging time series analysis methods to forecast. Our approach is evaluated on two datasets: first, bid requests data received by a DSP and second, web analytics data of a US publisher. The DSP receives bid requests from multiple Ad Exchanges and serves multiple advertisers. The web analytics data, although from a single publisher, is more feature-rich than the bid requests data. While the two settings are different, the forecasting problem has important commonalities: both datasets comprise historical time stamped events of consumers (representing bid requests and page views in respective settings), where each event is defined using values for a set of categorical attributes. For each dataset, we forecast the number of events occurring in a given time period for a specific target set defined using values of categorical attributes. Our solution for the first dataset computes and stores the support for frequent itemsets out of a possible and only about time series models, and yet, is more accurate than baselines.

Online audience estimation requires forecasts (1) be available for any arbitrary attribute value combination, (2) be frequently updated, and (3) account for temporal variations. Historical time stamped events are used to estimate number of events with specified attribute values in a future time period. Notably, all attributes of web traffic data are categorical and most attributes show long-tailed univariate distributions (Figure 1). Under this premise, our contributions are: One, we leverage the categorical nature of attributes to efficiently mine frequent attribute combinations from the event database (Section 3.1) by modifying a leading FIM algorithm to include categorical constraints. This improves performance time and helps meet (2). Two, the mined frequent item (attribute) sets (FIS) are only a small portion of attribute combinations used by firms for targeting. For the non-FIS, which is a very large set, we offer a scalable method for forecasting since the cost of storing all data is prohibitive. Our solution uses an approximation based on conditional probability, storing only on relatively few attribute sets. Three, given a target set definition and a time period in the testing phase, we select an appropriate time series model for predictions and then use information obtained from FIM to obtain estimates of audience size, which meet (3). The approach also estimates for non-FIS, thereby providing estimates for any arbitrary itemset (Section 3.2) and satisfies (1). Four, contributing to the FIM literature concerned with categorical variables, we introduce a simulation framework to stress test FIM algorithms.

2. Related Work

The curse of cardinality in web traffic attribute combinations manifests in adverse query time and massive cost of storing temporal data. Websites need forecasts at granular level of attribute combinations and updated often. While existing FIM algorithms may handle the curse by extracting FIS, that fails to meet the website’s needs for forecasts for most other non-frequent itemsets. We bring tools from probability and time series to address these issues. The forecasting problem considered in this work has been explored earlier by Agarwal et al (Agarwal et al., 2010). However, they use domain knowledge in display advertising to build time series models for a subset of attribute combinations; we use FIM to build a generalizable approach.

The Apriori algorithm (Agrawal and Srikant, 1994) has been extended to Eclat (Han et al., 2000), FP-Growth (Zaki et al., 1997), and LCM (Uno et al., 2004) algorithms. The latter three are considered better off-the-shelf algorithms for association rule mining problems (Borgelt, 2012). In further development,  (Srikant et al., 1997; Pei et al., 2001) adds category-based constraints to Apriori (Do et al., 2003). Advancing the work, we add categorical constraints to Eclat and show better performance against other state-of-the-art algorithms.

Time series forecasting is not new  (Hamilton, 1994). Recent attention to search through a class of models to provide forecasts based on best performing models includes Exponential Smoothing (Hyndman et al., 2008), Automatic ARIMA models (Hyndman and Khandakar, 2008) and Prophet (Taylor and Letham, 2017). We explore these three and a Neural Net based approach in our experiments.

Our introduction of a new framework for stress testing FIM algorithms draws upon statistical copula (Nelsen, 1999), to capture statistical dependencies among categorical variables. Existing data sets for testing FIM algorithms are not built for categorical variables. This approach to the FIM literature is expected to help in testing and comparing suitability of algorithms for data with categorical variables, which are common in web traffic.

3. Approach

Let us define a set of attributes , where each takes one of a possible set of values, . Let the set of events be , where each event or transaction is defined by value assignments for each attribute, where . Additionally, each transaction has a timestamp associated with it. We define a target definition as , where , where is a special marker indicating that can take any value in . This marker defines targets where some attributes are left unspecified. A transaction satisfies the target definition if for all where . The audience estimation problem is formally stated as: given a historic dataset , estimate the number of events satisfying , in a future time range.

3.1. Frequent Itemset Mining

In frequent itemset mining(FIM), the events could be transactions, as in the case of purchase, or occurrences of audience member on a publisher site, as in our case. The problem is formally stated as follows (Borgelt, 2012). For the set of transactions , such that each transaction is a set of items, denote the set of all possible items as . Hence, each transaction is an itemset, . The cover of itemset is the set of all transactions such that . The support is the size of , . The problem is to find all itemsets in with support more than a threshold . Additional constraints allow more efficient enumeration of frequent itemsets (Srikant et al., 1997). A constraint is a mapping from the power set of items to a boolean value, . FIM algorithms exploit properties of the support constraint ().

A characteristic of online traffic is that the attribute of a transaction takes only one of the values in . This implies that any itemset which has two or more values for the same attribute must have a zero count, which we encode as the categorical constraint (). We modify Eclat by checking for during the candidate set generation stage. Note that LCM and FP-Growth have both the horizontal and vertical representations of the transactions (explicitly in case of LCM and as the FP-Tree in FP-Growth) (Borgelt, 2012), thus cannot benefit from the inclusion of . Formally, , where is the transaction , and is as defined in Section 3. Constraints can be characterized by some properties such as anti-monotone, succinct, and convertible (Ng et al., 1998). We state the definitions of two such properties here.

Definition 3.1 ().

Anti Monotone: A constraint defined on sets is anti-monotone iff for all itemsets , .

Definition 3.2 ().

Succinct: A constraint defined on sets is succinct iff for all itemsets : can be expressed as : for a predicate .

is anti-monotone and succinct (Do et al., 2003). Anti-monotone constraints can be applied to a level-wise algorithm, at each level successively (Do et al., 2003). Moreover, if a constraint is succinct, it is also pre-counting pushable. While (Do et al., 2003) applied to Apriori, we extend to Eclat. This is done by pre-counting pruning, that is, can be pushed to the stage post the candidate generation phase and prior to support related checks, discarding ineligible candidates. For Eclat, the check is pushed to the stage prior to applying intersections of transaction lists of generated candidates (see Algorithm 1).

// Define : Transaction ID list for itemset
// Initial call:
Function EclatCC(, , )
       Result: , the set of frequent itemsets
       forall  do
             // is a frequent itemset
             ,
             forall , with  do
                   =
                   // Pre-counting pruning
                   if  then
                        
                         if  then 
                   end if
                  
             end forall
            // Recursive call
             if  then EclatCC(, , )
       end forall
      
Algorithm 1 Eclat-CC

3.2. Audience Estimation

The previous section described generation of FIS from containing historical transactions in time . The mined FIS provide , being a target set satisfying the threshold . The interest lies in the support of in a future time period , that is, . While FIM obtains for many target sets, forecasting for each requires maintaining highly granular time series data for each, making this infeasible for arbitrary targets, including for non-FIS targets. Our approach requires maintaining a granular time series only for a small number of univariate (single item) targets, and for these targets performing time series forecast that captures seasonal and trend patterns.

Denote the univariate time series targets as . Given , , and a future time period , we estimate the expected number of events in . The FIS from are stored along with their support. We compute the best univariate time series (see below) to generate predictions for the target , subject to and . The predictions for are as follows:

(1)

where we use the empirical estimate for , given by

(2)

since . In equation (1) we make the assumption that , that is, the conditional probability of given remains (almost) constant from the training to the forecasting period. We tested this assumption on the FIS empirically from the two real datasets we work with, and get Pearson correlation between these two quantities for both.

We approximate as when is not frequent, where denotes that the attribute takes any value in its support. In other words, we assume conditional independence among the attributes and compute the joint probability as the product of marginal probabilities.

When is frequent, we can use the formulation described in equation (2). In the other case, we use a threshold probability estimate , where is the support threshold used for FIM. This is an upper bound on the empirical estimate for this itemset (using equation (2)).

To estimate the second term in equation (1), we explore multiple classes of time series models to generate the forecast

along with standard deviation for all elements in

(details in section 4.2). The granularity of forecasts depends on the granularity of the input data. We generate hourly forecasts.

Now, from the set of candidate univariate time series for each target , that is, those which satisfy: (1) , (2)

, we choose the time series with the least error in prediction. From this limited set of univariate time series we still generate good predictions, as shown in our experiments. We preselect the univariates at the time of computing the frequent itemsets and choose univariates which satisfy (1) and (2). We choose from possible candidate time series, at prediction time, by minimizing the standard error of the estimate

.

4. Experiments

Statistical Copula for a Simulation Framework: The current FIM literature offers synthetic datasets (Agrawal and Srikant, 1994) which do not meet the need of emulating categorical nature of web traffic. Our framework fills this gap by creating synthetic data with two important properties: first, the marginal distributions follow structure typically seen in audience data, such as many attributes depicting a long tailed distribution(Figure 1); second, the strong dependence structure common in web traffic be maintained. For example, a type of browser is more likely to be used on a certain operating system. We achieve this by introducing statistical copula (Nelsen, 1999) into the FIM literature. A copula is a function that joins the multivariate distribution function to their one-dimensional marginals. This approach allows arbitrary marginal distributions while controlling the level of dependence between attributes.

We construct a Gaussian copula from a multivariate normal distribution over

, by first specifying a correlation matrix

. We simulate the random vector

with the multivariate Gaussian cumulative distribution function (CDF)

(with correlation matrix ). Then, the vector (where is the univariate normal CDF) has marginal distributions which are uniform in and a Gaussian copula which captures the dependence. Finally, to achieve the target distributions , we perform the transformation , where is the inverse CDF corresponding to . The resulting vector has the desired marginals with a given dependence structure.

We are still left with deciding two quantities, and . Experiments with long tailed distributions show a good way to select - base it on the observed multinomial distribution of attribute values. We base the marginals on typically observed distributions in real data (Figure 1). To choose , we make use of the structure of the observed data. We ensure that the association matrices for the real and simulated data follow a similar pattern.

With the goal of testing the robustness of FIM approaches and for comparison among them, we vary the following parameters in the synthetic data. First, the number of attributes () is , or . Next, the association is either as observed in real data or the off-diagonal elements are half of their values (these are referred to as ‘high’ or ‘low’ correlation). Next, we modify the multinomial marginal distributions to either be long-tailed (‘steep’) or uniform (‘flat’).

BRD PVD
Attribute Unique Values Attribute Unique Values
ad_exchange 8 browser 876
browser 100 color_depth 8
country 233 country 233
device_family 23141 domain 61684
device 3 language 153
os 55 os 257
region 2598 ref_type 7
slot_size 693 region 1043
slot_visibility 3 resolution 448
visit_number 12884
Table 1. Distinct values for attributes in the real datasets

Bid Request Dataset (BRD): This dataset arises in the ecosystem where the publisher seeks competitive bids using Ad Exchanges and a Real-Time Bidding (RTB) platform. The publisher delivers the consumer’s information, comprising attributes, for real time bidding by marketers seeking consumers matching those attributes. The training data comprise logs from March to March , and the testing data comprise logs for April, 2017. Around million bid request events are present, large enough for valid experiments. We have million and million bid requests in the training and the testing periods respectively. Each event has attributes (Table 1), a time stamp, and most attributes have a substantial number of distinct values. The number of possible attribute combinations is . The histogram for two attributes is presented in Figure 1. A similar long tailed distribution exists across all attributes.

Figure 1. Attribute value frequencies in BRD and PVD

Page View Dataset (PVD): This dataset comes from a publisher, where the publisher sells the consumer’s information directly to marketers based on contractual pricing (Roels and Fridgeirsdottir, 2009). For each page view, the publisher matches the consumer’s attributes to those desired by marketers and then offers it to a matched marketer. The contractual mechanism is less studied. Our work applies to both competitive bidding and contractual pricing. This second dataset affords generalization of our approach. The training data comprise million page views from March to April 2017, and the testing data comprise million for April. We refer to this dataset as PVD. The dataset has attributes, some with a large number of distinct values (Table 1), leading to a total possible itemsets. As in BRD, attributes display a long-tailed distribution (Figure 1).

Figure 2. Computation time of FIM algorithms on synthetic data. Average time from three runs, presented for different number of attributes, correlation across attributes and marginal distributions, for support of 10% (other support levels not displayed in the interest of space).

4.1. Frequent Itemset Mining

We perform experiments on the synthetic data and two real datasets. The experiments are carried out on a machine with 16GB RAM and 3.5GHz CPU running a Linux distribution. The algorithms included in our analysis are Apriori-CC (Do et al., 2003), Eclat (Zaki et al., 1997), Eclat-CC, LCM (Uno et al., 2004) and FP-Growth (Han et al., 2000). We follow or extend the implementation of Borgelt (Borgelt, 2012) for these algorithms and record the computation time averaged over runs.

The methods are first compared on the synthetic data (Figure 2). We present results for two levels of correlation and two univariate distribution patterns, across three different number of attributes. For each combination, million events are generated. We make a few broad observations. First, as expected, a higher number of attributes makes the problem more challenging, as reflected in increased computation times. Second, lower correlation leads to a limited decrease in the computation times. Third, having a steep distribution in the univariates leads to higher running times than having flat (equally likely) marginals. This happens because steep distribution and higher correlation lead to higher number of itemsets meeting the threshold, and hence leading to longer run times.

Dataset Support Apriori-CC Eclat Eclat-CC FP-Growth LCM
PVD 1% 70.7 83.2 76.8 71.7 72.5
5% 75.5 46.6 46.0 66.4 66.2
10% 71.8 41.9 42.1 59.9 59.1
BRD 1% 160.1 112.0 105.4 165.1 167.5
5% 167.9 80.1 78.7 154.3 151.0
10% 155.6 73.7 73.0 135.8 137.6
Table 2. Comparison of FIM algorithms. The average computation time (in seconds) from three runs of the algorithm.

In comparing the algorithms, some of the findings are: one, Eclat-CC performs better than unconstrained Eclat, on average; which itself performs better than Apriori-CC. Considering average ranks across different scenarios, the performance of algorithms in decreasing order is – Eclat-CC, Eclat, LCM, FP-Growth, Apriori-CC. Thus, incorporating CC into Eclat, leads to an algorithm that performs better than the other state-of-the-art algorithms.

On the real data BRD, we find that (Table 2) Eclat-CC is between and better than the next best algorithm, and between and faster than FP-Growth and LCM. On the other real data PVD, Eclat-CC is the best algorithm on a support of , while being close to the best algorithm (Eclat) on a support of . Moreover, in the case of low support (), Eclat-CC performs somewhere in between the best algorithm (Apriori-CC) and Eclat. Thus, using categorical constraints into FIM algorithms leads to more efficient implementations in audience size estimation. It is worth noting that the training data for BRD contains million events, larger than the other datasets analyzed, suggesting that the gains for incorporating CC may be more pronounced for larger datasets. We test this hypothesis with a simulated dataset of million events and attributes, and find that Eclat-CC is better than Eclat and better than FP-Growth.

4.2. Audience Forecasting

To evaluate the accuracy of forecasts, we compare our approach with a naïve, but feasible baseline (FB), an accurate, but also an infeasible baseline using individual target time series (TS) and a machine learning based method. This comparison across both BRD and PVD datasets, is done on two different target sets - FIS and IFIS(Infrequent-FIS). For

FIS, the support or threshold value is set at of the dataset size throughout. We find million and million FIS in BRD and PVD, respectively. We sample FIS from each dataset with a probability proportional to the support of the itemset, ensuring that itemsets of varying supports are included in the sample.For IFIS, we sample infrequent itemsets, among those with less than support. We now describe our baseline approaches.

Individual Target Time Series based infeasible baseline (TS): Entire time series is stored for all itemsets in FIS and in IFIS. Forecasts are generated directly by modeling the time series for each itemset, without using conditional probabilities and univariate. This baseline is not bounded by computation time or storage requirements for time series for millions of itemsets. We use it as a boundary condition baseline to compare our approach.

Feasible Baseline (FB): We find all univariate itemsets satisfying a given threshold (of ). For each of these, we obtain the hourly counts as a percentage of the global counts for that hour. For such time series, we train a model, so that we can forecast the fraction of hourly global count represented by the respective univariate itemset. We also maintain the global time series, for target , where denotes the attribute taking any value. For a target we predict the hourly count estimate as . Thus, the estimate is obtained by multiplying the global time series forecast by the estimates of percentages of each univariate, obtained from the time series. For univariate values where we do not have a time series, we assume that the percentage varies up to the threshold value used (which in our case corresponds to the interval ). This gives us a ranged estimate, which we average to get the point estimate.

Machine Learning Baseline (ML)

: We modify the datasets to remodel audience forecasting task as a supervised learning problem. To achieve this, we create a training set by sampling 5,000 itemsets from FIS mined at 0.01% from both PVD and BRD, by sampling with probability proportional to the support, ensuring that itemsets of varying supports are included in the sample (similar to FIS target sets). We collect hourly counts for these itemsets, throughout the training and testing periods. Each row of each data set consists of the itemset, hour of the day and a count of transactions (page views/bid requests) satisfying the itemset in that hour. We drop the day of the week attribute, since our data set is limited to a single week and capturing weekly seasonality is not possible in such a situation. Following the construction of this derived dataset, the forecasting problem is reduced to a regression problem, with a categorical input (itemset, hour) and the output being count of transactions. However, since the total number of levels across various attributes is large (ranging into a few thousands), it is intractable for machine learning models to capture interactions among attributes. Hence, we group all attribute values for which we do not have univariate time series, i.e. present in less than 0.5% of the dataset, into a new level.

The model is first trained on a subset of sampled FIS, and the trained model is used to make predictions for the same benchmark set as other baselines. The model is a multi-layer fully connected network, with dropout, and with an additional embedding layer in the input, implemented in PyTorch 

(Paszke et al., 2017)

. Categorical inputs are mapped to columns of the embedding layer, and then passed through the network to make predictions to minimize MAPE. Hyperparameters are chosen optimally using

hyperopt111https://github.com/hyperopt/hyperopt library, by considering hyperparameter space spanning embedding layer dimension , dropout , and number of layers .

Each parameter is sampled uniformly from the corresponding parameter space. We use 6 days of data to train each model, and optimize the hyperparameters according to the MAPE for the day using the TPE

algorithm for guiding the search over the hyperparameter space across 1000 trials, with 10 epochs per trial. This search leads to a 3 layer model, with layer dimensions 384, 192, and 64, a dropout of 0.05 and an embedding dimension of 128. With this model, we generate predictions for the ML baseline, and the results are shown in Figure 

3. We see that the model obtained by this process performs worse than our approach across all experiments.

We generate time series forecasts for univariate targets in (from Section 3.2) using four methods – Exponential Smoothing (ETS), Automatic ARIMA (ARIMA), Neural Network Autoregression (NNAR) and Prophet. We use the respective R packages to automatically choose the best hyper parameters for our time series methods. We use days of hourly data to train and offer hourly forecasts for the seventh day, capturing daily seasonality. The methods are evaluated using average Mean Absolute Percentage Error (MAPE). Based on superior performance of ETS (Table 3), we decide to use it as the time series model for all evaluations.

Figure 3 shows the results. In box plots, bars of the same color denote results on the same target sets, by data set. Mean and median of MAPE across all itemsets are denoted by dashed and solid horizontal lines, respectively. Mean MAPEs for FIS in BRD and PVD, and , are lower than Mean MAPEs for FIS-TS in both data, although not for medians, reflecting higher variability of FIS-TS (higher spread in box plot). Hence, we claim that the proposed approach is better in terms of mean MAPEs, than the infeasible baseline. Similarly, the proposed approach always performs better than the feasible, but naïve baseline (FB), for both FIS and IFIS; the effect being stronger for FIS. The bad performance of IFIS-TS for PVD may be due to fewer data points of page views for infrequent itemsets. The higher MAPEs for IFIS vs. FIS is due to IFIS itemsets having at most events every hour on average, a small sample to obtain good estimates. Surprisingly, even in small itemsets, our approach that assumes conditional independence, compares reasonably with IFIS-TS.

Method BRD PVD Method BRD PVD
ETS 23.2 13.6 NNAR 24.4 17.2
ARIMA 32.2 17.6 Prophet 23.6 26.5
Figure 3. MAPEs (Y-axis) for forecasting: Solid and dashed horizontal lines are median and mean. Comparison is relevant across boxes of same color.
Table 3. MAPEs for univariate time series333

We also explored a Long Short-Term Memory (LSTM) based time series model, but this failed to provide acceptable accuracy.

MAPEs are benchmarked against (Hyndman et al., 2002) where ETS produces MAPEs between 10 and 20% for time series in M3 competition (Makridakis and Hibon, 2000). Our MAPEs for univariate time series targets, tasks comparable to the competition, are 14 to 23%. The audience estimation task is more challenging since forecasts are for thousands of attribute combinations, without recording the time series for each. Our MAPE values under is likely to be acceptable in practice.

5. Conclusion

Knowing the likely size of audience segments for web traffic can help websites better plan their ad campaign. Audience forecasting is challenging because of the combinatorial explosion in attribute values, each of which could be a relevant target audience. We address this problem with a combination of frequent itemset mining and time series modeling. We are able to achieve good accuracy levels on real datasets from two use cases within online display ad and compare our results with three baseline approaches. We also give a novel FIM approach, specific to categorical characteristics of audience data. We demonstrate the superior performance of this approach over state of the art algorithms by proposing a new simulation framework.

References

  • (1)
  • Agarwal et al. (2010) Deepak Agarwal, Datong Chen, Long-ji Lin, Jayavel Shanmugasundaram, and Erik Vee. 2010.

    Forecasting high-dimensional data. In

    ACM SIGMOD 2010.
  • Agrawal and Srikant (1994) Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules in Large Databases. In VLDB ’94.
  • Borgelt (2012) Christian Borgelt. 2012. Frequent item set mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 6 (2012).
  • Do et al. (2003) Tien Dung Do, Siu Cheung Hui, and Alvis Fong. 2003. Mining Frequent Itemsets with Category-Based Constraints. In Discovery Science: International Conference.
  • Goldfarb and Tucker (2011) Avi Goldfarb and Catherine Tucker. 2011. Online Display Advertising: Targeting and Obtrusiveness. Marketing Science 30, 3 (2011), 389–404.
  • Hamilton (1994) J.D. Hamilton. 1994. Time Series Analysis. Princeton University Press.
  • Han et al. (2000) Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In ACM SIGMOD.
  • Hyndman and Khandakar (2008) Rob Hyndman and Yeasmin Khandakar. 2008. Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software, Articles 27, 3 (2008).
  • Hyndman et al. (2008) Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. 2008. Forecasting with exponential smoothing: The state space approach. Springer.
  • Hyndman et al. (2002) Rob J Hyndman, Anne B Koehler, Ralph D Snyder, and Simone Grose. 2002. A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting 18, 3 (2002), 439–454.
  • Lee et al. (2012) Kuang-chih Lee, Burkay Orten, Ali Dasdan, and Wentong Li. 2012. Estimating Conversion Rate in Display Advertising from Past Performance Data. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Makridakis and Hibon (2000) Spyros Makridakis and Michele Hibon. 2000. The M3-Competition: results, conclusions and implications. International journal of forecasting 16, 4 (2000).
  • Muthukrishnan (2009) S. Muthukrishnan. 2009. Ad Exchanges: Research Issues. In Proceedings of the 5th International Workshop on Internet and Network Economics (WINE ’09).
  • Nelsen (1999) Roger B Nelsen. 1999. Introduction. In An Introduction to Copulas. Springer, 1–4.
  • Ng et al. (1998) Raymond T. Ng, Laks V. S. Lakshmanan, Jiawei Han, and Alex Pang. 1998. Exploratory Mining and Pruning Optimizations of Constrained Associations Rules. In In the 1998 ACM SIGMOD International Conference on Management of Data.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
  • Pei et al. (2001) Jian Pei, Jiawei Han, and L. V. S. Lakshmanan. 2001. Mining frequent itemsets with convertible constraints. In International Conference on Data Engineering.
  • Roels and Fridgeirsdottir (2009) Guillaume Roels and Kristin Fridgeirsdottir. 2009. Dynamic revenue management for online display advertising. Journal of Revenue and Pricing Management (2009).
  • Srikant et al. (1997) Ramakrishnan Srikant, Quoc Vu, and Rakesh Agrawal. 1997. Mining Association Rules with Item Constraints. In KDD’97.
  • Taylor and Letham (2017) Sean J Taylor and Benjamin Letham. 2017. Forecasting at scale. The American Statistician (2017).
  • Uno et al. (2004) Takeaki Uno, Tatsuya Asai, Yuzo Uchida, and Hiroki Arimura. 2004. An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases.
  • Zaki et al. (1997) Mohammed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li, et al. 1997. New Algorithms for Fast Discovery of Association Rules.. In KDD.
  • Zhang et al. (2014) Weinan Zhang, Shuai Yuan, and Jun Wang. 2014. Optimal Real-time Bidding for Display Advertising. In KDD’14.
  • Zhang et al. (2016) Weinan Zhang, Tianxiong Zhou, Jun Wang, and Jian Xu. 2016. Bid-aware Gradient Descent for Unbiased Learning with Censored Data in Display Advertising. In KDD ’16.