Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

02/28/2018
by   Maurits Kaptein, et al.
Tilburg University
0

In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We propose a method for estimating a finite mixture of logistic regression models which can be used to cluster customers based on a continuous stream of responses. This method, which we coin oFMLR, allows segments to be identified in data streams or extremely large static datasets. Contrary to black box algorithms, oFMLR provides model estimates that are directly interpretable. We first introduce oFMLR, explaining in passing general topics such as online estimation and the EM algorithm, making this paper a high level overview of possible methods of dealing with large data streams in marketing practice. Next, we discuss model convergence, identifiability, and relations to alternative, Bayesian, methods; we also identify more general issues that arise from dealing with continuously augmented data sets. Finally, we introduce the oFMLR [R] package and evaluate the method by numerical simulation and by analyzing a large customer clickstream dataset.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/17/2022

Conjugate priors and bias reduction for logistic regression models

Logistic regression models for binomial responses are routinely used in ...
11/28/2019

Qini-based Uplift Regression

Uplift models provide a solution to the problem of isolating the marketi...
08/05/2014

Volumes of logistic regression models with applications to model selection

Logistic regression models with n observations and q linearly-independen...
07/26/2012

Achieving Approximate Soft Clustering in Data Streams

In recent years, data streaming has gained prominence due to advances in...
12/03/2018

On functional logistic regression via RKHS's

In this work we address the problem of functional logistic regression, r...
07/04/2022

Learning state machines via efficient hashing of future traces

State machines are popular models to model and visualize discrete system...

Code Repositories

ofmlr

R package for fitting online mixtures of logistic regression models.


view repo

Abstract

In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We propose a method for estimating a finite mixture of logistic regression models which can be used to cluster customers based on a continuous stream of responses. This method, which we coin oFMLR, allows segments to be identified in data streams or extremely large static datasets. Contrary to black box algorithms, oFMLR provides model estimates that are directly interpretable. We first introduce oFMLR, explaining in passing general topics such as online estimation and the EM algorithm, making this paper a high level overview of possible methods of dealing with large data streams in marketing practice. Next, we discuss model convergence, identifiability, and relations to alternative, Bayesian, methods; we also identify more general issues that arise from dealing with continuously augmented data sets. Finally, we introduce the oFMLR [R] package and evaluate the method by numerical simulation and by analyzing a large customer clickstream dataset.

Keywords:

Clustering, Finite mixture models, Logistic regression, Online estimation, Expectation-Maximization.

1 Introduction

In marketing we are often confronted with continuous data streams of customer behavior (Su and Chen, 2015; Chatterjee et al., 2003; Moe and Fader, 2004; Bucklin and Sismeiro, 2009). Consider online advertising which has become a major advertising medium, with over billion USD invested in 2014 in search engines, social media, and general websites (Yeu et al., 2013). Online advertising is an extremely competitive medium, and customers attention is not easily won: Numerous studies have shown that customers deliberately avoid internet ads (Yaveroglu and Donthu, 2008; Cho and Cheon, 2004; Rojas-Méndez et al., 2009; Baek and Morimoto, 2012; Seyedghorban et al., 2015). As such, getting ads noticed and processed by users is of particular importance (Huang and Lin, 2006; Yeu et al., 2013; Yaveroglu and Donthu, 2008). This is not easy, and marketing and advertising scholars have spent considerable effort studying online customer behavior in the last decades (see, e.g., Ranaweera, 2005; Huang and Lin, 2006; Micheaux, 2011; Bleier and Eisenbeiss, 2015; Vakratsas et al., 2004). In these studies, scholars have stressed the difficult challenge of understanding online customer behavior (Yeu et al., 2013).

On a positive note, online customer behavior can potentially make it possible to measure the effectiveness of marketing messages, experiment with novel types of messages, and understand customer responses better (Huang and Lin, 2006). Since online behavioral responses of customers can often be measured directly on scale hitherto unknown, online advertising and retailing opens up a treasure throve of information: we can now obtain direct, near real-time, data on customers. This information should allow us to develop and test new marketing theories (Laczniak, 2015).

However, to use these emerging opportunities for research and marketing practice, we need to have sound analysis methods capable of dealing with novel data sources. For example, while it is potentially extremely useful, analyzing click stream data is challenging for several reasons. First, click stream data sets grow very large: databases containing responses (clicks) to online advertisements quickly exceed millions of records, often called rows . Since we often also have access to a large number of customer background characteristics (e.g., their search queries, their IP address, their geo-location, and their purchase behavior), the database often also contains a large number of columns . The sheer volume of data poses problems. But not only does the volume poses problems: click stream data are often continuously augmented. New data points arrive continuously, and often at high velocity. Hence, to truly utilize the data-sources at our disposal, we need analysis methods that can deal with both high volume and high velocity data (Boyd and Crawford, 2012).

Recently, many technologies and statistical methods have been proposed to deal with Big Data or high velocity data. On the one hand, a large number of technical advances have been made: infrastructures such as Hadoop, Apache Spark, and the Map/Reduce framework allow us to store and process larges volumes of data (Chu et al., 2007; Alpcan and Bauckhage, 2009)

. These software packages are, however, geared towards computer scientists, not marketing scholars. On the other hand, scholars in statistics and machine learning are actively working on novel estimation methods to deal with high volume and high velocity data. Examples include methods for fitting generalized linear models in data streams using gradient descent methods

(Zinkevich et al., 2010), as well as computationally efficient bootstrap methods (Owen and Eckles, 2012).

This paper proposes another estimation method that is well suited for the analysis of click-stream responses to marketing messages: we present an online (or streaming) method for estimating Finite Mixtures of Logistic Regression models which we coin oFMLR. Our method will appeal to marketing scholars for a number of reasons: first, it is well suited for analyzing the dichotomous response data that are often observed in online advertising and retailing. Furthermore, the model is capable of dealing with extremely high volume and high velocity data, thus making it a perfect candidate for analyzing click stream data. Finally, it builds on the well-known logistic regression model that is frequently used in marketing research (see, e.g., Yeu et al., 2013; Bonfrer and Drèze, 2009; Kireyev et al., 2015, for examples of the use of logistic regression). This has the advantage that marketing scholars can readily interpret the estimated model parameters unlike the “black box” models often used in machine learning.

The proposed oFMLR is related to recent attempts to cluster customers in large datasets: for example, Su and Chen (2015) proposed a specific method of deriving customer interest segments from behavioral observations in online commerce. Our method is, however, more general in the sense that it does not rely on particularities of internet usage such as the frequency of customer visits: our method can be used to uncover segments in data streams, as long as the primary outcome measure is dichotomous. OFMLR is a specific implementation of the online expectation-maximization (EM) algorithm more generally described by Cappé and Moulines (2009) and allows clustering (or segmentation) of customers during a data-stream: the model allows one to identify homogenous groups of customers without using ad-hoc analysis.

Mixture models, more specifically mixtures of logistic regression models, have been used frequently and effectively in the marketing literature for decades. Already two decades ago, Wedel et al. (1993) discussed latent class Poisson regression models and more general mixture models (Wedel and DeSarbo, 1995), in work that remains influential in marketing practice today. Applications of finite mixtures of regression models date even further back, with Kamakura and Russell (1989) applying mixture models for customer segmentation. Research into the use of mixture models for applied market segmentation is still thriving (Wedel and Kamakura, 2012), and, despite having been known for decades, their use is still increasing. Mixture models are an invaluable tool for analyzing marketing data and identifying consumer segments. Unfortunately, many estimation methods of mixture models are computationally demanding for extremely large or continuously augmented datasets. Our aim is to solve this problem for a specific class of models. While presenting our solution, we also emphasize the issues that arise when analyzing high velocity data streams more generally, and discuss several possible approaches.

The next section of this paper presents some of the conceptual building blocks of our proposed method. First, we introduce the notation that is used. Second, we discuss possible methods of dealing with high volume or high velocity data, focussing explicitly on online estimation as a preferred approach. Third, we discuss estimation of a simple logistic regression model in a data stream using stochastic gradient descent (SGD). Next, we detail oFMLR: we present an algorithm for estimating a finite mixture of logistic regression models in a data stream. After formalizing the model and discussing the Expectation-Maximization (EM) algorithm as a general method for estimating latent variable models, we present oFMLR and discuss how the estimated parameters can be interpreted. In order to explain the issue to a broader audience, we introduce the methods for dealing with high velocity data, SGD, and EM quite broadly. Next, we further discuss convergence, identifiability, and model selection, and we evaluate oFMLR using a simulation study and by analyzing an actual dataset. Finally, we show the practical use of oFMLR and discuss possibilities for future work.

2 Big Data Analysis in Online Marketing

This section presents several conceptual solutions for dealing with extremely high volume or high velocity data streams, and introduces an estimation method for logistic regression models111Note that the same method is more broadly applicable for maximum likelihood estimation—our example however focusses on logistic regression. that is well-suited for dealing with high volume and velocity data.

Before discussing the possible methods of dealing with large datasets, we first introduce our notation. We assume that the data of interest consists of clicks on marketing messages (e.g., advertisements or product displays) that arrive sequentially over time. We further assume the dataset is continuously augmented. We will denote for a click on a message at a specific point in time (where the horizon might not be known). We will assume that clicks on messages originate from different customers, and hence by analogy the subscript , would often, in the analysis of static datasets be referred to using where is the total number of responses; we will use explicitly when discussing static analysis methods. Next to our dependent variable of interest

, we often know a number of properties of the message displayed (e.g., the content of the advertisement), and we know a number of customer characteristics as well. We will denote this information using the feature vector

.

When considering the analysis of a static dataset we will sometimes refer to vector , and the () matrix as the data. Here, is the total number of features (often including a column of 1’s for the intercept of a model). The full dataset , is thus composed of rows . Finally, we will use quite generally to denote the quantities of interest to a researcher (e.g., the parameters of some model), and for the estimated state of the parameters at time .

2.1 Methods of Dealing with “Big Data”

Analyzing the click-stream data resulting from online marketing is frequently a Big Data problem for a variety of reasons: first, the number of observations is large. This might be cumbersome since analysis using standard statistical software may take too long. Second, the data are often continuously augmented; thus, grows continuously. This is cumbersome not just when is large, but also when the data arrive at large speeds: It is not feasible to re-analyze the newly resulting dataset every time the old dataset is augmented.222The dimensionality of the feature vector

might also pose problems in the analysis. Many feature selection and dimensionality reduction methods have been developed to deal with such problems

(see, e.g., Yue et al., 2012; Gaber et al., 2005). These are outside the scope of this paper. There are, however, several methods available to deal with large, or quickly augmented datasets. Broadly, the following four approaches can be identified:

  • Subsampling: A simple method of dealing with a dataset that is too large is subsampling from the collected data. Here we randomly select rows from the dataset , without replacement. Clearly, if is chosen such that the resulting dataset can be dealt with using conventional software, the problem is solved; it is, however, solved at the cost of losing the information from data points not included in the sample. Subsampling can also be done in data streams: in this case, one randomly ignores a proportion of the incoming datapoints.

  • Moving Window: : A second approach for dealing with large datasets is to subsample only the last elements of the data and use those for analysis. Thus, at each time point, the analyzed data are composed of rows . If is chosen such that the analysis can be carried out faster then the dataset is augmented, the problem is solved; once again, however, it is solved at the cost of ignoring available, historical, information.

  • Parallelize: A method that has become popular over the last decade is parallelization. Effectively, if one computing machine (often referred to as core) cannot deal with all the data, then we split the data over cores. Each core receives its own batch of rows, performs the analysis on its batch, and the results are combined. This primarily solves the problem of analysis becoming too lengthy; the computation time is effectively used in parallel as opposed to in sequence, and is thus often greatly reduced. Note, however, that not all analysis methods can be easily parallelized: some methods that are relatively standard in advertising, such as mixture models (Cheong et al., 2011), are not easily expressed in separate, independent batches (for more on this topic see Poggio et al., 2011)

  • Online learning (or streaming): A fourth possible approach is called “online learning”, which is a somewhat confusing name in situations where the data is collected online (the latter referring to on the Web , the former referring to the estimation method). The basic idea behind online learning is to identify the parameters of interest up front, and subsequently to make use of an update function of the parameters each time a datapoint enters: .

In this paper we will focus solely on online learning since a) it is based on all the datapoints (as opposed to subsampling or moving window approaches), and b) it is extremely well-suited for dealing with high velocity data streams: as long as the update of can be computed quickly, online learning can provide accurate estimates at each point in time. Note that we will sometimes use the shorthand notation to denote the parameter update as a function of the old parameter and a single datapoint.

To illustrate the computational advantage of online estimation, consider the computation of a simple sum: we merely want to know the number of views of an advertisement. Hence, we maintain a counter denoting the number of views. Conventional analysis would compute this as follows: . Each time a new datapoint enters, this count would have to be computed again: . It is easily seen that this requires computations. Using online learning one would specify: (or simply ); a very simple update function, using just one computation for each new datapoint and thus computations in total (

). This tremendous difference in the computational complexity of computing a count offline vs. online clearly highlights the appeal of online methods for fitting statistical models in high volume or high velocity data streams. Luckily, more than just sums can be computed exactly online: means, variances, covariances, linear models, and a number of other interesting quantities can be computed using an update function rather than by revisiting all historical data

(Ippel et al., 2017). However, exact online estimation methods are not available for more complex models—such as the mixture model we discuss below. The next section discusses possible approximation methods to fit more complex models. Note that we focus on frequentist estimation; alternative Bayesian methods will be discussed in Section 3.7.

2.2 Online Learning of Logistic Regression

This section discusses how a logistic regression model can be estimated using online learning. We aim to model the observations

which are a Binomial random variable whose proportion parameter

depends on . Hence,

using a logit link:

(1)

where is called the linear part of the model and the parameter vector of the model is easy to interpret (for more details see Huang and Lin, 2006; Micheaux, 2011; Yaveroglu and Donthu, 2008).

In a conventional frequentist offline analysis, the parameters are generally estimated using maximum likelihood. The likelihood of the data as a function of the parameters is given by

(2)

For computational convenience, one often takes the log of the likelihood (which simplifies the the computation by effectively replacing the product term by a sum), and subsequently finds the maximum of the log-likelihood as a function of by setting its derivative to . As usual, we will denote the resulting estimate . However, in the case of logistic regression, this procedure results in a system of nonlinear equations that is not easily solved; the solution cannot be derived algebraically and must be numerically approximated.

However, other methods are available, a particularly simple one being Gradient Descent (GD), or Ascent (GA) in the case of maximization: with GD we compute the gradient—the vector of first order derivatives—of the log-likelihood, , and evaluate it at some initial value of the parameters . Next, we apply the following iterative scheme:

(3)

Hence, we make a step at each iteration (the size of which is determined by ) in the direction of the gradient.

This procedure is intuitive: consider finding the maximum of a parabola, where is some constant (and the sought-for value of that maximizes ). If, , (the derivative evaluated at being positive) the curve is going up, and thus the maximum must be to the right of . Thus, one has to make a step towards higher values of . Similarly, if , the curve is sloping down, and lower values of would approach the maximum. Gradient Ascent formalizes this intuition, and generalizes it for multidimensional cases. Given an appropriate step size , GA will converge to the maximum likelihood estimate (Poggio et al., 2011).

It is relatively simple to transform parameter estimation using GD into an online learning method called Stochastic Gradient Descent (SGD); we merely change from updating our parameter estimates using multiple iterations through entire dataset to updating our estimates for each new data point :

(4)

or equivalently,

(5)

For specific models and specific learn rates, one can show that SGD converges to the maximum likelihood estimate (Poggio et al., 2011). ). Logistic regression is especially well behaved in this case, but for an extensive discussion of SGD learning—and convergence—rates we refer the reader to Toulis et al. (2014). Note that SGD is a useful online estimation method for any model for which the derivative of the log-likelihood is easy to evaluate: hence, SGD provides a general method for fitting statistical models in data streams (Zinkevich et al., 2010). However, if the gradient of the log-likelihood is not easy to evaluate, then SGD might be cumbersome.

3 Mixed Logistic Regression Models in Data streams

Logistic regression models are frequently used to analyze click-stream data. In the previous section, we have illustrated how this model can be estimated in a high volume, high velocity data stream using SGD. However, marketing click-stream data are often generated by diverse groups of customers. While the logistic regression model is well-suited to deal with such heterogeneity if these clusters (or segments) of customers are known, this is not always the case: though customers can often be meaningfully grouped into relatively homogenous clusters, the cluster memberships are not known a priori. Rather, the latent, unobserved, cluster memberships can only be derived ad hoc from the collected data. Such clustering is quite often undertaken in advertising and marketing research (e.g., Bakshy et al., 2012; McFarland et al., 2006; Su and Chen, 2015; Jedidi et al., 1997; Kireyev et al., 2015) and is meaningful for marketing theory: identifying clusters of consumers and interpreting model estimates can greatly improve our understanding of the effects of marketing messages.

3.1 A Finite Mixture of Logistic Regression Models

The finite mixture of logistic regression models offers a suitable clustering method for online advertising (Gortmaker et al., 1994; Wang and Puterman, 1998). This model extends the logistic regression model by assuming that the observed data originate from a number of unobservable clusters that are characterized by different regression models. Effectively, one assumes that there are homogenous clusters of customers, each with its own relationship between the feature vector and the observed clicks . For example, one could imagine two groups of customers differing in their responses to product pricing: One group of customers () might be seeking exclusivity, and thus attracted by high prices, while another group is looking for bargains. This means that a positive slope of price would be expected for the first group, all else being equal, while a negative slope would be expected for the second group.

An intuitive way of thinking about mixture models is to simply extend the logistic regression model presented in Equation 2. We can denote the likelihood of the finite mixture model by:

(6)

where the proportion parameter of the observed ’s is modeled by a separate vector of coefficients

within each unobserved cluster. The cluster membership probabilities

() are considered “mixing probabilities” which relate to the size of the clusters.

One could interpret the model as originating from the following sampling scheme: first, a cluster is drawn from a multinomial distribution with probabilities . Second, given the cluster, the observation is generated from the corresponding model with parameters . Note that we will often use to refer to both the mixture probabilities () as well as the regression coefficients within each cluster ().

3.2 Estimating Mixture models: The EM algorithm

Estimating the finite mixture model is more involved than estimating simple regression models since the log-likelihood of the data is not merely a summation over datapoints, but rather a summation over data points and often unknown clusters (Wang and Puterman, 1998). This complicates estimation as the gradient of the (log-)likelihood is not easily evaluated. This complicates estimation, as the gradient of the (log-)likelihood is not easily evaluated. However, if the cluster membership of each customer were known, then the estimation is simple: if we know in advance to which cluster a customer belongs, then we can simply estimate separate logistic regression models. This occurs frequently: for many latent variable or latent class models, estimation is greatly simplified if the class memberships would have been known. The EM algorithm is frequently used to estimate parameters in such situations.

To give an idea to how the EM algorithm tackles the unobserved data problem, we consider a relatively simple example (inspired by the explanation in Do and Batzoglou, 2008). Consider observing a total of coin tosses originating from the tossing of two coins, A and B, in batches of ten. Hence, the data could be fully described using two vectors, each with five elements, where the vector denotes the number of heads in each of the five batches (thus , and denotes the coin that was tossed . Our objective is to estimate the probability that each coin will land heads, which we will denote . A complete observed dataset could be: and which intuitively raises the idea that .

Given the full dataset, maximum likelihood estimation is simple:

(7)

and similarly for . This is analogues to knowing the cluster for our finite mixture model, and merely using a conventional ML approach within each cluster. However, what can we do if we only observe the vector , and not ?

Formally, we are seeking a method to maximize the log-likelihood of the observed data . The problem is easy when we observe all the data and we are looking for the complete data solution, , as demonstrated above. The EM algorithm allows us to solve the incomplete data case, where is not observed. The iterative algorithm starts with a “guess” of the parameters. For example we can set and . Next, we iterate the following steps:

  1. Expectation (E) step

    : In the expectation step, a probability distribution over the missing data is “guessed” given the current model parameters. In the coin tossing example, for each batch we compute the probability of that batch originating from coin A or coin B. For example, for batch 1 (3/10), and

    as specified above, the probability of the data originating from coin A is , while it is for B. This is intuitive: given that our initial estimate of is closer to then our initial estimate the data in batch one is more probable for coin A then for coin B.333These probabilities are computed by evaluating the binomial density for both and and normalizing. This results in vectors and respectively for the probability that coin A or coin B generated the data in the respective batch.

  2. Maximization (M) step: In the maximization step, the model parameters are updated using the guessed probability distribution over the missing data. Here, we compute the maximum likelihood estimate as if we observed z, but, we weight the data according to the probability that it originates from one of the coins. Hence, we are not certain which cluster it belongs to (either A or B as denoted in vector ) but rather assign it probabilistically. This gives us the new estimates and which are used in the next expectation step.444The updated is computed by multiplying by to obtain the “weighted” number of successes, which gives . The same is done for the failures, and based on these weighted observations the guesses of the parameters are updated; this leads to and after the first iteration.

The numerical values of the estimate in subsequent iterations are , and respectively; hence, the algorithm seems to have converged in as few as five iterations.

The EM algorithm is very flexible, and can be used for a large number of problems that can be formulated as a missing data problem. We refer the interested reader to Do and Batzoglou (2008) for a general introduction, and to Moon (1996) or Dempster et al. (1977) for a thorough mathematical overview. The algorithm is well-known, and the EM-solution for the offline estimation of a mixture of logistic regression models is also well-known (Wang and Puterman, 1998). The EM algorithm for a mixture of logistic regression models can be understood as follows:

  1. Expectation (E) step: Given initial values of (the mixture probabilities) and (the regression coefficients in each cluster), we compute the expected cluster membership probability of each observation, (thus, we compute a “probabilistic assignment” to for each observed unit ).

  2. Maximization (M) step: Given the conditional cluster membership assignments as computed in the E step, we fit a weighted logistic regression (using conventional ML estimation) for each cluster to obtain an update of . The update of is obtained by averaging over the expected membership probabilities:

This iterative scheme, when combined with SGD, is surprisingly easy to implement online, as will be demonstrated in the next section.

3.3 Online Finite Mixture of Logistic Regressions

Here we present our method for fitting the oFMLR model in full. The algorithm starts with a choice of (fixed—see Section 3.6 for a discussion), and starting values for and each (which we jointly refer to as ). Next, for each arriving datapoint we compute the following:

  1. Expectation (E) step: In the expectation step the conditional probability of the datapoint beloning to a cluster is computed given the current state of and . Thus, we compute

    (8)

    where for each . Note that and is thus an online update.

  2. Maximization (M) step: In the M step we update our estimates of and :555Note that we use below, where the refers to the different elements of , as opposed to where each denotes a different vector for the respective model.

    • We first update each online using SGD weighted by :

      (9)
      (10)

      where again .

    • We then update by computing an online average of :

      (11)

We thus maintain in memory a counter , a -dimensional vector denoting the mixing probabilities, and dimensional vectors , each of which is updated fully online. An efficient [R] S4 package to fit the above model can be downloaded and installed from https://github.com/MKaptein/ofmlr (the package is discussed in more detail in Section 4.2).

3.4 Identifiability of Model Parameters

Mixture models are appealing for their ease of interpretation and intuitive representation as a latent variable model. However, mixture models also provide challenges; we have already mentioned the difficulty of estimating the parameters of the mixture model. This section discusses another challenge: identifiability. A given mixture model is said to be identifiable if it is uniquely characterized, so that no two distinct sets of parameters defining the mixture yield the same distribution. Identifiability of distinct mixture models has been actively studied: Teicher (1963) and Teicher (1967)

provided conditions under which mixtures of binomial distributions are identifiable.

Follmann and Lambert (1991) extended this work to mixtures of logistic regression models with random intercepts. Butler et al. (1997) considered a latent linear model for binary data with any class of mixing distribution and provided sufficient conditions for identifiability of the fixed effects and mixing distribution. The main result of this work for our current discussion is easily summarized: not all mixture models are identifiable. This implies that the parameters can not be uniquely estimated for all mixture models . Of course, it is not very hard to see why this is the case: suppose we consider a simple mixture of two Bernoulli random variables: where is specifies the mixture probability and and quantify the probably of success for each of the two Bernoulli processes. Clearly, the parameters , and are unidentifiable: informally, we could say that multiple combinations of parameters would each maximize the likelihood of a given dataset. To illustrate, consider the following five observations: . Any choice of parameters for which holds maximizes the likelihood and thus there is an infinite number of parameter sets that yield the same distribution.

The example above immediately illustrates that the mixture of logistic regressions is not always identifiable: if we consider a finite mixture of two logistic regression models with Bernoulli observations where each model contains only one intercept, the situation is exactly the same as in our example. However, a number of mixtures of logistic regression models have been are identifialbe, and sufficient conditions for identification have been studied (Follmann and Lambert, 1991). Identifiability can be obtained in two ways for the finite mixture of logistic regression model. The first is by considering Binomial data generation instead of Bernoulli: if we manage to meaningfully group our observations, this can ensure identification. If we happened to know that the five observations originate from two Binomial processes such that where the subscript identifies distinct groups, than we could obtain the maximum likelihood estimate of , , and . More generally, having observations for each Binomial process is sufficient to identify mixture components. The presence of such batched or grouped data also ensured identifiability in our coin tossing example used to illustrate the general EM algorithm.

Unfortunately, this approach is of little use in the streaming data scenario we have been focussing on; it would entail storing the groupings in a way which would complicate the fitting algorithm. Luckily, however, having repeated observations is not the only sufficient condition for a mixture model to be identifiable (and note that it is not a necessary condition). A mixture of logistic regression models can also be identified, informally, when the pattern within each individual logistic regression is clear enough. If enough unique values are available for the independent variables, then this can be used as a sufficient criteria for identifiability. (Follmann and Lambert, 1991) showed that for a single independent variable having unique values, a mixture with can be identified, at least if each of the logistic regressions is itself also identified (in the same way as a standard non-mixture logistic regression cannot be fit without additional constraints if the number of parameters is larger than the number of observations). Hence, when treating the observations as Bernoulli, sufficient criteria for identifying mixtures logistic regression models can also be obtained in the data stream. However, we should note that the number of components that can be identified as more unique values of an independent variable are observed grows slowly: for unique values, Follmann and Lambert (1991) Theorem only guarantees sufficiency for identifying mixture components. Since we are considering large, continuous, data streams here, we consider the situation of more than unique values of an independent variable quite likely, but we must still caution for over-enthusiasm when choosing a high number of mixture components. A formal discussion of identifiability for mixture models in general can be found in Titterington et al. (1985) or Frühwirth-Schnatter (2006), while (Follmann and Lambert, 1991) discuss specifically the mixture of logistic models.

3.5 Convergence

Even when a model is identified, one could wonder whether the procedure used to estimate the model parameters (whether iterative or sequential) converges: e.g., whether the parameter values can theoretically be guaranteed to end up in a position where the likelihood function is maximized (or at least the procedure finds a local maximum). The suggested procedure for oFMLR is challenging in this case since it combines stochastic gradient descent, with an online version of the EM algorithm, both of which might or might not converge. However, convergence of both the EM algorithm (and online versions there-of) and SGD have been widely studied.

For SGD the study of convergence, when assuming a stationary (e.g., non-time changing) process revolves around finding a learn rate that is both large enough for the parameter estimates to change sufficiently in the face of new evidence, while also small enough for the parameters to stabilize. This intuition can be formalized using the following criteria: and . These objectives can be attained by using a gradually decreasing learn rate. Learn rates such as , have frequently been used (Kubrusly and Gravier, 1973; Cappé and Moulines, 2009). However, such a decreasing learn rate is not the only feasible choice; constant (but small) learn rates effectively give more weight to recent observations than older ones, and thus fixed learn rates can be considered a sort of smooth moving window. Especially when the data generating process is expected to change over time, such a fixed learn rate might be preferred over learn rates that are guaranteed to converge but assume stationarity. This means that a small fixed learn rate combines online estimation with a smooth moving window approach.

For oFMLR, if the individual SGDs converge, we must ask whether the latent variable scheme used in the EM algorithm converges. This has also been studied for a large number of cases, specifically by Cappé and Moulines (2009) who considered the convergence of computationally feasible EM approximations similar to the online version suggested here.We omit the technical details but caution the reader to assess convergence by at least studying the traces of the fitted model parameters; an approach for oFMLR will be illustrated below.

In practice, offline model convergence is often determined by looking at the changes of model parameters over different iterations. The notion of multiple iterations is absent in the online or streaming context, which means that traditional convergence statistics can not be used. However, we propose a similar approach, looking at the following statistic (included by default in our [R] package) when assessing convergence of oFMLR in a data stream:

(12)

Here we use to denote the norm (or Euclidian distance / length, ) of the parameter vector at a specific data point , and to denote the moving average of the difference of this length compared to the length of the parameter vector at the previous datapoint . The term in the denominator ensures that we are looking not just at the average change in norm during the full data stream, but rather at a moving window; once the number of observations exceeds the window length , older datapoints are smoothly discarded. This is similar to the effect of choosing a small, but fixed, learn rate, as in the SGD example discussed earlier. We will examine the performance of this convergence metric below.

3.6 Interpretation of the Model Parameters and Choosing

One of the advantages of oFMLR is that it provides a meaningful interpretation of the parameters within each cluster, which is useful for policy making. Hence, contrary to more black box approaches to clustering (Cf. Chipman et al., 2007, 2010), the current approach allows researchers to not only segment customers, but also to interpret these segments. The first meaningful interpretation can be derived from the estimates of vector . Each element of represents the estimated proportion of customers belonging to this specific segment. Inspecting the elements of can give one direct information regarding customer segments and their size. In addition, estimating these segments in a data stream provides researchers with the opportunity to inspect the dynamics in customer segments: if the values of the elements of change over time, then the behavior of different segments is apparently not stable.

Researchers can also directly interpret the estimated regression coefficients within each cluster. These estimates are standard logistic regression estimates, and can thus be interpreted as such. Within a specific cluster, the vector of estimated coefficients, , where

is the number of predictors in the model, is available during in the data stream. These coefficients can be interpreted in terms of familiar log-odds, and one can directly inspect which variables play the largest role in the behavior within a segment.

In this discussion, we have not considered the choice of : it was treated as given. However, in reality, the number of clusters is unknown before the start of the analysis. The choice of might be strictly guided by theory, which allows the analyst to choose in advance. However, in conventional, offline, analyses is often determined ad hoc by fitting models with different choices of and inspecting their model fit and estimated parameters. The latter is quite easily incorporated when using oFMLR: one can simply estimate, in parallel, models with different choices of (see simulation results below).

Formally choosing the “best” number of components is more challenging: although we can easily fit multiple models with different choices of in parallel, formal tests to determine the value of in the face of streaming data are still actively being developed (Zhou et al., 2005). The traditional, offline, methods for selecting the number of components mostly rely on comparing either the likelihood or the or values of the competing models (Titterington et al., 1985; Frühwirth-Schnatter, 2006). The likelihood (or log-likelihood of the mixture of logistic regression is easily computed for a static dataset (see also Eq. 6) and a set of parameter estimates. Subsequently computing the or to enable model comparisons is straightforward:

(13)
(14)

where is the total number of observations, is the number of parameters in the model, and for conciseness is chosen to denote the maximized value of the likelihood function.

The corresponding quantities are however not easily defined in a continuous data stream. Computing the full likelihood is not feasible if the data set is large, since this requires re-evaluating the log-likelihood of the full data set for the current state of the parameters

. Going back through all the data in retrospectively would defeat the purpose of fitting the model in the stream. For oFLMR we therefore chose a slightly different approach, one that is—to the best of our knowledge—novel. We compute the streaming average log-likelihood of each datapoint given the state of the parameters at that moment for a smooth window of pre-specified size

:

(15)

where denote the log-likelihood of the datapoint arriving at time . Thus, quantifies the average log-likelihood of each of the streaming datapoints. However, this notation hides the fact that the value of used to compute changes at each timepoint and thus provides a stochastic approximation. The streaming average log-likelihood provides a quantification of the fit of the model for the latest points in the data stream; we discuss the behavior of this metric below. From this metric, we can derive streaming approximations to the and , respectively:

(16)
(17)

which can be used to compare models with different numbers of parameters fitted in parallel during the data stream.

3.7 Alternative (Bayesian) Approaches

In our presentation of oFLMR we have focussed on a frequentist approach. However, ever since the introduction of finite mixture modeling, authors have considered Bayesian approaches for estimating parameters (see Titterington et al., 1985, for an introduction). As is often the case, the Bayesian approach for parameter estimation is simple:

(18)
(19)

which indicates that we can obtain the joint posterior distribution of by simply specifying a prior distribution , after which, if we are interested, we can obtain point estimates of by summarizing the posterior in some way, often taking its expectation or its maximum value (Gelman, 2008). This simple formula seems to indicate that we only need to specify a prior distribution to compute since we have already worked out the likelihood. Furthermore, one should note that the Bayesian framework potentially also offers tremendous opportunities for dealing with data streams (Opper and Winther, 1998) since the following holds,

(20)

indicating that we can use the posterior at time as the prior for time and thus naturally update our beliefs as the data come streaming in. Using Eq 20 Bayesian parameter estimation is naturally carried out online!

Nevertheless, it is not as straightforward as it would appear. It has long been known that simple analytic treatment of even 19 is not possible when derives from a mixture model (see Titterington et al., 1985, and references therein). It follows that implementing a Bayesian estimation procedure for finite mixture models is not straightforward when analyzing static data, let alone in the streaming data case we are considering.

In recent years, Bayesian approaches that have been successful have relied on sampling methods and / or other types of approximations. MCMC sampling is popular (see, e.g., Ryu et al., 2011)

, and effective Markov Chain Monte Carlo (MCMC) samplers for finite mixture models have been presented

(Dippold and Hruschka, 2013). For the standard (offline) fitting of a Bayesian mixture model, software packages such as JAGS or Stan can be used relatively easily (Kruschke, 2014). Approximations using variational methods have recently gained popularity, and offline versions can of this algorithm can easily be found (Attias, 1999).

However, only recently have authors started to consider online (or streaming) versions of these estimation methods: recent work on sequential MCMC (Kantas et al., 2009; Scott et al., 2016) and stochastic variational Bayes (VB) methods (Hoffman et al., 2013; Tank et al., 2015) suggests computationally attractive approximations for fitting complex Bayesian models in data streams. To the best of our knowledge, there are however no ready to use implementations of either sequential MCMC or stochastic VB approaches for the finite mixture of logistic regression models presented here; developing such implementations would be a great benefit.

4 Evaluations of oFMLR

This section examines how oFMLR performs, using a simulation study and an empirical example. In each case we describe the data that are used, the resulting estimates, and we reflect on convergence and model selection. All the models presented here have been fit using the [R] oFMLR package that accompanies this article. The package allows easy fitting of mixtures of logistic regression models in data streams and allows one to compare multiple models in parallel that are fit to the same data stream.666is presumably not the programming language of choice for dealing with large data streams in practice. However, several tools that do handle large data streams, such as Apache Spark (for example see Salloum et al., 2016), directly interface with [R].

Before presenting the simulation study and our empirical example, we first describe the oFMLR package in a bit more detail and present some code examples to enable readers to directly use the package themselves. The following [R] code downloads and installs the package:

> library(devtools)
> install_github("MKaptein/ofmlr")
> library(ofmlr)

Here, the first line is used to load the devtools package which allows—among many other utilities—[R] packages to be easily installed from github. The second line downloads and installs the oFMLR package (this line only needs to be run once) and the final line is used to load the package. After running these three lines of code, the package is ready for use and, as is standard in [R], one can start by browsing the documentation using ?ofmlr.

Next, we present an example for fitting a simple model. The following code allows one to generate a mixture dataset and subsequently fit the oFMLR model to a simulated data-stream (note that the lines following the # sign are comments):

set.seed(123456)
# Use utility functions in package to generate data
n <- 10^5
data <- generate_mixture(n, 2, 2, beta=matrix(c(3,-2.5,-2, 5), nrow=2), ak=c(.3,.7))
# Create a new oFMLR object
M <- online_log_mixture(2,2, trace=1000, ll.window=1000)
# "Stream" the data using a for loop
for(i in 1:n){
        M <- add_observation(M, data$data[i,1], data$data[i, -1], lambda=0)
}

The first line of the code sets the seed for reproducibility of the results. Next, we use the the generate_mixture utility function included in the package to create a dataset consisting of observations with the number of mixture components , the number of parameters per logistic regression model , and with model parameters , and . The next line is used to instantiate a new oFMLR object with similar properties (e.g., and ) and random starting values for and where the latter are drawn uniformly between and ; for the current choice of seed these starting values are , and . The trace=1000 argument in the call to instantiate the new oFMLR object ensures that a snapshot of the parameters () is stored every datapoints as well as the values of , , , and respectively. Finally, the lambda=0 argument sets the learn rate to the default decreasing learn rate .

After instantiating the model the for loop simulates a sequential run through the dataset. At each row of the dataset that is visited, the parameter values of the oFMLR model are updated. After this process has finished, a call to summary(M) produces the following output:

Online fit of logistic mixture model (oFMLR)
Number of mixture components:  2
Number of predictors (including intercept):  2
Estimated cluster membership probabilities:
[1] 0.719 0.281
Estimated coefficients:
      [,1] [,2]
[1,] -1.16  3.4
[2,]  1.49 -1.3
Total number of observations in data stream:  1e+05
The current (streaming average) log likelihood is  -0.593 (-0.076), sAIC= 1197.197, sBIC= 1226.644 .
NOTE: The l2 norm of the parameters has changed by < 0.001 in the last update.

This shows that the parameter estimates are reasonable—although the values for still seem influenced by the starting values. The values of however are very close to the true values. We investigate the precision of the parameter estimates more thoroughly in the simulation study presented below. The “NOTE” at the end highlights that the parameter values have not changed much over the last data points. It is difficult to interpret the log-likelihood and the and in isolation; below we illustrate how these can be used for model comparisons. Finally, the value between brackets after the

denotes the maximum average log-likelihood of the datapoints when using fixed cluster assignments based on the posterior probabilities

; this can be used for additional diagnostic purposes. Since we have traced the progress of the model fitting in the data stream, a call to plot(M1, params=TRUE) produces Figure 1, which gives an overview of the learn rate and convergence diagnostics.

Figure 1: Trace of values of , , , and .

In addition to fitting an individual model, the oFMLR package also allows comparisons of multiple models fit to the same datastream. The following [R] code instantiates multiple models and adds them to a model comparison object (which we will call a multi_online_log_mixture) which is subsequently updated using a data stream:

compare_models <- multi_online_log_mixture(online_log_mixture(2,1))
compare_models <- add_model(compare_models, online_log_mixture(2,2))
compare_models <- add_model(compare_models, online_log_mixture(2,3))
for(i in 1:n){
        compare_models <- add_observation(compare_models, data$data[i,1], data$data[i, -1], 0)
}

After running the above code, a call to summary(compare_models) produces Table 1. Here the use of the sIAC and sBIC is clear: the model with two components, which fits the true data generating model, is preferred.777In practice we recommend running multiple models in parallel, each with different starting values, to asses model fit. Having introduced the oFMLR package, we will investigate the parameter convergence and its practical applications in the next section.

M k p ll maxll sAIC sBIC dNorm n
1 M1 1.0 2 -0.618 -0.618 12359 12381 0.0034 100000
2 M2 2.0 2 -0.592 -0.182 11853 11896 0.0001 100000
3 M3 3.0 2 -0.592 -0.126 11863 11928 0.0001 100000
Table 1: Table of model comparisons created by calling summary() on a multi_online_log_mixture object.

4.1 Simulation Study: Parameter estimates

This section examines the convergence of the oFMLR over a large series of models and for different sizes of data stream. We simulate the previously mentioned scenario in which the observed clicks on an advertisement that originate from two homogeneous clusters of customers: one sensitive, the other seeking exclusivity. Once again we set , and as the true model parameters and simulate observations from this model. We than draw the values of the dependent variable (the ’s) randomly between and and subsequently simulate an observation from the model. Thus, to simulate an observation we first draw a cluster membership (, and then fill in a value for in the respective data generating model. We repeat this process for simulations.

Next, we fit both an offline version of a cluster model using standard EM (as implemented in the mixtools package in [R] (Benaglia et al., 2009)) and an online version using the oFMLR package. For both the online as well as the offline version, we need to specify starting values for both and . We choose as the value for each element of , and a random draw for each of the elements of . Furthermore, for the offline version, we choose a fixed number of iterations to allow the EM algorithm to converge.888A larger number of iterations might improve the parameter estimates in some runs. A smaller number obviously decreases running time. However, we feel that a fixed—albeit relatively large—number of iterations provides a more valid comparison. Finally, for the online version, we choose a fixed learn rate: .

gives an overview of the obtained parameter estimates for different lengths of the data stream. The mean and the standard error (SE) as computed over the

simulations are presented.999Note that the order in which the clusters are identified is not the same in every simulation run. For interpretation purposes, we order both the elements of and the components of . A number of things are quite striking: first, both of the methods largely fail on the smaller data streams (or at least have very high standard errors on the parameters). Both methods seem to work relatively well for medium size streams, and the resulting estimates are close to the true data generating model. Both methods perform well for large streams; the signs of all the coefficients are already properly estimated at . The current simulations further show that for , and hence a number of clusters that is too high compared to the data generating model, we find two clusters that have similar estimates for when using oFMLR. As mentioned earlier, this can be used to select the number of clusters when analyzing a data stream. This does not seem to work for the offline method, which is unable to converge—using the standard convergence criteria in the mixtools package—when the number of clusters is too high. Finally, for an online simulation with the estimated coefficients using oFMLR are , and (all SEs ) (This was not presented in the table since the offline version takes too much time). These values are very close to the true data generating values. At small sample sizes it is clear that the starting values for the SGD still play a large role in the parameter estimation; however, for large data streams () our oFMLR seems both fast and accurate.

Offline 0.04 (0.01) 0.03 (0) 0.04 (0)
0.27 (0) 0.27 (0) 0.27 (0)
0.44 (0) 0.46 (0) 0.46 (0)
0.56 (0) 0.54 (0) 0.54 (0)
-0.59 (0.22) 0.08 (0.06) 0.21 (0)
3.25 (0.74) 0.3 (0.19) -0.16 (0)
-16.17 (6.33) -1.44 (0.06) -1.5 (0.03)
52.16 (20.41) 6.07 (0.2) 6.41 (0.07)
0.23 (0) 0.22 (0) 0.23 (0)
0.28 (0) 0.28 (0) 0.26 (0)
0.49 (0.01) 0.5 (0) 0.52 (0)
0.27 (0.07) 0.31 (0.04) 0.07 (0.03)
0.44 (0.11) 0.37 (0.07) 0.03 (0.03)
0.27 (0.07) 0.29 (0.02) 0.34 (0.04)
0.2 (0.39) -0.34 (0.01) -0.28 (0.04)
-16.8 (6.07) -2.19 (0.08) -1.69 (0.06)
52.02 (16.34) 7.97 (0.17) 7.14 (0.13)
Online 0.05 (0.03) 0.01 (0.02) 0.02 (0.03)
0.31 (0.02) 0.33 (0.02) 0.32 (0.03)
0.44 (0) 0.4 (0) 0.35 (0)
0.56 (0) 0.6 (0) 0.65 (0)
0.12 (0.02) 0.42 (0.02) 0.99 (0.04)
-0.27 (0.02) -0.55 (0.02) -0.97 (0.03)
0.02 (0.02) -0.76 (0.02) -1.71 (0.03)
1.7 (0.02) 3.77 (0.02) 5.6 (0.02)
0.31 (0) 0.3 (0) 0.3 (0)
0.33 (0) 0.33 (0) 0.33 (0)
0.36 (0) 0.37 (0) 0.37 (0)
0.07 (0.02) 0.63 (0.03) 2.28 (0.11)
-0.44 (0.04) -0.83 (0.09) -1.8 (0.17)
0.03 (0.01) -0.37 (0.04) -1.46 (0.11)
0.98 (0.04) 2.53 (0.11) 4.28 (0.18)
0.01 (0.01) -0.63 (0.02) -2.01 (0.02)
1.29 (0.03) 3.11 (0.02) 5.15 (0.02)
Table 2: Results of simulation Study 1. Data generated using a true model with parameters , and .

4.2 Empirical Evaluation: Practical utility

To demonstrate the utility of our proposed method for marketing applications we use the oFMLR [R] package to do a post-hoc analysis of an existing large dataset of online consumer browsing behavior. The dataset consists of observations of the web-browsing behavior of prospective customers of a large insurance company (the data was collected between March and June , 2016). The dataset supplied by the firm, which wishes to remain unidentified, contains records of unique customers, identifying the number of pages they visited in their first visit to the the insurance company s website, the total time spent browsing in this session, and whether or not the customer ended up purchasing a product. These behavioral measures were later augmented by the company with an orientation score . Based on the content of the pages that are visited, the insurance company enriches their database by dividing customers into two categories: those who are merely looking for information regarding the company and different insurance types, and those who seem to be interested in purchasing one of their products, determined by the fact that they visit one or more of the actual product information pages. Table 3 presents some descriptives to further illustrate the current dataset.

Variable Description Min Max Mean Median
Purchase Did the customer buy? 0 1 0.037 0
Time Total session time 1 1190 283 180
Pages Number of pages visited in session 1 19 4.4 4
Orient “Looker” or “purchaser” 0 1 .53 1
Table 3: Descriptives for the consumer browsing behavior dataset. Note the fairly low purchase (or convergence) rates as common in many online marketing applications. The total number of observations was and the data was collected between March and June 2016.

Interestingly, a separate analysis, using a simple logistic regression model with “time” and “pages” as predictors, shows that the effect of the time that consumers spend on the page is positive for those that are merely “looking”, while it is negative—albeit not significantly—for those customers that seem to be interested in purchasing. Table 4 presents the results of these two independent analyses. The results are fairly intuitive: when visitors that wish to purchase a product spend more time, they are most likely experiencing barriers in the check-out process, while for visitors who are merely “lookers”, a longer session probably relates to an increased interest.

Estimate Std. Error z value Pr(z)
Lookers (Intercept) -3.2445 0.0261 -124.42 0.0000
Time 0.6572 0.0183 35.83 0.0000
Pages 0.4527 0.0161 28.20 0.0000
Purchasers (Intercept) -5.6142 0.0726 -77.30 0.0000
Time -0.0425 0.0898 -0.47 0.6356
Pages 0.0135 0.0943 0.14 0.8858
Table 4: Comparison of simple logistic regressions to examine the effect of timing and the number of visiting pages on those that are (ad-hoc) qualified as either “Lookers” or “Purchasers” based on the content of the visited webpages.

To evaluate the use oFMLR, we fit multiple logistic mixture models to the data while simulating a data-stream; hence, we effectively “replay” all the observations in chronological order and fit a number of competing models with different starting values and different number of components to the generated stream. Furthermore, we assume that the orientation score is unknown at the time the data arrives; these scores are determined ad hoc by the insurance company after a lengthy analysis of the urls which the consumers visited. The code below details how the ofmlr package was used to simulate the data stream and fit multiple models:

# open the dataset
data <- read.csv("session_data.csv")
# use the ofmlr package and add multiple models
library(ofmlr)
models <- multi_online_log_mixture(online_log_mixture(3,1))
models <- add_model(models, online_log_mixture(3,1))
models <- add_model(models, online_log_mixture(3,2))
models <- add_model(models, online_log_mixture(3,2))
models <- add_model(models, online_log_mixture(3,3))
models <- add_model(models, online_log_mixture(3,3))
# Simulate the data stream
for(i in 1:nrow(data)){
        models <- add_observation(models, data[i,1], c(1,data[i, 2:3]) )
}

This code fits six distinct models in the simulated stream: we fit models with different (random) starting values that contain predictors (the intercept, the effect of “time”, and the effect of “page”), and either , , or mixture components; given the large number of unique values for the predictors, we should have sufficient data to identify models of this size. The for loop effectively “replays” the incoming data in chronological order. We fit the model after standardizing (computing -scores) for the variables “time” and “pages” to prevent numerical instability.

The output of the call to summary(models) is presented in Table 5. It is clear that the moving-average log-likelihoods of the two models with the same number of components but different starting values do not differ from each other much and that the average change in the norm of the parameter vector is small, indicating reasonable convergence for each of the six models. When comparing the models based on the and , we see that the single component model is preferred; perhaps the mixture component is either a) too small to identify, or b) the estimated negative coefficient of time for the “lookers” is not sufficient to affect the estimates for the two component model (in fact, this estimated coefficient was not significantly different from in the separate analysis). However, when looking at the maximum log-likelihood which presents the mean log-likelihood for each datapoint when each datapoint is attributed to its highest posterior component—one would be inclined to prefer models or . Both of these two-component models have a high value on this metric. This provides some evidence (albeit weak) in favor of the two-component model. Inspecting the model parameters of model shows a large cluster (76%) of customers for whom the effect of time is positive, and a smaller one (23%) for whom the effect is instead negative; at least qualitatively, this replicates the findings for the separate analysis, which included the omitted variable “orientation”. Hence, though the signal in the current dataset is weak, fitting multiple logistic mixture models in parallel can shed light on possible differences between clusters of consumers. Again, it is intuitive to identify clusters of visitors based on the time spent during their visit to the website: time on the page can be a proxy for general interest and thus have a positive effect, but it could also highlight usability issues with the website, in which case it will likely have a negative effect for consumers trying to make an actual purchase.

M k p ll maxll AIC BIC Norm n
2 M1 1 3 -0.1003 -0.1003 208.60 228.23 0.0031 105502
3 M2 1 3 -0.1003 -0.1003 208.60 228.23 0.0031 105502
4 M3 2 3 -0.1079 -0.0416 231.87 271.13 0.0015 105502
6 M4 2 3 -0.1060 -0.0539 227.96 267.22 0.0018 105502
7 M5 3 3 -0.1071 -0.0588 238.25 297.14 0.0011 105502
9 M6 3 3 -0.1045 -0.0967 233.04 291.93 0.0020 105502
Table 5: Comparison of oFMLR models with 1-3 mixture components fit to the consumer browsing behavior data in simulated a data stream.

5 Discussion

This article introduced a method—-and associated [R] package—for the analysis of customer responses arriving in a continuous data stream (in this case, clicking behavior) The method is an online or row-by-row implementation of the EM algorithm to fit a finite mixture of logistic regression models. These models have been widely investigated, and we implement a specific version of the more general online EM algorithm discussed by Cappé and Moulines (2009). Our specific implementation of the finite mixture of logistic regression models, and the software provided, make the current work especially suited for analyzing consumer behavior arriving in large and continuous data streams. We have detailed the challenges facing us when analyzing customer data that arrive in high velocity data streams and we have explained how the use of online estimation methods can help marketing scholars deal with the large streams of data that originate from online click streams. We have also discussed the concepts behind both SGD and the EM algorithm. SGD is an optimization method with many applications for fitting models in data streams: its importance for future work in analyzing online advertising data can hardly be overstated. The EM algorithm is a general algorithm for the dealing with missing data problems. In many cases, the EM algorithm can be transformed into an online version, thus providing a wealth of methods for analyzing data streams. Finally, we have discussed both identification and convergence issues, as well as linking our work to Bayesian approaches that have a similar direction. We hope the current paper will be instructive for readers unfamiliar with the analysis of large continuous data streams, as well as introducing our oFMLR model.

We envision using oFMLR (or other related statistical models that can be fit online) in order to continuously monitor the effects of online marketing campaigns. We can encode both features of the marketing messages (the product being displayed, the shape, the persuasive appeals being used, etc), as well as features of the customers. The data stream that results from the click-behavior of customers on the messages or surrounding products can be analyzed continuously using multiple versions of oFMLR running in parallel (with different choices of and different starting values). The estimated mixture probabilities will provide direct and real-time feedback to policy makers about homogeneous clusters within the target audience, and the estimated coefficients within each cluster can be used to interpret these findings qualitatively: as our empirical example illustrated, the analysis can highlight both positive and negative effects of a single predictor. Interpreting such differences between clusters can inspire different courses of action: for example, in our empirical example one imagines a redesign of the website could improve its usability and make purchasing simpler.

Both the logistic regression model and its finite mixture extension (and variants of it) have been used for numerous applications in marketing research (e.g., Zhang and Krishnamurthi, 2004; Van den Bulte and Joshi, 2007; Schmittlein and Peterson, 1994; West et al., 1997). These models have been applied for a wide range of purposes: to understand and cluster customers online behavior, to understand new product diffusion, and to model consumer choice. As such, the models are flexible and have a tremendous potential for applications in marketing; even recent developments in machine learning still heavily rely on the logistic regression model (see, e.g., He et al., 2014). Technological advances have created new measurement opportunities in all areas of marketing, and marketing researchers are increasingly confronted with high volume or high velocity streams. We hope the current work will contribute to the use of logistic regression models to understand such continuously collected data. We believe that analysis practice will change in the coming years: when confronted with continuous data streams, a continuous analysis—such as demonstrated in our empirical example—can inspire continuous new marketing policies. Using fixed learn rates—and thus effectively “forgetting” older data—we can approach the estimates based on multiple parallel statistical models, providing a continuous source of information for policy decisions rather than testing specific hypotheses at specific points in time. The data collection is never finished, nor are our attempts to optimize our marketing policies.

Admittedly, this paper has only discussed the Bayesian paradigm very briefly. This is by no means intended to discourage a Bayesian treatment of modeling continuous data streams; for example, the relatively recent Probit regression approach presented by Graepel et al. (2010) provides an extremely usable, scalable, and fully online approach to modeling binary outcomes using a single component. We would welcome the development of similar online approaches for mixture models. Large steps are also being made in statistics and computer science as witnessed by the recent work on stochastic variational inference and sequential MCMC (e.g., Tank et al., 2015; Scott et al., 2016). In the years to come, these approaches surpass the frequentist treatment presented here because of the inherent benefits of the Bayesian approach when quantifying uncertainty in the estimated parameters. However, we have focussed on what we believe is still the most common approach in marketing practice; we hope the current work contributes to further developments in both Frequentist and Bayesian methods to deal with high velocity continuous data streams.

This paper has solely focussed on learning from a data stream case-by-case. However, in practice, models that are fit online might benefit from a hybrid approach in which either batches of datapoints are used to compute updates of the parameters or in which an offline analysis of a static dataset is used to determine the starting values for a model that is subsequently updated in a data stream. Both of these approaches warrant further investigation. Such hybrid approaches have been specifically explored for the EM-algorithm (see Neal and Hinton (1998) and Liang and Klein (2009) for examples). Finally, this paper introduced several diagnostic tools geared specifically towards the analysis of a data steam for example, the online moving average of the log-likelihood as presented in Eq. 15; these proposals need to be studied in more detail.

In this paper, we demonstrated the performance of oFMLR in a simulation study, as well as applying it to an empirical dataset. We also provided an easy to use [R] package to fit the oFMLR model. While the results of our simulations are promising, and the reduction of computation time that is obtained by using online estimation is very appealing, we do have to stress that finite mixtures of logistic regression models do not always converge properly: this is true both for the online and the offline versions. Furthermore, especially in the online case, the analysis is influenced by the starting values, and, without substantial domain knowledge, it is hard to establish identifiability criteria at the start of the data stream. Hence, the analyst should—as always—be careful interpreting the results, and should seek additional methods to validate the clustering that is found when analyzing a data stream using oFMLR. In practice, we recommend fitting multiple models (for different choices of ) online, and using an online bootstrap method, such as suggested by (Owen and Eckles, 2012), to quantify the uncertainty in the estimated model parameters. Combined with a long data stream, , this will ensure that our proposed method can be used efficiently and responsibly to identify clusters of customers based on click-stream data.

We believe that oFMLR provides an addition to the toolbox of statistical methods that is available to marketing scholars and practitioners when analyzing data streams.We also hope to have provided additional understanding regarding a) the basic conceptual methods of dealing with high volume and high velocity data streams, and b) the methodological building blocks (SGD and EM) that were used to develop oFMLR.

Acknowledgements

M.K. would like to gratefully acknowledge the support of Dr. Joris Mulder, Prof. Dr. Dean Eckles, and Prof. Dr. Petri Parvinen for their insightful discussions during the preparation of this manuscript. Furthermore, M.K. would like to thank Hans van der Poel and Dave Kruizinga for supplying the data for the case study is Section LABEL:S:emp.

References

  • Alpcan and Bauckhage (2009) Alpcan, T. and Bauckhage, C. (2009). A distributed machine learning framework. Proceedings of the 48h IEEE Conference on Decision and Control CDC held jointly with 2009 28th Chinese Control Conference, pages 2546–2551.
  • Attias (1999) Attias, H. (1999). Inferring parameters and structure of latent variable models by variational bayes. In

    Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

    , pages 21–30. Morgan Kaufmann Publishers Inc.
  • Baek and Morimoto (2012) Baek, T. H. and Morimoto, M. (2012). Stay Away From Me. Journal of Advertising, 41(1):59–76.
  • Bakshy et al. (2012) Bakshy, E., Eckles, D., Yan, R., and Rosenn, I. (2012). Social Influence in Social Advertising : Evidence from Field Experiments. In Electronic Commerce 2012, volume 1.
  • Benaglia et al. (2009) Benaglia, T., Chauveau, D., Hunter, D., and Young, D. (2009). mixtools: An r package for analyzing finite mixture models. Journal of Statistical Software, 32(6):1–29.
  • Bleier and Eisenbeiss (2015) Bleier, A. and Eisenbeiss, M. (2015). Personalized Online Advertising Effectiveness: The Interplay of What, When, and Where. Marketing Science, 34(5):669–688.
  • Bonfrer and Drèze (2009) Bonfrer, A. and Drèze, X. (2009). Real-Time Evaluation of E-mail Campaign Performance. Marketing Science, 28(2):251–263.
  • Boyd and Crawford (2012) Boyd, D. and Crawford, K. (2012). Critical Questions for Big Data. Information, Communication & Society, 15(5):662—-679.
  • Bucklin and Sismeiro (2009) Bucklin, R. E. and Sismeiro, C. (2009). Click Here for Internet Insight: Advances in Clickstream Data Analysis in Marketing. Journal of Interactive Marketing, 23(1):35–48.
  • Butler et al. (1997) Butler, S. M., Louis, T. A., et al. (1997). Consistency of maximum likelihood estimators in general random effects models for binary data. The Annals of Statistics, 25(1):351–377.
  • Cappé and Moulines (2009) Cappé, O. and Moulines, E. (2009). On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):593–613.
  • Cappé and Moulines (2009) Cappé, O. and Moulines, E. (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):593–613.
  • Chatterjee et al. (2003) Chatterjee, P., Hoffman, D. L., and Novak, T. P. (2003). Modeling the Clickstream: Implications for Web-Based Advertising Efforts. Marketing Science, 22(4):520–541.
  • Cheong et al. (2011) Cheong, Y., Leckenby, J. D., and Eakin, T. (2011). Evaluating the Multivariate Beta Binomial Distribution for Estimating Magazine and Internet Exposure Frequency Distributions. Journal of Advertising, 40(1):7–24.
  • Chipman et al. (2007) Chipman, H. A., George, E. I., and McCulloch, R. E. (2007). Bayesian Ensemble Learning. Transportation Research Part B: Methodological, 44(5):686–698.
  • Chipman et al. (2010) Chipman, H. a., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298.
  • Cho and Cheon (2004) Cho, C.-H. and Cheon, H. J. (2004). Why do people avoid advertising on the internet? Journal of Advertising, 33(4):89–97.
  • Chu et al. (2007) Chu, C.-t., Kim, S. K., Lin, Y.-a., and Ng, A. Y. (2007). Map-Reduce for Machine Learning on Multicore. Advances in neural information processing systems, 19(23):281.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39(1):1–22.
  • Dippold and Hruschka (2013) Dippold, K. and Hruschka, H. (2013). A model of heterogeneous multicategory choice for market basket analysis. Review of Marketing Science, 11(1):1–31.
  • Do and Batzoglou (2008) Do, C. B. and Batzoglou, S. (2008). What is the expectation maximization algorithm? Nature biotechnology, 26(8):897–899.
  • Follmann and Lambert (1991) Follmann, D. A. and Lambert, D. (1991). Identifiability of finite mixtures of logistic regression models. Journal of Statistical Planning and Inference, 27(3):375–381.
  • Frühwirth-Schnatter (2006) Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. Springer Science & Business Media.
  • Gaber et al. (2005) Gaber, M. M., Zaslavsky, A., and Krishnaswamy, S. (2005). Mining data streams. ACM SIGMOD Record, 34(2):18.
  • Gelman (2008) Gelman, A. (2008).

    Objections to Bayesian statistics.

    Bayesian Analysis, 3(3):445–450.
  • Gortmaker et al. (1994) Gortmaker, S. L., Hosmer, D. W., and Lemeshow, S. (1994). Applied Logistic Regression.
  • Graepel et al. (2010) Graepel, T., Candela, J. Q., Borchert, T., and Herbrich, R. (2010). Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 13–20.
  • He et al. (2014) He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., et al. (2014). Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1–9. ACM.
  • Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347.
  • Huang and Lin (2006) Huang, C.-y. and Lin, C.-s. (2006). Modeling the Audience’S Banner Ad Exposure for Internet Advertising Planning. Science, 35(2):123–136.
  • Ippel et al. (2017) Ippel, L., Kaptein, M., and Vermunt, J. (2017). Dealing with data streams: an online, row-by-row, estimation tutorial. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences.
  • Jedidi et al. (1997) Jedidi, K., Jagpal, H. S., and DeSarbo, W. S. (1997). Finite-Mixture Structural Equation Models for Response-Based Segmentation and Unobserved Heterogeneity. Marketing Science, 16(1):39–59.
  • Kamakura and Russell (1989) Kamakura, W. A. and Russell, G. (1989). A probabilistic choice model for market segmentation and elasticity structure. Journal of marketing research, 26:379–390.
  • Kantas et al. (2009) Kantas, N., Doucet, A., Singh, S. S., and Maciejowski, J. M. (2009). An overview of sequential monte carlo methods for parameter estimation in general state-space models. IFAC Proceedings Volumes, 42(10):774–785.
  • Kireyev et al. (2015) Kireyev, P., Pauwels, K., and Gupta, S. (2015). Do display ads influence search? Attribution and dynamics in online advertising. International Journal of Research in Marketing.
  • Kruschke (2014) Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press.
  • Kubrusly and Gravier (1973) Kubrusly, C. and Gravier, J. (1973). Stochastic approximation algorithms and applications. In 1973 IEEE Conference on Decision and Control including the 12th Symposium on Adaptive Processes, number 12, pages 763–766.
  • Laczniak (2015) Laczniak, R. N. (2015). The Journal of Advertising and the Development of Advertising Theory: Reflections and Directions for Future Research. Journal of Advertising, 44(4):429–433.
  • Liang and Klein (2009) Liang, P. and Klein, D. (2009). Online em for unsupervised models. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, pages 611–619. Association for Computational Linguistics.
  • McFarland et al. (2006) McFarland, R. G., Challagalla, G. N., and Shervani, T. a. (2006). Influence Tactics for Effective Adaptive Selling. Journal of Marketing, 70(4):103–117.
  • Micheaux (2011) Micheaux, A. L. (2011). Managing e-mail Advertising Frequency from the Consumer Perspective. Journal of Advertising, 40(4):45–66.
  • Moe and Fader (2004) Moe, W. W. and Fader, P. S. (2004). Capturing evolving visit behavior in clickstream data. Journal of Interactive Marketing, 18(1):5–19.
  • Moon (1996) Moon, T. K. (1996). The expectation-maximization algorithm. IEEE Signal Processing Magazine, 13:47–60.
  • Neal and Hinton (1998) Neal, R. M. and Hinton, G. E. (1998). A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer.
  • Opper and Winther (1998) Opper, M. and Winther, O. (1998). A Bayesian approach to on-line learning.

    On-line Learning in Neural Networks, ed. D …

    .
  • Owen and Eckles (2012) Owen, A. B. and Eckles, D. (2012). Bootstrapping data arrays of arbitrary order. The Annals of Applied Statistics, 6(3):895–927.
  • Poggio et al. (2011) Poggio, T., Voinea, S., and Rosasco, L. (2011). Online Learning, Stability, and Stochastic Gradient Descent. Artificial Intelligence, 8(11):11.
  • Ranaweera (2005) Ranaweera, C. (2005). A model of online customer behavior during the initial transaction: Moderating effects of customer characteristics. Marketing Theory, 5(1):51–74.
  • Rojas-Méndez et al. (2009) Rojas-Méndez, J. I., Davies, G., and Madran, C. (2009). Universal differences in advertising avoidance behavior: A cross-cultural study. Journal of Business Research, 62(10):947–954.
  • Ryu et al. (2011) Ryu, D., Li, E., and Mallick, B. K. (2011).

    Bayesian nonparametric regression analysis of data with random effects covariates from longitudinal measurements.

    Biometrics, 67(2):454–66.
  • Salloum et al. (2016) Salloum, S., Dautov, R., Chen, X., Peng, P. X., and Huang, J. Z. (2016). Big data analytics on apache spark.

    International Journal of Data Science and Analytics

    , pages 1–20.
  • Schmittlein and Peterson (1994) Schmittlein, D. C. and Peterson, R. A. (1994). Customer base analysis: An industrial purchase process application. Marketing Science, 13(1):41–67.
  • Scott et al. (2016) Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016). Bayes and big data: The consensus monte carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88.
  • Seyedghorban et al. (2015) Seyedghorban, Z., Tahernejad, H., and Matanda, M. J. (2015). Reinquiry into Advertising Avoidance on the Internet: A Conceptual Replication and Extension. Journal of Advertising, pages 1–10.
  • Su and Chen (2015) Su, Q. and Chen, L. (2015). A method for discovering clusters of e-commerce interest patterns using click-stream data. Electronic Commerce Research and Applications, 14(1):1–13.
  • Tank et al. (2015) Tank, A., Foti, N. J., and Fox, E. B. (2015). Streaming variational inference for bayesian nonparametric mixture models. In AISTATS.
  • Teicher (1963) Teicher, H. (1963). Identifiability of finite mixtures. The Annals of Mathematical Statistics, pages 1265–1269.
  • Teicher (1967) Teicher, H. (1967). Identifiability of mixtures of product measures. The Annals of Mathematical Statistics, 38(4):1300–1302.
  • Titterington et al. (1985) Titterington, D. M., Smith, A. F., and Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley,.
  • Toulis et al. (2014) Toulis, P., Airoldi, E., and Rennie, J. (2014). Statistical analysis of stochastic gradient methods for generalized linear models. In ICML, pages 667–675.
  • Vakratsas et al. (2004) Vakratsas, D., Feinberg, F. M., Bass, F. M., and Kalyanaram, G. (2004). The Shape of Advertising Response Functions Revisited: A Model of Dynamic Probabilistic Thresholds. Marketing Science, 23(1):109–119.
  • Van den Bulte and Joshi (2007) Van den Bulte, C. and Joshi, Y. V. (2007). New product diffusion with influentials and imitators. Marketing Science, 26(3):400–421.
  • Wang and Puterman (1998) Wang, P. and Puterman, M. L. (1998). Mixed Logistic Regression Models. Journal of Agricultural, Biological, and Environmental Statistics, 3(2):175.
  • Wedel and DeSarbo (1995) Wedel, M. and DeSarbo, W. S. (1995). A mixture likelihood approach for generalized linear models. Journal of Classification, 12(1):21–55.
  • Wedel et al. (1993) Wedel, M., DeSarbo, W. S., Bult, J. R., and Ramaswamy, V. (1993). A latent class poisson regression model for heterogeneous count data. Journal of Applied Econometrics, 8(4):397–411.
  • Wedel and Kamakura (2012) Wedel, M. and Kamakura, W. A. (2012). Market segmentation: Conceptual and methodological foundations, volume 8. Springer Science & Business Media.
  • West et al. (1997) West, P. M., Brockett, P. L., and Golden, L. L. (1997). A comparative analysis of neural networks and statistical methods for predicting consumer choice. Marketing Science, 16(4):370–391.
  • Yaveroglu and Donthu (2008) Yaveroglu, I. and Donthu, N. (2008). Advertising Repetition and Placement Issues in On-Line Environments. Journal of Advertising, 37(2):31–44.
  • Yeu et al. (2013) Yeu, M., Yoon, H.-S., Taylor, C. R., and Lee, D.-H. (2013). Are Banner Advertisements in Online Games Effective? Journal of Advertising, 42(May 2015):241–250.
  • Yue et al. (2012) Yue, Y., Hong, S., and Guestrin, C. (2012). Hierarchical exploration for accelerating contextual bandits. arXiv preprint arXiv:1206.6454.
  • Zhang and Krishnamurthi (2004) Zhang, J. and Krishnamurthi, L. (2004). Customizing promotions in online stores. Marketing Science, 23(4):561–578.
  • Zhou et al. (2005) Zhou, J., Foster, D., Stine, R., and Ungar, L. (2005). Streaming feature selection using alpha-investing. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 384–393. ACM.
  • Zinkevich et al. (2010) Zinkevich, M. A., Smola, A., and Weimer, M. (2010). Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 23(6):1–9.