Modeling Time to Open of Emails with a Latent State for User Engagement Level

08/18/2019
by   Moumita Sinha, et al.
adobe
0

Email messages have been an important mode of communication, not only for work, but also for social interactions and marketing. When messages have time sensitive information, it becomes relevant for the sender to know what is the expected time within which the email will be read by the recipient. In this paper we use a survival analysis framework to predict the time to open an email once it has been received. We use the Cox Proportional Hazards (CoxPH) model that offers a way to combine various features that might affect the event of opening an email. As an extension, we also apply a mixture model (MM) approach to CoxPH that distinguishes between recipients, based on a latent state of how prone to opening the messages each individual is. We compare our approach with standard classification and regression models. While the classification model provides predictions on the likelihood of an email being opened, the regression model provides prediction of the real-valued time to open. The use of survival analysis based methods allows us to jointly model both the open event as well as the time-to-open. We experimented on a large real-world dataset of marketing emails sent in a 3-month time duration. The mixture model achieves the best accuracy on our data where a high proportion of email messages go unopened.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/15/2020

Neural Topic Models with Survival Supervision: Jointly Predicting Time-to-Event Outcomes and Learning How Clinical Features Relate

In time-to-event prediction problems, a standard approach to estimating ...
04/21/2020

An RNN-Survival Model to Decide Email Send Times

Email communications are ubiquitous. Firms control send times of emails ...
03/25/2020

Bayesian Hierarchical Bernoulli-Weibull Mixture Model for Extremely Rare Events

Estimating the duration of user behavior is a central concern for most i...
05/04/2018

Time-on-Task Estimation with Log-Normal Mixture Model

We describe a method of estimating a user's time-on-task in an online le...
02/29/2020

Survival Cluster Analysis

Conventional survival analysis approaches estimate risk scores or indivi...
10/05/2021

A new harmonium for pattern recognition in survival data

Background: Survival analysis concerns the study of timeline data where ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Email has a rich history of being a data source for machine learning techniques. Starting with spam filtering

(Cormack, 2008), the range of applications today covers a rich spectrum of scenarios. The Enron Corpus (Klimt and Yang, 2004) enabled research into the modeling of users’ interactions with email in a collaborative environment (Chapanond et al., 2005). For email service providers, detailed understanding of consumers’ interactions with the email system allows building predictive models for specific actions, e.g. if an email will be replied to or not (Yang et al., 2017) and creating rich experiences (Karagiannis and Vojnovic, 2009; Kannan et al., 2016) for the recipients. On the consumer side, given its popularity, there has been much work on different ways to handle large volumes of email effectively (Whittaker and Sidner, 1996). An early paper by Horvitz et al. (Horvitz et al., 1999) proposed that autonomous agents may be able to identify and prioritize emails that need attention. The authors of (Di Castro et al., 2016) show that historical data allows the prediction of what actions a user might take on the receipt of an email, for example, marking it for deletion (Dabbish et al., 2003). Apart from being a mode of communication, email is also used as a personal information management environment (Ducheneaut and Bellotti, 2001), leading to the need to support other forms of interactions like search (Narang et al., 2017).

The domain of interest in the current paper is marketing, where the email channel ranks high in popularity (VanBoskirk et al., 2011) alongside social media, search & display advertising. Email based marketing is predicted to have a compound annual growth rate of (VanBoskirk et al., 2011) and nearly every enterprise marketer uses it as a delivery channel (Tsirulnik, 2011; Chaffey, 2009). The engagement levels however are typically low, as compared to personal email messages at work or among friends. The open rates for the marketing email messages, vary by industry - ranging from to in the e-commerce, beauty and personal care, and gambling industries, and in the range of to in the hobbies, home and garden or health and fitness industries (Wells, 2016). Marketers are therefore always on the lookout for techniques that might enhance the engagement levels. For example, Kumar et al. (Kumar et al., 2014) modeled opt-in and opt-out behaviour and related these to transactions made by the consumer. Bonfrer et al. (Bonfrer and Drèze, 2009) proposed a framework that allows real-time evaluation of an email campaign.

In this submission, we propose the use of survival analysis for jointly modeling the open event on an email, as well as the time-to-open. The next section provides technical background to some important concepts in survival analysis that are relevant in the current scenario.

2. Survival Analysis

Survival analysis refers to an area of statistical modeling where the main variable of interest is the time to an event. Historically, the event is assumed to be death. One characteristic of data that makes the use of survival models appropriate is the presence of censoring. This refers to the fact that not all individuals would have experienced the event within the observation window. The censoring may be because at the time of analysis the event had not yet occurred, or if the corresponding individual can no longer be tracked. Figure 1 is a pictorial representation of survival data in the context of emails. Observations are synchronized at , which is the time at which the individuals receive the email. If the event of the email being read is not within a chosen time interval, e.g. hours, this would be a censored data point. And some recipients may of course not read the email at all.

Figure 1. Example of survival data with different censoring windows. Emails are received at t= by all individuals. Note that at t= hours, outcome for individual ID is censored but will be observed for t= hours.

Consider a random variable

for the time to the event of interest, with the corresponding probability density function

and the cumulative distribution function being

at a given time . Then the survivor function is defined as

(1)

It represents the probability that an individual will survive beyond time . Equivalently, given that the individual has not yet experienced the event till time , the hazard function represents the instantaneous chance of the event occurring at time .

(2)

The relationship between the survivor function and the hazard function can be derived as being , where is the cumulative hazard function corresponding to .

A survival analysis dataset containing N individuals is represented as , with . For the individual,

is a vector of features that are believed to be predictive of the survival time. The target

represents the survival time, where represents the duration of time for which the individual was observed and is also known as the censoring window. If observed within the censoring window, is the time to event for the individual. The indicator variable encodes if the individual experienced the event of interest within the censoring window.

(3)

2.1. Cox Proportional Hazard Regression

Given a feature vector for the individual, the hazard function for the individual at any given time can be defined as

(4)

Here is the baseline hazard function at time , and incorporates the dependence on the individual-specific features , which are independent of time. The specific factorization of into a global time-dependent component () and an individual’s time-independent factor () is the Proportional Hazards assumption - Section 3.2 provides a methodology to validate this assumption on a given dataset. What has been defined above is a semi-parametric approach, in that no assumptions have been made about the shape of the baseline hazard function . The parametric alternative would be to impose a functional form, e.g. a Weibull distribution. Based on the relation between the survivor and hazard functions, the survivor function of the individual for Cox Proportional Hazard (CoxPH) regression is

(5)

The corresponding partial likelihood function (Cox, 1972) is defined as

(6)

where the function has been parameterized by that controls the combination of the features. is the set of individuals who are at-risk of the event at time , that is, the set of individuals for whom the event has not occurred yet. is also the observed time to event of the individual. Note that the numerator of the likelihood is a function of only the individuals that observed the event, and censored individuals only contribute to the denominator of Equation 6. The

values are estimated by maximizing the above likelihood using a gradient based method.

The most common form of , where is a vector of parameters controlling the dependence between the features in and target . Doing so assumes a linear scaling of the relative (log) hazards of different individuals with respect to the values of the features. Ridgeway (Ridgeway, 1999) proposed that the likelihood in Equation  6

can alternatively be optimized directly using gradient boosting methods that might provide benefits in scenarios where the effect of the features is non-linear. Note that this is still a Proportional Hazards model, but with

taken to be the output of a gradient boosting machine (GBM).

2.2. Mixture Model with Cox Proportional Hazard Regression

The CoxPH model assumes that all individuals will eventually experience the event. But there may be a proportion of individuals who are not prone to the event, i.e., who are not predisposed to opening emails. The level at which an individual user is engaged with marketing messages influences his/her act of opening the email (and how quickly). The CoxPH model described earlier tries to explain all the observations using only the features () as the explanatory factors. Through the use of mixture models (Farewell, 1982; Branders et al., 2015), we might expect to get more discriminatory power. The individual is now represented as , where is a latent indicator variable such that

(7)

is a set of features that help predict if an individual is prone to the event of interest or not. The feature set can also be the same as the feature set .

(8)

The probability

is estimated using logistic regression here, and is introduced as a mixture probability into the overall survivor function:

(9)

If the individual is predisposed to not experiencing the event, then , leading to a prediction of a survival probability close to . Conversely, a scenario with leads to the first term dominating, with the quantity representing the survival probability in the traditional sense. A proportional hazards assumption can be encoded by setting as before. The likelihood of the model is given by:

(10)

Since there are latent variables (the

), the optimization is an Expectation Maximization based iterative procedure that estimates the

, along with (for calculating ) and controlling how the features of an individual affect the relative hazards. In the current setting, we are interpreting as the engagement level of a given user , the model however is more general. For e.g., it can be used to represent the probability that a patient has been cured, which in turn affects the chances that he/she will experience the event.

2.3. Related Work

Survival analysis has traditionally been used in the health-care domain to determine the time to ‘death’ in patients, but the usage of this range of techniques has recently expanded to other application areas (Wang et al., 2017). Examples include prediction of early student dropouts (Ameri et al., 2016), post-click engagement on native ads (Barbieri et al., 2016), query specific micro-blog ranking for improved retrieval (Efron, 2012), recommender systems in e-commerce (Wang and Zhang, 2013), search engine evaluation via the use of ”absence time” (Chakraborty et al., 2014), and predicting time for crowd-sourced tasks (Lease et al., 2011).

By appropriately defining the event being modeled, existing marketing concepts also lend themselves survival analysis techniques. E.g. re-purchasing behavior is an indicator of high engagement (Lee et al., 2012) and a proxy for the potential value of a customer (Drye et al., 2001; Lu and Park, 2003). Attrition modeling helps businesses identify customers who are most at-risk so that attempts can be made to keep them in the system, and (Lee et al., 2012) proposes a survival analysis based solution.

Much of the literature referred to above involve applying well-known and established models (like CoxPH) in different scenarios. But more recently, growing interest in the use of survival analysis has led to modeling improvements. For instance, when modeling time-to-event of related tasks, the parameters of the different models can be more reliably estimated using regularization techniques commonly used in multi-task learning (Li et al., 2016). Even in traditional application areas of survival analysis, given a large number of data points and a variety of features that potentially have a highly non-linear dependence on the time-to-event, deep latent models provide better performance (Ranganath et al., 2016).

The closest related work to that presented here is described in (Dave et al., 2017) where time-to-event is modelled in the email domain. Given this context, the contribution of the current paper is two-fold: (1) we describe techniques from the rich history of survival analysis to identify those models whose assumptions are better matched with the characteristics of the data (2) for the application of predicting time-to-event when the censored rows dominate, the mixture model (MM) described above is shown to not only describe the data better but also provide better predictive performance.

3. Problem Definition and Data Description

When emails containing time sensitive information are sent, it may be relevant for the sender to know what is the expected time within which the email will be read by the recipient. Specifically in marketing messages, if the email advertises a flash sale, the marketer will need to decide on the time window for the sale - to optimize between reaching sufficient consumers within the window and yet keep it exclusive. Prediction of time-to-open of an email by a consumer helps to determine the size of the recipient list one wants to reach.

Our dataset corresponds to email marketing campaigns that are sent out to consumers of an enterprise and we are interested in a predictive model that answers questions of two types: (a) Is a particular email likely to be opened by a given recipient? (b) Can we predict the time within which the email will be opened?

In the dataset, there is a high degree of variability amongst the marketing messages - some are sent to a large group of recipients, while others are targeted at a narrow set of consumers - e.g. a personalized birthday communication. We expect that the nature of people’s interaction with these different types of emails varies drastically. In particular, we are interested in modeling how people differ in terms of their engagement with the mass marketing emails. For this reason, the analysis presented here includes only those emails that were sent to at least of the total consumers. We have additionally dropped those consumers who received fewer than messages during the period of interest.

The time at which an email reaches a consumer is labelled as its start-time. In the event that the email is read, the email has a corresponding open-time. The difference between the two time-stamps is referred to as the time-to-open. The emails are divided into 3 non-overlapping buckets based on the start-time: a Training dataset (spanning 4 weeks) and one dataset each for Validatation & Test (spanning 3 weeks each) respectively. Table 1 shows the size of each of these datasets. Chronologically ordered, these 3 datasets cover 13 weeks of email messages with a week gap between Validatation and Test. Within each group, data from the initial two weeks are used to compute features that will be used to model users’ interaction with emails sent in the subsequent week(s).

Dataset #Recipients
(millions) #Emails
(millions)
Training 2.05 31.86
Validation 2.04 22.73
Test 2.22 19.14
Table 1. Overview of Datasets

3.1. Survival and Censoring

In the context of email messaging, the opening of an email is the event and the survivor function at time provides the probability that the email will not be opened by this time. For the purposes of the results presented here, the data was gathered by monitoring the emails for days after they were sent. Figure 2 shows the distribution of the time-to-open in our Training dataset. Note that the plot accounts for only of the total data points.

In applications of survival models, the size of the censoring window is dictated by when the analysis needs to be performed or when information stops being available about the individuals. The choice of censoring window should include as many of the eventual opens as possible. Amongst the emails that were opened within the observed 10 days, happened within the first hours, with the corresponding numbers for and hours being and respectively. Beyond hours, the opens become very rare.

With hours for each individual , over half the eventual opens would be lost - this is undesirable. On the other hand, there are practical limitations to being able to track and monitor an email campaign for extended periods of time. In addition, for time sensitive messages, the censoring window is the time within which the message needs to be opened by the recipients to ensure that the promotion is valid. Given the observations above about the data and the constraints imposed by the domain, in the current paper, we are evaluating three different censoring windows - , and hours - with a preference for the higher values. For a given choice of , the emails that were not opened by their recipients within the time period are considered censored.

Given that all individuals are synchronized at (Figure 1), the ’s across all individuals will be the same for a given choice of censoring window. Note that when the censoring window changes, the training data provided to the survival models changes (via Equation 3). Therefore, identifying the right censoring window for the given application requires us to build a given model multiple times - one for each value of .

Figure 2. Distribution of time-to-open as a percentage of the number of recipients

Across the dataset, we can construct a histogram of the cumulative fraction of individuals that survived (i.e., emails that are yet to be opened) up until a given time. The Kaplan-Meier estimator (Kaplan and Meier, 1958) constructs a smooth non-parametric approximation to this histogram. In Figure  4, the curve corresponding to “Combined” represents the Training dataset, the significance of the other data in the plot will be explained in Section  3.2. We can see how the curve has a survival rate of at around hours - this represents the fact that only of the users have actual time-to-open data for the censoring window of three hours (as also seen in Figure  2). This high degree of censorship is a unique characteristic of our dataset and represents one of the challenges in terms of the traditional use of survival models.

3.1.1. Baselines:

We consider the historical open rate of an individual as the baseline for predicting his/her future open events. We also include a Logistic Regression (LR) model as a classification baseline in our experiments. The auxiliary problem of predicting the time to the open event is not as straightforward. As a simplistic baseline, we consider the censoring time as the predicted time to open for all individuals (B). Given that the majority of rows are censored, is taken as the median of the

values, with the mean being only marginally lower. We also considered a linear regression (LR) model to predict time to open, as a baseline in our experiments which excluded all individuals who had censored information from the model

(Burke et al., 1997; Delen et al., 2005). Here, with slight abuse of notation, we denote both logistic and linear regression baselines by LR. The distinction is made clearly while presenting results.

3.2. Testing Model Assumptions

In this section, we will describe statistical analyses that validates the choice of models in our experimental section. Every model represents a corresponding set of statistical assumptions, and we need to verify that the data aligns with the assumptions of the models used.

In the original CoxPH model, the hazard of an individual at a time is a product of two quantities - the baseline hazard and the individual’s relative hazard (See Equation 4). The baseline hazard is a global quantity that incorporates the dependence on time. And the relative hazard of an individual is expressed as a parameterized function of features and is independent of time. Verifying the applicability of a proportional hazards model equates to validating that the features are independent of time. To test this independence, we take all observed (i.e., not censored) events for individual and calculate the scaled Schoenfeld residual (Grambsch and Therneau, 1994) for each feature as follows:

(11)

where is the value of the feature and

(12)

In the equation above, represents the time at which the event occurred for individual . And as before, the risk set represents those individuals who have not yet experienced the event until . Plotting the values as a function of the times allows us to validate the proportionality assumption - the check being that the have no trend over time. Figure 3 plots the scaled Schoenfeld residuals calculated using the cox.zph function in the survival

package in R. For the sake of clarity, only a sample of data points are displayed in each plot. Also included in the plot are the corresponding p-values for the null hypothesis test, that the sum of residuals for the corresponding feature across individuals is

. This procedure allows us to ensure that the features included in the model satisfy the null hypothesis (i.e., the proportionality assumption) at a chosen significance level - the features illustrated above have p-values greater than .

Figure 3. Test for the Proportionality Assumption in CoxPH for different features. The scaled Schoenfeld residuals do not have a trend over time

Note that the above procedure utilizes the parameters from a previously fitted CoxPH model. A similar procedure can be used for the GBM (where the dependence between the features and the hazard is non-linear) as well as the Mixture Model (MM). All the models considered in the current paper are based on the idea of proportional hazards, hence the relevance of such a test.

We note two popular extensions available from the survival analysis literature that were excluded from our experiments because they did not match the characteristics of our scenario:

  1. [leftmargin=*]

  2. Parametric distributions, like the Weibull, are common choices for modeling event data. However, these were not appropriate here due to the multi-modal nature of the time-to-open, that is a large portion of the recipients are not prone to opening the message

  3. There exist extensions to the CoxPH model that are capable of utilizing time-dependent features. Given the nature of email data, where most messages that will be opened are opened within a relatively small period, obtaining updated predictions at short intervals due to changes in time-dependent feature values is not practical

Mixture models (MM) are applicable to scenarios where there are sub-populations with different characteristics. The MM estimates the probability of each individual belonging to one of the two latent states, those who will not open and those who may open the messages. The quantity in Equation  9

, therefore, represents how engaged the individual is with the email messages. A proxy method to test the existence of groups with differing engagement behaviors is to use a logistic regression model to classify the recipients into two groups - prone to opening their emails versus not - and plotting their survivor functions, as done in Figure

4. Statistical differences between the two groups can then be validated by a log-rank test, where the ranks of the survival times in the two groups are compared.

Figure 4. Survivor Functions of the Two Groups of Consumers

3.3. Procedure for Predicting Time-to-Event

The application scenarios discussed in the Related Work section as well as the work described here, all utilize the main strengths of survival analysis - modeling time-to-event data as a function of features in the presence of censoring.

To predict the time-to-event for an individual (say ), we first calculate the hazard function for that individual. This is obtained by multiplying the estimate of the global baseline hazard, found using Breslow estimator (Lin, 2007), with the relative hazard of that individual []. The individual’s survival curve is then constructed as where is the cumulative sum of over time . This survivor function will be similar to the one shown in Figure  5a, but instead of characterizing the population, it would be for the individual. In the survivor function, a natural estimate of time-to-open would be to use the median survival time, typically referred to as , which corresponds to the time when the survival probability is . Given that the observed overall survival rates are quite high in the current scenario, due to the fact that most marketing emails are never opened, the median might not be defined if the survival curve flattens above on the Y-axis, as in Figure  5b. Thus, the concept is generalized to the notion of a percentile such that the predicted time-to-open an email by the individual, is the largest for which for a given percentile . Therefore, we can have and . Figure  5 illustrates how a time corresponding to a given survival probability threshold is identified from a survival curve. In the current paper, we evaluate a range of percentiles as predictors of time-to-open for future emails to that recipient, and all predictions of time are made at the scale of minutes.

Figure 5. Two examples of survivor functions. (b) corresponds to our training data. Note that median survival time, i.e., t(50) is not defined in (b)

4. Experiments and Results

In the earlier sections, we have described how the use case of modeling email interaction data maps to concepts in the area of survival analysis. In the current section, we evaluate the performance of a set of models on the two tasks of interest: (a) predicting the likelihood of a marketing message being opened by a recipient, and (b) estimating the time at which this event is likely to occur.

Table  2 contains the models evaluated in the current paper. The baseline for the classification task is the historical open rate of the corresponding recipient, and the corresponding time-to-open is a constant equal to the censoring window for the prediction of time-to-open. Logistic Regression is used for the ”open vs not” classification task and Linear Regression is used for the time to open of emails. Models were trained using an Elastic Net  (Zou and Hastie, 2005) regularization penalty to the corresponding likelihoods. Values of the hyper-parameters (the strength of the regularization) and (trading off L1 vs L2) were chosen via the use of the Validation dataset. The equivalent parameters of the GBM - number of trees, learning rate and minimum number of observations for a leaf node - were similarly chosen.

Features:

The set of features available to us are aggregates of actions that a user can take on emails, for example number of messages received, opened, clicked, and open rate in the past. We also included campaign specific information, like whether the last message was opened and/or clicked and consumer specific features like, how long the consumer has signed up for receiving these email messages. In the current paper, we primarily focused on the methodology, instead of identifying an expansive set of features. If a particular feature has good predictive signal, we expect all the models described here to benefit from it.

B: Baselines
LR: Logistic Regression (Classification)/Linear Regression (Time to Open)
CPH-L: CoxPH Model with relative hazard,
CPH-G: CoxPH Model with relative hazard, from a GBM
MM: Mixture Model with Proportional Hazards
Table 2. List of Models

4.1. Evaluation Metrics

For the binary prediction of whether the email will be opened within time , we calculate the area-under-the-ROC-curve (AUC) for the probabilities calculated from the survivor function. For prediction of the time to open an email, we defined the Mean Relative Absolute Deviation () between the actual value, , for the time-to-open and the predicted value , and is given by

(13)

When we calculate the MRAD over all the individuals in the dataset, we refer to the resulting metric as . We also define as the version of the metric calculated only over the observed individuals, i.e., all the rows for which . For both the metrics, is estimated by the value where for a given value of , as described in the previous section.

We compared the models, as shown in Table 2 on the basis of AUC for classification of which recipients will open the messages and MRAD for time to open the messages. Higher the AUC, better the classifier, and a lower MRAD shows lesser deviance between the observed and the predicted. We now describe different sets of experiments evaluating various aspects of the models.

4.2. Comparison of Models Across Censoring Windows

Censoring Window = 3 hours Censoring Window = 6 hours Censoring Window = 12 hours
Model B LR* CPH-L CPH-G MM B LR* CPH-L CPH-G MM B LR* CPH-L CPH-G MM
AUC 0.863 0.931 0.931 0.932 0.929 0.870 0.939 0.939 0.940 0.938 0.878 0.948 0.948 0.949 0.948
MRAD(A) 1.226 1.332 1.085 0.941 0.483 2.504 1.653 1.835 1.372 0.678 5.079 2.332 1.707 1.572 1.318
MRAD(O) 26.641 8.411 11.953 12.217 9.499 40.602 11.706 23.245 14.788 9.832 62.740 17.501 19.831 28.978 15.657
  • *AUC results for LR correspond to Logistic Regression. MRAD(A) and MRAD(O) results for LR correspond to Linear Regression.

Table 3. Comparison of the Models under AUC and MRAD across Censoring Windows

We compared the different models across three different censoring windows hours, using the predicted time-to-open calculated at . Section  4.3 explores the effect of on both AUC and MRAD for the various censoring windows. Table  3 summarizes the comparisons, and reports the results for the best configuration of parameters on the Validation set. The following observations can be made from Table  3:

  1. [leftmargin=*]

  2. The separate models for each task - Logistic Regression for classification and Linear Regression for predicting time - are very compelling baselines. The benefit of jointly modeling both tasks via survival models becomes more obvious at the larger censoring windows. As discussed earlier, with hours, of the actual open events are considered censored by the survival models. The noise so introduced reduces with larger choices of , allowing MM in particular to perform well

  3. Analyzing the importance of the features in our models, we find that the historical open rate is expectedly the strongest. AUC numbers when using the open rate as a predictor of a user’s future open actions have also been included (B)

  4. CPH-G models non-linear relations between the features and hazard function, and as a classifier is stronger than the other models

  5. The assumptions of MM represent the characteristics of our data most accurately. We see that the AUC is similar to that of CPH-L. Note, however, that the predictions of time provided by MM are more accurate than the alternatives at larger

  6. The MRAD(A) is much lower than the MRAD(O) as it considers the accuracy of predicting the time-to-open of the censored individuals as well, which the models do efficiently, as evidenced from the high AUC values

  7. When moving through the censoring window options ( hours), the classification task generally gets easier, as indicated by the increasing AUC values. This is because the set of observed individuals gets more complete, and separating these from the individuals that are unlikely to open the email is what the model needs to do

An additional observation from these experiments is regarding the validation metric. Across the different hyper-parameter settings, the model that had the highest AUC on the Validation set was chosen, instead of that with lowest MRAD. The choice of the validation metric dictates the future performance of the model - a setting chosen on the basis of higher AUC does not necessarily achieve a lower MRAD on held-out datasets compared to a parameter combination driven by MRAD. This observation indicates that there is an implicit need to pick one of the tasks as being of higher priority - here we have chosen the classification task to have a higher priority. Also, for the rest of the experiments in this paper, to compare models for prediction of time to open, we consider MRAD(O), instead of MRAD(A). This is because AUC already provides a comparison for classification and MRAD(O) helps to compare specifically based on those who have opened the messages.

4.3. Model Performance Across Survival Thresholds

One of the characteristics of the dataset utilized in the current paper is the high percentage of censored observations. This manifests itself in the fact that the survival curves for individuals routinely flatten well before they reach , an illustration of this at the population level can be seen in Figure  5. Thus, across the different settings and models evaluated here, using a large value of for provides sufficient fidelity for predicting the probability of opening the email.

Table  4 describes how the MRAD(O) changes for all the models as we alter the percentile . The column titles use the conventional notation , the equivalent survival probability can be worked out readily. corresponds to the largest time when . Note that the AUC values remain constant by design, and therefore are not displayed. In general, as is reduced, since we are operating in the flatter regions of the survival curves, all become equal to the duration of the censoring window. The point at which the MRAD(O) values appear to reach the maximum is indicative of where the corresponding saturation region is in the survival plot. We would expect the corresponding MRAD(O) values to approach that of the baseline in Table  3.

In that sense, MM is most closely modeling the data - there is of the population that never opens their messages and the survival curve for MM arrests its drop around percentile, with maximum MRAD(O) observed for all time points after that. The survival curve for CPH-L on the other hand seems to gradually go towards the same maximum MRAD(O), around . The survival curve for CPH-G is in between the MM and CPH-L.

Censoring Model MRAD(O)
Window
3 hours CPH-L 11.952 12.608 13.929 16.744 23.462 25.716
CPH-G 12.217 14.593 18.501 25.407 26.629 26.641
MM 9.499 26.641 26.641 26.641 26.641 26.641
6 hours CPH-L 23.245 17.456 18.857 19.870 27.041 34.542
CPH-G 14.788 19.588 26.233 30.970 38.102 40.602
MM 9.832 12.483 40.602 40.602 40.602 40.602
12 hours CPH-L 19.831 34.632 27.229 32.510 40.356 40.750
CPH-G 28.978 31.866 29.986 34.590 57.086 61.040
MM 15.657 21.705 62.545 62.740 62.740 62.740
Table 4. The effect of varying the percentile of Survivor Function on the prediction quality of the time-to-event

4.4. Measuring Model Stability

In this section we test the sensitivity of our models, and of the corresponding fitting processes, to variations in the data. We construct an experiment where we take bootstrap samples of the Training set, fit each model in turn to the datasets generated, and calculate the AUC and MRAD(O) metrics for the various models on the Validation set. The hyper-parameter values used were the optimal parameters for defining classification at , same as for the results reported in Table  3

. The mean and standard deviation of the metrics across the samples, for each of the four models, are shown in Table  

5.

Censoring Model AUC MRAD(O)
Window Mean StdDev Mean StdDev
3 hours LR* 0.931 4e-5 8.215 0.036
CPH-L 0.931 2e-5 13.579 1.913
CPH-G 0.929 8e-3 13.746 0.743
MM 0.929 3e-4 9.277 1.854
6 hours LR* 0.939 4e-5 11.651 0.079
CPH-L 0.939 1e-5 20.301 2.486
CPH-G 0.939 3e-4 19.208 0.798
MM 0.938 2e-4 9.753 1.194
12 hours LR* 0.948 4e-5 17.514 0.096
CPH-L 0.948 7e-5 38.143 3.333
CPH-G 0.949 2e-4 29.455 1.071
MM 0.948 2e-4 15.444 1.683
  • *AUC results for LR correspond to Logistic Regression and MRAD(O) results for LR correspond to Linear Regression.

Table 5. Mean and Standard Deviation of AUC & MRAD(O) respectively on boostrapped samples

This sampling procedure allows us to obtain proxy estimates for the confidence intervals for each metric via the standard deviation. This would be required if we were to reliably order the different models. We note that AUC in general has very low variability, and therefore, the ordering over the models on the basis of AUC is reliably stable. MRAD(O) on the other hand, shows more sensitivity to input variations. This offers another reason as to why AUC might be a more appropriate metric for validation/model-selection.

The difference between the two metrics can be attributed to the fact that AUC is a ranking metric that is an aggregate over the entire list of consumers, i.e., even if the fitted model parameters across the samples were slightly different, as long as the ordering imposed by over the dataset does not change appreciably, AUC remains the same. In contrast, MRAD is a comparison between the individual predicted () and actual () time-to-open, and is expected to be more sensitive.

When comparing across the models for a given censoring window, we find that MM provides the lowest mean of the MRAD values, with their standard deviations being higher than that of CPH-G.

4.5. Out-of-Time Model Evaluation

The final set of experiments verify if the model continues being predictive into the future. To do this, we use the Test dataset that has been held aside up until now. We combine the Training and the Validation sets, and use this concatenated data for building the models. The parameters that are used as input are those that have been chosen via the validation process. This not only includes the hyper-parameters of the models (e.g. the and for the Elastic Net based models), but also the choice of percentile to be used for predicting time (which is the one with least MRAD(O) in Section  4.3 for each model).

Censoring

Window

Model AUC MRAD(O)
3 hours LR* 0.937 7.753
CPH-L 0.937 10.009
CPH-G 0.934 13.787
MM 0.935 7.381
6 hours LR* 0.944 10.194
CPH-L 0.944 23.227
CPH-G 0.940 18.339
MM 0.942 9.340
12 hours LR* 0.952 13.409
CPH-L 0.952 29.131
CPH-G 0.950 21.080
MM 0.951 11.653
  • *AUC results for LR correspond to Logistic Regression and MRAD(O) results for LR correspond to Linear Regression.

Table 6. AUC and MRAD(O) for each of the models for the out-of-time dataset

The results presented in Table  6 indicate that the performance does not degrade over time. In all the models, the AUC values are high. As was seen in Section  4.3, with the right choice of the percentile to be used for estimating the time-to-open, the Mixture Model obtains prediction accuracy that is better than the other models. Coupled with the fact that the model assumptions of MM aligns more closely with the characteristics of our data, we believe that designing better optimization strategies with this model would be a fruitful avenue to explore in future work.

5. Discussion and Conclusion

In this paper, we proposed the use of survival models for the problems of predicting if a recipient will open an email or not, and estimating the corresponding time-to-open. We have provided a step by step procedure to check model assumptions using statistical tests. These tests help to determine whether the models that we have compared are meaningful for the data that we have. We have then performed a thorough experimental comparison of the models. Since a large proportion of emails are not opened, the median time-to-open is not defined and hence cannot be used as the estimate of future time-to-open. We therefore tested our models for various percentiles and the percentile is found to be the best predictor for time-to-open.

Across our experiments, it was observed that for the task of predicting if an individual will open the email or not, a traditional classification based model (here, Logistic Regression) is a strong baseline. Survival analysis based models achieve similar accuracy to the classifier when predicting the event. While the linear regression is a strong baseline for predicting time when the censoring window is hours, the proposed survival analysis performs best with the addition of new data when the censoring window is larger. Focusing on marketing email messages, we observe that a large fraction () of emails do not get opened (even after days), owing to the fact that certain recipients are inclined to ignore such messaging. This characteristic is not commonly encountered in applications of survival analysis. Given data of this nature, a mixture model that allows for individuals to have very low predispositions to experiencing the event (here, opening the email) was explored. This differential modeling of sub-populations was done in association with a Proportional Hazards assumption and was shown to be most effective. As future extensions to our work, we plan to derive a regularization framework for the mixture model, which promises to be more effective specially when the model includes a large number of features with correlations between them.

Acknowledgements.
We thank Frederic Mary in Adobe Product Management for his insights, and anonymous reviewers for their constructive feedback.

References

  • S. Ameri, M. J. Fard, R. B. Chinnam, and C. K. Reddy (2016) Survival analysis based framework for early prediction of student dropouts. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, New York, NY, USA, pp. 903–912. External Links: ISBN 978-1-4503-4073-1, Link, Document Cited by: §2.3.
  • N. Barbieri, F. Silvestri, and M. Lalmas (2016) Improving post-click user engagement on native ads via survival analysis. In Proceedings of the 25th International Conference on World Wide Web, pp. 761–770. Cited by: §2.3.
  • A. Bonfrer and X. Drèze (2009) Real-time evaluation of e-mail campaign performance. Marketing Science 28 (2), pp. 251–263. Cited by: §1.
  • S. Branders, R. D’Ambrosio, and P. Dupont (2015)

    A mixture cox-logistic model for feature selection from survival and classification data

    .
    arXiv preprint arXiv:1502.01493. Cited by: §2.2.
  • H. B. Burke, P. H. Goodman, D. B. Rosen, D. E. Henson, J. N. Weinstein, F. E. Harrell, J. R. Marks, D. P. Winchester, and D. G. Bostwick (1997)

    Artificial neural networks improve the accuracy of cancer survival prediction

    .
    Cancer 79 (4), pp. 857–862. Cited by: §3.1.1.
  • D. Chaffey (2009) Mint.com used strongmail influencer to create this viral program.. Note: http://www.strongmail.com/pdf/sm_casestudy_mint.pdf Cited by: §1.
  • S. Chakraborty, F. Radlinski, M. Shokouhi, and P. Baecke (2014) On correlation of absence time and search effectiveness. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, New York, NY, USA, pp. 1163–1166. External Links: ISBN 978-1-4503-2257-7, Link, Document Cited by: §2.3.
  • A. Chapanond, M. S. Krishnamoorthy, and B. Yener (2005) Graph theoretic and spectral analysis of enron email data. Computational & Mathematical Organization Theory 11 (3), pp. 265–281. Cited by: §1.
  • G. V. Cormack (2008) Email spam filtering: a systematic review. In Foundations and Trends in Information Retrieval, Vol. 1. Cited by: §1.
  • D. Cox (1972) Regression models and life tables. Journal of the Royal Statistical Society 34, pp. 187–220. Cited by: §2.1.
  • L. Dabbish, G. Venolia, and J. Cadiz (2003) Marked for deletion: an analysis of email data. In CHI ’03 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’03, New York, NY, USA, pp. 924–925. External Links: ISBN 1-58113-637-4, Link, Document Cited by: §1.
  • V. S. Dave, M. Al Hasan, and C. K. Reddy (2017) How fast will you get a response? predicting interval time for reciprocal link creation.. In ICWSM, pp. 508–511. Cited by: §2.3.
  • D. Delen, G. Walker, and A. Kadam (2005) Predicting breast cancer survivability: a comparison of three data mining methods. Artificial intelligence in medicine 34 (2), pp. 113–127. Cited by: §3.1.1.
  • D. Di Castro, Z. Karnin, L. Lewin-Eytan, and Y. Maarek (2016) You’ve got mail, and here is what you could do with it!: analyzing and predicting actions on email messages. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, New York, NY, USA, pp. 307–316. External Links: ISBN 978-1-4503-3716-8, Link, Document Cited by: §1.
  • T. Drye, G. Wetherill, and A. Pinnock (2001) When are customers in the market? applying survival analysis to marketing challenges. Journal of Targeting, Measurement and Analysis for Marketing 10 (2), pp. 179–188. Cited by: §2.3.
  • N. Ducheneaut and V. Bellotti (2001) E-mail as habitat: an exploration of embedded personal information management. interactions 8 (5), pp. 30–38. Cited by: §1.
  • M. Efron (2012) Query-specific recency ranking: survival analysis for improved microblog retrieval. In Proceedings of the TAIA-12 Workshop associated to SIGIR-12, Cited by: §2.3.
  • V. T. Farewell (1982) The use of mixture models for the analysis of survival data with long-term survivors. Biometrics, pp. 1041–1046. Cited by: §2.2.
  • P. M. Grambsch and T. M. Therneau (1994) Proportional hazards tests and diagnostics based on weighted residuals. Biometrika, pp. 515–526. Cited by: §3.2.
  • E. Horvitz, A. Jacobs, and D. Hovel (1999) Attention-sensitive alerting. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, San Francisco, CA, USA, pp. 305–313. External Links: ISBN 1-55860-614-9, Link Cited by: §1.
  • A. Kannan, K. Kurach, S. Ravi, T. Kaufmann, A. Tomkins, B. Miklos, G. Corrado, L. Lukács, M. Ganea, P. Young, and V. Ramavajjala (2016) Smart reply: automated response suggestion for email. CoRR abs/1606.04870. External Links: Link Cited by: §1.
  • E. L. Kaplan and P. Meier (1958) Nonparametric estimation from incomplete observations. Journal of the American statistical association 53 (282), pp. 457–481. Cited by: §3.1.
  • T. Karagiannis and M. Vojnovic (2009) Behavioral profiles for advanced email features. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, New York, NY, USA, pp. 711–720. External Links: ISBN 978-1-60558-487-4, Link, Document Cited by: §1.
  • B. Klimt and Y. Yang (2004) The enron corpus: a new dataset for email classification research. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings, J. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi (Eds.), pp. 217–226. External Links: ISBN 978-3-540-30115-8, Document, Link Cited by: §1.
  • V. Kumar, X. (. Zhang, and A. Luo (2014) Modeling customer opt-in and opt-out in a permission-based marketing context. Journal of Marketing Research 51 (4), pp. 403–419. External Links: Document, Link, http://dx.doi.org/10.1509/jmr.13.0169 Cited by: §1.
  • M. Lease, V. R. Carvalho, and E. Yilmaz (2011) Crowdsourcing for search and data mining. SIGIR Forum 45 (1), pp. 18–24. External Links: ISSN 0163-5840, Link, Document Cited by: §2.3.
  • J. Lee, H. Zhang, and V. A. Petrushin (2012) Survival analysis for marketing. Note: https://pdfs.semanticscholar.org/2360/bb9ea10622c8c21595ade8f43cc237aac230.pdf[Online; accessed 15-March-2017] Cited by: §2.3.
  • Y. Li, J. Wang, J. Ye, and C. K. Reddy (2016) A multi-task learning formulation for survival analysis. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1715–1724. Cited by: §2.3.
  • D. Lin (2007) On the breslow estimator. Lifetime Data Analysis 13 (4), pp. 471–480. Cited by: §3.3.
  • J. Lu and O. Park (2003) Modeling customer lifetime value using survival analysis—an application in the telecommunications industry. Data Mining Techniques, pp. 120–128. Cited by: §2.3.
  • K. Narang, S. T. Dumais, N. Craswell, D. Liebling, and Q. Ai (2017) Large-scale analysis of email search and organizational strategies. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, CHIIR ’17, New York, NY, USA, pp. 215–223. External Links: ISBN 978-1-4503-4677-1, Link, Document Cited by: §1.
  • R. Ranganath, A. Perotte, N. Elhadad, and D. Blei (2016) Deep survival analysis. In Proceedings of the 1st Machine Learning for Healthcare Conference, pp. 101–114. Cited by: §2.3.
  • G. Ridgeway (1999) The state of boosting. Cited by: §2.1.
  • G. Tsirulnik (2011) British airways mobile email campaign garners 250k app downloads. Note: http://www.mobilemarketer.com/ex/mobilemarketer/cms/news/email/9056.html Cited by: §1.
  • S. VanBoskirk, C. Overby, and S. Takvorian (2011) US interactive marketing forecast 2011 to 2016, forrester research. Inc. Cited by: §1.
  • J. Wang and Y. Zhang (2013) Opportunity model for e-commerce recommendation: right product; right time. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 303–312. Cited by: §2.3.
  • P. Wang, Y. Li, and C. K. Reddy (2017) Machine learning for survival analysis: a survey. ACM Computing Surveys. Cited by: §2.3.
  • D. Wells (2016) Email marketing statistics 2017. Note: http://www.smartinsights.com/email-marketing/email-communications-strategy/statistics-sources-for-email-marketing/ Cited by: §1.
  • S. Whittaker and C. Sidner (1996) Email overload: exploring personal information management of email. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 276–283. Cited by: §1.
  • L. Yang, S. Dumais, P. Bennett, and A. H. Awadallah (2017) Characterizing and predicting enterprise email reply behavior. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), Cited by: §1.
  • H. Zou and T. Hastie (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), pp. 301–320. Cited by: §4.