1. Introduction
Email has a rich history of being a data source for machine learning techniques. Starting with spam filtering
(Cormack, 2008), the range of applications today covers a rich spectrum of scenarios. The Enron Corpus (Klimt and Yang, 2004) enabled research into the modeling of users’ interactions with email in a collaborative environment (Chapanond et al., 2005). For email service providers, detailed understanding of consumers’ interactions with the email system allows building predictive models for specific actions, e.g. if an email will be replied to or not (Yang et al., 2017) and creating rich experiences (Karagiannis and Vojnovic, 2009; Kannan et al., 2016) for the recipients. On the consumer side, given its popularity, there has been much work on different ways to handle large volumes of email effectively (Whittaker and Sidner, 1996). An early paper by Horvitz et al. (Horvitz et al., 1999) proposed that autonomous agents may be able to identify and prioritize emails that need attention. The authors of (Di Castro et al., 2016) show that historical data allows the prediction of what actions a user might take on the receipt of an email, for example, marking it for deletion (Dabbish et al., 2003). Apart from being a mode of communication, email is also used as a personal information management environment (Ducheneaut and Bellotti, 2001), leading to the need to support other forms of interactions like search (Narang et al., 2017).The domain of interest in the current paper is marketing, where the email channel ranks high in popularity (VanBoskirk et al., 2011) alongside social media, search & display advertising. Email based marketing is predicted to have a compound annual growth rate of (VanBoskirk et al., 2011) and nearly every enterprise marketer uses it as a delivery channel (Tsirulnik, 2011; Chaffey, 2009). The engagement levels however are typically low, as compared to personal email messages at work or among friends. The open rates for the marketing email messages, vary by industry  ranging from to in the ecommerce, beauty and personal care, and gambling industries, and in the range of to in the hobbies, home and garden or health and fitness industries (Wells, 2016). Marketers are therefore always on the lookout for techniques that might enhance the engagement levels. For example, Kumar et al. (Kumar et al., 2014) modeled optin and optout behaviour and related these to transactions made by the consumer. Bonfrer et al. (Bonfrer and Drèze, 2009) proposed a framework that allows realtime evaluation of an email campaign.
In this submission, we propose the use of survival analysis for jointly modeling the open event on an email, as well as the timetoopen. The next section provides technical background to some important concepts in survival analysis that are relevant in the current scenario.
2. Survival Analysis
Survival analysis refers to an area of statistical modeling where the main variable of interest is the time to an event. Historically, the event is assumed to be death. One characteristic of data that makes the use of survival models appropriate is the presence of censoring. This refers to the fact that not all individuals would have experienced the event within the observation window. The censoring may be because at the time of analysis the event had not yet occurred, or if the corresponding individual can no longer be tracked. Figure 1 is a pictorial representation of survival data in the context of emails. Observations are synchronized at , which is the time at which the individuals receive the email. If the event of the email being read is not within a chosen time interval, e.g. hours, this would be a censored data point. And some recipients may of course not read the email at all.
Consider a random variable
for the time to the event of interest, with the corresponding probability density function
and the cumulative distribution function being
at a given time . Then the survivor function is defined as(1) 
It represents the probability that an individual will survive beyond time . Equivalently, given that the individual has not yet experienced the event till time , the hazard function represents the instantaneous chance of the event occurring at time .
(2) 
The relationship between the survivor function and the hazard function can be derived as being , where is the cumulative hazard function corresponding to .
A survival analysis dataset containing N individuals is represented as , with . For the individual,
is a vector of features that are believed to be predictive of the survival time. The target
represents the survival time, where represents the duration of time for which the individual was observed and is also known as the censoring window. If observed within the censoring window, is the time to event for the individual. The indicator variable encodes if the individual experienced the event of interest within the censoring window.(3) 
2.1. Cox Proportional Hazard Regression
Given a feature vector for the individual, the hazard function for the individual at any given time can be defined as
(4) 
Here is the baseline hazard function at time , and incorporates the dependence on the individualspecific features , which are independent of time. The specific factorization of into a global timedependent component () and an individual’s timeindependent factor () is the Proportional Hazards assumption  Section 3.2 provides a methodology to validate this assumption on a given dataset. What has been defined above is a semiparametric approach, in that no assumptions have been made about the shape of the baseline hazard function . The parametric alternative would be to impose a functional form, e.g. a Weibull distribution. Based on the relation between the survivor and hazard functions, the survivor function of the individual for Cox Proportional Hazard (CoxPH) regression is
(5) 
The corresponding partial likelihood function (Cox, 1972) is defined as
(6) 
where the function has been parameterized by that controls the combination of the features. is the set of individuals who are atrisk of the event at time , that is, the set of individuals for whom the event has not occurred yet. is also the observed time to event of the individual. Note that the numerator of the likelihood is a function of only the individuals that observed the event, and censored individuals only contribute to the denominator of Equation 6. The
values are estimated by maximizing the above likelihood using a gradient based method.
The most common form of , where is a vector of parameters controlling the dependence between the features in and target . Doing so assumes a linear scaling of the relative (log) hazards of different individuals with respect to the values of the features. Ridgeway (Ridgeway, 1999) proposed that the likelihood in Equation 6
can alternatively be optimized directly using gradient boosting methods that might provide benefits in scenarios where the effect of the features is nonlinear. Note that this is still a Proportional Hazards model, but with
taken to be the output of a gradient boosting machine (GBM).2.2. Mixture Model with Cox Proportional Hazard Regression
The CoxPH model assumes that all individuals will eventually experience the event. But there may be a proportion of individuals who are not prone to the event, i.e., who are not predisposed to opening emails. The level at which an individual user is engaged with marketing messages influences his/her act of opening the email (and how quickly). The CoxPH model described earlier tries to explain all the observations using only the features () as the explanatory factors. Through the use of mixture models (Farewell, 1982; Branders et al., 2015), we might expect to get more discriminatory power. The individual is now represented as , where is a latent indicator variable such that
(7) 
is a set of features that help predict if an individual is prone to the event of interest or not. The feature set can also be the same as the feature set .
(8) 
The probability
is estimated using logistic regression here, and is introduced as a mixture probability into the overall survivor function:
(9) 
If the individual is predisposed to not experiencing the event, then , leading to a prediction of a survival probability close to . Conversely, a scenario with leads to the first term dominating, with the quantity representing the survival probability in the traditional sense. A proportional hazards assumption can be encoded by setting as before. The likelihood of the model is given by:
(10) 
Since there are latent variables (the
), the optimization is an Expectation Maximization based iterative procedure that estimates the
, along with (for calculating ) and controlling how the features of an individual affect the relative hazards. In the current setting, we are interpreting as the engagement level of a given user , the model however is more general. For e.g., it can be used to represent the probability that a patient has been cured, which in turn affects the chances that he/she will experience the event.2.3. Related Work
Survival analysis has traditionally been used in the healthcare domain to determine the time to ‘death’ in patients, but the usage of this range of techniques has recently expanded to other application areas (Wang et al., 2017). Examples include prediction of early student dropouts (Ameri et al., 2016), postclick engagement on native ads (Barbieri et al., 2016), query specific microblog ranking for improved retrieval (Efron, 2012), recommender systems in ecommerce (Wang and Zhang, 2013), search engine evaluation via the use of ”absence time” (Chakraborty et al., 2014), and predicting time for crowdsourced tasks (Lease et al., 2011).
By appropriately defining the event being modeled, existing marketing concepts also lend themselves survival analysis techniques. E.g. repurchasing behavior is an indicator of high engagement (Lee et al., 2012) and a proxy for the potential value of a customer (Drye et al., 2001; Lu and Park, 2003). Attrition modeling helps businesses identify customers who are most atrisk so that attempts can be made to keep them in the system, and (Lee et al., 2012) proposes a survival analysis based solution.
Much of the literature referred to above involve applying wellknown and established models (like CoxPH) in different scenarios. But more recently, growing interest in the use of survival analysis has led to modeling improvements. For instance, when modeling timetoevent of related tasks, the parameters of the different models can be more reliably estimated using regularization techniques commonly used in multitask learning (Li et al., 2016). Even in traditional application areas of survival analysis, given a large number of data points and a variety of features that potentially have a highly nonlinear dependence on the timetoevent, deep latent models provide better performance (Ranganath et al., 2016).
The closest related work to that presented here is described in (Dave et al., 2017) where timetoevent is modelled in the email domain. Given this context, the contribution of the current paper is twofold: (1) we describe techniques from the rich history of survival analysis to identify those models whose assumptions are better matched with the characteristics of the data (2) for the application of predicting timetoevent when the censored rows dominate, the mixture model (MM) described above is shown to not only describe the data better but also provide better predictive performance.
3. Problem Definition and Data Description
When emails containing time sensitive information are sent, it may be relevant for the sender to know what is the expected time within which the email will be read by the recipient. Specifically in marketing messages, if the email advertises a flash sale, the marketer will need to decide on the time window for the sale  to optimize between reaching sufficient consumers within the window and yet keep it exclusive. Prediction of timetoopen of an email by a consumer helps to determine the size of the recipient list one wants to reach.
Our dataset corresponds to email marketing campaigns that are sent out to consumers of an enterprise and we are interested in a predictive model that answers questions of two types: (a) Is a particular email likely to be opened by a given recipient? (b) Can we predict the time within which the email will be opened?
In the dataset, there is a high degree of variability amongst the marketing messages  some are sent to a large group of recipients, while others are targeted at a narrow set of consumers  e.g. a personalized birthday communication. We expect that the nature of people’s interaction with these different types of emails varies drastically. In particular, we are interested in modeling how people differ in terms of their engagement with the mass marketing emails. For this reason, the analysis presented here includes only those emails that were sent to at least of the total consumers. We have additionally dropped those consumers who received fewer than messages during the period of interest.
The time at which an email reaches a consumer is labelled as its starttime. In the event that the email is read, the email has a corresponding opentime. The difference between the two timestamps is referred to as the timetoopen. The emails are divided into 3 nonoverlapping buckets based on the starttime: a Training dataset (spanning 4 weeks) and one dataset each for Validatation & Test (spanning 3 weeks each) respectively. Table 1 shows the size of each of these datasets. Chronologically ordered, these 3 datasets cover 13 weeks of email messages with a week gap between Validatation and Test. Within each group, data from the initial two weeks are used to compute features that will be used to model users’ interaction with emails sent in the subsequent week(s).
Dataset  #Recipients  

(millions)  #Emails  
(millions)  
Training  2.05  31.86 
Validation  2.04  22.73 
Test  2.22  19.14 
3.1. Survival and Censoring
In the context of email messaging, the opening of an email is the event and the survivor function at time provides the probability that the email will not be opened by this time. For the purposes of the results presented here, the data was gathered by monitoring the emails for days after they were sent. Figure 2 shows the distribution of the timetoopen in our Training dataset. Note that the plot accounts for only of the total data points.
In applications of survival models, the size of the censoring window is dictated by when the analysis needs to be performed or when information stops being available about the individuals. The choice of censoring window should include as many of the eventual opens as possible. Amongst the emails that were opened within the observed 10 days, happened within the first hours, with the corresponding numbers for and hours being and respectively. Beyond hours, the opens become very rare.
With hours for each individual , over half the eventual opens would be lost  this is undesirable. On the other hand, there are practical limitations to being able to track and monitor an email campaign for extended periods of time. In addition, for time sensitive messages, the censoring window is the time within which the message needs to be opened by the recipients to ensure that the promotion is valid. Given the observations above about the data and the constraints imposed by the domain, in the current paper, we are evaluating three different censoring windows  , and hours  with a preference for the higher values. For a given choice of , the emails that were not opened by their recipients within the time period are considered censored.
Given that all individuals are synchronized at (Figure 1), the ’s across all individuals will be the same for a given choice of censoring window. Note that when the censoring window changes, the training data provided to the survival models changes (via Equation 3). Therefore, identifying the right censoring window for the given application requires us to build a given model multiple times  one for each value of .
Across the dataset, we can construct a histogram of the cumulative fraction of individuals that survived (i.e., emails that are yet to be opened) up until a given time. The KaplanMeier estimator (Kaplan and Meier, 1958) constructs a smooth nonparametric approximation to this histogram. In Figure 4, the curve corresponding to “Combined” represents the Training dataset, the significance of the other data in the plot will be explained in Section 3.2. We can see how the curve has a survival rate of at around hours  this represents the fact that only of the users have actual timetoopen data for the censoring window of three hours (as also seen in Figure 2). This high degree of censorship is a unique characteristic of our dataset and represents one of the challenges in terms of the traditional use of survival models.
3.1.1. Baselines:
We consider the historical open rate of an individual as the baseline for predicting his/her future open events. We also include a Logistic Regression (LR) model as a classification baseline in our experiments. The auxiliary problem of predicting the time to the open event is not as straightforward. As a simplistic baseline, we consider the censoring time as the predicted time to open for all individuals (B). Given that the majority of rows are censored, is taken as the median of the
values, with the mean being only marginally lower. We also considered a linear regression (LR) model to predict time to open, as a baseline in our experiments which excluded all individuals who had censored information from the model
(Burke et al., 1997; Delen et al., 2005). Here, with slight abuse of notation, we denote both logistic and linear regression baselines by LR. The distinction is made clearly while presenting results.3.2. Testing Model Assumptions
In this section, we will describe statistical analyses that validates the choice of models in our experimental section. Every model represents a corresponding set of statistical assumptions, and we need to verify that the data aligns with the assumptions of the models used.
In the original CoxPH model, the hazard of an individual at a time is a product of two quantities  the baseline hazard and the individual’s relative hazard (See Equation 4). The baseline hazard is a global quantity that incorporates the dependence on time. And the relative hazard of an individual is expressed as a parameterized function of features and is independent of time. Verifying the applicability of a proportional hazards model equates to validating that the features are independent of time. To test this independence, we take all observed (i.e., not censored) events for individual and calculate the scaled Schoenfeld residual (Grambsch and Therneau, 1994) for each feature as follows:
(11) 
where is the value of the feature and
(12) 
In the equation above, represents the time at which the event occurred for individual . And as before, the risk set represents those individuals who have not yet experienced the event until . Plotting the values as a function of the times allows us to validate the proportionality assumption  the check being that the have no trend over time. Figure 3 plots the scaled Schoenfeld residuals calculated using the cox.zph function in the survival
package in R. For the sake of clarity, only a sample of data points are displayed in each plot. Also included in the plot are the corresponding pvalues for the null hypothesis test, that the sum of residuals for the corresponding feature across individuals is
. This procedure allows us to ensure that the features included in the model satisfy the null hypothesis (i.e., the proportionality assumption) at a chosen significance level  the features illustrated above have pvalues greater than .Note that the above procedure utilizes the parameters from a previously fitted CoxPH model. A similar procedure can be used for the GBM (where the dependence between the features and the hazard is nonlinear) as well as the Mixture Model (MM). All the models considered in the current paper are based on the idea of proportional hazards, hence the relevance of such a test.
We note two popular extensions available from the survival analysis literature that were excluded from our experiments because they did not match the characteristics of our scenario:

[leftmargin=*]

Parametric distributions, like the Weibull, are common choices for modeling event data. However, these were not appropriate here due to the multimodal nature of the timetoopen, that is a large portion of the recipients are not prone to opening the message

There exist extensions to the CoxPH model that are capable of utilizing timedependent features. Given the nature of email data, where most messages that will be opened are opened within a relatively small period, obtaining updated predictions at short intervals due to changes in timedependent feature values is not practical
Mixture models (MM) are applicable to scenarios where there are subpopulations with different characteristics. The MM estimates the probability of each individual belonging to one of the two latent states, those who will not open and those who may open the messages. The quantity in Equation 9
, therefore, represents how engaged the individual is with the email messages. A proxy method to test the existence of groups with differing engagement behaviors is to use a logistic regression model to classify the recipients into two groups  prone to opening their emails versus not  and plotting their survivor functions, as done in Figure
4. Statistical differences between the two groups can then be validated by a logrank test, where the ranks of the survival times in the two groups are compared.3.3. Procedure for Predicting TimetoEvent
The application scenarios discussed in the Related Work section as well as the work described here, all utilize the main strengths of survival analysis  modeling timetoevent data as a function of features in the presence of censoring.
To predict the timetoevent for an individual (say ), we first calculate the hazard function for that individual. This is obtained by multiplying the estimate of the global baseline hazard, found using Breslow estimator (Lin, 2007), with the relative hazard of that individual []. The individual’s survival curve is then constructed as where is the cumulative sum of over time . This survivor function will be similar to the one shown in Figure 5a, but instead of characterizing the population, it would be for the individual. In the survivor function, a natural estimate of timetoopen would be to use the median survival time, typically referred to as , which corresponds to the time when the survival probability is . Given that the observed overall survival rates are quite high in the current scenario, due to the fact that most marketing emails are never opened, the median might not be defined if the survival curve flattens above on the Yaxis, as in Figure 5b. Thus, the concept is generalized to the notion of a percentile such that the predicted timetoopen an email by the individual, is the largest for which for a given percentile . Therefore, we can have and . Figure 5 illustrates how a time corresponding to a given survival probability threshold is identified from a survival curve. In the current paper, we evaluate a range of percentiles as predictors of timetoopen for future emails to that recipient, and all predictions of time are made at the scale of minutes.
4. Experiments and Results
In the earlier sections, we have described how the use case of modeling email interaction data maps to concepts in the area of survival analysis. In the current section, we evaluate the performance of a set of models on the two tasks of interest: (a) predicting the likelihood of a marketing message being opened by a recipient, and (b) estimating the time at which this event is likely to occur.
Table 2 contains the models evaluated in the current paper. The baseline for the classification task is the historical open rate of the corresponding recipient, and the corresponding timetoopen is a constant equal to the censoring window for the prediction of timetoopen. Logistic Regression is used for the ”open vs not” classification task and Linear Regression is used for the time to open of emails. Models were trained using an Elastic Net (Zou and Hastie, 2005) regularization penalty to the corresponding likelihoods. Values of the hyperparameters (the strength of the regularization) and (trading off L1 vs L2) were chosen via the use of the Validation dataset. The equivalent parameters of the GBM  number of trees, learning rate and minimum number of observations for a leaf node  were similarly chosen.
Features:
The set of features available to us are aggregates of actions that a user can take on emails, for example number of messages received, opened, clicked, and open rate in the past. We also included campaign specific information, like whether the last message was opened and/or clicked and consumer specific features like, how long the consumer has signed up for receiving these email messages. In the current paper, we primarily focused on the methodology, instead of identifying an expansive set of features. If a particular feature has good predictive signal, we expect all the models described here to benefit from it.
B: Baselines 
LR: Logistic Regression (Classification)/Linear Regression (Time to Open) 
CPHL: CoxPH Model with relative hazard, 
CPHG: CoxPH Model with relative hazard, from a GBM 
MM: Mixture Model with Proportional Hazards 
4.1. Evaluation Metrics
For the binary prediction of whether the email will be opened within time , we calculate the areaundertheROCcurve (AUC) for the probabilities calculated from the survivor function. For prediction of the time to open an email, we defined the Mean Relative Absolute Deviation () between the actual value, , for the timetoopen and the predicted value , and is given by
(13) 
When we calculate the MRAD over all the individuals in the dataset, we refer to the resulting metric as . We also define as the version of the metric calculated only over the observed individuals, i.e., all the rows for which . For both the metrics, is estimated by the value where for a given value of , as described in the previous section.
We compared the models, as shown in Table 2 on the basis of AUC for classification of which recipients will open the messages and MRAD for time to open the messages. Higher the AUC, better the classifier, and a lower MRAD shows lesser deviance between the observed and the predicted. We now describe different sets of experiments evaluating various aspects of the models.
4.2. Comparison of Models Across Censoring Windows
Censoring Window = 3 hours  Censoring Window = 6 hours  Censoring Window = 12 hours  
Model  B  LR*  CPHL  CPHG  MM  B  LR*  CPHL  CPHG  MM  B  LR*  CPHL  CPHG  MM 
AUC  0.863  0.931  0.931  0.932  0.929  0.870  0.939  0.939  0.940  0.938  0.878  0.948  0.948  0.949  0.948 
MRAD(A)  1.226  1.332  1.085  0.941  0.483  2.504  1.653  1.835  1.372  0.678  5.079  2.332  1.707  1.572  1.318 
MRAD(O)  26.641  8.411  11.953  12.217  9.499  40.602  11.706  23.245  14.788  9.832  62.740  17.501  19.831  28.978  15.657 

*AUC results for LR correspond to Logistic Regression. MRAD(A) and MRAD(O) results for LR correspond to Linear Regression.
We compared the different models across three different censoring windows hours, using the predicted timetoopen calculated at . Section 4.3 explores the effect of on both AUC and MRAD for the various censoring windows. Table 3 summarizes the comparisons, and reports the results for the best configuration of parameters on the Validation set. The following observations can be made from Table 3:

[leftmargin=*]

The separate models for each task  Logistic Regression for classification and Linear Regression for predicting time  are very compelling baselines. The benefit of jointly modeling both tasks via survival models becomes more obvious at the larger censoring windows. As discussed earlier, with hours, of the actual open events are considered censored by the survival models. The noise so introduced reduces with larger choices of , allowing MM in particular to perform well

Analyzing the importance of the features in our models, we find that the historical open rate is expectedly the strongest. AUC numbers when using the open rate as a predictor of a user’s future open actions have also been included (B)

CPHG models nonlinear relations between the features and hazard function, and as a classifier is stronger than the other models

The assumptions of MM represent the characteristics of our data most accurately. We see that the AUC is similar to that of CPHL. Note, however, that the predictions of time provided by MM are more accurate than the alternatives at larger

The MRAD(A) is much lower than the MRAD(O) as it considers the accuracy of predicting the timetoopen of the censored individuals as well, which the models do efficiently, as evidenced from the high AUC values

When moving through the censoring window options ( hours), the classification task generally gets easier, as indicated by the increasing AUC values. This is because the set of observed individuals gets more complete, and separating these from the individuals that are unlikely to open the email is what the model needs to do
An additional observation from these experiments is regarding the validation metric. Across the different hyperparameter settings, the model that had the highest AUC on the Validation set was chosen, instead of that with lowest MRAD. The choice of the validation metric dictates the future performance of the model  a setting chosen on the basis of higher AUC does not necessarily achieve a lower MRAD on heldout datasets compared to a parameter combination driven by MRAD. This observation indicates that there is an implicit need to pick one of the tasks as being of higher priority  here we have chosen the classification task to have a higher priority. Also, for the rest of the experiments in this paper, to compare models for prediction of time to open, we consider MRAD(O), instead of MRAD(A). This is because AUC already provides a comparison for classification and MRAD(O) helps to compare specifically based on those who have opened the messages.
4.3. Model Performance Across Survival Thresholds
One of the characteristics of the dataset utilized in the current paper is the high percentage of censored observations. This manifests itself in the fact that the survival curves for individuals routinely flatten well before they reach , an illustration of this at the population level can be seen in Figure 5. Thus, across the different settings and models evaluated here, using a large value of for provides sufficient fidelity for predicting the probability of opening the email.
Table 4 describes how the MRAD(O) changes for all the models as we alter the percentile . The column titles use the conventional notation , the equivalent survival probability can be worked out readily. corresponds to the largest time when . Note that the AUC values remain constant by design, and therefore are not displayed. In general, as is reduced, since we are operating in the flatter regions of the survival curves, all become equal to the duration of the censoring window. The point at which the MRAD(O) values appear to reach the maximum is indicative of where the corresponding saturation region is in the survival plot. We would expect the corresponding MRAD(O) values to approach that of the baseline in Table 3.
In that sense, MM is most closely modeling the data  there is of the population that never opens their messages and the survival curve for MM arrests its drop around percentile, with maximum MRAD(O) observed for all time points after that. The survival curve for CPHL on the other hand seems to gradually go towards the same maximum MRAD(O), around . The survival curve for CPHG is in between the MM and CPHL.
Censoring  Model  MRAD(O)  

Window  
3 hours  CPHL  11.952  12.608  13.929  16.744  23.462  25.716 
CPHG  12.217  14.593  18.501  25.407  26.629  26.641  
MM  9.499  26.641  26.641  26.641  26.641  26.641  
6 hours  CPHL  23.245  17.456  18.857  19.870  27.041  34.542 
CPHG  14.788  19.588  26.233  30.970  38.102  40.602  
MM  9.832  12.483  40.602  40.602  40.602  40.602  
12 hours  CPHL  19.831  34.632  27.229  32.510  40.356  40.750 
CPHG  28.978  31.866  29.986  34.590  57.086  61.040  
MM  15.657  21.705  62.545  62.740  62.740  62.740 
4.4. Measuring Model Stability
In this section we test the sensitivity of our models, and of the corresponding fitting processes, to variations in the data. We construct an experiment where we take bootstrap samples of the Training set, fit each model in turn to the datasets generated, and calculate the AUC and MRAD(O) metrics for the various models on the Validation set. The hyperparameter values used were the optimal parameters for defining classification at , same as for the results reported in Table 3
. The mean and standard deviation of the metrics across the samples, for each of the four models, are shown in Table
5.Censoring  Model  AUC  MRAD(O)  

Window  Mean  StdDev  Mean  StdDev  
3 hours  LR*  0.931  4e5  8.215  0.036 
CPHL  0.931  2e5  13.579  1.913  
CPHG  0.929  8e3  13.746  0.743  
MM  0.929  3e4  9.277  1.854  
6 hours  LR*  0.939  4e5  11.651  0.079 
CPHL  0.939  1e5  20.301  2.486  
CPHG  0.939  3e4  19.208  0.798  
MM  0.938  2e4  9.753  1.194  
12 hours  LR*  0.948  4e5  17.514  0.096 
CPHL  0.948  7e5  38.143  3.333  
CPHG  0.949  2e4  29.455  1.071  
MM  0.948  2e4  15.444  1.683 

*AUC results for LR correspond to Logistic Regression and MRAD(O) results for LR correspond to Linear Regression.
This sampling procedure allows us to obtain proxy estimates for the confidence intervals for each metric via the standard deviation. This would be required if we were to reliably order the different models. We note that AUC in general has very low variability, and therefore, the ordering over the models on the basis of AUC is reliably stable. MRAD(O) on the other hand, shows more sensitivity to input variations. This offers another reason as to why AUC might be a more appropriate metric for validation/modelselection.
The difference between the two metrics can be attributed to the fact that AUC is a ranking metric that is an aggregate over the entire list of consumers, i.e., even if the fitted model parameters across the samples were slightly different, as long as the ordering imposed by over the dataset does not change appreciably, AUC remains the same. In contrast, MRAD is a comparison between the individual predicted () and actual () timetoopen, and is expected to be more sensitive.
When comparing across the models for a given censoring window, we find that MM provides the lowest mean of the MRAD values, with their standard deviations being higher than that of CPHG.
4.5. OutofTime Model Evaluation
The final set of experiments verify if the model continues being predictive into the future. To do this, we use the Test dataset that has been held aside up until now. We combine the Training and the Validation sets, and use this concatenated data for building the models. The parameters that are used as input are those that have been chosen via the validation process. This not only includes the hyperparameters of the models (e.g. the and for the Elastic Net based models), but also the choice of percentile to be used for predicting time (which is the one with least MRAD(O) in Section 4.3 for each model).
Censoring Window 
Model  AUC  MRAD(O) 

3 hours  LR*  0.937  7.753 
CPHL  0.937  10.009  
CPHG  0.934  13.787  
MM  0.935  7.381  
6 hours  LR*  0.944  10.194 
CPHL  0.944  23.227  
CPHG  0.940  18.339  
MM  0.942  9.340  
12 hours  LR*  0.952  13.409 
CPHL  0.952  29.131  
CPHG  0.950  21.080  
MM  0.951  11.653 

*AUC results for LR correspond to Logistic Regression and MRAD(O) results for LR correspond to Linear Regression.
The results presented in Table 6 indicate that the performance does not degrade over time. In all the models, the AUC values are high. As was seen in Section 4.3, with the right choice of the percentile to be used for estimating the timetoopen, the Mixture Model obtains prediction accuracy that is better than the other models. Coupled with the fact that the model assumptions of MM aligns more closely with the characteristics of our data, we believe that designing better optimization strategies with this model would be a fruitful avenue to explore in future work.
5. Discussion and Conclusion
In this paper, we proposed the use of survival models for the problems of predicting if a recipient will open an email or not, and estimating the corresponding timetoopen. We have provided a step by step procedure to check model assumptions using statistical tests. These tests help to determine whether the models that we have compared are meaningful for the data that we have. We have then performed a thorough experimental comparison of the models. Since a large proportion of emails are not opened, the median timetoopen is not defined and hence cannot be used as the estimate of future timetoopen. We therefore tested our models for various percentiles and the percentile is found to be the best predictor for timetoopen.
Across our experiments, it was observed that for the task of predicting if an individual will open the email or not, a traditional classification based model (here, Logistic Regression) is a strong baseline. Survival analysis based models achieve similar accuracy to the classifier when predicting the event. While the linear regression is a strong baseline for predicting time when the censoring window is hours, the proposed survival analysis performs best with the addition of new data when the censoring window is larger. Focusing on marketing email messages, we observe that a large fraction () of emails do not get opened (even after days), owing to the fact that certain recipients are inclined to ignore such messaging. This characteristic is not commonly encountered in applications of survival analysis. Given data of this nature, a mixture model that allows for individuals to have very low predispositions to experiencing the event (here, opening the email) was explored. This differential modeling of subpopulations was done in association with a Proportional Hazards assumption and was shown to be most effective. As future extensions to our work, we plan to derive a regularization framework for the mixture model, which promises to be more effective specially when the model includes a large number of features with correlations between them.
Acknowledgements.
We thank Frederic Mary in Adobe Product Management for his insights, and anonymous reviewers for their constructive feedback.References
 Survival analysis based framework for early prediction of student dropouts. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, New York, NY, USA, pp. 903–912. External Links: ISBN 9781450340731, Link, Document Cited by: §2.3.
 Improving postclick user engagement on native ads via survival analysis. In Proceedings of the 25th International Conference on World Wide Web, pp. 761–770. Cited by: §2.3.
 Realtime evaluation of email campaign performance. Marketing Science 28 (2), pp. 251–263. Cited by: §1.

A mixture coxlogistic model for feature selection from survival and classification data
. arXiv preprint arXiv:1502.01493. Cited by: §2.2. 
Artificial neural networks improve the accuracy of cancer survival prediction
. Cancer 79 (4), pp. 857–862. Cited by: §3.1.1.  Mint.com used strongmail influencer to create this viral program.. Note: http://www.strongmail.com/pdf/sm_casestudy_mint.pdf Cited by: §1.
 On correlation of absence time and search effectiveness. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, New York, NY, USA, pp. 1163–1166. External Links: ISBN 9781450322577, Link, Document Cited by: §2.3.
 Graph theoretic and spectral analysis of enron email data. Computational & Mathematical Organization Theory 11 (3), pp. 265–281. Cited by: §1.
 Email spam filtering: a systematic review. In Foundations and Trends in Information Retrieval, Vol. 1. Cited by: §1.
 Regression models and life tables. Journal of the Royal Statistical Society 34, pp. 187–220. Cited by: §2.1.
 Marked for deletion: an analysis of email data. In CHI ’03 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’03, New York, NY, USA, pp. 924–925. External Links: ISBN 1581136374, Link, Document Cited by: §1.
 How fast will you get a response? predicting interval time for reciprocal link creation.. In ICWSM, pp. 508–511. Cited by: §2.3.
 Predicting breast cancer survivability: a comparison of three data mining methods. Artificial intelligence in medicine 34 (2), pp. 113–127. Cited by: §3.1.1.
 You’ve got mail, and here is what you could do with it!: analyzing and predicting actions on email messages. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, New York, NY, USA, pp. 307–316. External Links: ISBN 9781450337168, Link, Document Cited by: §1.
 When are customers in the market? applying survival analysis to marketing challenges. Journal of Targeting, Measurement and Analysis for Marketing 10 (2), pp. 179–188. Cited by: §2.3.
 Email as habitat: an exploration of embedded personal information management. interactions 8 (5), pp. 30–38. Cited by: §1.
 Queryspecific recency ranking: survival analysis for improved microblog retrieval. In Proceedings of the TAIA12 Workshop associated to SIGIR12, Cited by: §2.3.
 The use of mixture models for the analysis of survival data with longterm survivors. Biometrics, pp. 1041–1046. Cited by: §2.2.
 Proportional hazards tests and diagnostics based on weighted residuals. Biometrika, pp. 515–526. Cited by: §3.2.
 Attentionsensitive alerting. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, San Francisco, CA, USA, pp. 305–313. External Links: ISBN 1558606149, Link Cited by: §1.
 Smart reply: automated response suggestion for email. CoRR abs/1606.04870. External Links: Link Cited by: §1.
 Nonparametric estimation from incomplete observations. Journal of the American statistical association 53 (282), pp. 457–481. Cited by: §3.1.
 Behavioral profiles for advanced email features. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, New York, NY, USA, pp. 711–720. External Links: ISBN 9781605584874, Link, Document Cited by: §1.
 The enron corpus: a new dataset for email classification research. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 2024, 2004. Proceedings, J. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi (Eds.), pp. 217–226. External Links: ISBN 9783540301158, Document, Link Cited by: §1.
 Modeling customer optin and optout in a permissionbased marketing context. Journal of Marketing Research 51 (4), pp. 403–419. External Links: Document, Link, http://dx.doi.org/10.1509/jmr.13.0169 Cited by: §1.
 Crowdsourcing for search and data mining. SIGIR Forum 45 (1), pp. 18–24. External Links: ISSN 01635840, Link, Document Cited by: §2.3.
 Survival analysis for marketing. Note: https://pdfs.semanticscholar.org/2360/bb9ea10622c8c21595ade8f43cc237aac230.pdf[Online; accessed 15March2017] Cited by: §2.3.
 A multitask learning formulation for survival analysis. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1715–1724. Cited by: §2.3.
 On the breslow estimator. Lifetime Data Analysis 13 (4), pp. 471–480. Cited by: §3.3.
 Modeling customer lifetime value using survival analysis—an application in the telecommunications industry. Data Mining Techniques, pp. 120–128. Cited by: §2.3.
 Largescale analysis of email search and organizational strategies. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, CHIIR ’17, New York, NY, USA, pp. 215–223. External Links: ISBN 9781450346771, Link, Document Cited by: §1.
 Deep survival analysis. In Proceedings of the 1st Machine Learning for Healthcare Conference, pp. 101–114. Cited by: §2.3.
 The state of boosting. Cited by: §2.1.
 British airways mobile email campaign garners 250k app downloads. Note: http://www.mobilemarketer.com/ex/mobilemarketer/cms/news/email/9056.html Cited by: §1.
 US interactive marketing forecast 2011 to 2016, forrester research. Inc. Cited by: §1.
 Opportunity model for ecommerce recommendation: right product; right time. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 303–312. Cited by: §2.3.
 Machine learning for survival analysis: a survey. ACM Computing Surveys. Cited by: §2.3.
 Email marketing statistics 2017. Note: http://www.smartinsights.com/emailmarketing/emailcommunicationsstrategy/statisticssourcesforemailmarketing/ Cited by: §1.
 Email overload: exploring personal information management of email. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 276–283. Cited by: §1.
 Characterizing and predicting enterprise email reply behavior. In Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), Cited by: §1.
 Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), pp. 301–320. Cited by: §4.
Comments
There are no comments yet.