Despite the emergence of many new communication tools in the workplace, email remains a major, if not the dominant, messaging platform in many corporate settings [Agema2015]. Helping people manage and act on their emails can make them more productive. Recently, Google’s system that suggests email replies has gained wide adoption [Kannan et al.2016]. We can imagine many other classes of assistance scenarios that can improve worker productivity. For example, consider a system that is capable of predicting your next action when receiving an email. The system could then offer assistance to accomplish that action, for example in the form of a quick reply, adding a task to your to-do list, or helping you take action against another system. To build and train such systems, email data sets are essential, but unfortunately public email datasets such as klimt2004,oard2015 klimt2004,oard2015 are much smaller than the proprietary data used by Google; and more importantly, they lack any direct information/annotation regarding the recipients’ actions.
In this paper, we design an annotation scheme for such actions and have applied it to a corpus of publicly available emails. In order to overcome the data bottleneck for end-to-end training, we leverage other data and annotations that we hypothesize to contain structures similar to email and recipient actions. We apply multitask and multidomain learning, which use domain or task invariant knowledge to improve performance on a specific task/domain [Caruana1997, Yang and Hospedales2014]
. We show that these secondary domains and tasks in combination with multitask and multidomain learning can help our model discover invariant structures in conversations that improve a classifier on our primary data and task: email recipient action classification.
Previous work in the deep learning literature tackled multidomain/multitask learning by designing an encoder that encodes all data and the domain/task description into ashared representation space [Collobert and Weston2008, Glorot, Bordes, and Bengio2011, Ammar et al.2016, Yang, Salakhutdinov, and Cohen2017]. The overall model architecture generally is unchanged from the single-domain single-task setting; but the learned representations are now reparametrized to take account of knowledge from additional data and task/domain knowledge. In this work, we propose an alternative approach of model reparametrization. We train multiple parameter-sharing models across different domains and tasks jointly, without maintaining a shared encoded representation in the network. We show that reparametrized LSTMs consistently achieve better likelihood and overall accuracy on test data than common domain adaption variants. We also show that the representation extracted from a network instantiated with the shared parameter weights performs well on a previously unseen task.
The contributions of this paper are:
First, we designed an annotation scheme for labeling actionable workplace emails, which as we argue in section 2.2
, is more amenable to an end-to-end training paradigm, and collected an annotated dataset. Second, we propose a family of reparametrized RNNs for both multitask and multidomain learning. Finally, we show that such models encode domain-invariant features and, in the absence of sufficient data for end-to-end learning, still provide useful features for scoping tasks in an unsupervised learning setting.
2.1 The Avocado Dataset
In this study, all email messages we annotate and evaluate on are part of the Avocado dataset [Oard et al.2015], which consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as “Avocado”.111We considered other email corpora such as the Enron corpus [Klimt and Yang2004]. We decided to use the Avocado dataset because it is the largest and newest one publicly available. Email threads are reconstructed from the recipients’ mailboxes. For the purpose of this paper, we only use complete (thread contains all replies) and linear (every follow-up is a reply to the previous email) threads.222The summary statistics are in table 3.
2.2 Recipient Actions
Workplace email is known to be highly task-oriented [Khoussainov and Kushmerick2005, Corston-Oliver et al.2004]. As opposed to chit chat on the Internet, speaker intents and expected actions on the email are in general very precise. We aim to annotate the actions, which makes our approach differ in a subtle but important way from previous work such as [Cohen, Carvalho, and Mitchell2004], which is mostly focused on annotating emails for sender intents, modeled after illocutionary acts in Speech Act theory [Searle1976]. We believe that annotating recipient actions has the following advantages over annotating sender intents: First, action based annotation is not tied to a particular speech act taxonomy. The design of such a taxonomy is highly dependent on the system’s use cases [Traum1999] and definitions of sender intent can be circular [Riezler2014]. Even within a single domain such as email, there have been several different sender intent taxonomies [Goldstein and Sabin2006]. A speech-act-agnostic scheme that focuses on the recipient’s action generalizes better across scenarios. Our annotation scheme also has a lower risk of injected bias because the annotation relies on expected (or even observed) actions performed in response to an email, as opposed to relying on the annotator’s intuition about the sender’s intent. Lastly, while in this paper we rely on annotators for these action annotations, many of our annotated actions translate into very specific actions on the computer. Therefore we anticipate intelligent user interfaces could be used to capture and remind users of such email actions, as in dredze2008 dredze2008.
Based on our findings in two pilot runs of email annotations among the authors, we propose the set of recipient actions listed in table 1, which fall in three broad categories:
- Message sending
We identify that in many cases, the recipient is most likely to send out another email, either as a reply to the sender or to someone else. As listed in table 1, Reply-Yesno, Reply-Ack, Reply-Other, Investigate, Send-New-Email are actions that send out a new email, either on the same thread or a new one.
- Software interaction
In our pilot study we find some of the most likely recipient actions to be interaction with office softwares such as Setup-Appointment and Approve-Request.
- Share content
On many occasions, the most likely actions are to share a document, either as an attachment or via other means. We have an umbrella action Share-Content to capture these actions.
2.3 Data Annotation
|Reply-Yesno||Short yes/no reply to a question raised in the previous email|
|Reply-Ack||Simple acknowledgements such as ‘got it’, ‘thank you.’|
|Reply-Other||Reply to the thread based on information that is available without doing any additional investigation.|
|Investigate||Look into some questions/problems to gather the necessary information and reply with that information.|
|Send-New-Email||Write a new email that is not a reply to the current thread.|
|Setup-Appointment||Set up appointments/cancel appointments.|
|Approve-Request||Approve requests (typically from subordinates) through an external system such as an expense report system etc.|
|Share-Content||Share content, as an attachment, a link in the email body, or a location on the network that is known to both the sender and recipients|
A subset of the preprocessed email threads described in section 2.1 are subsequently annotated. We ask each annotator to imagine that they are a recipient of threaded emails in a workplace environment. For each message, we ask the annotator to read through the previous messages in the thread, and annotate with the most likely action (in table 1
) they may perform if they had been the addressee of that message. If the most probable action is not defined in our list, we ask the annotators to annotate with anOther action.
A total of emails from distinct threads have been annotated by two paid and trained independent annotators. Cohen’s Kappa is for the two annotators. The authors arbitrated the disagreements. We include the distribution across the actions in table 1.
|IRC||could somebody explain how i get the oss compatibility drivers to load automatically in ubuntu ?|
|IRC||you should try these ones , apt src deb __URL__ unstable/|
|IRC||Ah , cool . Thanks , I ’ll try that .|
|Does this really appeal to Sanders supporters ? Can one ( or more of you ) explain to me why ? Full disclosure : I do n’t pay ATM fees .|
|Dataset name (type)||# of threads||# of messages||Average thread length||Average message length|
|Ubuntu Dialog (IRC)|
2.4 Additional Domains
The annotations we collect are comparable in size to other speech act based annotation datasets. However like other expert-annotated datasets, ours is not large enough for end-to-end training. Therefore, we aim to enrich our training with additional semantic and pragmatic information derived from other tasks and domains without annotation for expected action. We consider data from the following additional domains for multidomain learning:
The Ubuntu Dialog Corpus is a curated collection of chat logs from Ubuntu’s Internet Relay Chat technical support channels [Lowe et al.2015].
Reddit is an internet discussion community consisting of several subreddits, each of which is more or less a discussion forum pertaining to a certain topic. We curate a dataset from the subreddit r/politics over two consecutive months. Each entry in our dataset consists of the post title, an optional post body, and an accompanying tree of comments. We collect linear threads by recursively sampling from the trees.
Messages from IRC and Reddit are less precise in terms of speaker intents; and our recipient action scheme is not directly applicable to them. However, previous studies on speech acts in Internet forums and chatrooms have shown that there are speech acts common to all these heterogeneous domains, e.g. information requests and deliveries. Some such examples are listed in table 2. [Arguello and Shaffer2015, Moldovan, Rus, and Graesser2011] We hypothesize that more data from these domains will help recognition of these speech acts, which in turn help recognize the resulting recipient actions.
In all experiments in section 4, we use half of the dataset as training data, a quarter as the validation data and the remaining quarter as test data.
2.5 Metadata-Derived Prediction Tasks
The datasets introduced in sections 2.4 and 2.1 are largely unlabeled as far as recipient actions are concerned, except for the small subset of Avocado data that was manually annotated. However we can still extract useful information from their metadata, such as inferred end-of-thread markers or system-logged events that can help us formulate additional prediction tasks for a multitask learning setting (listed in table 4). We also use these multitask labels to evaluate our multitask/domain model in section 4.3.
|e-t||end of an email thread|
|e-a||this message has attachment(s)|
|r-t||end of a Reddit thread|
3 Modeling Threaded Messages
We model threaded messages as a two-layer hierarchy: at the lower layer we have a message consisting of a list of words: . And in turn, a thread is a list of messages: . We assume each message thread to come from a specific domain; and therefore define a many-to-one mapping where is the set of all domains. We also define the tasks to have a many-to-one mapping . For prediction we define the predictor of task as , which predicts sequential tags from a thread on (a valid) task . We also define the real-valued task loss of task on thread to be , where is the ground truth.
3.2 Definition of Multitask/domain Loss
In this paper, we define the multitask loss as the sum of task losses of tasks under the same domain for a single (output, ground truth) pair :
and the aggregate loss
is the sum over examples .
We also define the multidomain loss to be the sum of aggregate losses over :
3.3 The Recurrent AttentIve Neural Bag-Of-Words model (Rainbow)
We start with the Recurrent AttentIve Neural Bag-Of-Word model (Rainbow) as the baseline model of threaded messages. From a high level view, Rainbow
is a hierarchical neural network with two encoder layers: the lower level encoder is a neural bag-of-words encoder that encodes each messageinto its message embeddings . And in turn, the upper level encoder transforms the independently encoded message embeddings
into thread embeddings via a learned recurrent neural network.333There is a slight abuse of annotation since actually differs for of different lengths. Rainbow has three main components: message encoder, thread encoder, and predictor.
We implement the message encoder as a bag of words model over the words in . Motivated by the unigram features in previous work on email intent modeling, we also add an attentive pooling layer [Rush, Chopra, and Weston2015] to pick up important keywords. The averaged embeddings then undergo a nonlinear transformation:
where is a learned feedforward network, is the word embeddings of and is the (learned) attentive network that judges how much each word contributes towards the final representation .444There may be concerns about the unordered nature of the neural bag-of-words (NBOW) model. However it has been shown that with a deep enough network, an NBOW model is competitive against syntax-aware RNN models such as Tree LSTMs[Tai, Socher, and Manning2015]. In preliminary experiments we did not find the difference between an NBOW and an RNN to be substantial. But the NBOW architecture trains much faster.
Thread encoder and predictor.
The message embeddings are passed onto the thread-level LSTM to produce a thread embeddingsvector:
Thread embeddings are then passed to the predictor layer. In this paper, the predictions are distributions over possible labels. We therefore define the predictor to be a -layer feed forward network that maps thread embeddings to distributions over , the label set of task : . The accompanying loss is naturally defined as the cross entropy between the predictions and the empirical distribution :
3.4 Multi-Task RNN Reparametrization
Rainbow is an extension of Deep Averaging Networks [Iyyer et al.2015] to threaded message modeling. It works well for tagging threaded messages for the messages’ properties, such as conversation-turn marking in online chats and end-of-thread detection in emails. However, in its current form, the model is trained to work on exactly one task. It also does not capture the shared dynamics of these different domains jointly when given out-of-domain data. In this section we describe a family of reparametrized recurrent neural networks that easily accommodates multi-domain multi-task learning settings.
In general, recurrent neural networks take a sequence of input data and recurrently apply a nonlinear function, to get a sequence of transformed representation . Here we denote such transformation with the function parametrized by the RNN parameters as . For an LSTM model, can be formulated as the concatenated vector of input, output, forget and cell gate parameters . And in general, the goal of training an RNN is to find the optimal real-valued vector such that
, for a given loss function.
In the context of multidomain learning, we parametrize eq. 1 in a similar fashion:
Here we are faced with two modeling choices (depicted in fig. 0(a)): we can either model every task Disjointly or with Tied parameters. The Disjoint approach learns a separate set of parameters per task . Therefore, performance of a task is little affected by data from other domain/tasks, except for the regularizing effect through the word embeddings.
On the other hand the Tied approach ties parameters of all domains to a single , which has been a popular choice for multitask/domain modeling — it has been found that the RNN often learns to encode a good shared representation when trained jointly for different tasks [Collobert et al.2011, Yang, Salakhutdinov, and Cohen2016]. The network also seems to generalize over different domains, too [Ragni et al.2016, Peng and Dredze2016]. However it hinges on the assumption that either all domains are similar, or the network is capable enough to capture the dynamics of data from all domains at the same time.
In this paper we propose an alternative approach. Instead of having a single set of parameters for all domains, we propose to reparametrize as a function of shared components and domain specific components . Namely:
and our goal becomes minimizing the loss w.r.t both :
A comparison between the vanilla RNN and our proposed modification can be found in fig. 1. This reparametrization allows us to share parameters among networks trained on data of different domains with the shared component , while allowing the network to work differently on data from each domain with the domain specific parameters .
The design of the function requires striking a balance between model flexibility and generalizability. In this paper we consider the following variants of :
First we consider
to be a linear interpolation of a shared baseand a network specific component :
where . In this formulation Add we learn a shared , and additive domain-specific parameters for each domain. We also learn for each domain , which controls how much effect has on the final parameters.
Both Disjoint and Tied can be seen as degenerate cases of Add: we recover Disjoint when the shared component is a zero vector: And with we have , namely Tied.
Additive + Multiplicative (AddMul)
Add has no nonlinear interaction between and : they have independent effects on the composite . In AddMul we have two components in : the additive component and the multiplicative component which introduces nonlinearity without significantly increasing the parameter count:
where is the Hadamard product and are learned parameters as in the Add formulation.
In this formulation are seen as task embeddings. We apply a learned affine transformation to the task embeddings and add up the shared component :
where is a learned parameter.
4.1 Evaluation Metrics
In this section we evaluate Rainbow and its multitask/multidomain variants on the datasets we introduced in section 2. We also apply our extracted thread embeddings on a real-world task setting of email action classification with impoverished resources.
Probabilistic models are usually evaluated on the log-likelihood of the test data : . However, in our multidomain setting we have multiple datasets that differ in size and average sequence length. Therefore we evaluate our models on mean average cross entropy (MACE):
where are the thread embeddings of , and follow the definition in section 3.3. MACE normalizes by both sequence length and dataset length : a model that ignores the resource-poor tasks or short sequences tends to perform poorly under this metric. MACE can therefore be seen as per-task (log) perplexity: a larger MACE value means the model performs worse on the dataset; and the oracle would obtain a MACE value of
. The average of MACE scores also has the natural interpretation of the geometric mean of log likelihoods over different tasks/domains. In addition to MACE, we also evaluate on accuracy intable 6.
. After each epoch of training, the model is evaluated on the validation split to check if the performance has stopped increasing. The training procedure terminates when no new performance gains are observed for two consecutive epochs.
4.2 Effectiveness of Rainbow: Ablation Studies
We evaluate Rainbow by comparing it, in the single task setup, against two simpler variant architectures: one is taking away the recurrent thread encoder (-R), the other is replacing the attentive pooling layer with an unweighted mean (-A). We evaluate the four configurations on the four labels listed in table 4 and report the averaged MACE numbers in table 5. We find that both attentive pooling and the recurrent network help; but the latter has a much more pronounced effect. Rainbow without the two additions (-R, -A) is reduced to the vanilla Deep Average Network model, a neural baseline that has been shown to be competitive against other neural and non-neural models.
4.3 Multidomain/task Experiments
We compare our reparametrized models against the following feature-reparametrizing approaches:
For each task , we concatenate the word embeddings with task embeddings : . are trained along with the network, and hopefully contains task-relevant information. This idea originated from the MaLOPa (MAny Language One PArser) parser [Ammar et al.2016].
In this setting, each task has its own predictor and two message encoders, one shared and the other specific to itself. The two encoder outputs are concatenated, linearly transformed, and fed into the predictor. This is an adaptation of theFenda(Frustratingly Easy Neural Domain Adaption) model in [Kim, Stratos, and Sarikaya2016], which in turn is a neural extension of the classic paper by daume2007 daume2007.
We also compare them against the two baselines:
Each task has its own predictor, thread encoder, and message encoder.
We evaluate our proposed models, feature-reparametrizing models, and the non-domain-adaptive baselines on tasks listed in table 4 in these following multidomain/multitask transfer settings: (E), (E+I), (E+R), (I+R), (E+I+R), where E=Email, I=IRC, R=Reddit. Note that since only the emails have two meta features E-A and E-T, we have (E) as our only multitask transfer setting. The results are in table 6. Difference between results from all models is small. We inspected the model outputs and found they all suffer severely from the label bias problem — all four tasks have very unbalanced label distributions; and the network learns to strongly favor the more frequent label. The label bias problem can potentially be addressed by using a globally normalized model which we leave as future work. Despite the small margins, we can see that both model- and feature-reparametrizing models outperform the baselines in terms of likelihood. Moreover, our reparametrized models consistently achieve higher likelihood than baselines on test data in all transfer settings. In addition, Add and AddMul perform comparably well against strong domain-adaptive models in terms of accuracy.
4.4 Recipient Action with Minimal Supervision
-test. Hyperparameters are regularization strengthand transfer setting.
We now turn to a task-based evaluation where we use our extracted thread embeddings on the task of predicting an email recipient’s next action. In particular, we focus on scenarios where we do not have a sizable amount of annotated data to train a neural network in an end-to-end fashion, and when we simply did not anticipate the task when we trained the model. This setting evaluates the network’s ability to generalize over multiple tasks and learn a good representation.
To be more specific, the setup is as follows: we use trained models from section 4.3 to encode thread embeddings from action-annotated emails of section 2. Subsequently we use these thread embeddings to trainTied, Disjoint, MaLOPa, and Fenda. We also compare it against doc2vec embeddings trained on the whole Avocado corpus (listed in table 7 as Doc2Vec).
Given the small size of annotated data, we decide to evaluate the models with nested cross validation (CV). In the outer layer, we randomly split the annotated emails into (train+dev)-test splits,555120 splits with a ratio of in a thread-wise fashion. In the inner layer, we use 7-fold CV on the (train+dev) split to find the best hyperparameters. The best hyperparameters are then used to train a classifier, which is subsequently evaluated on the test split of the outer layer CV. We report the average in table 7. Disjoint performs poorly on this task since there is no baked-in constraint for it to learn a shared representation. All shared-representation baselines (Tied, Fenda, MaLOPa) performed better than both Disjoint and Doc2Vec. Still, our reparametrized models compare favorably against the feature-reparametrizing baselines.
We do another cross validation evaluation, over different transfer settings in table 8. It seems that while both Reddit (E+R) and the IRC (E+I) datasets do better than email only (E), the IRC dataset is much more helpful than Reddit. This resonates with our initial findings in section 2.4 that the IRC dataset is more similar to emails. We note that all the scores are low. Nonetheless we find it encouraging that out-of-domain data is able to help learn a better representation in this extremely resource-scarce setting.
5 Related Work
There has been a lot of work on multidomain/task learning with shared representation as we described in section 1. Our work is also closely related to work on email speech act modeling and recognition [Cohen, Carvalho, and Mitchell2004, Lampert et al.2008, Jeong, Lin, and Lee2009, De Felice and Deane2012]. The idea of model reparametrization for domain adaption is abundant in the literature of hierarchical Bayesian modeling, such as finkel2009,eisenstein2011 finkel2009,eisenstein2011.
Within the deep learning literature, our work is also related to work on DNN reparametrization for multitask learning, such as spieckermann2014,yang2016b spieckermann2014,yang2016b. Our work shows the reparametrization approach also works for domain adaptation. Finally we would like to point out that ha2016 ha2016 introduces an alternative and much more sophisticated reparametrization of RNNs. An interesting future direction of our work is to follow this work by reparametrizing networks as hypernetworks that take a task embedding as an input. In that case, using the terminology introduced in this paper, we will be feature-reparametrizing the hypernetwork; which in turn model-reparametrizes an RNN.
In this paper, we have introduced an email recipient action annotation scheme, and a dataset annotated according to this scheme. By annotating the recipient action rather than the sender’s intent, our taxonomy is agnostic to specific speech act theories, and arguably more suitable for training systems that suggest such actions. We have curated an annotated dataset, which achieved good inter-annotator agreement levels. We have also introduced a hierarchical threaded message model Rainbow to model such emails. To cope the problem of data scarcity, we have introduced RNN reparametrization as an approach to domain adaptation, and applied it onto the problem of email recipient action modeling. It is competitive against common feature-reparametrized neural models when trained in an end-to-end fashion. We also show that while it is not explicitly designed to encode a shared representation across tasks and domains, it learns to generalize in a minimally supervised scenario. There are many possible future directions of our work. For example, with appropriate software, we can obtain more annotation automatically, and possibly learn the taxonomy along. Also our reparametrization framework is quite extensible. For instance, user-specific parameters for each user can be learned for personalized models, as in li2016 li2016.
- [Agema2015] Agema, L. 2015. Death by e-mail overload.
- [Ammar et al.2016] Ammar, W.; Mulcaire, G.; Ballesteros, M.; Dyer, C.; and Smith, N. 2016. Many languages, one parser. Transactions of the Association for Computational Linguistics 4:431–444.
- [Arguello and Shaffer2015] Arguello, J., and Shaffer, K. 2015. Predicting speech acts in mooc forum posts. In ICWSM, 2–11.
- [Caruana1997] Caruana, R. 1997. Multitask learning. Machine Learning 28(1):41–75.
- [Cohen, Carvalho, and Mitchell2004] Cohen, W.; Carvalho, V.; and Mitchell, T. 2004. Learning to classify email into “speech acts”. In EMNLP.
[Collobert and Weston2008]
Collobert, R., and Weston, J.
A unified architecture for natural language processing: Deep neural networks with multitask learning.In Proceedings of the 25th international conference on Machine learning, 160–167. ACM.
- [Collobert et al.2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12:2493–2537.
- [Corston-Oliver et al.2004] Corston-Oliver, S. H.; Ringger, E.; Gamon, M.; and Campbell, R. 2004. Task-focused summarization of email. In ACL.
- [Daume III2007] Daume III, H. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 256–263. Prague, Czech Republic: Association for Computational Linguistics.
- [De Felice and Deane2012] De Felice, R., and Deane, P. 2012. Identifying speech acts in e-mails: Toward automated scoring of the toeic e-mail task. ETS Research Report Series 2012(2):i–62.
- [Dredze et al.2008] Dredze, M.; Brooks, T.; Carroll, J.; Magarick, J.; Blitzer, J.; and Pereira, F. 2008. Intelligent email: Reply and attachment prediction. In IUI.
- [Eisenstein, Ahmed, and Xing2011] Eisenstein, J.; Ahmed, A.; and Xing, E. P. 2011. Sparse additive generative models of text. In Getoor, L., and Scheffer, T., eds., ICML, 1041–1048. Omnipress.
- [Finkel and Manning2009] Finkel, J. R., and Manning, C. D. 2009. Hierarchical bayesian domain adaptation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL ’09, 602–610. Stroudsburg, PA, USA: Association for Computational Linguistics.
- [Glorot, Bordes, and Bengio2011] Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11), 513–520.
- [Goldstein and Sabin2006] Goldstein, J., and Sabin, R. E. 2006. Using speech acts to categorize email and identify email genres. In System Sciences, 2006. HICSS’06. Proceedings of the 39th Annual Hawaii International Conference on, volume 3, 50b–50b. IEEE.
- [Ha, Dai, and Le2016] Ha, D.; Dai, A. M.; and Le, Q. V. 2016. Hypernetworks. CoRR abs/1609.09106.
- [Iyyer et al.2015] Iyyer, M.; Manjunatha, V.; Boyd-Graber, J. L.; and Daumé, H. 2015. Deep unordered composition rivals syntactic methods for text classification. In ACL.
- [Jeong, Lin, and Lee2009] Jeong, M.; Lin, C.-Y.; and Lee, G. G. 2009. Semi-supervised speech act recognition in emails and forums. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3, EMNLP ’09, 1250–1259. Stroudsburg, PA, USA: Association for Computational Linguistics.
- [Kannan et al.2016] Kannan, A.; Kurach, K.; Ravi, S.; Kaufmann, T.; Tomkins, A.; Miklos, B.; Corrado, G.; Lukács, L.; Ganea, M.; Young, P.; et al. 2016. Smart reply: Automated response suggestion for email. arXiv preprint arXiv:1606.04870.
- [Khoussainov and Kushmerick2005] Khoussainov, R., and Kushmerick, N. 2005. Email task management: An iterative relational learning approach. In CEAS.
- [Kim, Stratos, and Sarikaya2016] Kim, Y.-B.; Stratos, K.; and Sarikaya, R. 2016. Frustratingly easy neural domain adaptation. ACL – Association for Computational Linguistics,.
- [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Klimt and Yang2004] Klimt, B., and Yang, Y. 2004. The Enron Corpus: A New Dataset for Email Classification Research. Berlin, Heidelberg: Springer Berlin Heidelberg. 217–226.
- [Lampert et al.2008] Lampert, A.; Dale, R.; Paris, C.; et al. 2008. The nature of requests and commitments in email messages. In Proceedings of the AAAI Workshop on Enhanced Messaging, 42–47.
- [Li et al.2016] Li, J.; Galley, M.; Brockett, C.; Spithourakis, G. P.; Gao, J.; and Dolan, W. B. 2016. A persona-based neural conversation model. CoRR abs/1603.06155.
- [Lowe et al.2015] Lowe, R.; Pow, N.; Serban, I.; and Pineau, J. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909.
- [Moldovan, Rus, and Graesser2011] Moldovan, C.; Rus, V.; and Graesser, A. C. 2011. Automated speech act classification for online chat. MAICS 710:23–29.
- [Oard et al.2015] Oard, D.; Webber, W.; Kirsch, D.; and Golitsynskiy, S. 2015. Avocado research email collection. DVD.
- [Peng and Dredze2016] Peng, N., and Dredze, M. 2016. Multi-task multi-domain representation learning for sequence tagging. arXiv preprint arXiv:1608.02689.
- [Ragni et al.2016] Ragni, A.; Dakin, E.; Chen, X.; Gales, M. J.; and Knill, K. M. 2016. Multi-language neural network language models. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 8, 3042–3046.
- [Riezler2014] Riezler, S. 2014. On the problem of theoretical terms in empirical computational linguistics. Computational Linguistics 40(1):235–245.
- [Rush, Chopra, and Weston2015] Rush, A. M.; Chopra, S.; and Weston, J. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- [Searle1976] Searle, J. R. 1976. A classification of illocutionary acts. Language in society 5(01):1–23.
- [Spieckermann, Udluft, and Runkler2014] Spieckermann, S.; Udluft, S.; and Runkler, T. 2014. Data-efficient temporal regression with multi-task recurrent neural networks.
- [Tai, Socher, and Manning2015] Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from tree-structured long short-term memory networks. CoRR abs/1503.00075.
- [Traum1999] Traum, D. R. 1999. Speech acts for dialogue agents. In Foundations of rational agency. Springer. 169–201.
- [Yang and Hospedales2014] Yang, Y., and Hospedales, T. M. 2014. A Unified Perspective on Multi-Domain and Multi-Task Learning. ArXiv e-prints.
- [Yang and Hospedales2016] Yang, Y., and Hospedales, T. M. 2016. Deep multi-task representation learning: A tensor factorisation approach. CoRR abs/1605.06391.
- [Yang, Salakhutdinov, and Cohen2016] Yang, Z.; Salakhutdinov, R.; and Cohen, W. W. 2016. Multi-task cross-lingual sequence tagging from scratch. CoRR abs/1603.06270.
- [Yang, Salakhutdinov, and Cohen2017] Yang, Z.; Salakhutdinov, R.; and Cohen, W. W. 2017. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks. ArXiv e-prints.