Listening between the Lines: Learning Personal Attributes from Conversations

04/24/2019 ∙ by Anna Tigunova, et al. ∙ Max Planck Society 0

Open-domain dialogue agents must be able to converse about many topics while incorporating knowledge about the user into the conversation. In this work we address the acquisition of such knowledge, for personalization in downstream Web applications, by extracting personal attributes from conversations. This problem is more challenging than the established task of information extraction from scientific publications or Wikipedia articles, because dialogues often give merely implicit cues about the speaker. We propose methods for inferring personal attributes, such as profession, age or family status, from conversations using deep learning. Specifically, we propose several Hidden Attribute Models, which are neural networks leveraging attention mechanisms and embeddings. Our methods are trained on a per-predicate basis to output rankings of object values for a given subject-predicate combination (e.g., ranking the doctor and nurse professions high when speakers talk about patients, emergency rooms, etc). Experiments with various conversational texts including Reddit discussions, movie scripts and a collection of crowdsourced personal dialogues demonstrate the viability of our methods and their superior performance compared to state-of-the-art baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Motivation and Problem: While interest in dialogue agents has grown rapidly in recent years, creating agents capable of holding personalized conversations remains a challenge. For meaningful and diverse dialogues with a real person, a system should be able to infer knowledge about the person’s background from her utterances. Consider the following example where stands for a human and for an agent:
   H: What’s the best place for having brekky?
   A: The porridge at Bread and Cocoa is great.
   H: Any suggestions for us and the kids later?
   H: We already visited the zoo.
   A: There’s the San Fransisco Dungeon,
   H: an amusement ride with scary city history.

From the word ‘brekky’ in the first utterance, the system understands that the user is Australian and may thus like porridge for breakfast. However, the cue is missed that the user is with pre-teen children (talking about kids and the zoo), and the resulting suggestion is inappropriate for young children. Instead, with awareness of this knowledge, a better reply could have been:
   A: I bet the kids loved the sea lions, so you
   A: should see the dolphins at Aquarium of the Bay.

A possible remedy to improve this situation is to include user information into an end-to-end learning system for the dialogue agent. However, any user information would be bound to latent representations rather than explicit attributes. Instead, we propose to capture such attributes and construct a personal knowledge base (PKB) with this information, which will then be a distant source of background knowledge for personalization in downstream applications such as Web-based chatbots and agents in online forums.

The advantages of an explicit PKB are twofold: it is an easily re-usable asset that benefits all kinds of applications (rather than merely fixing the current discourse), and it provides a convenient way of explaining the agent’s statements to the human whenever requested. Constructing a PKB involves several key issues:

  • What is personalized knowledge about users?

  • How can this knowledge be inferred from the users’ utterances in conversations?

  • How should such background knowledge be incorporated into a dialogue system?

As for the first issue, interesting facts about a user could be attributes (age, gender, occupation, etc), interests and hobbies, relationships to other people (family status, names of friends, etc) or sentiments towards certain topics or people. In this paper’s experiments, we focus on crisp attributes like profession and gender.

The second issue is concerned with information extraction from text. However, prior works have mostly focused on well comprehensible text genres such as Wikipedia articles or news stories. These methods do not work as well when conversations are the input. Compared to formal documents, dialogues are noisy, utterances are short, the language used is colloquial, topics are diverse (including smalltalk), and information is often implicit (“between the lines”). This paper addresses these problems by proposing methods that identify the terms that are informative for an attribute and leverage these terms to infer the attribute’s value.

A detailed exploration of the third issue is beyond the scope of this paper, which focuses on inferring personal attributes. However, we do partially address the issue of integrating background knowledge by arguing that such information should be captured in an explicit PKB. In addition to being independent of downstream applications, an explicit PKB can provide transparency by allowing users to see what is known about them as well as giving users the opportunity to consent to this data being stored.

State of the Art and its Limitations: Currently the most successful dialogue agents are task-oriented, for instance, supporting users with car navigation or delivery orders (e.g., (Madotto et al., 2018; Mo et al., 2018)). General-purpose chatbot-like agents show decent performance in benchmarks (e.g., (Ghazvininejad et al., 2018; Li et al., 2016; Sordoni et al., 2015)), but critically rely on sufficient training data and tend to lack robustness when users behave in highly varying ways. Very few approaches have considered incorporating explicit knowledge on individual users, and these approaches have assumed that personal attributes are explicitly mentioned in the text (Li et al., 2014; Zhang et al., 2018; Jing et al., 2007).

To illustrate that identifying explicit mentions of attributes is insufficient, we developed an oracle to obtain an upper bound on the performance of pattern-based approaches, such as (Li et al., 2014). This oracle, which is described in Section 5.2

, assumes that we have perfect pattern matching that correctly extracts an attribute value every time it is mentioned. (When multiple attribute values are mentioned, we assume the oracle picks the correct one.) This oracle routinely performs substantially worse than our proposed methods, demonstrating that extracting information from utterances requires

listening between the lines (i.e., inferring the presence of attribute values that are never explicitly stated).

On the other hand, many efforts have considered the problem of profiling social media users in order to predict latent attributes such as age, gender, or regional origin (e.g., (Rao et al., 2010; Burger et al., 2011; Schwartz et al., 2013; Sap et al., 2014; Flekova et al., 2016a; Kim et al., 2017; Vijayaraghavan et al., 2017; Bayot and Gonçalves, 2018; Fabian et al., 2015)). While social media posts and utterances are similar in that both are informal, the former can be associated with many non-textual features that are unavailable outside of the social media domain (e.g., social-network friends, likes, etc. and explicit self-portraits of users). We consider several user profiling baselines that rely on only textual features and find that they do not perform well on our task of inferring attributes from conversational utterances.

Approach and Contributions: We devise a neural architecture, called Hidden Attribute Model (HAM), trained with subject-predicate-object triples to predict objects on a per-predicate basis, e.g., for a subject’s profession or family status. The underlying neural network learns to predict a scoring of different objects (e.g., different professions) for a given subject-predicate pair by using attention within and across utterances to infer object values. For example, as illustrated later in Table 7, our approach infers that a subject who often uses terms like theory, mathematical, and species is likely to be a scientist, while a subject who uses terms like senate, reporters, and president may be a politician.

Our salient contributions are the following:

  • a viable method for learning personal attributes from conversations, based on neural networks with novel ways of leveraging attention mechanisms and embeddings,

  • a data resource of dialogues from movie scripts, with ground truth for professions, genders, and ages of movie characters,

  • a data resource of dialogue-like Reddit posts, with ground truth for the authors’ professions, genders, ages, and family statuses

  • an extensive experimental evaluation of various methods on Reddit, movie script dialogues, and crowdsourced personalized conversations (PersonaChat),

  • an experimental evaluation on the transfer learning approach: leveraging ample data from user-generated social media text (Reddit) for inferring users’ latent attributes from data-scarce speech-based dialogues (movie scripts and PersonaChat).

2. Related work

Personal Knowledge from Dialogues: Awareness of background knowledge about users is important in both goal-oriented dialogues (Chen et al., 2016; Joshi et al., 2017) and chitchat settings (Li et al., 2016; Zhang et al., 2018). Such personal knowledge may range from users’ intents and goals (Chen et al., 2016; Wakabayashi et al., 2016) to users’ profiles, with attributes such as home location, gender, age, etc. (Joshi et al., 2017; Li et al., 2016; Lin and Walker, 2011), as well as users’ preferences (Mo et al., 2018; Zhang et al., 2018).

Prior works on incorporating background knowledge model personas and enhance dialogues by injecting personas into the agent’s response-generating model (Li et al., 2016; Lin and Walker, 2011; Zhang et al., 2018) or by adapting the agent’s speech style (Joshi et al., 2017). These approaches typically use latent representations of user information. In contrast, there is not much work on capturing explicit knowledge based on user utterances. Li et al. encode implicit speaker-specific information via distributed embeddings that can be used to cluster users along some traits (e.g., age or country) (Li et al., 2016). Zhang et al. describe a preliminary study on predicting simple speaker profiles from a set of dialogue utterances (Zhang et al., 2018). The most related prior work by Garera and Yarowsky captures latent biographic attributes via SVM models from conversation transcripts and email communication, taking into account contextual features

such as partner effect and sociolinguistic features (e.g., % of “yeah” occurences), in addition to n-grams

(Garera and Yarowsky, 2009).

Several works explore information extraction from conversational text to build a personal knowledge base of speakers given their utterances, such as extracting profession: software engineer and employment_history: Microsoft from “I work for Microsoft as a software engineer”

, using maximum-entropy classifiers

(Jing et al., 2007) and sequence-tagging CRFs (Li et al., 2014). However, this approach relies on user attributes being explicitly mentioned in the utterances. In contrast, our method can infer attribute values from implicit cues, e.g., “I write product code in Redmond.”

Social Media User Profiling: The rapid growth of social media has led to a massive volume of user-generated informal text, which sometimes mimics conversational utterances. A great deal of work has been dedicated to automatically identify latent demographic features of online users, including age and gender (Rao et al., 2010; Burger et al., 2011; Schwartz et al., 2013; Sap et al., 2014; Flekova et al., 2016a; Kim et al., 2017; Vijayaraghavan et al., 2017; Bayot and Gonçalves, 2018; Fabian et al., 2015), political orientation and ethnicity (Rao et al., 2010; Pennacchiotti and Popescu, 2011; Preoţiuc-Pietro et al., 2017; Preoţiuc-Pietro and Ungar, 2018; Vijayaraghavan et al., 2017), regional origin (Rao et al., 2010; Fabian et al., 2015), personality (Schwartz et al., 2013; Gjurković and Šnajder, 2018), as well as occupational class that can be mapped to income (Preoţiuc-Pietro et al., 2015; Flekova et al., 2016b). Most of these works focus on user-generated content from Twitter, with a few exceptions that explore Facebook (Schwartz et al., 2013; Sap et al., 2014) or Reddit (Fabian et al., 2015; Gjurković and Šnajder, 2018) posts.

Most existing studies on social media to capture users’ latent attributes rely on classification over hand-crafted features such as word/character n-grams (Rao et al., 2010; Burger et al., 2011; Basile et al., 2017), Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2001) categories (Preoţiuc-Pietro et al., 2017; Preoţiuc-Pietro and Ungar, 2018; Gjurković and Šnajder, 2018), topic distributions (Pennacchiotti and Popescu, 2011; Preoţiuc-Pietro et al., 2015; Flekova et al., 2016a)

and sentiment/emotion labels of words derived from existing emotion lexicon

(Pennacchiotti and Popescu, 2011; Preoţiuc-Pietro et al., 2017; Preoţiuc-Pietro and Ungar, 2018; Gjurković and Šnajder, 2018). The best performing system (Basile et al., 2017) in the shared task on author profiling organized by the CLEF PAN lab (Potthast et al., 2017; Francisco Manuel et al., 2017) utilizes a linear SVM and word/character n-gram features.

There have been only limited efforts to identify attributes of social media users using neural network approaches. Bayot and Gonçalves

explore the use of Convolutional Neural Networks (CNNs) together with word2vec embeddings to perform user profiling (age and gender) of English and Spanish tweets

(Bayot and Gonçalves, 2018). Kim et al. employ Graph Recursive Neural Networks (GRNNs) to infer demographic characteristics of users (Kim et al., 2017). Vijayaraghavan et al.

exploit attention-based models to identify demographic attributes of Twitter users given multi-modal features extracted from users’ profiles (e.g., name, profile picture, and description), social network, and tweets

(Vijayaraghavan et al., 2017).

The vast majority of these works rely on features specific to social media such as hashtags, users’ profile descriptions and social network structure, with only (Preoţiuc-Pietro et al., 2015; Bayot and Gonçalves, 2018; Basile et al., 2017)

inferring users’ latent attributes based solely on user-generated text. In our evaluation we consider these three methods as baselines.

Neural Models with Attention: Recently, neural models enhanced with attention mechanisms have boosted the results of various NLP tasks (Tan et al., 2018; Yang et al., 2016; Zhou et al., 2016), particularly in neural conversation models for generating responses (Xing et al., 2018; Yao et al., 2016) and in predicting user intents (Chen et al., 2016). While our attention approach bears some similarity to (Xing et al., 2018) in that both consider attention on utterances, both our approach’s architecture and our use case differ substantially. The role of attention weights has been studied for various neural models, including feed-forward networks (Vaswani et al., 2017), CNNs, (Yin et al., 2016) and RNNs (Bahdanau et al., 2015).

3. Methodology

EDWARDS
Put down the gun and put your hands on the counter!
KAY
I warned him.
EDWARDS
Drop the weapon!
KAY
You warned him.
EDWARDS
You are under arrest.
You have the right to remain silent.

Figure 1. An excerpt from Men in Black (1997).

In this section we propose Hidden Attribute Models (HAMs) for ranking object values given a predicate and a sequence of utterances made by a subject. For example, given the movie script excerpt shown in Figure 1, the profession predicate, and the subject Edwards, policeman should be ranked as the subject’s most likely profession.

More formally, given a subject and a predicate

, our goal is to predict a probability distribution over object values

for the predicate based on the subject’s utterances from a dialogue corpus (e.g., a movie script). Each subject is associated with a sequence of utterances containing terms each, . Each term is represented as a -dimensional word embedding.

HAMs can be described in terms of three functions and their outputs:

  1. creates a representation of the utterance given the terms in the utterance:

    (1)
  2. creates a subject representation given the sequence of utterance representations:

    (2)
  3. outputs a probability distribution over object values given the subject representation:

    (3)

In the following sections we describe our proposed HAMs by instantiating these functions.

3.1. Hidden Attribute Models

HAM illustrates the most straightforward way to combine embeddings and utterances. In this model,

(4)

serves as both and ; the th utterance representation is created by averaging the terms in the th utterance and the subject representation is created by averaging the utterance representations together. Two stacked fully connected layers serve as the function ,

(5)

where

is an activation function and

and are learned weights. The full HAM model is then

(6)
(7)
(8)

where uses a sigmoid activation and uses a softmax activation function in order to predict a probability distribution over object values.

HAM extends HAM with two attention mechanisms allowing the model to learn which terms and utterances to focus on for the given predicate. In this model the utterance representations and subject representations are computed using attention-weighted averages,

(9)

with the attention weights calculated over utterance terms and utterance representations, respectively. That is, and , where the attention weights for each term in an utterance are calculated as

(10)
(11)

and the utterance representation weights are calculated analogously over . Given these attention weights, the HAM model is

(12)
(13)
(14)

where function uses a softmax activation function as in the previous model.

HAM considers n-grams when building utterance representations, unlike both previous models that treat each utterance as a bag of words. In this model is implemented with a text classification CNN (Kim, 2014)

with a ReLU activation function and

-max pooling across utterance terms (i.e., each filter’s top

values are kept). A second -max pooling operation across utterance representations serves as . As in the previous model, a single fully connected layer with a softmax activation function serves as .

HAM extends HAM by using attention to combine utterance representations into the subject representation. This mirrors the approach used by HAM, with and computed using equations 10 and 11 as before. This model uses the same and as HAM. That is, utterance representations are produced using a CNN with -max pooling, and a single fully connected layer produces the model’s output.

3.2. Training

All HAMs were trained with gradient descent to minimize a categorical cross-entropy loss. We use the Adam optimizer (Kingma and Ba, 2015) with its default values and apply an L2 weight decay (2e-7) to the loss.

4. Data acquisition and processing

MovieChAtt dataset. Following prior work on personalized dialogue systems, we explore the applicability of fictional dialogues from TV or movie scripts to approximate real-life conversations (Li et al., 2016; Lin and Walker, 2011). Specifically, we compile a Movie Character Attributes (MovieChAtt) dataset consisting of characters’ utterances and the characters’ attributes (e.g., profession).

To create MovieChAtt, we labeled a subset of characters in the Cornell Movie-Dialogs Corpus111http://www.cs.cornell.edu/ cristian/Cornell_Movie-Dialogs_Corpus.html (Danescu-Niculescu-Mizil and Lee, 2011) of 617 movie scripts. From each movie, we derive a sequence of utterances for each character, excluding characters who have less than 20 lines in the movie. Each utterance is represented as a sequence of words, excluding stop words, the 1,000 most common first names222Removed to prevent overfitting. http://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data, and words that occur in fewer than four different movies. The latter two types of words are excluded in order to prevent the model from relying on movie-specific or character-specific signals that will not generalize.

We extracted characters’ gender and age attributes by associating characters with their entries in the Internet Movie Database (IMDb) and extracting the corresponding actor or actress’ attributes at the time the movie was filmed. This yielded 1,963 characters labeled with their genders and 4,548 characters labeled with their ages. We discretized the age attribute into the following ranges: (i) 0–13: child, (ii) 14–23: teenager, (iii) 24–45: adult, (iv) 46–65: middle-aged and (v) 66–100: senior. In our data the distribution of age categories is highly imbalanced, with adult characters dominating the dataset (58.7%) and child being the smallest category (1.7%).

To obtain the ground-truth labels of characters’ profession attributes, we conducted a Mechanical Turk (MTurk) crowdsourcing task to annotate 517 of the movies in our corpus. Workers were asked to indicate the professions of characters in a movie given the movie’s Wikipedia article. Workers were instructed to select professions from a general predefined list if possible (e.g., doctor, engineer, military personnel), and to enter a new profession label when necessary. We manually defined and refined the list of professions based on several iterations of MTurk studies to ensure high coverage and to reduce ambiguity in the options (e.g., journalist vs reporter). We also included non-occupational “professions” that often occur in movies, such as child and criminal.

Fleiss’ kappa for the crowdworkers’ inter-annotator agreement is . Disagreement was oftentimes caused by one character having multiple professions (Batman is both a superhero and a businessman), or a change of professions in the storyline (from banker to unemployed). We kept only characters for which at least 2 out of 3 workers agreed on their profession, which yielded 1405 characters labeled with 43 distinct professions. The highly imbalanced distribution of professions reflects the bias in our movie dataset, which features more criminals and detectives than waiters or engineers.

PersonaChat dataset. We also explore the robustness of our models using the PersonaChat corpus333http://convai.io/#personachat-convai2-dataset (Zhang et al., 2018), which consists of conversations collected via Mechanical Turk. Workers were given persona descriptions consisting of 5-sentence-long descriptions (e.g., “I am an artist”, “I like to build model spaceships”) and asked to incorporate these facts into a short conversation (up to 8 turns) with another worker. We split these conversations by persona, yielding a sequence of 3 or 4 utterances for each persona in a conversation.

We automatically labeled personas with profession and gender attributes by looking in the persona description for patterns “I am/I’m a(n) term”, where term is either a profession label or a gender-indicating noun (woman, uncle, mother, etc). We manually labeled persons with family status by identifying persona descriptions containing related words (single, married, lover, etc) and labeling the corresponding persona as single or not single. Overall, we collected 1,147 personas labeled with profession, 1,316 with gender, and 2,302 labeled with family status.

attribute pattern
profession “(I—i) (am—’m)”+[one of the profession names]
gender “(I—i) (am—’m)”+[words, indicating gender (lady, father, etc.)]
age “(I—i) (am—’m)”+number (5-100)+“years old”
“(I—i) (am—’m—was) born in”+number (1920-2013)
family status “(I—i) (am—’m)”+[married, divorced, single, … ]
“(M—m)y”—“(I—i) have a”+[wife, boyfriend, …]
Table 1. Patterns for extracting ground truth labels from Reddit posts.

Reddit dataset. As a proxy for dialogues we used discussions from Reddit online forums. We used a publicly available crawl444 https://files.pushshift.io/reddit/ spanning the years from 2006 to 2017. Specifically, we tapped into two subforums on Reddit: “iama” where anyone can ask questions to a particular person, and “askreddit” with more general conversations. In selecting these subforums we followed two criteria: 1) they are not concerned with fictional topics (e.g. computer games) and 2) they are not too topic-specific, as this could heavily bias the classification of user attributes.

To create ground-truth labels for users, we searched for posts that matched specific patterns. The list of patterns for all four attributes is given in Table 1. For the case of profession, we created a list of synonyms and subprofessions for better coverage. For example, the list for businessperson includes entrepreneur, merchant, trader, etc.

We selected only posts of length between 10 and 100 words by users who have between 20 and 100 posts. Also, we removed users who claim multiple values for the same attribute, as we allow only for single-label classification. To further increase data quality, we rated users by language style, to give preference to those whose posts sound more like the utterances in a dialogue. This rating was computed by the fraction of a user’s posts that contain personal pronouns, as pronouns are known to be abundant in dialogue data.

The test set, disjoint from training data and created in the same manner, was further checked by manual annotators, because the above mentioned patterns may also produce false positives, for example, the wrong profession from utterances such as “they think I am a doctor” or “I dreamed I am an astronout”, or the wrong age and family status from “I am 10 years old boy’s mother” or “I am a single child”. The final Reddit test set consists of approximately 400 users per predicate.

Limitations. The predicates we consider are limited by the nature of our datasets. We do not consider the family status predicate for MovieChAtt, because the necessary information is often not easily available from Wikipedia articles. Similarly, we do not consider the age predicate on the PersonaChat dataset, because this attribute is not easily found in the persona descriptions. More generally, all users in our datasets are labeled with exactly one attribute value for each predicate. In a real setting it may be impossible to infer any attribute value for some users, whereas other users may have multiple correct values. We leave such scenarios for future work.

5. Experimental setup

5.1. Data

We randomly split the MovieChAtt and PersonaChat datasets into training (90%) and testing (10%) sets. The Reddit test set consists of approximately 400 users per predicate whose label extractions were manually verified by annotators. We tuned models’ hyperparameters by performing a grid search with 10-fold cross validation on the training set.

For the binary predicates family status and gender, we balanced the number of subjects in each class. For the multi-valued profession and age predicates, which have very skewed distributions, we did not balance the number of subjects in the test set. During training we performed downsampling to reduce the imbalance; each batch consisted of an equal amount (set to

) of training samples per class, and these samples were drawn randomly for each batch. This both removes the class imbalance in training data and ensures that ultimately the model sees all instances during training, regardless of the class size.

Note that all three datasets are somewhat biased regarding the attribute values and not representative for real-life applications. For example, the professions in MovieChAtt are dominated by the themes of entertaining fiction, and gender distributions are not realistic either. The data simply provides a diverse range of inputs for fair comparison across different extraction methods.

5.2. Baselines

We consider the following baselines to compare our approach against.

Pattern matching oracle. Assuming that we have a perfect sequence tagging model (e.g., (Li et al., 2014)) that extracts a correct attribute value every time one appears in an utterance, we can determine the upper-bound performance of such sequence tagging approaches. Note that, as mentioned in the introduction, this type of model assumes attribute values explicitly appear in the text. In order to avoid vocabulary mismatches between our attribute value lists and the attribute values explicitly mentioned in the data, we augment our attribute values with synonyms identified in the data (e.g., we add terms like ‘soldier’ and ‘sergeant’ as synonyms of the the profession attribute’s value ‘military personnel’). For both MRR and accuracy, a subject receives a score of 1 if the correct attribute value appears in any one of a subject’s utterances. If the correct attribute value never appears, we assume the model returns a random ordering of attribute values and use the expectation over this list (i.e., given attribute values, the subject receives a score of for MRR and for accuracy). This oracle method does not provide class confidence scores, so we do not report AUROC with this method.

Embedding similarity

. Given an utterance representation created by averaging the embeddings of the words within the utterance, we compute the cosine similarity between this representation and the embeddings of each attribute value.

Logistic regression

. Given an averaged utterance representation (as used with embedding similarity), we apply a multinomial logistic regression model to classify the representation as one of the possible attribute values. This model obtains a ranking by ordering on the per-class probabilities of its output.

Multilayer Perceptron (MLP). Given an averaged utterance representation (as used with embedding similarity), we apply an MLP with one hidden layer of size 100 to classify the utterance representation as one of the possible attribute values. This model can be used to obtain a ranking by considering the per-class probabilities.

The embedding similarity, logistic regression, and MLP baselines are distantly supervised, because the subject’s labels are applied to each of the subject’s utterances. While this is necessary because the baselines do not incorporate the notion of a subject with multiple utterances, it results in noisy labels because it is unlikely that every utterance will contain information about the subject’s attributes. We address this issue by using a window of (determined by a grid search) concatenated utterances as input to each of these methods. With these distantly supervised models, the label prediction scores are summed across all utterances for a single subject and then ranked.

CNN (Bayot and Gonçalves, 2018). We consider the Convolutional Neural Network (CNN) architecture proposed by Bayot and Gonçalves for the task of predititng the age and gender of Twitter users. This approach is a simpler variant of HAM in which is implemented with a tanh activation function and max pooling (i.e., ) and is a fully connected layer with dropout () and a softmax activation function. The CNN is applied to individual utterances and the majority classification label is used for the user, which differs from the in-model aggregation performed by HAM.

New Groningen Author-profiling Model (N-GrAM) (Basile et al., 2017). Following the best performing system at CLEF 2017’s PAN shared task on author profiling (Francisco Manuel et al., 2017)

, we implemented a classification model using a linear Support Vector Machine (SVM) that utilizes the following features: character n-grams with

, and term unigrams and bigrams with sublinear tf-idf weighting.

Neural Clusters (W2V-C) (Preoţiuc-Pietro et al., 2015). We consider the best classification model reported by Preoţiuc-Pietro et al. for predicting the occupational class of Twitter users, which is a Gaussian Process (GP) classifier with neural clusters

(W2V-C) as features. Neural clusters were obtained by applying spectral clustering on a word similarity matrix (via cosine similarity of pre-trained word embeddings) to obtain

word clusters. Each post’s feature vector is then represented as the ratio of words from each cluster.

For both N-GrAM and W2V-C baselines, flattened representations of the subject’s utterances are used. That is, the model’s input is a concatenation of all of a given user’s utterances.

5.3. Hyperparameters

Hyperparameters were chosen by grid search using ten-fold cross validation on the training set. Due to the limited amount of data, we found a minibatch size of 4 users performed best on MovieChAtt and PersonaChat. All models were trained with a minibatch size of 32 on Reddit. We used 300-dimensional word2vec embeddings pre-trained on the Google News corpus (Mikolov et al., 2013) to represent the terms in utterances. We set the number of utterances per character and the number of terms per utterance

, and truncated or zero padded the sequences as needed.

  • HAM

    uses a hidden layer of size 100 with the sigmoid activation function. The model was trained for 30 epochs.

  • HAM uses 178 kernels of size 2 and k-max pooling with . The model was trained for 40 epochs.

  • HAM uses a sigmoid activation with both attention layers and with the prediction layer. The model was trained for 150 epochs.

  • HAM uses 128 kernels of size 2. The model was trained for 50 epochs.

5.4. Evaluation metrics

For binary predicates (gender and family status), we report models’ performance in terms of accuracy. Due to the difficulty of performing classification over many attribute values, a ranking metric is more informative for the multi-valued predicates (profession and age category). We report Mean Reciprocal Rank for these predicates.

Mean Reciprocal Rank (MRR) measures the position of the correct answer in a ranked list of attribute values provided by the model. We obtain a ranking of attribute values for a movie character or a persona by considering the attribute value probabilities output by a model. Given one ranked list of attribute values per character, MRR is computed by determining the reciprocal rank of the correct attribute value in each list, and then taking the average of these reciprocal ranks. We report both macro MRR, in which we calculate a reciprocal rank for each attribute value before averaging, and micro MRR, which averages across each subject’s reciprocal rank.

Area under the ROC Curve (AUROC) measures the performance of a model as a function of its true positive rate and false positive rate at varying score thresholds. We report micro AUROC for all predicates. For multi-valued predicates, we binarize the labels in a one-vs-all fashion.

profession gender
Models MovieChAtt PersonaChat Reddit MovieChAtt PersonaChat Reddit
MRR AU- MRR AU- MRR AU- Acc AU- Acc AU- Acc AU-
micro / macro ROC micro / macro ROC micro / macro ROC ROC ROC ROC
Embedding sim. 0.22* / 0.14* 0.60* 0.30* / 0.25* 0.63* 0.15* / 0.13* 0.59* 0.52* 0.54* 0.49 0.50 0.61* 0.60*
Logistic reg. 0.46* / 0.20* 0.76* 0.81* / 0.77* 0.58* 0.13* / 0.19* 0.57 0.59 0.62 0.86 0.93 0.69* 0.75*
MLP 0.47* / 0.20 0.75 0.86* / 0.77* 0.97 0.46* / 0.23 0.78 0.57* 0.60* 0.80 0.87 0.71 0.77
N-GrAM (Basile et al., 2017) 0.21* / 0.16* 0.62* 0.83* / 0.83 0.88 0.17* / 0.26 0.64* 0.57 0.58 0.86 0.87 0.66* 0.71*
W2V-C (Preoţiuc-Pietro et al., 2015) 0.25* / 0.13* 0.74* 0.59* / 0.46 0.89 0.27* / 0.17* 0.74* 0.62 0.66 0.73* 0.80* 0.64* 0.73*
CNN (Bayot and Gonçalves, 2018) 0.19* / 0.20* 0.66* 0.77* / 0.77* 0.81* 0.26* / 0.24* 0.76* 0.60 0.60 0.72* 0.73* 0.61* 0.61*
HAM 0.39* / 0.37* 0.81* 0.86* / 0.91* 0.98* 0.34* / 0.22* 0.82* 0.72 0.82 0.79 0.87 0.86 0.92
HAM 0.42* / 0.37 0.83 0.96* / 0.94 0.99 0.36* / 0.37* 0.86* 0.75 0.85 0.95 0.99 0.86 0.93*
HAM 0.43* / 0.50 0.85 0.90* / 0.93 0.99 0.51* / 0.40 0.9 0.77 0.84 0.96 0.97 0.85 0.94
HAM 0.39* / 0.34 0.84 0.94* / 0.93 0.99 0.43* / 0.42 0.89 0.69 0.77 0.94 0.98 0.80 0.91
Table 2. Comparison of models on all datasets for profession and gender attributes. Results marked with * significantly differ from the best method (in bold face) with

as measured by a paired t-test (MRR) or McNemar’s test (Acc and AUROC).

6. Results and Discussion

6.1. Main Findings

In Table 2, 3 and 4 we report results for the HAMs and the baselines on all datasets (MovieChAtt, PersonaChat and Reddit) for all considered attributes (profession, gender, age and family status). We do not report results for the pattern oracle baseline in the tables as we evaluated the baseline solely on the MovieChAtt dataset, because the oracle essentially replicates the way we labeled persona descriptions and posts in the PersonaChat and Reddit datasets, respectively. The pattern oracle baseline yields 0.21/0.20 micro/macro MRR for profession, 0.67 accuracy for gender, and 0.41/0.40 micro/macro MRR for age. HAMs significantly outperform this baseline, indicating that identifying explicit mentions of attribute values is insufficient in our dialogue setting.

HAMs outperform the distantly supervised models (i.e., embedding similarity, logistic regression and Multilayer Perceptron (MLP)) in the vast majority of cases. MLP and logistic regression perform best in several occasions for profession and age attributes when micro MRR is considered. However, their macro MRR scores fall behind HAMs, showing that HAMs are better at dealing with multi-valued attributes having skewed distribution. The low performance of these distantly supervised methods may be related to their strong assumption that every sequence of four utterances contains information about the attribute being predicted.

Comparing with baselines from prior work, HAMs significantly outperform N-GrAM (Basile et al., 2017) in many cases, suggesting that representing utterances using word embeddings, instead of merely character and word n-grams, is important for this task. Using neural clusters (W2V-C) as features for the classification task (Preoţiuc-Pietro et al., 2015) works quite well for the age attribute, where different ‘topics’ may correlate with different age categories (e.g. ‘video game’ for teenager and ‘office’ for adult). However, W2V-C is often significantly worse for the profession, gender, and family status attributes, which may be caused by similar discriminative words (e.g., ‘husband’/‘wife’ for gender) being clustered together in the same topic. The CNN baseline (Bayot and Gonçalves, 2018) is significantly worse than the best method in the majority of cases. Furthermore, it generally performs substantially worse than HAM, further illustrating the advantage of aggregating utterances within the model.

In general, HAM performs worse than the other HAMs777 For the sake of brevity we neither instantiate nor report results for LSTM-based HAMs, such as and or . These models were unable to outperform HAM, with the best variant obtaining a micro MRR of only 0.31 after grid search (profession predicate on MovieChAtt; Table 2). This is in line with recent results suggesting that RNNs are not ideal for identifying semantic features (Tang et al., 2018). , demonstrating that simple averaging is insufficient for representing utterances and subjects. In most cases HAM performs slightly worse than HAM, demonstrating the value of exploiting an attention mechanism to combine subject’s utterances. HAM and HAM achieve the strongest performance across predicates, with HAM generally performing better. HAM performs particularly well on the gender and family status attributes, where detecting bigrams may yield an advantage. For example, HAM places high attention weights on terms like ‘family’ and ‘girlfriend’ where the previous term may be a useful signal (e.g., ‘my family’ vs. ‘that family’).

The gap between the baselines and HAMs is often smaller on PersonaChat compared with on the other two datasets, illustrating the simplicity of crowdsourced-dialogue as compared to movie scripts or Reddit discussions. This is also supported by the fact that the maximum metrics on PersonaChat are much higher. There are several factors that may be responsible for this: (1) the dialogue in PersonaChat was created by crowdworkers with the goal of stating facts that were given to them, which often leads to the facts being stated in a straightforward manner (e.g., “I am an author”); (2) PersonaChat utterances are much shorter and there are far fewer utterances per character (i.e., a maximum of 4 in PersonaChat vs. a minimum of 20 in MovieChAtt), leading to a higher density of information related to attributes; and (3) the persona descriptions in PersonaChat are used for many dialogues, giving models an opportunity to learn specific personas.

age
Models MovieChAtt Reddit
MRR AU- MRR AU-
micro / macro ROC micro / macro ROC
Embedding sim. 0.45* / 0.45* 0.61* 0.55* / 0.44* 0.56*
Logistic reg. 0.65* / 0.49* 0.76 0.80* / 0.61 0.87
MLP 0.64* / 0.48* 0.83 0.78* / 0.48 0.88
N-GrAM (Basile et al., 2017) 0.69* / 0.47 0.85 0.48* / 0.53* 0.55*
W2V-C (Preoţiuc-Pietro et al., 2015) 0.67* / 0.45 0.86 0.75* / 0.51 0.88
CNN (Bayot and Gonçalves, 2018) 0.66* / 0.62* 0.83 0.68* / 0.65* 0.79*
HAM 0.62* / 0.59 0.76* 0.67* / 0.67 0.77*
HAM 0.73* / 0.63 0.84 0.73* / 0.61* 0.89*
HAM 0.73* / 0.60 0.86 0.79* / 0.68 0.90
HAM 0.74* / 0.6 0.85 0.72* / 0.6 0.82
Table 3. Comparison of models on all datasets for age attribute. Results marked with * significantly differ from the best method (in bold face) with as measured by a paired t-test (MRR) or McNemar’s test (Acc and AUROC).
family status
Models PersonaChat Reddit
Acc AUROC Acc AUROC
Embedding sim. 0.41* 0.49* 0.42* 0.47*
Logistic reg. 0.75* 0.84* 0.71 0.74
MLP 0.70 0.80 0.62* 0.60*
N-GrAM (Basile et al., 2017) 0.85 0.86 0.45* 0.47*
W2V-C (Preoţiuc-Pietro et al., 2015) 0.74* 0.82* 0.70 0.78
CNN (Bayot and Gonçalves, 2018) 0.74 0.74 0.69 0.69
HAM 0.80 0.91 0.67 0.72
HAM 0.93 0.99 0.52* 0.62*
HAM 0.92 0.98 0.70 0.78
HAM 0.88 0.94 0.64 0.67
Table 4. Comparison of models on all datasets for family status attribute. Results marked with * significantly differ from the best method (in bold face) with as measured by a paired t-test (MRR) or McNemar’s test (Acc and AUROC).

6.2. Study on word embeddings

In the previous experiments, we represented terms using embeddings from a word2vec skip-gram model trained on Google News. (Mikolov et al., 2013) In this study we compare the Google News embeddings with word2vec embeddings trained on Reddit posts, GloVe (Pennington et al., 2014) embeddings trained on Common Crawl, and GloVe embeddings trained on Twitter. We also consider (Peters et al., 2018), a contextualized embedding model. To capture semantic variations, this model creates a contextualized character-based representation of words using a bidirectional language model. We use AllenNLP’s small ELMo model888https://allennlp.org/elmo trained on the 1 Billion Word Benchmark of news crawl data from WMT 2011 (Chelba et al., 2013).

Given space limitations and the higher model variance on the

profession attribute on MovieChAtt, we restrict the study to this predicate and corpus. We evaluated the two best performing HAMs, i.e., HAM and HAM. Table 5 shows the results obtained with the various embedding methods trained on different datasets. The difference in performance does not greatly vary across embedding models and datasets, with Google News embeddings performing best in terms of macro MRR and Reddit embeddings performing best in terms of micro MRR. Despite their strong performance on some NLP tasks, the ELMo contextualized embeddings do not yield a performance boost for any method or metric. We view this observation as an indicator that the choice of term embedding method is not very significant for this task compared to the method used to combine terms into an utterance representation.

Model Corpus HAM HAM
MRR AU- MRR AU-
micro / macro ROC micro / macro ROC
word2vec Google News 0.42 / 0.44 0.77 0.39 / 0.37 0.83
(skip-gram) Reddit 0.43 / 0.37 0.82 0.50 / 0.37 0.83
GloVe Common Crawl 0.40 / 0.37 0.76 0.40 / 0.39 0.82
Twitter 0.39 / 0.35 0.67 0.36 / 0.34 0.81
ELMo WMT News 0.38 / 0.32 0.76 0.37 / 0.37 0.83
Table 5. Comparison of embedding models trained on different datasets, for identifying profession attribute.

6.3. Ablation study

We performed an ablation study in order to determine the performance impact of the HAMs’ components. Given space limitations and the higher model variance on the profession attribute on MovieChAtt, we restrict the study to this predicate and dataset. Ablation results for HAM using cross validation on the training set are shown in Table 6. Replacing either representation function (i.e., or ) with an averaging operation reduces performance, as shown in the last two lines. Attention on utterance representations () is slightly more important in terms of MRR, but both types of attention contribute to HAM’s performance. Similarly, removing both types of attention corresponds to HAM, which consistently underperforms HAM in Table 2, 3 and 4.

Removing attention from HAM yields HAM, which consistently performs worse than HAM in Table 2, 3 and 4, supporting the observation that attention is important for performance on our task. Intuitively, attention provides a useful signal because it allows the model to focus on only those terms that provide information about an attribute.

(a) profession: military personnel
(b) age (category): child
Figure 2. Attention visualization for profession and age predicates on MovieChAtt.
(a) gender: female
(b) family status: married
Figure 3. Attention visualization for gender and family status predicates on Reddit.
MRR AUROC
micro / macro
HAM 0.57 / 0.42 0.84
attention on terms 0.49 / 0.40 0.81
attention on 0.48 / 0.34 0.82
Table 6. Ablation study for the profession attribute.

6.4. Case study on attention weights

In order to illustrate the types of terms the model is looking for, we display HAM’s term and utterance weights for profession and age attributes (on MovieChAtt) in Figure 2, as well as gender and family status attributes (on Reddit) in Figure 3. While HAM is sometimes outperformed by HAM, this model is more interpretable because individual terms are considered in isolation. Note that all these dialogue snippets come from the specific datasets in our experiments, partly from fictional conversations. Some are skewed and biased, and not representative for the envisioned downstream applications.

When predicting military personnel as the profession (Figure 1(a)), the model focuses on military terms such as mission, guard, barracks, and colonel. When predicting child as the age category (Figure 1(b)), on the other hand, the model focuses on terms a child is likely to use, such as pet, mommy, and daddy. According to Reddit posts, female gender is suggested by terms such as boyfriend, pms and jewelry (Figure 2(a)). Meanwhile, married users were identified through terms such as dated, fiance and divorces, along with obvious terms like marrying and marriages (Figure 2(b)). These examples illustrate how the model is able to infer a subject’s attribute by aggregating signals across utterances.

In addition to looking at specific utterances, we investigated which terms the model is strongly associating with a specific attribute. To do so, we computed attribute value probabilities for each term in the corpus, and kept the top terms for each attribute value. The results using HAM are shown in Table 7, which is divided into words that appear informative (top section) and words that do not (bottom section). In the case of informative words, there is a clear relationship between the words and the corresponding profession. Many of the uninformative words appear to be movie-specific, such as names (e.g., xavier, leonard) and terms related to a waiter’s role in a specific movie (e.g., rape, stalkers). Reducing the impact of setting-specific signals like this is one direction for future work.

profession significant words
scientist characteristics, theory, mathematical, species, changes
politician governors, senate, secretary, reporters, president
detective motel, spotted, van, suitcase, parked
military personnel captured, firepower, guard, soldiers, attack
student playing, really, emotional, definitely, unbelievable
photographer xavier, leonard, collins, cockatoo, burke
waiter rape, stalkers, murdered, overheard, bothering
Table 7. Top-5 words from HAM characterizing each profession.
Models profession gender age
MRR AU- Acc AU- MRR AU-
micro / macro ROC ROC micro / macro ROC
HAM 0.19 / 0.18 0.58 0.56 0.58 0.57 / 0.54 0.69
HAM 0.21 / 0.21 0.67 0.61 0.64 0.45 / 0.41 0.45
Table 8. Transfer learning performance of pre-trained Reddit models on MovieChAtt.

6.5. Insights on transfer learning

To investigate the robustness of our trained HAMs, we tested the best performing models (i.e., HAM and HAM) on a transfer learning task between our datasets. Specifically, we leveraged user-generated social media text (Reddit) available in abundance for inferring personal attributes of subjects in speech-based dialogues (MovieChAtt and PersonaChat). We report the results in Table 8 and 9 respectively.

While the scores on PersonaChat are low compared to those in Table 2 and 4, the HAMs’ performance on MovieChAtt is often comparable with the baselines’ performance in Table 2, 3 and 4. This difference may be caused by the fact that PersonaChat is a smaller, more synthetic dataset as discussed in Section 6.1. On MovieChAtt with the profession predicate, both HAMs match the performance of all six baselines in terms of macro MRR. Similarly, HAM matches the performance of five of the six baselines on the gender predicate (accuracy), and HAM matches the performance of four of the six baselines on the age predicate (macro MRR). Particularly for the profession attribute, missing training subjects in the Reddit dataset for certain professions such as astronaut or monarch contribute to the decreasing performance, although the models still make a reasonable prediction of scientist for astronaut. The methods do not perform as well in terms of micro MRR, which may be due to the substantially different attribute value distributions between datasets (i.e., the professions common in movies are uncommon in Reddit). Improving the HAMs’ transfer learning performance is a direction for future work.

Models profession gender family status
MRR AU- Acc AU- Acc AU-
micro / macro ROC ROC ROC
HAM 0.20 / 0.16 0.58 0.52 0.50 0.74 0.74
HAM 0.21 / 0.18 0.71 0.51 0.54 0.62 0.64
Table 9. Transfer learning performance of pre-trained Reddit models on PersonaChat.

6.6. Profession misclassification study

In this section we investigate common misclassifications on the MovieChAtt dataset for the profession

predicate, which is both the most challenging predicate and the predicate with the most possible object values (i.e., 43 on MovieChAtt). A confusion matrix for HAM

is shown in Figure 4. Dotted lines indicate several interesting misclassifications: policemen are often confused with detectives and special agents (red line); scientists are confused with astronauts (yellow line), because sci-fi films often feature characters who arguably serve in both roles; and a child is often labeled as a student or a housewife (green line) because they sometimes use similar terms (e.g., ‘school’ is used by both children and students, and ‘mommy’ is used by both children and housewives). Finally, many occupations are confused with criminal, which is the most common profession in MovieChAtt.

Figure 4. Confusion matrix computed with HAM. True positives are not shown. Darker squares indicate more misclassifications.

To further compare the performance of the model on direct and transfer learning tasks we computed the confusion matrix for HAM trained on the Reddit dataset and using MovieChAtt as the test corpus. Interesting misclassifications include the following: artistic professions (actor, painter, musician, director) are often mixed up (red lines); banker is confused with manager (green lines); policeman and airplane pilot are confused with military personnel (purple lines); and stewardess is often confused with nurse as they both are related to caring and serving tasks (yellow line).

Figure 5. Confusion matrix of Reddit MovieChAtt computed with HAM. True positives are not shown. Darker squares indicate more misclassifications.

7. Conclusion

We proposed Hidden Attribute Models (HAMs) for inferring personal attributes from conversations, such as a person’s profession. We demonstrated the viability of our approach in extensive experiments considering several attributes on three datasets with diverse characteristics: Reddit discussions, movie script dialogues and crowdsourced conversations. We compared HAMs against a variety of state-of-the-art baselines, and achieved substantial improvements over all of them, most notably by the HAM and HAM variants. We also demonstrated that the attention weights assigned by our methods provide informative explanations of the computed output labels.

As a stress test for our methods, we investigated transfer learning by training HAMs on one dataset and applying the learned models to other datasets. Although we observed degradation in output quality, compared to training on in-domain data, it is noteworthy that the transferred HAMs matched the MovieChAtt performance of the baselines when trained on in-domain data. We plan to further investigate this theme of transfer learning in our future work on construction personal knowledge bases and leveraging them for personalized Web agents.

References

  • (1)
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR’15.
  • Basile et al. (2017) Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. 2017. N-GrAM: New Groningen Author-profiling Model—Notebook for PAN at CLEF 2017. In Working Notes Papers of the CLEF 2017 Evaluation Labs.
  • Bayot and Gonçalves (2018) Roy Khristopher Bayot and Teresa Gonçalves. 2018. Age and Gender Classification of Tweets Using Convolutional Neural Networks. In Machine Learning, Optimization, and Big Data. Springer International Publishing, Cham.
  • Burger et al. (2011) John D. Burger, John Henderson, George Kim, and Guido Zarrella. 2011. Discriminating Gender on Twitter. In Proceedings of EMNLP’11.
  • Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. Technical Report. Google.
  • Chen et al. (2016) Yun-Nung Chen, Dilek Hakkani-Tür, Gokhan Tur, Jianfeng Gao, and Li Deng. 2016. End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding. In Proceedings of Interspeech’16. https://doi.org/10.21437/Interspeech.2016-312
  • Danescu-Niculescu-Mizil and Lee (2011) Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.
  • Fabian et al. (2015) Benjamin Fabian, Annika Baumann, and Marian Keil. 2015. Privacy on Reddit? Towards Large-scale User Classification. In Proceedings of ECIS’15.
  • Flekova et al. (2016a) Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, and Daniel Preoţiuc-Pietro. 2016a. Analyzing Biases in Human Perception of User Age and Gender from Text. In Proceedings of ACL’16 (Volume 1: Long Papers).
  • Flekova et al. (2016b) Lucie Flekova, Daniel Preoţiuc-Pietro, and Lyle Ungar. 2016b. Exploring Stylistic Variation with Age and Income on Twitter. In Proceedings of ACL’16 (Volume 2: Short Papers).
  • Francisco Manuel et al. (2017) Francisco Manuel, Rangel Pardo, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Working Notes Papers of the CLEF 2017 Evaluation Labs.
  • Garera and Yarowsky (2009) Nikesh Garera and David Yarowsky. 2009. Modeling Latent Biographic Attributes in Conversational Genres. In Proceedings of ACL/IJCNLP’09.
  • Ghazvininejad et al. (2018) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen tau Yih, and Michel Galley. 2018. A Knowledge-Grounded Neural Conversation Model. In Proceedings of AAAI’18.
  • Gjurković and Šnajder (2018) Matej Gjurković and Jan Šnajder. 2018. Reddit: A Gold Mine for Personality Prediction. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, NAACL-HLT’18.
  • Jing et al. (2007) Hongyan Jing, Nanda Kambhatla, and Salim Roukos. 2007. Extracting Social Networks and Biographical Facts From Conversational Speech Transcripts. In Proceedings of ACL’07.
  • Joshi et al. (2017) Chaitanya K. Joshi, Fei Mi, and Boi Faltings. 2017. Personalization in Goal-Oriented Dialog. In Proceedings of Conversational AI Workshop, NIPS’17.
  • Kim et al. (2017) Sunghwan Mac Kim, Qiongkai Xu, Lizhen Qu, Stephen Wan, and Cecile Paris. 2017. Demographic Inference on Twitter using Recursive Neural Networks. In Proceedings of ACL’17 (Volume 2: Short Papers).
  • Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of EMNLP’14.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of ICLR’15.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A Persona-Based Neural Conversation Model. In Proceedings of ACL’16 (Volume 1: Long Papers).
  • Li et al. (2014) Xiang Li, Gökhan Tür, Dilek Z. Hakkani-Tür, and Qi Li. 2014.

    Personal knowledge graph population from user utterances in conversational understanding. In

    Proceedings of IEEE Spoken Language Technology Workshop (SLT).
  • Lin and Walker (2011) Grace I. Lin and Marilyn A. Walker. 2011. All the World’s a Stage: Learning Character Models from Film. In Proceedings of AIIDE’11.
  • Madotto et al. (2018) Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2018. Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems. In Proceedings of ACL’18 (Volume 1: Long Papers).
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of NIPS’13.
  • Mo et al. (2018) Kaixiang Mo, Yu Zhang, Shuangyin Li, Jiajun Li, and Qiang Yang. 2018.

    Personalizing a Dialogue System With Transfer Reinforcement Learning. In

    Proceedings of AAAI’18.
  • Pennacchiotti and Popescu (2011) Marco Pennacchiotti and Ana-Maria Popescu. 2011. A Machine Learning Approach to Twitter User Classification. In Proceedings of ICWSM’11.
  • Pennebaker et al. (2001) James W Pennebaker, Martha E Francis, and Roger J Booth. 2001. Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates 71 (2001).
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of EMNLP’14.
  • Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL’18, Volume 1 (Long Papers).
  • Potthast et al. (2017) Martin Potthast, Francisco Rangel, Michael Tschuggnall, Efstathios Stamatatos, Paolo Rosso, and Benno Stein. 2017. Overview of PAN’17: Author Identification, Author Profiling, and Author Obfuscation. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. 7th International Conference of the CLEF Initiative (CLEF 17). Berlin Heidelberg New York.
  • Preoţiuc-Pietro et al. (2015) Daniel Preoţiuc-Pietro, Vasileios Lampos, and Nikolaos Aletras. 2015. An analysis of the user occupational class through Twitter content. In Proceedings of ACL/IJCNLP’15 (Volume 1: Long Papers).
  • Preoţiuc-Pietro et al. (2017) Daniel Preoţiuc-Pietro, Ye Liu, Daniel Hopkins, and Lyle Ungar. 2017. Beyond Binary Labels: Political Ideology Prediction of Twitter Users. In Proceedings of ACL’17 (Volume 1: Long Papers).
  • Preoţiuc-Pietro and Ungar (2018) Daniel Preoţiuc-Pietro and Lyle Ungar. 2018. User-Level Race and Ethnicity Predictors from Twitter Text. In Proceedings of COLING’18.
  • Rao et al. (2010) Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying Latent User Attributes in Twitter. In Proceedings of SMUC’10.
  • Sap et al. (2014) Maarten Sap, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, and Hansen Andrew Schwartz. 2014. Developing Age and Gender Predictive Lexica over Social Media. In Proceedings EMNLP’14.
  • Schwartz et al. (2013) H. Andrew Schwartz, Johannes C. Eichstaedt, Margaret L. Kern, Lukasz Dziurzynski, Stephanie Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin E. P. Seligman, and Lyle H. Ungar. 2013. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. In PloS one.
  • Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In Proceedings of NAACL-HLT’15.
  • Tan et al. (2018) Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep Semantic Role Labeling With Self-Attention. In Proceedings of AAAI’18.
  • Tang et al. (2018) Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. In Proceedings of EMNLP’18.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
  • Vijayaraghavan et al. (2017) Prashanth Vijayaraghavan, Soroush Vosoughi, and Deb Roy. 2017. Twitter Demographic Classification Using Deep Multi-modal Multi-task Learning. In Proceedings of ACL’17 (Volume 2: Short Papers).
  • Wakabayashi et al. (2016) Kei Wakabayashi, Johane Takeuchi, Kotaro Funakoshi, and Mikio Nakano. 2016. Nonparametric Bayesian Models for Spoken Language Understanding. In Proceedings of EMNLP’16.
  • Xing et al. (2018) Chen Xing, Yu Wu, Wei Wu, Yalou Huang, and Ming Zhou. 2018. Hierarchical Recurrent Attention Network for Response Generation. In Proceedings of AAAI’18.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of NAACL-HLT’16. https://doi.org/10.18653/v1/N16-1174
  • Yao et al. (2016) Kaisheng Yao, Baolin Peng, Geoffrey Zweig, and Kam-Fai Wong. 2016. An Attentional Neural Conversation Model with Improved Specificity. CoRR abs/1606.01292 (2016).
  • Yin et al. (2016) Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. TACL 4 (2016).
  • Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing Dialogue Agents: I have a dog, do you have pets too?. In Proceedings of ACL’18 (Volume 1: Long Papers).
  • Zhou et al. (2016) Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016.

    Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. In

    Proceedings of ACL’16 (Volume 2: Short Papers).