Dialogue Modeling Via Hash Functions: Applications to Psychotherapy

04/26/2018 ∙ by Sahil Garg, et al. ∙ 0

We propose a novel machine-learning framework for dialogue modeling which uses representations based on hash functions. More specifically, each person's response is represented by a binary hashcode where each bit reflects presence or absence of a certain text pattern in the response. Hashcodes serve as compressed text representations, allowing for efficient similarity search. Moreover, hashcode of one person's response can be used as a feature vector for predicting the hashcode representing another person's response. The proposed hashing model of dialogue is obtained by maximizing a novel lower bound on the mutual information between the hashcodes of consecutive responses. We apply our approach in psychotherapy domain, evaluating its effectiveness on a real-life dataset consisting of therapy sessions with patients suffering from depression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

111Correspondence Address: sahil.garg.cs@gmail.com

Dialogue modeling and generation is an active research area of great practical importance as it provides a solid basis for building successful conversational agents in a wide range of applications. However, despite recent successes of deep neural dialogue models, the open dialog generation problem is far from being solved.

Therefore, it is important to continue exploring novel types of models and model-selection criteria, beyond today’s deep neural dialogue systems, in order to better capture the structure of different types of dialogues and to overcome certain limitations of neural models, including dependence on large training datasets, long training times, and difficulties incorporating non-standard objective functions, among others. Moreover, different applications may posses specific properties which suit some approaches better than others.

In this work, one of the motivating applications is a fast-growing area of (semi-)automated psychotherapy: easily accessible, round-the-clock psychotherapeutic services provided by a conversational agent. The importance of this area cannot be underestimated: according to recent statistics, mental health disorders affect one in four adult Americans, one in six adolescents, and one in eight children; predicted by the World Health Organization, by 2030 the amount of worldwide disability and life loss attributable to depression may become greater than for any other condition, including cancer, stroke, heart disease, accidents, and war.

However, many people do not receive an adequate treatment. One of the major factors here is limited availability of mental health care professionals, as compared to the number of potential patients; thus, automating at least some aspects of the treatment is a promising direction.

One of the domain-specific challenges in automated therapy is difficulty obtaining large training datasets which are often necessary for neural dialogue models; this limitation may require developing alternative approaches. Another domain-specific property of therapeutic dialogues, which can potentially simplify dialogue generation, is the classical pattern of relatively long patient’s utterances (up to thousands of words) followed by much shorter therapist’s responses. Therapist’s responses are often high-level, generic statements, confirming and/or summarizing patient’s responses; they can be viewed as semantic ”labels” to be predicted from patient’s ”input samples”.

Furthermore, a therapy session is typically an example of a collaborative dialogue, unlike debates, political arguments, and so on. Indeed, a fundamental concept in psychotherapy is the working alliance between the therapist and the patient [Bordin1979]. The alliance involves the agreement on the goals to be achieved and the tasks to be carried out, and the bond, trust and respect to be established over the course of the therapy. While an encompassing formalization of working alliance is a challenging task, we propose maximizing mutual information (infogain)222We will use ”mutual information” as a synonym of ”infogain”. between the patient’s and therapist’s responses as a simple criterion helping us capture, to some extent, the dynamics of agreement expected to develop in most therapies, and, more generally, in other types of dialogues333Note that imbalanced response length between the two participants, as well as collaborative property, are shared with some other types of dialogues, e.g., TV show interviews such as Larry King dataset analyzed in this paper, where the guest of a show produces long responses, with the host inserting relatively short comments facilitating the interview.. Furthermore, maximizing mutual information (infogain) between the patient’s and therapist’s can improve the predictability of the latter from the former, thus facilitating better dialogue generation.

Motivated by above considerations, we introduce here a novel dialogue modeling framework where responses are represented as locality-sensitive binary hashcodes [Kulis and Grauman2009, Joly and Buisson2011, Garg et al.2019], and the hashing models are optimized using a novel mutual-information lower bound, since exact mutual information computation is intractable in high-dimensional spaces. Using hashcode representations may allow for a more tractable way of predicting responses in a compressed, general representation space instead of direct generation of textual responses. (Previously, hashcode representations were successfully applied in prior work on information-extraction [Garg et al.2019].) Once the compressed representation of the response is inferred, any separately trained generative model can be plugged in to produce actual textual response 444Herein we used a quick and easy approach to text generation, simply selecting the nearest neighbor of the inferred response from the training dataset, while the main objective was to validate the proposed representation learning and representation prediction techniques; exploring better text-generation models remains the topic of future work.

. Note that separating response inference in the representation space from the actual text generation increases method’s flexibility, while mutual information criterion facilitates better alignment between the responses of two subjects and higher predictability of the proper response. It is also important to note tat, while the psychotherapy domain was our primary motivation, the proposed approach is generally applicable to a wider range of domains as demonstrated in empirical section.

Overall, our key contributions include: (1) a novel generic framework for dialogue modeling and generation using locality sensitive hash functions; (2) a novel lower bound on the Mutual Information (MI) between the hashcodes of the responses from the two agents used as an optimization criterion for the locality sensitive hash functions; (3) an extensive empirical evaluation on three different dialogue domains, from depression therapy to TV show interviews and Twitter data, demonstrating advantages of our approach when compared with the state-of-art neural network based dialog systems, both in terms of the higher quality of generated responses (especially on relatively small datasets (thousands of samples), including therapy sessions and Larry King TV interviews), as well as computational efficiency, reducing the training time from days or even weeks (e.g. on near-million-sample Twitter dataset) to a few hours.

2 Related Work

Therapy chatbots, such as Woebot [Fitzpatrick et al.2017] and similar systems, are becoming increasingly popular; however, these agents have limited ability to understand free text and have to resort to a fixed set of preprogrammed responses to choose from [Di Prospero et al.2017, Ly et al.2017, Schroeder et al.2018, Morris et al.2018, Hamamura et al.2018]. (Also, see [Jurafsky and Martin2014] for an overview.)

For dialogue modeling in general domains, several recently proposed neural network based approaches are considered state-of-art  [Serban et al.2016b, Serban et al.2017, Shao et al.2017, Asghar et al.2017, Wu et al.2017]. However, their usually require very large training datasets, unavailable in many practical applications; furthermore, they are not typically explored in dialogue settings such as therapy including very long responses (up to thousands or even tens of thousands words). Also, evaluating the effectiveness of the therapist’s response requires some notion of relevance (e.g., mutual information) which goes beyond the standard measures of its semantic features [Papineni et al.2002, Liu et al.2016, Li and Jurafsky2016, Lowe et al.2017, Li et al.2017].

Unlike task-driven dialogue [Zhai and Williams2014, Wen et al.2016, Althoff et al.2016, Lewis et al.2017, He et al.2017], an immediate response quality metric may not be available in our settings, since the effect of therapy is harder to evaluate and multiple sessions are often required to achieve the desired outcome. Attention to specific parts of the response, as well as background knowledge, explored in neural network based dialogue modeling [Kosovan et al.2017] can be helpful in therapeutic dialogues; those aspects are, to some extent, implicitly captured by learning the hashing models. Note that in related work by [Bartl and Spanakis2017], hashcodes are only used for the nearest neighbor search , instead of serving as response representation as proposed here.

In [He et al.2018], an approach is presented to a task-driven (e.g., negotiation) dialogue which consists in mapping a response to an ordered list of rules, where each rule represents a task-specific intent; this does not apply to our more open-ended dialogues without the specific tasks.

While mutual information has been previously considered in dialogue modeling during testing [Li et al.2015], it was not used as a model selection criterion for learning representations. Moreover, popular BLEU score [Papineni et al.2002] does not measure relevance between the responses, but rather tries to capture all information when comparing the ground truth with the produced text.

Finally, there are multiple approaches for estimating mutual information from data

[Kraskov and Grassberger2004, Koeman and Heskes2014, Singh and Póczos2014, Gao et al.2015]. However, these estimators are highly expensive in high-dimensional settings, and can be quite inaccurate when the number of samples is small. There is a recent approach of neural estimation of mutual information on two high dimensional continuous variables, though not applicable to discrete variables like hashcodes [Belghazi et al.2018]. For discrete variables, theoretical analysis has been limited to one dimensional case [Jiao et al.2017]. Previously, several mutual information lower bounds have been proposed for classification problems [Chalk et al.2016, Gao et al.2016, Alemi et al.2017], assuming one-dimensional class label; unfortunately, they do not apply in our setting where the predicted response is a high-dimensional vector.

3 Problem Formulation and Background

We now present a novel framework for dialogue modeling using binary hash functions. We will refer to the two dialogue agents as to a patient and a therapist, respectively, although the approach is generally applicable to a wider variety of dialogue settings, as demonstrated later in the empirical section on datasets such as TV show interviews and Twitter dialogues.

3.1 Problem Formulation and Approach Overview

We consider a dialogue dataset consisting of samples, , where each sample is a pair of a patient and a therapist responses, and , respectively; we will also use the following notation: , , . Each response is a natural language structure which can be simply a text, or a text with part of speech tags (PoS), or a syntactic/semantic parsing of the text.

Given response , the dialogue generation task is to produce the response . We approach this task as a three-stage problem: first, we learn a representation model, based on locality sensitive hashing, which maps each text response into some binary hashcode vector

; second, we train a classifier to infer the therapist’s hashcode

given the patient’s hashcode , so that the inference takes place in the abstract representation space; hashcode representation aims at capturing, in a compressed form, the semantic essence of the responses while leaving out irrelevant details; finally, we produce a textual response based on the predicted hashcode representation555As mentioned earlier, we simply choose the nearest hashcode based on training samples  (sublinear algorithms can be used for efficient similarity search in hamming spaces [Norouzi et al.2014, Komorowski and Trzcinski2017]); alternatives include various unsupervised generative generation model, e.g. [Bowman et al.2015, Jozefowicz et al.2016, Semeniuta et al.2017, Yu et al.2017]..

Our objective is to chose a hashcode-based text representation model so that consequtive responses of the dialogue participants are maximally relevant to each other, as measured by the mutual information between the corresponding hashcode representations; from another perspective, this will also make the response of the second’s person more predictable given the first person’s response.

3.2 Background: Locality Sensitive Hashing

The main idea behind the locality sensitive hashing is that similar data points are assigned hashcodes within a short Hamming distance to each other, and vice versa [Grauman and Fergus2013, Zhao et al.2014, Wang et al.2017]. Such hashcodes can be used as generalized representations of language structures, e.g. responses of dialogue participants. There are multiple hash functions proven to be locality sensitive [Wang et al.2017]

. Recently, several kernel-based locality-sensitive hashing approaches have been developed that are applicable for natural language processing

[Kulis and Grauman2009, Joly and Buisson2011, Garg et al.2019]. These techniques rely on a convolution kernel similarity function defined for any pair of structures and with kernel parameters [Srivastava et al.2013, Mooney and Bunescu2005, Haussler1999].

In order to construct hash functions for mapping textual responses to hashcodes, we will first select from a training dataset a random subset of text structures (responses) of size , called a reference set. Further, let , , denote a set of binary-valued hash functions, and let denote vector . The hashcode representation of response will be given as .

To generate a hash function , for each bit , we first select a random subset of the reference set, . Next, we assign label 0 to randomly selected elements of , and label 1 to the remaining elements of that set, creating an artificial binary-labeled training dataset, which can be now fed into any binary classifier to learn a function . We generate such random splits of the reference set, and learn the corresponding

binary classifiers, or hash functions. We also tried k-nearest neightbor classifier (kNN), resulting into hashing approach we refer to as

LSH-RkNN.

Overall, to obtain a hashcode of a given response , we must compute kernel similarities, . For a limited size () of the reference set , hashcodes can be computed efficiently, with the computational cost linear in ; also, note that LSH techniques described above are easily parallelizable.

Finally, our LSH-RLSTM

model uses LSTM language model for generating hashcodes; no reference set optimization is required here, since LSTM easily handles large training datasets; however, other hyperparameters, including network’s architecture, need to be optimized.

4 Learning Hashcode Representations

Given that each specific hashing model described above involves several model-selection choices, our task will be to optimize those choices using the information-theoretic criterion proposed below.

Optimizing LSH Model Parameters. As per the discussion of LSH above, an LSH model involves the function for mapping text responses to hashcodes: , where and where each hash function is built based on a random subset of using either a kernel (kNN, SVM) or a neural network (LSTM) classifier. For the case of kernel-based LSH, are the parameters of a convolution kernel similarity function . For neural hashing (LSH-RLSTM), refers to the neural architecture hyperparameters (number of layers, the number of units in a layer, type of units, etc.); also includes LSH-specific parameters such as .

When learning LSH models on a training datset, the (hyper) parameters as well as the reference set will be optimized with respect to the information-theoretic objective introduced below. Namely, for LSH-RkNN and LSH-RMM, the kernel parameters are optimized via grid search. For LSH-RLSTM, reflects the neural architecture, i.e. the number of layers and the number of units in each layer, optimized by greedy search. Similarly, is also constructed via a greedy algorithm.

Info-theoretic Objective Function. The objective function for hashcode-based model selection in dialog generation should (1) characterize the quality of hashcodes as generalized/compressed representations of dialogue responses and (2) favor representation models leading to higher-accuracy response generation.

Mutual information between the dialog responses (e.g, patient) and

(e.g., therapist) is a natural candidate objective as it implies higher predictability of one response from another. Though, it is hard to compute in practice as the joint distribution over all pairs of textual responses is not available. However, we can attempt to approximate it using hashcode representations. If

represents a function from the space of all statements to the hashing code space, then the data processing inequality implies that , and maximizing the quantity on the right can be more computationally feasible.

Thus we will maximize the mutual information (MI) between the response hashcodes, over LSH model parameters; it turns out that MI reflects both the inference accuracy as well as the representation quality, as we will see below:

(1)
(2)
(3)

Herein, and

are the multivariate binary random variables associated with the hashcodes of patient and therapist responses, respectively. Minimizing the conditional entropy,

, improves the predictive accuracy when inferring therapist response hashcode, while maximizing the entropy term, , should ensure good quality of the hashcodes as generalized representations of text responses; thus MI objective satisfies both criteria stated at the beginning of this section.

4.1 Information-Theoretic Bounds

Since computing mutual information between two high-dimensional variables can be both computationally expensive and inaccurate if the number of samples is small [Kraskov and Grassberger2004, Walters-Williams and Li2009, Singh and Póczos2014, Gao et al.2015], we develop a (novel) lower bound on the mutual information which is easy to compute. For derivation details, see Supplementary material666Supplementary material: https://tinyurl.com/y6gefz8k..

We will first introduce the information-theoretic quantity called Total Correlation [Watanabe1960], , which captures non-linear correlation among the dimensions of a random variable ; given an additional random variable , is defined as .

Theorem 1 (Lower Bound on Mutual Information).

Mutual information between two random hashcode variables, , can be bounded from below as follows:

Herein, describes Total Correlations within that can be explained by a latent variables representation ; is a proposal conditional distribution for the bit of the hashcode

predicted using a probabilistic classifier, like a Random Forest model.

As discussed in [Ver Steeg and Galstyan2014], can be computed efficiently.

Note that the first two terms in the MI lower bound contribute to improving the quality of hashcodes as response representations, maximizing entropy of each hashcode bit while discouraging redundancies between the bits, while the last term containing conditional entropies aims at improving inference of individual hashcode bits.

5 Empirical Evaluation

Several variants of the proposed hashing based dialog model, using kNN, SVM or LSTM to build hashcodes, respectively, were evaluated on three different datasets and compared with three state-of-art dialog generation approaches of [Serban et al.2017, Serban et al.2016a] and [Vinyals and Le2015]

. Besides several standard evaluation metrics adopted by those approaches, we also report the

model rankings obtained by human evaluators via Amazon Machanical Turk.

5.1 Experimental Setup

Datasets. The three datasets used in our experiments include (1) depression therapy sessions, (2) Larry King TV interviews and (3) Twitter dataset. The depression therapy dataset777https://alexanderstreet.com/products/counseling-and-psychotherapy-transcripts-series consists of transcribed recordings of nearly 400 therapy sessions between multiple therapists and patients. Each patient response followed by therapist response is treated as a single sample; all such pairs, from all sessions, were combined into one set of N=42000 samples. We select 10% of the data randomly as a test set (4200 samples), and then perform another random 90/10 split of the remaining 38,000 samples into training and validation subsets, respectively. We follow the experimental setup from prior work cited above when comparing the respective neural network models with our hashing based approaches: all models are trained only once using the same training and validation datasets, and evaluated on the same test set. However, for our hashing model metrics introduced below, we average the estimates over 10 random subsets using 95% of test samples each time.

The Larry King dataset 888 http://transcripts.cnn.com/TRANSCRIPTS/lkl.html contains transcripts of interviews with the guests of TV talk shows, conducted by Larry King, the host. Similarly to the depression therapy dataset, we put together all pairs of guest/host responses from 69 sessions into a single set of size 8200. The data are split into training, validation and test subsets as described earlier.

Next, we experimented with the Twitter Dialogue Corpus [Ritter et al.2010]. Considering the original tweet and the following comments on it, in the same session, the task is to infer the next tweet. Note that we consider all utterances preceding that tweet as one long utterance, i.e. as the first “response” , mapped to one hashcode, while the next tweet is the second “response” , which is different from the approach of [Serban et al.2017] we compare with, where the previous utterances in a session are explicitly viewed as a sequence. The number of tweet sessions (each viewed as a separate sample, i.e. pair of responses), in training, validation, and test subsets are, respectively, 749060, 93633, and 93633.

Task. For all datasets, the task is to train a model on a set of training samples, i.e. response pairs , where is a response of person A, followed by the corresponding response of a person B. Then each test sample is given as a response of person A, and the task is to generate the response of a person B.

Hashing Models.
Step 1: Representation Learning. We evaluate three different hashing models: the first two, based on kernel locality sensitive hashing (KLSH) [Joly and Buisson2011, Garg et al.2019], are called LSH-RMM and LSH-RkNN, and use, respectively, Max-Margin (SVM) classifier (with C=1 parameter) or kNN classifier (k=1), to compute each hash function; R stands for random data splits. The third model, LSH-RLSTM, uses LSTM for hash function computation. We use hashcode vectors of dimensionality H=100. For LSH-RkNN and LSH-RMM, we use as a reference set a random subset of M=300 samples from the training dataset, to reduce the computational complexity of training those models, but for LSH-RLSTM we use the whole training dataset as a reference set999The optimized architectures found by LSH-RLSTM were: for Depression Therapy, a four-layer network [16,64,16,8] having 16, 64, 16, and 8 nodes, respectively, in the 1st, 2nd, 3rd and 4th hidden layers; for Twitter dataset, a 3-layer net [8, 32, 16], an for Larry King dataset, only a two-layer net: [8, 32].. Parameters for LSH models are obtained by maximizing the proposed MI LB criterion, using .

Step 2: Hashcode Prediction.

We now map all responses, of both participants A and B, in both training and test sets, to the corresponding hashcodes using one of the above hashcode-based representation models. Next, to predict the response hashcode of a person B given a hashcode of a person A, we train separate Random Forest (RF) classifiers (each containing 100 decision trees) for each hashcode bit (i.e. 100 such RF classifiers, since H=100).

Step 3: Textual Response Generation. Given a hashcode of a response inferred by RF classifier above, mapping it to an actual text can be performed in multiple ways; for now, we simply find the nearest neightbor of the generated hashcode in the set of all hashcodes corresponding to the person B responses in our training data. As it was already mentioned before, better generative models remain the direction of future work, while this simple method is used to validate our main contribution, a hashcode-based representation and response prediction model driven by infogain maximization.

Baseline: Neural Network Dialog Generation Models.
We compare our dialog generation method with the state-of-art VHRED approach of [Serban et al.2017], as well as with the two other approaches, HRED [Serban et al.2016a], and LSTM [Vinyals and Le2015], also used as baselines in the VHRED paper. We adopt the same hyperparameter settings as those used in [Serban et al.2017]. For the Twitter dataset, we compare with the results presented in the above paper, while on the other two datasets, we train the above models ourselves. The vocabulary size for the input is set via grid search between values 1000 to 100000. The neural network structures are chosen by an informal search over a set of architectures and we set maximum gradient steps to 80, validation frequency to 500 and step-size decay for SGD is 1e-4.

LSTM HRED VHRED LSH-RkNN LSH-RMM LSH-RLSTM
Appropriate (%) 3.7 8.6 9.5 28.7 24.1 25.4
Diverse (%) 0.7 9.3 0.7 35.1 13.2 41.1
(a) Depression Dataset
LSTM HRED VHRED LSH-RkNN LSH-RMM LSH-RLSTM
Appropriate (%) 5.3 3.9 5.3 31.7 25.6 28.3
Diverse (%) 3.3 13.3 0.0 36.7 10.0 36.7
(b) Larry King Dataset
Table 1: Human evaluation scores on (a) Depression dataset (900 test samples) and (b) Larry King dataset (180 test samples).
Model Average Greedy Extrema
LSTM [Vinyals and Le2015] 0.610.31 0.580.29 0.280.16
HRED[Serban et al.2016a] 0.480.23 0.430.20 0.290.16
VHRED[Serban et al.2017] 0.480.23 0.430.20 0.290.16
LSH-RkNN 0.550.39 0.470.29 0.260.21
LSH-RMM 0.560.38 0.530.33 0.310.23
LSH-RLSTM 0.640.37 0.510.28 0.280.19
(a) Depression Therapy Dataset
Average Greedy Extrema
0.51 0.39 0.37
0.50 0.38 0.36
0.53 0.40 0.38
0.610.17 0.400.13 0.250.09
0.610.17 0.410.13 0.250.09
0.600.18 0.390.13 0.240.09
(b) Twitter Dataset
Average Greedy Extrema
0.710.24 0.600.20 0.350.14
0.710.25 0.610.20 0.290.12
0.700.24 0.720.25 0.430.18
0.760.28 0.600.21 0.340.15
0.730.28 0.590.22 0.350.16
0.760.27 0.580.21 0.330.15
(c) Larry King Dataset
Table 2:

Comparison between state-of-art neural network models (LSTM, HRED and VHRED) and the proposed hashing models (LSH-RkNN, LSH-RMM and LSH-RLSTM), on three datasets – Depression Therapy, Twitter, and Larry King data – using word embedding-based similarity metrics between the actual and generated responses. Mean and standard deviation across samples (response pairs) are reported for all metrics, for each test set except for Twitter results with prior art models (LSTM, HRED, VHRED) - we used the numbers reported in

[Serban et al.2017], without rerunning the models; standard deviations were not reported in that paper.

Evaluation metrics.
Human evaluation. Using Amazon Mechanical Turk, we obtained model rankings from 108 human readers (annotators). For each test sample, we showed to the reader all responses produced the six models evaluated here, in random order for each instance, and without specifying which model produced which response. We asked the annotator to chose the most appropriate response; then, for each model, we computed across all test samples the percent of the readers who voted for that model. Furthermore, in a separate session, we also asked the person which model produced the most diverse responses; in this case we listed the model’s names and associated responses.

Embedding-based metrics. We compare our methods with the state-of-art neural network approaches listed above using three word embedding-based topic similarity metrics - embedding average, embedding greedy, and embedding extrema [Liu et al.2016], adopted by [Serban et al.2017]. Following the prior art, we used Google News Corpus to train the embeddings101010https://code.google.com/archive/p/word2vec/.

Given a textual response  (of Person B) generated by a particular method, and the true textual response  (of person A), all words are first mapped to their corresponding embeddings. An average across the words in each response is computed, and the cosine between the two resulting vectors constitutes the embedding average similarity metric.

Another approach to computing response-level embeddings is to use vector extrema, where, for each dimension of the word vectors, we select the most extreme value amongst all word vectors in the response, and use that value in the response-level embedding; the cosine similarity is then computed between the corresponding response-level embeddings, resulting into a metric called

embedding extrema [Liu et al.2016].

Finally, the third metric, embedding greedy, does not compute response-level embeddings. Instead, given two responses and , each token is greedily matched with a token having maximum cosine similarity of the corresponding word embeddings, and the total score is averaged across all words  [Liu et al.2016].

The mean and standard deviation statistics for each metric are computed over 10 runs of the experiment, as mentioned above; in the case of the accuracy metric, the statistics are computed over all (100) hashcode bits and over 10 trials.

5.2 Results

Computational efficiency.

First of all, we observed that the hashing models are much more computationally efficient than the neural network approaches we compared with: it takes from several days (for smaller datasets such as Depression therapy and Larry King dataset containing about 42,000 and 8,200 samples) to more than two weeks (on Twitter, with roughly 749,000 samples) to train a neural network model, even on a 1000-core GPU, whereas an LSH model is typically optimized within several hours (e.g., less than 5-6 hours for Twitter), on a 16-core CPU.

Human evaluation.

We performed an extensive human evaluation of the dialog generation quality (as described earlier in section 5.1), on two relatively smaller datasets, Depression and Larry King. The results are shown in Table 1a and 1b, respectively. The hashcode-based approaches considerably outperform the neural net models, by one-two orders of magnitude, in terms of the responses being both more appropriate and more diverse.

Embedding-based Metrics.

Table 2 summarizes performance of all methods with respect to the embedding-based metrics, with the best results for each metric/column shown in boldface. Overall, the proposed hashing approaches are quite competitive with the neural net methods in terms of those metrics as well: on all three datasets, our LSH methods always outperform their competitors in terms of the average similarity metric (which is the most intuitive among the three metrics in terms of reflecting the topic similarity between the true and system-generated responses); moreover, our methods also achieve best performance w.r.t. the greedy metric on Twitter dataset, and w.r.t. extrema metric on the Depression dataset.

Note that, similarly to other studies, we observe that those commonly used metrics cannot truly capture the quality of the produced dialogues, and human evaluation is still hard to replace. For example, on Depression dataset, due to its relatively small size, none of the neural net models managed to actually learn how to generate good responses, as the human evaluation results in Table 1 showed, despite relatively good embedding-based metric results. Those models kept generating practically same response, of questionable quality on the majority of test samples (see an example below), while hashing methods often produced much more meaningful responses:

<patient> I did lose three pounds, did you notice that?
<therapist> That’s great, I did notice that.
-------------- Hashing models: -----------------------
<LSH-RMM> Yes, you did.
<LSH-RkNN> To help you with your diet?
<LSH-RLSTM> Okay what’s the explanation?
-------------- Neural network  models: ---------------
<LSTM> I don’t want to be a lot a lot ... of the way
<HRED> Yeah. you
<VHRED> Uh-huh. you
-------------
-------------
<patient> ...Which is stupid because like, ya know, I guess
if it’s not what they like, then it’s not what they like
and they can like do something else, but like  ya know
there’s some psychiatrists who charge like $300 an hour...
<therapist> I’m sorry you’re having such trouble getting
... getting access to the care that you need.
-------------- Hashing models: -----------------------
<LSH-RMM> You got to do.
<LSH-RkNN> Oh yeah, you never tried Lithium?
<LSH-RLSTM> And you ought not to just shove it away out
of your life because you will need it again.
-------------- Neural network  models: ---------------
<LSTM> Uh-huh. you
<HRED> Yeah. you
<VHRED> Uh-huh. you

For more examples, see Supplementary material at https://tinyurl.com/y6gefz8k.

6 Conclusions

This paper introduces a novel approach to dialogue modeling where responses of both participants are represented by hashcodes. Furthermore, a novel lower bound on Mutual Information (infogain) is derived and used as a hascode-based model-selection criterion in order to facilitate a better alignment in collaborative dialogue, as well as predictability of responses. Our empirical results consistently demonstrate superior performance of the proposed approach over state-of-art neural network dialogue models in terms of both computational efficiency and response quality. In the future, we plan to further improve the approach by choosing a better response from a larger corpus, and/or using more sophisticated hashcode-to-text generative models; moreover, we plan to go beyond the current simplistic treatment of response pairs as i.i.d. samples, and focus on modeling the dynamics of a dialogue.

References