Compositional Sentence Representation from Character within Large Context Text

05/02/2016 ∙ by Geonmin Kim, et al. ∙ 0

This paper describes a Hierarchical Composition Recurrent Network (HCRN) consisting of a 3-level hierarchy of compositional models: character, word and sentence. This model is designed to overcome two problems of representing a sentence on the basis of a constituent word sequence. The first is a data-sparsity problem in word embedding, and the other is a no usage of inter-sentence dependency. In the HCRN, word representations are built from characters, thus resolving the data-sparsity problem, and inter-sentence dependency is embedded into sentence representation at the level of sentence composition. We adopt a hierarchy-wise learning scheme in order to alleviate the optimization difficulties of learning deep hierarchical recurrent network in end-to-end fashion. The HCRN was quantitatively and qualitatively evaluated on a dialogue act classification task. Especially, sentence representations with an inter-sentence dependency are able to capture both implicit and explicit semantics of sentence, significantly improving performance. In the end, the HCRN achieved state-of-the-art performance with a test error rate of 22.7

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentence representations are usually built from representations of constituent word sequences by a compositional word model. Many compositional word models based on neural networks have been proposed, and have been used for sentence classification

[Socher et al.2010, Kim2014] or generation [Sutskever et al.2014]

tasks. However, learning to represent a sentence on the basis of constituent word sequences faces two difficulties.First, vector representation of word, word embedding, vary independently from other words. Thus, estimation of rare words suffer from the data-sparsity problem in which there are insufficient data samples to learn embedding of them. Poorly estimated of rare words embedding can cause sentence representations of inferior quality. Second, conventional sentence representation does not take into account the dependency of the meaning of one sentence on the meanings of other sentences. This inter-sentence dependency is especially evident in large context text such as in documents and in dialogues. Without accounting for this inter-sentence dependency, model capture only superficial meanings of sentences, and cannot capture implicit aspects of sentences such as intention which often requires linguistic context to understand.

[width=16cm] figure1-HRCN_hierarchy_dotted_revised.eps

Figure 1: Illustration of the Hierarchical Composition Recurrent Network. The thick arrows indicate transformations described in equations (3)-(6), the thin arrows lines indicate identity transformation. For simplicity, each level is shown with one layer.

In this paper, we propose a Hierarchical Composition Recurrent Network (HCRN), which consists of a hierarchy of 3 levels of compositional models: character, word and sentence. Sequences at each level are composed by a Recurrent Neural Network (RNN) which has shown good performance on various sequence modeling tasks. In the HCRN, output of lower levels of the compositional model is fed into the higher levels. Sentence representation by the HCRN enjoys several advantages when compared to sentence representation by a single compositional word model. From the compositional character model, the word representation is built from characters by modeling morphological processes shared by different words. In this way, the data-sparsity problem with rare words is resolved. From the compositional sentence model, inter-sentence dependency can be embedded into sentence representation. Sentence representation with inter-sentence dependency is able to capture implicit intention as well as explicit semantics of given sentence.

Training the HCRN in an end-to-end has presents optimization difficulties since the network, a deep hierarchical recurrent network, may suffer from the vanishing gradient problem across different levels in the hierarchy. To alleviate this, hierarchy-wise language learning algorithm is proposed, and empirically shows that it improves the optimization of network. The hierarchy-wise learning algorithm trains the lower level network first, and then trains higher levels successively.

The efficacy of the proposed method is verified on a spoken dialogue act classification task. The task is to classify communicative intentions of sentences in spoken dialogues. Compared to conventional sentence classification, this task presents two challenging problems. First, it requires that the model estimate representations of spoken words which often include rare and partial words. Second, understanding dialogue context is often required to clarify meanings of sentences with a given dialogue.

The HCRN with hierarchy-wise learning algorithm achieves state-of-the-art performance on the SWBD-DAMSL database.

2 Hierarchical Composition Recurrent Network

Figure 1 shows our proposed Hierarchical Composition Recurrent Network (HCRN). The HCRN consists of a hierarchy of RNNs with compositional character, compositional word, and compositional sentence levels. At each level, each sequence is encoded by the output state of hidden layer of RNN at the end of sequence, which is commonly used in other research [Cho et al.2014, Tang et al.2015]. In Figure 1 and in this section, each level of RNN is assumed to have one layer for simplicity of notation and figure. The notation of well-known transformations are represented as follows: gating units such as LSTM or GRU are represented as , and affine transformation and non-linearity as .

Consider a dialogue which consists of sentence sequences and its associated label . The compositional character model sequentially takes the -th character embedding of word in sentence , and recurrently calculates the output state of hidden layer to produce a word representation as:

(1)
(2)

where is the length of character sequence of word in sentence .

Similarly, the compositional word model takes word sequence as input, and iteratively calculates the output state of hidden layer to produce a representation of the -th sentence as:

(3)
(4)

where is the number of words in sentence .

The compositional sentence model updates its hidden neuron activation

in the same way as lower levels. Especially for dialogue, the compositional sentence model additionally takes the agent (or speaker) identity change vector , as an input. Agent identity, or at least agent identity change across neighboring sentences, is an important clue to understand the intended meaning of a sentence in a dialogue, as shown in previous research [Li et al.2016].

(5)

where agent identity change vector is:

(6)

where is the agent identity of -th sentence.

The memory of the compositional sentence model , which includes

-th sentence information as well as that of previous sentences in the dialogue, is fed into a multi-layer Perceptron for classification of its label.

One advantage of the HCRN is its ability to learn long character sequences. While conventional stacked RNNs have difficulty when dealing with very long sequences [Bengio et al.1994, Hochreiter1998], the hierarchy of the HCRN deals with short sequences of types specific to each level so that vanishing gradient problems during back-propagation through time are relatively insignificant. Each level of the HCRN uses a different speed of dynamics during sequence processing, so that the model can learn both short-range and long-range dependencies in large text samples. (See the preliminary experiment in supplementary materials of section A for details).

The following abbreviations are used for the rest of this paper: the compositional character model (), the compositional word model (), the compositional sentence model (), and the multi layer perceptron ().

3 Hierarchy-wise Language Learning

In order to alleviate optimization difficulties occurring when entire hierarchies of RNNs are trained in an end-to-end fashion, hierarchy-wise language learning is proposed. In the hierarchy of composition models, the lower level composition network is trained first, gradually adding higher level composition networks after the lower level network is optimized for a given objective function. This approach is inspired by the unsupervised layer-wise pre-training algorithm in [Hinton et al.2006]

, known to provide better initialization for subsequent supervised learning.

3.1 Unsupervised Word-hierarchy learning

To pre-train the , we adopt a pre-training scheme by [Srivastava2015], while following the architecture of RNN Encoder-Decoder in [Cho et al.2014]. Figure 2 shows the RNN Encoder-Decoder architecture used for this learning, and parameters of the can be obtained from the RNN Encoder.

In this architecture, the representation of word which consists of characters is built by the . This representation is then fed into the RNN decoder. The and RNN Decoder are jointly trained so that their output sequence becomes exactly the same as the input character sequence by minimizing the negative log likelihood. This phase helps the learns how to spell words as character sequences, reducing the burden of learning morphological process of words on subsequent learning in higher levels of the hierarchy.

[width=7.5cm]figure2-word_hierarchy.eps

Figure 2: Illustration of the architecture used for word-hierarchy learning. Training objective forces the network to reconstruct exactly the same input character sequences.

3.2 Supervised Sentence-hierarchy and Discourse-hierarchy learning

Next, sentence-hierarchy learning and discourse-hierarchy learning proceed in a supervised way with given sentence labels. In sentence-hierarchy learning, the model is trained to classify the label of single sentences independently from other sentences in a context. In discourse-hierarchy learning, the model is trained to classify the label across multiple sentences. An is stacked on top of sentence representation (for sentence-hierarchy learning), or (for discourse-hierarchy learning) to predict class labels.

A randomly initialized is stacked on top of the pre-trained during word-hierarchy learning, and a randomly initialized is stacked on top of the pre-trained and

resulting from sentence-hierarchy learning. The parameters of the pre-trained lower level models are excluded from the training during the first few epochs in order to prevent the lower level pre-trained models from changing too much due to large error signals generated by the randomly initialized layer. This method is also employed in

[Oquab et al.2014]

when adding a new layer on top of pre-trained layers for transfer learning. In section 4.5, the hierarchy-wise learning will be empirically shown to alleviates optimization difficulties in training the HCRN compared to end-to-end learning.

4 Experiments

4.1 Task and Dataset

The HCRN was tested on a spoken dialogue act classification task. The dialogue act (DA) is the communicative intention of a speaker for each sentence. Prediction of the DA can be further used as an input to modules in dialogue systems such as dialogue manager. We chose the SWBD-DAMSL database111The dataset is available at https://web.stanford
.edu/~jurafsky/swb1_dialogact_annot.tar.gz
, which is a subset of the Switchboard-I (LDC97S62) corpus annotated with DA for each sentence. The Switchboard-I consists of phone conversations between strangers. The SWBD-DAMSL has 1155 dialogues of 70 pre-defined topics, 0.22M sentences, and 1.4M word tokens. There are several dialogue act tagsets available depending on the purpose of application. We chose the 42-class tagset from DAMSL, which is widely used to analyze dialogue acts of phone conversations. (See supplementary material section A for the complete class list.)

The number of elements in the character dictionary is 31 including 26 letters, - (indicating a partial word), ’(indicating possessive case), . (indicating abbreviation), noise (indicating non-verbal sound) and unk (indicating unknown words) for all other characters. We follow the train/test set division in [Stolcke et al.2000]: 1115/19 dialogues, respectively. Validation data includes 19 dialogues chosen from training data. After pre-processing of the corpus, the number of sentences in the train/test/validation sets are 197370, 4190 and 3315 respectively. (See supplementary material section C for pre-processing details.) Table 1 shows the sequence length statistics. Note that a sentence becomes a much longer sequence when represented by characters (37.92) compared to words (8.28).

Hierarchy #C/W #W/S #S/D #C/S
Mean 8.19 8.28 161.26 37.92
Stddev 2.52 8.11 67.32 39.72
Table 1: Summary of the average sequence length for each level. C, W, S, and D indicate character, word, sentence and dialogue levels of processing, respectively. #A/B means the average number of A per B.
CC CW CS
Small
Large
Table 2: Size of compositional model at each level, represented by ( layers) ( memory in each layer). Note that the complexity of the model increases as the level of hierarchy increases, following the assumption that the complexity of composition increases as the level of language increases.

4.2 Common settings

We employ the Gated Recurrent Unit (GRU) as a basic unit of the RNN because several studies have shown that the GRU performs similar with an LSTM, while requiring that fewer parameters be tuned

[Chung et al.2014, Jozefowicz et al.2015]. The configuration of the HCRN is represented by the hierarchy of the compositional model and its size, , where the size is represented by the number of layers and the number of hidden units in each layer. We tested two different sizes of compositional model at each level, as shown in Table 2.

In all supervised learning, the classifier consists of MLP with 3-layers of affine transformation and non-linearity. The non-linearity of the first 2 layer is Rectified Linear Unit (ReLU), and that of last layer is softmax. The number of hidden neurons used in MLP is 128. A common hyperparameter setting is used in all experiments. All weights are initialized from a uniform distribution within [-0.1, 0.1] except pre-trained weights. We optimized all networks with adadelta

[Zeiler2012] with decay rate () 0.9 and constant ()

which show much faster convergence than stochastic gradient descent with momentum 0.9, and employed a gradient clipping strategy adopted from

[Pascanu et al.2013] with clipping threshold 5. Early stopping based on validation loss was used to prevent overfitting.

4.3 Unsupervised Word-hierarchy learning

During word-hierarchy learning, is jointly trained with the RNN Decoder to reconstruct input character sequences. The number of all unique words in the training set is 19353. The end of word token is appended at every end of character sequence. The parameters are updated after processing a minibatch of 10 words. The chosen character embedding dimension is 15. Learning is terminated if validation loss fails to decrease by 0.1 for three consecutive epochs.

Model In Vocabulary Out of Vocabulary
CPER
()
WRFR
()
Length
CPER
()
WRFR
()
Length
0.39 2.25 13.1(2.6) 2.06 9.17 12.3(2.2)
0 0 - 1.21 5.28 12.7(2.4)
Table 3:

Reconstruction performance of the RNN Encoder-Decoder on words in the vocabulary and out-of-vocabulary (OOV). The length column presents the mean and the standard deviation (in parentheses) of character length of words for which complete reconstruction failed.

Pre-training performance itself is evaluated by sequence reconstruction ability. For reconstruction, the RNN Decoder generates character sequences from the encoding generated by . Generation is performed based on greedy sampling at each time step. The performance is evaluated on two measures: Character Prediction Error Rate (CPER) and Word Reconstruction Fail Rate (WRFR). CPER measures the ratio of incorrectly predicted characters to the reconstructed sequence. WRFR is the ratio of words where complete reconstruction fails out of the total words in test set.

The reconstruction performance of the RNN Encoder-Decoder on words both in vocabulary and out-of-vocabulary (OOV) is summarized in Table 3. Overall, the model almost perfectly reconstructs character sequences of training data, and even generalizes well for the unseen words. The large size model outperforms the small size model. Almost all cases in which reconstruction failed involved sequences longer than 12 characters on average.

4.4 Supervised Sentence-hierarchy learning

During sentence-hierarchy learning, the parameters are updated after processing a minibatch of 64 sentences.

Word representation method Unigram counts 5 1 Unigram counts 4 Out of Vocabulary (OOV)
uh-huh really emphasizing probab- environmentalism seventy-eights
Compositional Character Model uh-oh reall- emphasize probably environmentalist ninety-eight
huh-uh real emphasis probability environmentals seventeenth
um very surpassing probable environmental twentiy-six
Non-compositional () hmm believe - -
helpful very
yeah frankly
Table 4: Comparison of word representation built from CC and non-compositional word embedding. The nearest 3 words by Euclidean distance are retrieved for given target word.

Initialization of CC: Random VS. Pre-trained The test set classification error rate of sentence-hierarchy learning with and without the pre-trained are compared to evaluate how the pre-trained provides useful initialization for sentence-hierarchy learning. With the pre-trained , at first, the parameters of the are frozen, and the and are trained for 1 epoch222The number of epochs to freeze the pre-trained model is chosen as the best parameter from preliminary experiments on the validation set.. After that, the whole architecture consisting of the , , and is jointly trained. Evaluation was performed on architectures with different and sizes (see Table 2). In addition, pre-training on two different training dataset sizes (50 and 100 ) are compared. The results are shown in Figure 3. Pre-training consistently reduces the test error rate on the various architectures as shown in Fig. 3(a). Moreover, as shown in Fig. 3(b), improvement by pre-training is significant especially when fewer training data are available, where the model is liable to overfit to a given small number of data.

[width=7.5cm]figure3-effect_of_pretrain.eps

Figure 3: Result to show the quality of our pre-trained CC for initialization on sentence-hierarchy learning. (a) Test error rate () (b) Relative error improving rate of the pre-trained architecture with respect to the randomly initialized architecture. At each level, size is represented as either small (S) or large (L) in Table 2.

CC VS. non-compositional word embedding We compare two different methods to build word representation in this section: and non-compositional word embedding. Note that pre-trained are not directly compared to other widely used pre-trained word embedding such as Word2Vec [Mikolov et al.2013] since pre-training of the aims at the model learning underlying morphological structures of words rather than learning the semantic/syntactic similarities between different words. Therefore, for a fair comparison we randomly initialized both models rather than employing pre-trained word embedding.

For the non-compositional word embedding method, we set two different cutoff frequencies: (6294 words), (11746 words). The dimensions of the non-compositional word embedding are chosen as 64 and 128, which is the same dimensions produced by sizes of in Table 2.

[width=7.5cm]figure4-effect_of_compchar.eps

Figure 4: Result to show the quality of word representation built by CC. (a) Test error rate () (b) Relative error improvement rate of the CC with respect to non-compositional word representation with two different cutoff frequencies. At each level, size is represented as either small (S) or large (L) listed in Table 2.

Figure 4 shows comparison of test error rates on above settings. Non-compositional word embedding with the high cutoff setting () outperforms the low cutoff setting (). This is because the data sparsity problem during estimation of rare words is more severe for the model with the lower cutoff setting. Compared to the non-compositional method, outperforms or is on par, with fewer parameters since the number of parameters for non-compositional word embedding scales with the number of words in a dictionary.

Table 4 shows the 3-nearest neighbors of word representation built by two different methods: and non-compositional representation. For each method, the model with the best test accuracy is chosen. For word representation built by , retrieved nearest words usually have similar analogy with similar meaning. Moreover, rare words and OOVs such as partial words can be mapped to semantically similar words, which are usually estimated by a single unk in non-compositional word representation.

4.5 Discourse-hierarchy supervised learning

During discourse-hierarchy learning, the on top of the is trained. The pre-trained model was chosen by selecting the model with the lowest validation error rate during sentence-hierarchy learning. For the first 5 epochs, the network is trained with the and frozen. Then, the whole network is jointly optimized. The model is updated after it processes a minibatch of 8 dialogues. We evaluate performance of discourse-hierarchy learning with the two different sizes of listed in Table 2.

[width=7.5cm]figure5-loss_e2e_reduced.eps

Figure 5: Learning curve on (a) training data and (b) test data. The objective function is converged to much lower value when the model employ initialization from the pre-trained model resulting from sentence-hierarchy learning.

Optimization difficulty of end-to-end learning To verify that hierarchy-wise learning actually alleviates optimization difficulies, we compared the objective function curves of discourse-hierarchy learning using two different model initializations: with the pre-trained model from sentence-hierarchy learning, and with random initialization (end-to-end learning). The learning curve in Figure 5 clearly shows that initializing with the pre-trained model significantly alleviates optimization difficulties.

Effects of dialogue context on sentence representation Table 5 shows the test classification error rate of sentence-hierarchy learning and discourse-hierarchy learning. Compared to sentence-hierarchy learning, discourse-hierarchy learning improves performance significantly (up to 13.48 relative improvement of error).

To qualitatively analyze the improvement, we show examples of sentences on the test set for which prediction is improved by dialogue context. Analysis is done with model , which achieved the best test accuracy during discourse-hierarchy learning. Table 6 shows a dialogue segment example including 8 sentences, with the predicted labels for each sentence from both sentence-hierarchy and discourse-hierarchy learning. Highlighted sentences indicate cases where discourse-hierarchy learning correctly predicts while sentence-hierarchy learning fails to predict. For example, ”yeah” in the 3rd sentence of the example can be interpreted as both Agreement and Backchannel, and an informed decision between the two is only possible when the dialogue context is available. This example demonstrates that sentence representation with dialogue context helps to distinguish confusing dialogue acts (see supplementary materials section D for further examples).

Hierarchy Model Err () Rel ()
Sentence 26.27 -
Discourse 22.73 13.48
22.99 12.49
Table 5: Test error rate of sentence hierarchy learning and discourse hierarchy learning. The relative improvement rate of error (Rel.) with respect to sentence-hierarchy learning are also provided.
Dialogue segment (with speaker) True label
Estimated label
without context
Estimated label
with context
A: and uh quite honestly i just got so fed up with it Statement Statement Statement
i just could not stand it any more
B: is that right Backchannel-question Yes-No question Backchannel-question
A: yeah Agreement Backchannel Agreement
A: i mean this is the kind of thing you look at Statement
Opinion
Statement
B: yeah Backchannel Backchannel Backchannel
A: you sit there Statement Statement Statement
A: and when you are writing up budgets you wonder Statement Wh-question Statement
okay how much money do we need
Table 6: An example of dialogue segment containing 8 sentences. Predictions of label from model of sentence-hierarchy learning (without dialogue-context) and discourse-hierarchy learning (with dialogue-context) are provided along with true labels.

Comparison with other methods Several other methods for dialogue act classification are compared with our approach in Table 7. Our approach outperforms the other benchmarks, achieving 22.7 classification error rate on the test set. Similar approaches employ a neural network based model which hierarchically composes sequences starting from word sequences [Kalchbrenner and Blunsom2013, Ji et al.2016, Serban et al.2016]. We conjecture that the improvement demonstrated by our model is due to two factors. First, our model build word representations from constituent characters and so suffers less from the data sparsity problem when confronted with rare words, and second the hierarchy-wise language learning method alleviates optimization difficulties of the deep hierarchical recurrent network. To the best our knowledge, our model achieves the state-of-the-art performance of dialogue act classification on the SWBD-DAMSL database.

Method Test err. (%)
Class based LM + HMM [Stolcke et al.2000] 29.0
RCNN [Kalchbrenner and Blunsom2013] 26.1
HCRN with word as basic unit 24.9
[Serban et al.2016]
Utterance feature + Tri-gram context 23.5

+ Active learning + SVM

[Gambäck et al.2011]
Discourse model + RNNLM [Ji et al.2016] 23.0
Discourse-hierarchy learning 22.7
  • *This performance was evaluated by ourselves due to task difference.

Table 7: Performance comparison with other methods for dialogue act classification on SWBD-DAMSL.

5 Related works

The difficulty for RNNs learning long-range dependencies within character sequences has been addressed in [Bojanowski et al.2016]. Hierarchical RNNs have been proposed as one possible solution. [Graves2012, Chan et al.2015] proposed sub-sampling sequences hierarchically to reduce sequence length at higher levels. [Koutnik et al.2014, Chung et al.2015] proposed a RNN architecture wherein different layers learn at different speeds of dynamics. Compared with these models, the HCRN deals with shorter sequences at each level, and thereby the vanishing gradient problem is rendered relatively insignificant.

There are several recent studies on representing large context text hierarchically for sentence [Li et al.2015, Serban et al.2016] and document classification [Tang et al.2015]. These approaches benefit from hierarchical representations which represents long sequences as a hierarchy of shorter sequences. However, the basic unit used in these approaches is the word, and models that begin at this level of representation open themselves to the data sparsity problem. This problem is somewhat resolved by building word representation from constituent sequences. Successful examples can be found in POS classification [Santos and Zadrozny2014] and language modeling [Botha and Blunsom2014, Ling et al.2015, Kim et al.2016].

6 Conclusion

In this paper, we introduced the Hierarchical Composition Recurrent Network (HCRN) model consisting of a 3-level hierarchy of compositional models: character, word and sentence. The inclusion of the compositional character model improves quality of word representation especially for rare and OOV words. Moreover, the embedding of inter-sentence dependency into sentence representation by the compositional sentence model significantly improves performance of dialogue act classification. This is because intentions which may remain ambiguous in single sentence samples are revealed in dialogue context, facilitating proper classification. The HCRN is trained in a hierarchy-wise language learning fashion, alleviating optimization difficulties with end-to-end training. In the end, the proposed HCRN using the hierarchy-wise learning algorithm achieves state-of-the-art performance with a test classification error rate of 22.7 on the dialogue act classification task on the SWBD-DAMSL database.

Future work aims at the learning of hierarchy information of given sequential data without explicitly given hierarchy information. Another direction involves applying the HCRN to other tasks which might benefit from OOV-free sentence representation in large contexts such as document summarization.

References

Appendix A Effects of Hierarchical Composition

In order to learn character-level representation of large text such as sentence and dialogue, our model is hierarchically composed. As a preliminary experiment, we compared the performance of hierarchical (Fig 6-(a), sentence-hierarchy learning in paper) and non-hierarchical (Fig 6-(b), conventional stacked RNN) architectures on dialogue act classification tasks with the SWBD-DAMSL database.

[width=15cm]figure6-effect_of_hier_comp.eps

Figure 6: Comparison of network to learn sentence representation from character. (a) With hierarchical composition (b) Without hierarchical composition(conventional stacked RNN)

For fair comparison, hierarchical composition and non-hierarchical composition network have the same number of hidden units in each layer. For the non-hierarchical composition network, we add blank tokens in character sequences to indicate word boundaries. Table 8 shows that the hierarchical composition network outperforms the non-hierarchical composition network. It is because the network with hierarchical composition has a compositional word model (CW) which can be viewed as a specially designed compositional character model (CC), where composition is only allowed between hidden neurons at the ends of words. Therefore, the network with hierarchical composition has shorter sequences in each layer compared to the non-hierarchical model, where the vanishing gradient problem becomes relatively insignificant in each layer of RNN.

Non-hierarchical composition Hierarchical composition
Model
Error
(%)
Model
Error
(%)
Rel.
(%)
35.80 27.13 24.22
36.38 27.81 23.56
Table 8: Test error rate comparison of network with/without hierarchical composition. The relative improvement (Rel.) from non-hierarchical to hierarchical composition is also reported. ‘+’ indicates consecutive layers in a conventional stacked RNN.

Appendix B Complete list of class names in SWBD-DAMSL

Non-Opinion Declarative question Other answers
Backchannel Backchannel(question) Opening
Opinion Quotation Or clause
Abandoned Summarize Dispreferred answer
Agreement Non-yes answer 3rd party talk
Appreciation Action-directive Offers
Yes-No-Question Completion Self talk
Non-verbal Repeat phrase Downplayer
Yes answer Open question Accept part
Closing Rhetorical question Tag question
Wh-question Hold before answer Declartive question
No answer Reject Apology
Acknowledgment Non-no answer Thanking
Hedge Non-understand Others
Table 9: 42-class tagset of dialogue act provided from SWBD-DAMSL. Classes are sorted from the most frequent to the least frequent, from top-left to bottom-right with column-major order

Appendix C Detail of pre-processing

We pre-processed raw text according to several rules below. At first, all letters are converted into lower-case. Disfluency tags and special punctuation marks such as (? ! ,) which cannot be produced by a speech recognizer are removed. We also merged sentences with segment tags into previous unfinished sentences. Segment tags indicate the interruption of one speaker by another. It is difficult even for human beings to predict tags of segmented sentences, because sentence segments often do not provide enough information for reliable ascription of its DA. Sentences which interrupt others are placed after combined sentences. This scheme is also used in (Webb et al., 2005; Milajevs and Purver, 2014).

Appendix D Supporting results showing discourse context improves sentence representation

d.1 Comparison of sentence-hierarchy learning and discourse-hierarchy learning on class accuracy

[width=15cm]figure7-Class_acc.eps

Figure 7: Class accuracy for sentence-hierarchy learning and discourse-hierarchy learning.

Figure 7 shows percent improvement when dialogue context is incorporated for each class label. Two models which achieves the best test set accuracy on sentence-hierarchy learning (without dialogue context) and dialogue-hierarchy learning (with dialogue context) are compared. Classes are sorted by descending order of relative improvement rate (from sentence-hierarchy to discourse-hierarchy). 33 out of 42 class improved with discourse-hierarchy learning (Type 1). Performance is degraded with discourse-hierarchy learning for 6 classes (Type 3). Also, there are 3 classes where both methods fail to predict. (Type 2)

d.2 Improved cases

Text True label Estimated label without context Estimated label with context
B : what d- what is that Wh-Question Wh-Question Wh-Question
A : it’s more uh noise Abandoned Abandoned Abandoned
A : i don’t know how to explain it Hold before answer Opinion Opinion
A : kind of pop you know rock Statement Statement Statement
B : rock Repeat phrase Statement Repeat phrase
A : yeah Agree Backchannel Agree
B : hard rock Summarize Statement Summarize
A : well not hard laughter Reject Statement Reject
A : huh well are you going to paint the outside of your house too Yes-No-Question Declarative question Yes-No Question
B : well yeah Yes answers Yes answers Yes answers
B : i think i am going to do it this spring actually Statement Statement Statement
A : oh really Backchannel-question Backchannel-question Backchannel-question
B : yeah Yes answers Backchannel Yes answer
B : there are six houses Statement Statement Statement
B : see the people that own the house they uh pay for anything like that we do as far as the materials Statement Opinion Statement
A : there are three houses on this street the same color of yellowout of six houses Statement Statement Statement
Table 10: Examples of improved dialogue segment containing 8 sentences. More examples of Table 7.