The task of multiclass user intent classification comes from the background of conversational agents, like chatbots. For example, when a chatbot system processes a user query, the first step is to identify the user intent.
The challenge for Chinese intent classification stems from the fact that, unlike English where most words are made up of 26 phonologic alphabet letters, Chinese is logographic. Chinese words are composed of Chinese characters which are logograms that have independent meanings and the meanings vary in contexts. Previous works on Chinese intent classification mainly adopt pre-trained word embedding vectors for learning, however, compared to other text classification tasks, in intent detection, the text contains more low-frequency domain-specific words like flight number or the name of a dish, which are out of vocabulary for many pre-trained word embeddings. Those less frequent words can share common characters with the more frequent words, like “步行街” and “步行” (“walkway” and “walk”), but with embedding learning tools like Word2Vec, the commonality of morphology between “步行街” and “步行” are lost since they are converted to different word ids. On the other hand, Chinese characters occur more frequently in fixed collocations, which limits the different contexts around a character, and this would make training Chinese character embeddings easier and more accurate, and hence features learned from Chinese characters are very informative. For an analogy, the close counterpart to Chinese characters is English subwords, like suffixes and prefixes, and Chinese radicals (graphical components of a Chinese character) are the close counterpart to English characters. Therefore, incorporating character or character n-gram vectors into the conventional word vectors can help capture the morphology within a rare Chinese word to produce a better vector representation, because the character embeddings can be shared across rare and frequent words.
In addition, the meaning and relevancy of a word to the conversational intent are closely related to the sentence context it is in, but fixed pre-trained word and character embedding vectors are unable to adapt to contextual information. To account for the lack of adaptivity, word and character embeddings could be dynamically updated during training.
To address the inadequacy of using static, pre-trained word embeddings alone for intent classification, we propose a CNN based approach to learn dynamic character-level n-gram feature maps, which is integrated with word embedding vectors during classifier training and real-time inference, the word-character embeddings obtained are stacked with an LSTM based contextual information extraction module to produce a final class probability distribution. The highlights of the experimental results are summarized as follows:
A 2-D CNN based approach to learn dynamic character-level n-gram feature maps and integrate dynamically with word embedding vectors. We have observed that the proposed models outperformed models without utilizing characters by run experiments on a Chinese benchmark dataset.
Combining multiple proposed word-character models with ensemble techniques, we have observed an F1 score of 93.55% on the Chinese SMP benchmark dataset, which outperforms some existing methods and is on par with the state-of-the-art result of 93.91%, while greatly saving development time.
The word-character approach can be applied with low efforts to many kinds of existing neural network models simply by inserting the proposed word-character embedding module.
The remainder of this paper is organized in the following way. In related work section we present works related to intent classification and embedding learning. In model section, we detail the word-character embedding integration approach and the overall architecture, with ensemble methods used to combine our single models. This is followed by experimental setup, and results and discussion sections, where we describe experimental settings, then show, compare, and analyze experimental results. In the last section, we conclude on the works we have done and discuss future works.
2 Related Work
We review works related to the proposed method, including intent classification methods and works on joint word and character embeddings.
. Previous works have been done on machine learning methods like Support Vector Machines (SVMs) and Conditional Random Fields (CRFs), n-gram language models, or combined approaches for intent determination[Wang et al.2002, Wu et al.2005, Raymond and Riccardi2007]. Knowledge based methods have also been explored [Broder et al.2007, Li et al.2008, Hu et al.2009].
In recent years, neural network based architectures and word embeddings have gained growing popularity for intent classification. Recurrent Neural Network Language Models (RNN LMs) have been proved to be effective for capturing temporal sentence semantics through a series of hidden units ([Mikolov et al.2010]
). Long Short-Term Memory (LSTM) neural networks were developed to further avoid the exploding and vanishing gradient problem of traditional RNNs by regulating the memory cells through activation computed by the gates[Hochreiter and Schmidhuber1997, Schmidhuber2000]. Deep RNNs and LSTMs based models with word embeddings have shown remarkable results on slot and intent detection, and there are models that jointly detects both [Yao et al.2013, Yao et al.2014, Ravuri and Stolcke2015, Shi et al.2015, Shi et al.2016, Zhang and Wang2016]
. Convolutional Neural Network (CNN) based CRFs have also been applied and shown comparable results on joint detection of slots and intents[Xu and Sarikaya2013].
, leveraging external lexicons like WordNet to enrich word embeddings[Kim et al.2016a]
are some of the effective approaches to replace the previous word-level input. Another approach is to incorporate character-level embeddings into word embeddings, which has been shown useful for Part-of-Speech tagging, named entity recognition, and implicit discourse relation recognition[dos Santos and Zadrozny2014, dos Santos and Guimaraes2015, Qin and andHai Zhao2016]. Joint learning of word and character embeddings have also drawn growing attention, where the character-enhanced word embeddings are evaluated against intrinsic word-level tasks including word relatedness and word analogy, and can be further used for extrinsic NLP tasks such as text classification [Xu et al.2016, Chen et al.2015, Yu et al.2017]. For Chinese related work particularly, sub-character information has also been leveraged. Radical-level character learning methods proposed by [Sun et al.2014, Li et al.2015] are proved to work well on either Chinese character similarity judgment and Chinese word segmentation, or word similarity and text classification.
As shown in Figure1, the overall architecture is a hybrid neural network with model ensembling at the output level. The rest of this section is on details of the component CNN based Word-Character module, the LSTM based contextual information extraction module, and the ensemble method used.
3.1 Word-Character Embedding Module
This module aims to learn context-adaptive integrated Chinese word-character embeddings during training and at runtime, which does not rely on external corpus for training.
Input Layer The input layer creates placeholders for sentence-level input in both word and character representations. For a sentence of words where each word consists of
characters (padding or truncation applied), a word-based inputis represented as a sequence of words, where the value of a word will be filled in by its -dimensional word embedding vector. A character-based input is a sequence of character sequences. It is a depicted as a sequence of words, where each word is decomposed into a sequence of characters, and the value of a character will be given by its -dimensional character embedding vector. For the simplicity of notations and from the mini-batch training perspective, for a batch of
sentences, the word- and character-based inputs will be in the form of 3-D and 4-D tensors, i.e.,, and .
Word Embedding and Character Embedding Layer The embedding layer takes outputs from the input layer, performs word and character embeddings look-ups, and fills the placeholders in with the corresponding word and character vectors.
2-D Convolutional Layer The purpose of this layer is to uncover the information embedded in the characters. 2-D convolutional neural networks are used to extract the features because they are good at extracting temporal and spatial information within a sequence. This layer takes the character output W from the embedding layer, applies a 4-D filter to compute 2-D convolution operation on a window of characters, features in the second dimension, features in the third dimension. In this work, we set fixed to 2 to create bi-gram character features. For instance, an output is obtained by the following equation, where are in the ranges of [1, ], [, ], [, ], and [, ] respectively, is a bias,
is a non-linear activation function applied to the convolution result plus bias:
The resulting feature map is a 4-D tensor in as follows:
Max-Pooling Layer The feature map can be interpreted as a batch of sentences in its character-level feature representations, where each word is represented by -dimensional character features. To reduce the
features to form a single most informative feature within each word, a max-pooling operator with a sliding window ofis applied on O. For example, a pooling output is computed by:
Therefore, the feature map is downsampled to size .
After proper reshaping, the shape of P should be , which is in the same dimensions of word-based input W.
Integration Layer This layer enforces the integration of the pooled character feature map with the input word vectors to bring the most out of both word and characters. Taking into consideration computation time, we average the two representations elementwisely instead of concatenating them. The result will be integrated Word-Character vectors in the dimensions of .
3.2 Contextual Information Extraction Module
Up to this point, features at the word and character levels are extracted, but these features can not be utilized best without considering the dependency of words to their contexts. Contextual information like past neighboring words are still important to reveal the actual meaning of a word in a sentence. Thus we use an LSTM cell to remember past temporal information, and feed the integrated embedding vectors I as basic input to LSTM for obtaining context level features.
The recurrent LSTM layer has 512 hidden units, and the output is stacked with a linear layer that computes an output probability distribution over the intent classes. The argmax of the distribution is taken and returned as a single model prediction.
The LSTM works as follows. An LSTM cell is accomplished by modifying the basic RNN cell, which computes the output at each timestep using both the current timestep input and the previous output via
The LSTM cell augments the RNN cell by implementing a forget gate and an input gate that control what past information is kept or discarded.
This allows the cell to have a state vector
that represents the memory of the network. The output is then calculated from the cell state C and an output gate , where
Furthermore, to enable context-adaptive embeddings, the word and character embeddings will be updated during backpropagation.
3.3 Model Ensemble
The previous discussions focus on producing a single model result, to account for the variances in single model predictions, we employ a model ensemble scheme at the final output level. Following the same single model architecture as described above, we re-train three single models to obtain an ensemble, where the final prediction is selected by majority voting on single model predictions, as is also shown in Figure1.
4 Experimental Setup
4.1 The SMP Dataset
The SMP2017ECDT (SMP) dataset consists of Chinese user queries recorded from human-computer dialogues and transcribed to text [Zhang et al.2017]. It covers 31 intents including Chit-chat, Weather, Flights, Cookbook and etc. A sample query is in the format of 你好请问一下明天广州的天气如何 (Hello I want to know the weather in Guangzhou tomorrow), which is labeled as the Weather intent. The dataset is split into a train set of 3,069 samples, and a test set of 667 samples. The SMP dataset is a little imbalanced as the Chit-chat category contains around 20% of all data, whereas the rest 30 categories are distributed more evenly.
4.2 Word and Character Embeddings
We hypothesize that dynamically integrating character features into input word features can enhance performance, compared to using word features alone, so we use random initialization for character embeddings in our experiments, and use both randomly initialized and open-domain pre-trained word embeddings111Pre-trained word embeddings are trained on a 1G Chinese Wikipedia corpus, http://pan.baidu.com/s/1boPm2x5 for experiment and control. The character embedding vectors are initialized to be 300-dimensional with component value ranging from 0 to 1, and the word embedding vectors are initialized in the same range with a dimension of 60 to be consistent with the pre-trained word embeddings. For both randomly initialized and pre-trained embeddings, we update them during every backpropagation in training.
4.3 Baseline Models
Our hypothesis is that the proposed word-character based model improves intent classification accuracy compared to word-alone model, in this case the experiment is done on the hybrid Word-Character embedding based neural model and the control group is done on word embedding based LSTM, where details are presented in the previous section.
Combining the hypotheses of models and embeddings, we come to develop four sets of experiment settings, including two experiments on word-alone LSTMs, one using randomly initialized word embeddings and the other with pre-trained word embeddings, and experiments on the proposed model are also divided into two parts, one utilizes random initialization for both embeddings, the other uses pre-trained word embeddings and randomly initialized character embeddings to test out if even for pre-trained word embeddings, the proposed scheme of integrating character features can still help boost up performances. We omit experiments on using pre-trained character embeddings, because we want to lay our focus on the effectiveness of our low-effort way of generating and integrating character features dynamically, without relying on large external corpus and the need of pre-training embeddings.
For ensembles, we compare the ensemble of the proposed models to ensemble of baseline LSTMs. A comparison on our best model and the state-of-the-arts will also be drawn in the Results and Discussion section.
To start with, Since the data is not tokenized into words, the first step is to tokenize the sentences. The Jieba Chinese tokenizer222https://github.com/fxsjy/jieba is applied in this work. Sentences and words in sentences are then padded to fixed maximum lengths in order to do mini-batch training. Similarly, for run-time prediction, either padding or truncation to the same fixed lengths are done as a step of preprocessing.
4.5 Hyper-parameter Tuning
For model selection, we perform hyper-parameter tunings by grid search. The component single models in the ensemble share the same set of hyper-parameters.
4.6 Evaluation Metrics
For this multiclass intent classification problem, we measure model performance by unweighted scores, implemented with the Python scikit-learn package [Pedregosa et al.2011].
5 Results and Discussion
|Embeddings and Model||F1(%)|
|1 Pre-trained word, LSTM||78.71|
|2 Randomly initialized word, LSTM||86.06|
|3 Pre-trained word, randomly initialized char, Word-Char LSTM||87.86|
|4 Randomly initialized word and char, Word-Char LSTM||89.51|
|Ensemble of Proposed model||93.55|
|Ensemble of baseline LSTM||87.26|
|Model||F1 (%) on SMP||Development time on SMP|
|The proposed single model||89.51||Low|
|Ensemble of baseline LSTMs||87.26||Low to medium|
|Ensemble of the proposed models||93.55||Low to medium|
|N-gram SVM [Li et al.2017]||90.89||Medium, with feature engineering|
|Ensemble of SIR-CNNs [Lu2017]||92.88||Medium to high|
|93.91||High, with feature engineering|
The results of the proposed Word-Char (CNN based) LSTM models and baseline LSTMs are shown in Table 1. Ensemble results are given in Table 2. A Comparison on overall performance and computation time across different methods is presented in Table 3.
5.1 Effectiveness of the Word-Character Approach
As is shown in Table 1, there is an increase comparing experiment 3 to 1 or experiment 4 to 2, with an 9.15% improvement from 1 to 3 and 3.45% from 2 to 4. We thus verify that the CNN method as described in the Word-Character module is useful for extracting and integrating informative Chinese character level features.
Besides, comparing experiment 1 with 2, or 3 with 4, we observe an interesting finding that even though pre-trained word embeddings are dynamically updated and fine-tuned during training, the result is still worse than using randomly initialized embeddings. This is explained by the fact that the external source that pre-trains the word embeddings does not align well with the task at hand. To be more specific, in our case, the Sogou News Corpus used for pre-training does not necessarily contain contexts similar to the queries in human-computer dialogs, so for the same word, the Sogou embedding vector and the actual vector associated with the SMP task can result in quite different directions if we project them onto an embedding representation space. Thus for task-specific data, relying on pre-trained vectors can have a diminishing effect on the performance. This is especially true with Chinese data where words can have very different meanings depending on the contexts. Luckily, the Chinese characters have fewer variations and provide a more stable source to form character-level word representations, which can then be easily learned and safely used without exploiting external resources.
5.2 Combining Word-Character Models with Ensemble Techniques
With ensembling, the classification accuracy of ensemble of the proposed character-level models reaches 93.55%,which gives an increase of 6.29% compared to that of ensemble of baseline LSTMs. The ensemble method helps reduce variance and brings the best out of the constituent word-character models.
5.3 Overall Performance Comparisons
In this section we compare our work with the state-of-the-art works in terms of scores and development time. Our ensemble model outperforms models in two of the three works and is on par with the top score model. Table 3 lists the scores and development times of the proposed model, the ensemble of proposed models, and the state-of-the-art works on SMP.
The work of [Li et al.2017] uses a classical machine learning approach to text classification. They have adopted a one-vs-the-rest SVM classifier in the Lib-SVM package with n-gram character based feature vectors as input, which achieves 90.89% score. They have experimented different combinations of n-grams. In the final model, 1+2+3+4-gram vectors with a dimension of 2642 are used. Feature weights are calculated by tf-idf. The overall development time is medium. Compared to their model, our model has obtained a higher classification accuracy with a neural architecture, and is straightforward to build without feature engineering.
The work of [Lu2017] has utilized pre-trained character embeddings as input and an same-structure ensemble of Self-Inhibiting Residual CNNs (SIR-CNNs). The convolution and max pooling are done in 1-D, and character embeddings are trained during training. The result gives a 92.88% score and the development time is medium to high. Our performance is better with lower computation time.
The top score comes from an ensemble model of 10 single LSTM (with multiple hidden layers) models along with data-based keyword extraction proposed by[Tang et al.2017]
. They have developed a domain keyword based LSTM classifier and applied ensembling techniques to integrate 10 retrained such classifiers of the same parameters, and finally used majority voting to select the final prediction. They have trained word vectors on 10G Weibo (a Chinese microblogging website) data with fine tuning. Due to fairly complicated feature engineering, such as domain keyword extraction, the development time of their model is very high. Instead of ensembling a great number of fine-tuned complex single models with a feature extraction algorithm, the ensemble with our proposed word-character models does not require feature engineering and comprises of less and simpler constituent models, which makes faster training possible while achieving a comparable to the state-of-the-art result. Besides, their best single classifier performance on the test set is not given, so we are not able to compare their single model result and time to ours.
In this paper, we address the Chinese intent classification problem and propose a low-effort integrated word-character approach that enhances classification accuracy compared to models using only Chinese word embeddings. We run experiments on the SMP dataset, with different word and character embedding configurations. Our single model achieves 89.51% on SMP. Our main findings are that the Chinese SMP data benefits more from the character approach, and we do not need to rely on pre-trained word embeddings using the proposed method. The proposed word-character module exploits the internal word and character relationships via CNN and pooling, and the embeddings are learned during training by optimizing the same loss function on logits as the word embedding model does.
Taking into account the ensemble method, we observe an improvement from ensembles without characters to those with characters, and the best ensemble achieves 93.55% on SMP, which are on par with the state-of-the-art. The proposed model is easy to implement and train, which greatly reduces the development time compared to works that rely on feature engineering and sophisticated architectures.
Our future work could focus on conducting experiments on different pooling strategies and embedding combining methods, for example, instead of giving equal weights to the word and character embeddings when combining, we would like to find out the contributions of word and character embeddings by experimenting on various weights. In addition, concatenation instead of weighted average could be used. Another focus of our work will be exploring different ensembling and stacking techniques with the character-level models. Last but not least, we will investigate attention mechanisms that could potentially further improve intent classification results.
- [Bojanowski et al.2017] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL.
- [Broder et al.2007] Andrei Broder, Marcus Fontoura, Evgeniy Gabrilovich, Amruta Joshi, Vanja Josifovski, and Tong Zhang. 2007. Robust classification of rare queries using web knowledge. the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 231–238.
- [Béchet2008] Frédéric Béchet. 2008. Processing spontaneous speech in deployed spoken language understanding systems: a survey. SLT, 1.
- [Chen et al.2015] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015. Joint learning of character and word embeddings. IJCAI.
- [dos Santos and Guimaraes2015] Cicero Nogueira dos Santos and Victor Guimaraes. 2015. Boosting named entity recognition with neural character embeddings. Proceedings of ACL Named Entities Workshop.
- [dos Santos and Zadrozny2014] Cicero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818–1826.
- [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1438.
- [Hu et al.2009] Jian Hu, Gang Wang, Fred Lochovsky, Jian-Tao Sun, and Zheng Chen. 2009. Understanding user’s query intent with wikipedia. WWW.
- [Kim et al.2016a] Joo-Kyung Kim, Gokhan Tur, Asli Celikyilmaz, Bin Cao, and Ye-Yi Wang. 2016a. Intent detection using semantically enriched word embeddings. Spoken Language Technology Workshop (SLT), 2016 IEEE.
- [Kim et al.2016b] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016b. Character-aware neural language models. AAAI.
- [Li et al.2008] Xiao Li, Ye-Yi Wang, and Alex Acero. 2008. Learning query intent from regularized click graphs. Proc. of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-08).
[Li et al.2015]
Yanran Li, Wenjie Li, Fei Sun, and Sujian Li.
Component-enhanced chinese character embeddings.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 829–834.
- [Li et al.2017] Dayu Li, Wenyue Zhang, Peng Zhang, Yujie Fu, and Suge Wang. 2017. Multiple features and multiple classifiers. SMP-ECDT.
- [Lu2017] Chenyu Lu. 2017. Character based cnn for text classification. SMP-ECDT.
- [Mikolov et al.2010] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 2:1045–1048.
- [Mori2007] Renato De Mori. 2007. Spoken language understanding: A survey. ASRU, page 365–376.
- [Pedregosa et al.2011] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- [Qin and andHai Zhao2016] Lianhui Qin and Zhisong Zhang andHai Zhao. 2016. Implicit discourse relation recognition with context-aware character-enhanced embeddings. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1914–1924.
- [Ravuri and Stolcke2015] Suman Ravuri and Andreas Stolcke. 2015. Recurrent neural network and lstm models for lexical utterance classification. Interspeech, pages 135–139.
- [Raymond and Riccardi2007] Christian Raymond and Giuseppe Riccardi. 2007. Generative and discriminative algorithms for spoken language understanding. INTERSPEECH.
- [Schmidhuber2000] Felix A. Gers; Jürgen Schmidhuber. 2000. Learning to forget: Continual prediction with lstm. Neural Computation, 12(10):2451–2471.
- [Shi et al.2015] Yangyang Shi, Kaisheng Yao, Hu Chen, Yi-Cheng Pan, Mei-Yuh Hwang, and Baolin Peng. 2015. Contextual spoken language understanding using recurrent neural networks. ICASSP.
- [Shi et al.2016] Yangyang Shi, Kaisheng Yao, Le Tian, and Daxin Jiang. 2016. Deep lstm based feature mapping for query classification. Proceedings of NAACL-HLT, page 1501–1511.
- [Sun et al.2014] Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, Zhenzhou Ji, and Xiaolong Wang. 2014. Radical-enhanced chinese character embedding. arXiv preprint arXiv:1404.4714.
- [Tang et al.2017] Jiecong Tang, Yongshi Liang, jiangyue Yan, Yanghui Li, Dawei Ling, Zhen Zeng, Zefeng Du, and Peijie Huang. 2017. Team report of scau sigsds on the closed task of intent classification for the evaluation of chinese human computer dialogue technology. SMP-ECDT.
- [Tur et al.2010] Gokhan Tur, Dilek Hakkani-Tür, and Larry Heck. 2010. What is left to be understood in atis? Spoken Language Technology Workshop (SLT), 2010 IEEE.
- [Wang et al.2002] Ye-Yi Wang, Alex Acero, Ciprian Chelba, Brendan Frey, and Leon Wong. 2002. Combination of statistical and rule-based approaches for spoken language understanding. ICSLP-2002, pages 609–612.
- [Wu et al.2005] Wei-Lin Wu, Ru-Zhan Lu, Feng Gao, and Yan Yuan. 2005. Combining multiple statistical classifiers to improve the accuracy of task classification. Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, 3406:452–462.
- [Xu and Sarikaya2013] Puyang Xu and Ruhi Sarikaya. 2013. Convolutional neural network based triangular crf for joint intent detection and slot filling. ASRU.
- [Xu et al.2016] Jian Xu, Jiawei Liu, Liangang Zhang, Zhengyu Li, and Huanhuan Chen. 2016. Improve chinese word embeddings by exploiting internal structure. Proceedings of NAACL-HLT, pages 1041–1050.
- [Yao et al.2013] Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. 2013. Recurrent neural networks for language understanding. Interspeech.
- [Yao et al.2014] Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. 2014. Spoken language understanding using long short-term memory neural networks. IEEE SLT.
- [Yu et al.2017] Jinxing Yu, Xun Jian, Hao Xin, and Yangqiu Song. 2017. Joint embeddings of chinese words, characters, and fine-grained subcharacter components. the 2017 Conference on Empirical Methods in Natural Language Processing, pages 286–291.
- [Zhang and Wang2016] Xiaodong Zhang and Houfeng Wang. 2016. A joint model of intent determination and slot filling for spoken language understanding. IJCAI-16.
- [Zhang et al.2017] Weinan Zhang, Zhigang Chen, Wanxiang Che, Guoping Hu, and Ting Liu. 2017. The first evaluation of chinese human-computer dialogue technology. CoRR, abs/1709.10217.