Discourse segmentation aims to segment Elementary Discourse Units (EDUs) which are defined as the minimal building blocks of constituting a text [Mann and Thompson1988]. There exist many controversies about what constitutes an EDU. A generally acceptable practice takes EDUs to be non-overlapping clauses [Carlson and Marcu2001]
, which has been verified a reasonable language unit and used successfully in some downstream applications such as automatic summarization[Hirao et al.2013, Li, Thadani, and Stent2016, Durrett, Berg-Kirkpatrick, and Klein2016].
For research on Chinese discourse, several discourse corpus have been constructed [Zhou and Xue2012, Li, Kong, and Zhou2014, Zhang, Qin, and Liu2014]. [Zhou and Xue2012] built a predicate-argument style Chinese discourse treebank, following the annotation scheme of English Penn Discourse Treebank (PDTB) [Prasad et al.2007]. Motivated by Rhetorical Structure Theory (RST), li2014cdtbli2014cdtb constructed a RST-like discourse treebank. These work directly treated EDUs as Chinese clauses segmented by some punctuations, e.g. comma, semicolon and period. At the same time, there are some work to research whether punctuations especially commas are boundaries of EDUs [Xue and Yang2011, Yang and Xue2012, Xu and Li2013, Li, Feng, and Feng2015].
In our opinion, Chinese EDUs(or clauses) may not always end at the punctuation positions. Fig. 1 shows two Chinese examples of discourse segmentation (i.e., 1a and 2a). From the examples we can see that, besides punctuations, some intra-sentential words may be the boundaries of EDUs, e.g., “亚美尼亚指出 (Armenia state)” in 1a and “在制定下一个比额表时(when developing the next scale)” in 2a.
As we know, the discourse guidelines defined by [Carlson and Marcu2001] compile a series of rules for segmenting English EDUs, which can also apply on Chinese text. However, compared to English, it is relatively difficult to identify Chinese EDUs under this guideline, because Chinese phrase structure is syntactically similar to sentence structure, and relative pronouns and subordinating conjunctions are rarely used to introduce clauses [Tsao1990]. Thus, in this paper we argue to adopt the EDU segmentation guideline proposed by by [Carlson and Marcu2001] and aim to conduct Chinese discourse segmentation with a more appropiate way.
For our work, the main problem is the absence of labeled Chinese data which conforms to this segmentation guideline. It is fortunate that there exist some discourse commonalities between Chinese and English as we observe. Figure 1 gives two English discourse segmentation examples (i.e., 1b and 2b) which are translations of 1a and 2a respectively. We can see that each EDU in either language contains a verb-like word and other similar syntactic and lexical features in judging the EDU boundary. In such cases, we come up with the idea that Chinese EDU segmentation can be conducted with the help of the discourse commonality summarized from substantial English discourse data such as RST Discourse Treebank (RST-DT) [Carlson and Marcu2001].
As for cross-lingual discourse segmentation, braud2017crossbraud2017cross designed a multi-task learning framework for five languages (i.e., English, Spanish, German, Dutch and Brazilian Portuguese) which belong to the Indo-European family and are very similar. Their model is relatively simple and can not well extract common features for discourse segmentation sharing in much different languages such as Chinese and English, since Chinese belongs to Sino-Tibetan language faminly and Engligh. To transfer English segmentation information to Chinese, inspired by the work of [Ganin and Lempitsky2015]
, we design the adversarial neural network framework which can leverage both language-specific features and shared language-independent features for discourse segmentation. We propose to use English labeled data to train a discourse segmenter and simultaneously transfer the learned discourse segmentation features to Chinese. A common Bidirectional Long-short term memory (BiLSTM) network is designed with language-adversarial training to exploit the language-independent knowledge which can be transferred from English to Chinese. At the same time, two private BiLSTMs are used to extract language-specific knowledge.
To the best of our knowledge, we are the first to propose that Chinese EDU segmentation can not be limited to only judging the functions of punctuations and contribute to providing a small scale of Chinese labeled corpus. We also contribute to using an adversarial neural network framework to conduct Chinese discourse segmentation with the help of English labeled data. Experiments show that our models can leverage common features from English data, and learn efficient Chinese-specific features from a small amount of Chinese labeled data, outperforming the baseline models.
Chinese Elementary Discourse Units
In this section, we briefly introduce the definition of Chinese EDUs in our work. Generally, the basic principle is to treat an EDU as a clause. However, since a discourse unit is a semantic concept while a clause is syntactically defined, researchers further made some refinements on EDU segmentation. We mainly followed the guidelines defined by carlson2001discoursecarlson2001discourse and manually labeled a small corpus according to the characteristics of Chinese. Next we list some criteria of segmenting EDUs as follows.
Subjective clauses, objective clauses of non-attributive verbs and verb complement clauses are not segmented as EDUs.
A prepositional clause is an EDU.
Complements of attribution verbs, including both speech acts and other cognitive acts, are treated as EDUs.
Coordinated sentences and clauses are broken into separate EDUs.
Coordinated verb phrases are not separated into EDUs.
Temporal expressions are marked as separate EDUs.
Correlative subordinators consist of two separate EDUs, provided that the subordinate clause contains a verbal element.
Strong discourse cues can start a new EDU no matter they are followed by a clause or a phrase.
There are some embedded discourse units where one EDU is seperated by a subordinate clause.
We also show some EDU segmentation examples in Fig. 2, which correspond to the above segmentation criteria in bold face. Due to space limitation, we do not explain the segmentation criteria in detail. Following the EDU segmentation guideline, we manually segment 782 sentences from People’s Daily for training, validation and test.
To conduct Chinese discourse segmentation by exploiting English labeled data, we design our model as in Fig. 3. The whole architecture is mainly composed of four modules: a common feature extractor () that extracts the shared language-independent features for discourse segmentation, private feature extractors ( and ) that learns language-specific features respectively for Chinese and English, an EDU segmenter () that conducts discourse segmentation for a sequence, and a language discriminator () that judges whether a sequence is in Chinese or English language.
To be specific, given a sentence composed of words, we look up the pre-trained bilingual word embeddings and Universal POS tag embeddings , and use their concatenation () as the input of three feature extrators (i.e.,, and ), each of which is modeled using the bidirection LSTMs (BiLSTM) models. Knowing the used language of , is inputted into the common feature extractor and the corresponding language-specific feature extractor. The common feature extractor maps into the hidden states which contains both forward and backward information learned by LSTMs. Similarly, the private feature extractor converts into the hidden states if is in Chinese, if is in English. Next, with the hidden states as input, we design two tasks: one task is discourse segmentation and the other is a language-adversarial task.
Discourse segmentation task. Discourse segmentation is regarded as a sequence labeling problem with two labels (i.e., 0 and 1) indicating whether to segment the current word. With the concatenation of with (i.e., or ), a BiLSTM layer is applied to obtain the abstract hidden states for the following CRF layer which outputs the final labeling results. Suppose is the transition matrix of a linear-CRF and ,
are score vectors that capture the cost of beginning and ending with a given label. Then, the global score of a given label sequenceis represent by
Where denotes the cost of labeling
with the hidden representation of.
Then, we choose the best label sequence based on the global score. Let the parameters of the discourse segmenter be . Given and , we use cross-entropy loss as the objective function:
Language-adversarial task. The language-adversarial task aims to train the language-independent features which make the language discriminator difficult to distinguish what language is used. The idea is inspired by domain adversarial training and Generative Adversarial Network (GAN), which has been applied on sentiment classification and POS tagging [Chen et al.2016, Kim et al.2017]Kim2014].
First, the common language-independent features
are inputted to three convolution filters whose activation function is Leaky ReLU[Maas, Hannun, and Ng2013]
and window sizes are set to 3, 4 and 5 respectively. Then, the max-over-time pooling (MaxPool) operation is applied to obtain three fixed-length vectors which are concatenated as a vector. Next, the fully connected layer takes the concatenated vector as input and outputs a scalar value. The higher the value is, the more probable the language is in English.
When training the language discriminator, we expect the scores for English instances to be higher and the scores for Chinese to be lower. Here, we define the distribution of the common language-independent features for and instances as:
We train to make these two distributions as close as possible so that is independent of language. Following [Chen et al.2016] and [Arjovsky, Chintala, and Bottou2017], we minimize the Wasserstein distance between and
since the Wasserstein distance is relatively stable for hyperparameter selection. We notate the language discriminator as the function, according to the Kantorovich-Rubinstein duality [Villani2008] we get the Wasserstein distance :
Let the discriminator be parameterized by , and the objective function given is:
Finally, when training the parameters of the common feature extractor , we minimize the objective function given that the discriminator has been trained well. This means that the common features between English and Chinese are language-independent and the discriminator have no capacity to distinguish the two languages. At the same time, we need to consider the discourse segmentation task with and as input. Let the feature extractors be parameterized as (), and we train the features with the objective function:
To train our model, we use English labeled data, Chinese unlabeled data and zero (or a little) Chinese labeled data. The English labeled corpus is important for learning language-independent features. According to the Chinese resource we use, two ways of training are designed.
With zero Chinese labeled data (ZL). We only use English labeled corpus and Chinese unlabeled corpus to train the language discriminator (), common feature extractor () and EDU segmenter (). Since no labeled Chinese data is used, the private Chinese-specific feature extractor can not be trained. Only the common language-independent features are fed into the segmenter for Chinese discourse segmentation.
Only with a little Chinese labeled data (LL). We use English and Chinese labeled data to train our model and do not use Chinese unlabeled data. To balance the scale of Chinese data with that of English data, we keep duplicating the small amount of Chinese labeled data until it has a similar scale with the English labeled data. The duplicated Chinese labeled data and English labeled data are used to train our model until the Chinese validation data reaches the highest performance.
Chinese data. In this work of Chinese discourse segmentation, there are totally 782 manually labeled Chinese sentences, which were introduced in section Chinese Elementary Discourse Units. Among these data, we choose 182 sentences to compose of a test set and 200 sentences a validation set , and use the rest 400 sentences as training data when needed. Besides, another 9236 sentences from People’s Daily serves as unlabeled Chinese training data.
English data. We use the RST-DT corpus composed of 385 discourse-segmented articles from Wall Street Journal (WSJ) and 500 segmented articles from [Yang and Li2018]. Finally, we totally get 9636 segmented English sentences.
Bilingual word embeddings (BWE). To obtain cross-lingual word embeddings which share the same space between Chinese and words, we use BilBOWA [Gouws, Bengio, and Corrado2015] to obtain the 200-dimensional bilingual word embeddings. To train BilBOWA, we use 570,499 Chinese sentences from People’s Daily, 700,000 English sentences from CNN/DailyMail [Hermann et al.2015], and a parallel corpus composed of 50,000 pairs of aligned Chinese and English sentences.
Universal POS tags. In order to assure that Chinese and English adopt the same POS tagset, we use Universal POS tagset [Nivre et al.2016]. For English text, we use pre-trained UDPipe model [Straka, Hajic, and Straková2016] to postag. For Chinese text, we use Stanford CoreNLP toolkit [Manning et al.2014] to postag with the UPenn tagset and then convert the tag to Universal POS tag with a conversion map111Refer to http://universaldependencies.org/tagset-conversion/zh-conll-uposf.html and make some modifications., because UDPipe lacks Chinese training data and performs poorly.
When training our model, we train the feature extractors () and the discourse segmenter () together, and train the language discriminator () separately. and are trained iterations per iteration. Following the ANDN network [Chen et al.2016], we set to be the value of 5. For the parameter which balances the influence of and , we experiment the values of 0.2, 0,1, 0.05
on the validation data and finally set it to be 0.1. All the models are optimized using RMSProp with learning rate of 0.001 and decaying rate of 0.9. The reason why we do not use Adam is that momentum-based optimizers are unreliable when training weight-clipped WGAN[Arjovsky and Bottou2017]. When training , the threshold of weight-clipping is set as 0.01. The word embeddings dimension is set 200. The POS tag embeddings are initialized randomly and their dimension is set 50. The dimensions of the hidden states in the BiLSTMs are set the same as the dimensions of their corresponding input vectors. Besides, the outputs of all BiLSTMs are regularized with dropout rate of 0.5 [Pham et al.2014]. The size of minibatch is set 20.
To verify the effectiveness of our model, we design two sets of experiments. In the first set of experiments, we compare our models with several baselines and exhibit the performance improvement brought by the discourse commonality between Chinese and English and the adversarial language training model. In the second set of experiments, we analyze the factors such as size of the labeled Chinese data which may influence the performance of our model. In the experiments, we use Precision, Recall and F-measure to evaluate the performance and F-measure is the main metric of measuring the quality of a model.
Since there is no available tools which follow the RST-DT criteria to segment Chinese EDUs, here we design several baselines for comparison. First, considering when there is no Chinese labeled training data, we use three baselines to contrast with our model of ZL. The first baseline is named S-baseline in which we directly see each sentence as an EDU. The second baseline is named P-baseline which segments the sentences whenever it meets a punctuation except quotation marks (i.e., “ ”) and slight-pause mark (i.e., 、). The third is a baseline named NoAdvers-Z which is identical to our model except the adversarial training part is removed. In another view, NoAdvers-Z can be seen as a segmenter which is composed of two-layer LSTM and one layer of CRF. We use English labeled data to train NoAdvers-Z and test it on Chinese text which is similar to the method of [Braud, Lacroix, and Søgaard2017]. When we use the small amount of Chinese labeled data, we design a baseline NoAdvers-L to contrast with our model LL. The only difference between NoAdvers-L and NoAdvers-Z is that we train NoAdvers-L using the duplicated Chinese data as training set.
|NoAdvers-Z||71.13%||77.50 %||65.72 %|
Table 1 shows the results of our models with the baselines. From the table, we can see S-baseline achieves the worst performance, though its precision is 100 percent because each sentence boundary must be an EDU boundary. The recall of 48.46% means that only 48.46% of all EDU boundaries belong to the sentence boundaries. P-baseline achieves a high F-measure of 79.41%, though it is simple. According to the performance of P-baseline, we know that 71.2% of EDU boundaries belong to the punctuations and most of the punctuations (89.76%) prefer to be EDU boundaries. Thus, the most challenging in Chinese discourse segmentation is to recall the rest 28.8% EDU boundaries and pick out those punctuations which do not serve as EDU boundaries. We can also see that NoAdvers-Z performs worse than P-baseline with regard to the three metrics of Precision, Recall and F-measure. This means that we can not directly use the English training data to train a Chinese discourse segmenter even if bilingual embeddings of two languages share the same space. ZL significantly outperforms the other three baselines when no labeled Chinese data is available and achieves the F-measure of 82.35% by leveraging the common feature extractor adversarial network.
Compared to P-baseline, ZL can identify more EDU boundaries because the shared language-independent features learned from English labeled data are more appropriate to recognizing Chinese EDUs. This can be verified from several segmentation examples which are shown in Figure 4. We can see that Example (3a) can identify the attribution verb “提醒 (remind)”, which indicates an EDU boundary following a subordinate clause. This means that “提醒 (remind)” is also an attribution verb in the English training data and the Chinese segmenter learns the feature. Example (4a) correctly judges the functions of two commas where the content before the first comma is a temporal expression and the second comma is a real EDU boundary, because our model may learn the language-independent features that a temporal clause can not be a separate EDU and verb-like words should be contained in an EDU. It is obvious that ZL has its limitations. Example (5a) is wrongly segmented and each segmented EDU does not have a complete meaning, though the model learns to contain a verb-like word in each EDU. This is because Chinese is a language of parataxis and use the passive voice with adding a function word ( e.g., “由…创作(be created)” in this example), which is different from English. It is difficult for ZL to learn the Chinese-specific features without any Chinese labeled data.
Table 1 also shows the performance of NoAdvers-L and LL, which use the 400 Chinese labeled sentences to supervise the training. From the table, we can see that NoAdvers-L and LL both outperform over the methods using zero Chinese labeled data. With Chinese and English labeled data together, LL can effectively train the Chinese-specific features as well as the common language-independent features which are useful to labeling the EDU boundaries. At the same time, NoAdvers-L achieves satisfactory results and again verifies that labeled data can provide effective supervision. LL outperforms NoAdvers-L, since the common features learned through the adversarial language discrimination and cross-lingual resources remedy training insufficiency of Chinese-specific features caused by the small Chinese labeled data. Figure 5 shows two segmentation examples by the LL model. Here LL correctly segments Example (6a) which has been wrongly segmented by ZL (i.e., 5a). From this example, we can see that the Chinese labeled data can train more Chinese-specific features such as the embedded attributive clause. Example (7a)is still wrongly segmented by LL. Here “报告 (report)” and “伤亡 (casualty)” are seen as noun words. In fact, when our annotators read this sentence, “伤亡” is always understood as a verb phrase “be killed or injured”. This is because verbs and verb phrases in Chinese can be nominalized without any overt marker for it. Without a large amount of Chinese labeled data, it is difficult to learn such ambiguous knowledge which is crucial to discourse segmentation.
Impact of Resources
In this subsection, we mainly investigate the influence of size of Chinese labeled data and Universal POS tagset. To figure out to what extend our model can remedy the lack of labeled Chinese data, we explore the performance change with regard to the size of these data and compare our models with the NoAdvers model. Figure 6 displays the performance change. With zero Chinese labeled data, there is a large performance gap between our ZL model and the NoAdvers-Z model, because the common features learned from the adversarial mechanism contribute much to Chinese discourse segmentation in ZL. With Chinese labeled data increasing, the performance gap between our LL model and the NoAdvers-L model first becomes decreased and then keeps stable when the size of Chinese labeled data reaches 200 sentences. The performance tendency means that Chinese labeled data when available is more effective for training the Chinese discourse segmenter than English data. The more Chinese labeled data a model has, the higher performance it can boost. From the performance gaps between our model and the NoAdvers model, we can say that the common language-independent features learned from cross-lingual discourse commonality also contributes to Chinese discourse segmentation. Since we do not have a large amount of Chinese labeled data, we are not sure whether the two performance lines will intersect and whether the common features learned from English data will not contribute to Chinese discourse segmentation when Chinese labeled data is large enough. At least, with the small-scale Chinese training data in hand, the common language-independent features learned from the adversarial discrimination are still important to the performance of a Chinese discourse segmenter.
To show the necessity of using Universal POS tagset, we do the contrast experiments with language-specific POS tagsets (English Penn Treebank POS tagset and Chinese Penn Treebank POS tagset) as input in the ZL and LL models respectively. In the experiments, all settings are the same as introduced above except using different POS tagsets and POS tag embeddings in the network. The results are shown in Table 2. We can see that both ZL and LL models with the Universal POS tagset show superior results to the models with different POS tagsets. These results suggest that Universal POS tagsets provide a good basis for learning cross-lingual knowledge which is important for discourse segmentation.
Recently, much work have been conducted on English EDU segmentation. Most of those segmenters are based on statistical classifiers and leverage many manually extracted features in which lexical syntactic subtree features play an important role[Soricut and Marcu2003]. Some modifications to the statistic model and features have made great improvements [Fisher and Roark2007, Joty, Carenini, and Ng2015] and [Bach, Minh, and Shimazu2012] achieved the best results ever. Besides, [Subba and Di Eugenio2007] first used neural networks on this task and some neural network models have achieved comparable results using less artificial features. Since [Sporleder and Lapata2005] see the EDU segmentation task as a sequence labeling problem, we leverage the LSTM-CRF model [Huang, Xu, and Yu2015, Ma and Hovy2016] which is proved effective in this problem.
As for Chinese EDU segmentation, previous work mainly focused on identifying EDU boundaries by punctuations and saw this task as comma classification [Li, Feng, and Zhou2012, Xue and Yang2011, Yang and Xue2012, Xu and Li2013]. [Cao et al.2017] claimed to conduct Chinese EDU segmentation based on RST, but in their criteria, no-punctuation EDU boundaries are not considered and only a small corpus is released. Thus, with zero or little segmented Chinese corpus, we attempt to borrow knowledge from abundant English labeled data and design the framework of adversarial neural network to learn the discourse commonality across different languages.
The main motivation of cross-lingual tasks is to use the language commonality between languages or remedy the lack of labeled data in a language. One kind of methods are to project annotations across parallel corpus [Yarowsky, Ngai, and Wicentowski2001, Diab and Resnik2002, Padó and Lapata2006, Xi and Hwa2005]. However, it is difficult to obtain parallel corpus. Then, one kind of the alternative methods are statistic model based and require cross-lingual features [Ando and Zhang2005, Darwish2013]; and the other kind of methods are neural network based, since word embeddings in different languages have the capability of representing semantic meanings in the same space. Besides bilingual word embeddings, bilingual character embeddings also improve the performance [Yang, Salakhutdinov, and Cohen2016, Cotterell and Duh2017]. However, these work are usually constrained to a language family such as Indo-European languages which share some same characters [Braud, Lacroix, and Søgaard2017]. For Chinese and English which belong to different language families, we propose a method to get Universal Chinese POS tags and use adversarial network to solve the language adaption problem inspired by [Ganin and Lempitsky2015], [Chen et al.2016] and [Kim et al.2017].
In this paper, we propose to segment EDUs in Chinese based on RST and identify those EDU boundaries where there is no punctuations. Further more, we design an adversarial network to exploit abundant English discourse data to help segment Chinese EDUs because of the lack of labeled Chinese data which follows the criterion above. Based on cross-lingual discourse commonality, we use an adversarial language discrimination task to extract common language-independent features and language-specific features which are useful for discourse segmentation. Experimental results verify the efficiency of our models.
[Ando and Zhang2005]
Ando, R. K., and Zhang, T.
A framework for learning predictive structures from multiple tasks
and unlabeled data.
Journal of Machine Learning Research6(Nov):1817–1853.
- [Arjovsky and Bottou2017] Arjovsky, M., and Bottou, L. 2017. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862.
- [Arjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
- [Bach, Minh, and Shimazu2012] Bach, N. X.; Minh, N. L.; and Shimazu, A. 2012. A reranking model for discourse segmentation using subtree features. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 160–168. Association for Computational Linguistics.
- [Braud, Lacroix, and Søgaard2017] Braud, C.; Lacroix, O.; and Søgaard, A. 2017. Cross-lingual and cross-domain discourse segmentation of entire documents. arXiv preprint arXiv:1704.04100.
- [Cao et al.2017] Cao, S.; Xue, N.; da Cunha, I.; Iruskieta, M.; and Wang, C. 2017. Discourse segmentation for building a rst chinese treebank. In Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms, 73–81.
- [Carlson and Marcu2001] Carlson, L., and Marcu, D. 2001. Discourse tagging reference manual. ISI Technical Report ISI-TR-545 54:56.
- [Chen et al.2016] Chen, X.; Sun, Y.; Athiwaratkun, B.; Cardie, C.; and Weinberger, K. 2016. Adversarial deep averaging networks for cross-lingual sentiment classification. arXiv preprint arXiv:1606.01614.
[Cotterell and Duh2017]
Cotterell, R., and Duh, K.
Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields.In
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, 91–96.
- [Darwish2013] Darwish, K. 2013. Named entity recognition using cross-lingual resources: Arabic as an example. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1558–1567.
- [Diab and Resnik2002] Diab, M., and Resnik, P. 2002. An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 255–262. Association for Computational Linguistics.
- [Durrett, Berg-Kirkpatrick, and Klein2016] Durrett, G.; Berg-Kirkpatrick, T.; and Klein, D. 2016. Learning-based single-document summarization with compression and anaphoricity constraints. arXiv preprint arXiv:1603.08887.
- [Fisher and Roark2007] Fisher, S., and Roark, B. 2007. The utility of parse-derived features for automatic discourse segmentation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 488–495.
Ganin, Y., and Lempitsky, V.
Unsupervised domain adaptation by backpropagation.In International Conference on Machine Learning, 1180–1189.
[Gouws, Bengio, and
Gouws, S.; Bengio, Y.; and Corrado, G.
Bilbowa: Fast bilingual distributed representations without word alignments.In International Conference on Machine Learning, 748–756.
- [Hermann et al.2015] Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, 1693–1701.
- [Hirao et al.2013] Hirao, T.; Yoshida, Y.; Nishino, M.; Yasuda, N.; and Nagata, M. 2013. Single-document summarization as a tree knapsack problem. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1515–1520.
- [Huang, Xu, and Yu2015] Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
- [Joty, Carenini, and Ng2015] Joty, S.; Carenini, G.; and Ng, R. T. 2015. Codra: A novel discriminative framework for rhetorical analysis. Computational Linguistics 41(3):385–435.
[Kim et al.2017]
Kim, J.-K.; Kim, Y.-B.; Sarikaya, R.; and Fosler-Lussier, E.
Cross-lingual transfer learning for pos tagging without cross-lingual resources.In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2832–2838.
- [Kim2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
[Li, Feng, and Feng2015]
Li, Y.; Feng, H.; and Feng, W.
Chinese discourse segmentation based on punctuation marks.
Int. J. Signal Process. Image Process. Pattern Recognit8(3):177–186.
- [Li, Feng, and Zhou2012] Li, Y.; Feng, W.; and Zhou, G. 2012. Elementary discourse unit in chinese discourse structure analysis. In Workshop on Chinese Lexical Semantics, 186–198. Springer.
- [Li, Kong, and Zhou2014] Li, Y.; Kong, F.; and Zhou, G. 2014. Building chinese discourse corpus with connective-driven dependency tree structure. In Proceedings of the 2014 Conference on Emperical Methods in Natual Language Processing, 2105–2114.
- [Li, Thadani, and Stent2016] Li, J. J.; Thadani, K.; and Stent, A. 2016. The role of discourse units in near-extractive summarization. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 137–147.
- [Ma and Hovy2016] Ma, X., and Hovy, E. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354.
- [Maas, Hannun, and Ng2013] Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, 3.
- [Mann and Thompson1988] Mann, W. C., and Thompson, S. A. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse 8(3):243–281.
- [Manning et al.2014] Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; and McClosky, D. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 55–60.
- [Nivre et al.2016] Nivre, J.; de Marneffe, M.-C.; Ginter, F.; Goldberg, Y.; Hajic, J.; Manning, C. D.; McDonald, R. T.; Petrov, S.; Pyysalo, S.; Silveira, N.; et al. 2016. Universal dependencies v1: A multilingual treebank collection. In LREC.
[Padó and Lapata2006]Padó, S., and Lapata, M. 2006. Optimal constituent alignment with edge covers for semantic projection. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, 1161–1168. Association for Computational Linguistics.
[Pham et al.2014]
Pham, V.; Bluche, T.; Kermorvant, C.; and Louradour, J.
Dropout improves recurrent neural networks for handwriting recognition.In Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, 285–290. IEEE.
- [Prasad et al.2007] Prasad, R.; Miltsakaki, E.; Dinesh, N.; Lee, A.; Joshi, A.; Robaldo, L.; and Webber, B. L. 2007. The penn discourse treebank 2.0 annotation manual.
- [Soricut and Marcu2003] Soricut, R., and Marcu, D. 2003. Sentence level discourse parsing using syntactic and lexical information. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, 149–156. Association for Computational Linguistics.
- [Sporleder and Lapata2005] Sporleder, C., and Lapata, M. 2005. Discourse chunking and its application to sentence compression. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 257–264. Association for Computational Linguistics.
- [Straka, Hajic, and Straková2016] Straka, M.; Hajic, J.; and Straková, J. 2016. Udpipe: Trainable pipeline for processing conll-u files performing tokenization, morphological analysis, pos tagging and parsing. In LREC.
- [Subba and Di Eugenio2007] Subba, R., and Di Eugenio, B. 2007. Automatic discourse segmentation using neural networks. In Proc. of the 11th Workshop on the Semantics and Pragmatics of Dialogue, 189–190.
- [Tsao1990] Tsao, F. 1990. Sentence and clause structure in Chinese: a functional perspective. Student Book Co. Ltd.
- [Villani2008] Villani, C. 2008. Optimal transport: old and new, volume 338. Springer Science & Business Media.
- [Xi and Hwa2005] Xi, C., and Hwa, R. 2005. A backoff model for bootstrapping resources for non-english languages. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 851–858. Association for Computational Linguistics.
- [Xu and Li2013] Xu, S., and Li, P. 2013. Recognizing chinese elementary discourse unit on comma. In Asian Language Processing (IALP), 2013 International Conference on, 3–6. IEEE.
- [Xue and Yang2011] Xue, N., and Yang, Y. 2011. Chinese sentence segmentation as comma classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, 631–635. Association for Computational Linguistics.
- [Yang and Li2018] Yang, A., and Li, S. 2018. Scidtb: Discourse dependency treebank for scientific abstracts. In Proceedings of the 56th Annual Meeting of the Association of Computational Linguistics.
- [Yang and Xue2012] Yang, Y., and Xue, N. 2012. Chinese comma disambiguation for discourse analysis. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, 786–794. Association for Computational Linguistics.
- [Yang, Salakhutdinov, and Cohen2016] Yang, Z.; Salakhutdinov, R.; and Cohen, W. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270.
- [Yarowsky, Ngai, and Wicentowski2001] Yarowsky, D.; Ngai, G.; and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research, 1–8. Association for Computational Linguistics.
- [Zhang, Qin, and Liu2014] Zhang, M.; Qin, B.; and Liu, T. 2014. Chinese discourse relation semantic taxonomy and annotation. Journal of Chinese Information Processing 28:28–36.
- [Zhou and Xue2012] Zhou, Y., and Xue, N. 2012. Pdtb-style discourse annotation of chinese text. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 69–77. Association for Computational Linguistics.