Log In Sign Up

TransSent: Towards Generation of Structured Sentences with Discourse Marker

by   Xing Wu, et al.

This paper focuses on the task of generating long structured sentences with explicit discourse markers, by proposing a new task Sentence Transfer and a novel model architecture TransSent. Previous works on text generation fused semantic and structure information in one mixed hidden representation. However, the structure was difficult to maintain properly when the generated sentence became longer. In this work, we explicitly separate the modeling process of semantic information and structure information. Intuitively, humans produce long sentences by directly connecting discourses with discourse markers like and, but, etc. We thus define a new task called Sentence Transfer. This task represents a long sentence as (head discourse, discourse marker, tail discourse) and aims at tail discourse generation based on head discourse and discourse marker. Then, by connecting original head discourse and generated tail discourse with a discourse marker, we generate a long structured sentence. We also propose a model architecture called TransSent, which models relations between two discourses by interpreting them as transferring from one discourse to the other in the embedding space. Experiment results show that our model achieves better performance in automatic evaluations, and can generate structured sentences with high quality. The datasets can be accessed by dataset.


page 1

page 2

page 3

page 4

1 Introduction

Automatically generating semantically meaningful and well-structured text has many applications in question answering, dialogue systems, product reviews, etc. Due to the obscurity and complexity of human language, it is difficult to generate realistic sentences, making NLG one of the most challenging tasks in natural language processing (NLP). Recently, with the development of deep neural networks, many studies have shown promising results in NLG. Some generate sentences from scratch 

[3, 26, 13, 27, 14, 19], while others [11, 10] explore the influence of different attributes. However, they focus on improving the quality of individual sentence while ignoring cross-sentence relations and dependencies. All these methods fuse semantic information and structure information in one mixed hidden representation, and decode the representation into a sentence. When the generated sentence became longer, its structure was difficult to maintain properly. Although there are also a few works fused on long text generation [7, 2], however, these generative models do not have a clear mechanism to relieve the problem.

S1 marker S2
She was late to class because she missed the bus
She was sick at home so she missed the class
She was good at soccer but she missed the goal
She had a clever son and she loved him
Table 1: Example discourse pairs with correct discourse markers.

Discourse markers [8, 23, 16] are the words that mark the semantic relationship between two sentences, such as because, but, and. Humans naturally connect discourses with discourse markers, as shown in Table 1. There exist works like DisSent [18] learn high quality sentence embeddings by leveraging explicit discourse relations. However, there is few attempts to generate long structured sentences explicitly composed with discourses. Compared to generating a long structured sentence from one hidden representation from scratch, it is easier to generate two discourses with explicit discourse relations indicated by discourse markers, which will be helpful to generative tasks like QA, etc., making this a useful task.

We thus take a structured sentence as , where the is indicated by discourse marker. Based on this, we propose a new task called Sentence Transfer, which aims at tail discourse generation based on head discourse and discourse marker. An example to illustrate the task is shown in Figure 1. We want to generate one tail discourse that still holds “and” relation with the head discourse “I enjoy the movie”, likes ”i like the popcorn sold in this cinema”. Such a setting brings two benefits. First, sentences generated through our method are naturally structured. Second, as humans naturally use discourse markers, it’s easy to collect huge amounts of text for model training, without hand annotation.

Figure 1: An example of Sentence Transfer.

We also propose a novel model architecture called TransSent to fulfill the task. TransSent consists of three parts, as shown in Figure 4

. The AutoEncoder network learns hidden representation for sentences and decodes hidden representation into text. The relation translation network translates head discourse into tail discourse in embedding space. Following  

[25, 6, 4], we use the language model as discriminator to measure the coherence and cohesion of a long structured sentence.

Figure 2: Generation of long structured sentence.

As shown in Figure 2, after TransSent being well-trained, we are able to generate long structure sentences in three steps. We first adopt well-trained Variational AutoEncoder (VAE) to randomly generate a head discourse, then we select one discourse marker and perform Sentence Transfer with TransSent, generating the tail discourse. At last, we concatenate original head discourse and the generated tail discourse with discourse marker to get a long structure sentence.

We evaluate the performance of our model and baselines with a discriminator well-trained on DMP [18] task. Experiment results show that our model achieves state-of-the-art performance and can generate long structured sentences with high quality.

Our contributions are as follows:

  • We define the Sentence Transfer task and construct a domain-specific dataset and two open domain datasets for further research.

  • We propose a novel model architecture called TransSent for long structured sentences generation, which can be easily combined with tasks like QA, to generate a better reply.

  • Experimental results on all datasets show that our method performs better in the automatic evaluation and human evaluation.

2 Related Work

2.1 Long Text Generation

Natural language generation has drawn enough attention and been studied by many researchers. [3] uses VAE to generate sentences from continuous space. [26] models text generation as a sequential decision making process by training the generative model with policy gradient strategy [21]. Since then, many related improvements have been proposed. [7] proposes LeakGAN for generating long text via adversarial training which allows the discriminative net to leak its own high-level extracted features to the generative net to further help the guidance. [2]

investigates the use of discourse-aware rewards with reinforcement learning to guide a model to generate long, coherent text.

However, these works encode semantic information and structure information directly in a mixed hidden representation, making it hard to decode a semantically meaningful and well-structured long sentence. When generated sentence gets longer, its structure is difficult to maintain properly, making it difficult to maintain the quality of the entire sentence.

2.2 Translation Models for Knowledge Representation Learning

Knowledge representation learning aims to embed the entities and relations in knowledge graphs (KGs) into a continuous semantic space. A knowledge graph is the set of fact triples with the format

, where and are the head and tail entities holding relation . TransE [1] attempts to regard a relation as a translation between the head and tail entities. TransE assumes that when holds. To address the issue of TransE when modeling complex mapping relations, TransH [22] is proposed to enable an entity to have different representations when involved in various relations. TransR [15] observes that an entity may exhibit its different attributes in distinct relations and choose to model entities and relations in separated spaces.

Figure 3: Simple illustration of TransSent.

Inspired by the success of these methods, we perform similar translation between two sentence representations as in Trans methods, in order to simulate the latent semantic relation indicated by discourse marker, as shown in Figure 3. One challenge is that sentences are more complicated than knowledge entities, when encoded into a continuous semantic space. The other challenge is that translated hidden representation need to be able to decoded into sentence, which increases the difficulty of the problem.

2.3 Discourse Relations

[8] argues that discourse relations always exist. They compose into parsable structures and can be categoried. [24] proposes to use discourse context and reward to refine the translation quality from the discourse perspective. Nie [18] defines the name “discourse markers” and proposes the discourse marker prediction (DMP) task. The task aims to predict which discourse marker should be taken to connect the two adjacent discourse. Moreover, Nie [18] provides an automatic way to collect a dataset of sentence pairs and the relations between them from natural text corpora using a set of explicit discourse markers and universal dependency parsing [20].

3 Sentence Transfer Task

We focus here on structured sentences with explicit discourse markers between adjacent discourses, rather than implicit relations between a sentence and the related discourse. We define a structured sentence as , where the is indicated by discourse marker. Sentence Transfer task aims at tail discourse generation based on head discourse and discourse marker. The generated tail discourse should hold relation to head discourse. An example is shown in Figure 1.

Sentence Transfer is a useful helper for generating long structured sentences. We can firstly generate a head discourse, then apply a discourse marker and generate a tail discourse through Sentence Transfer task. By concatenating the three parts, we can get a long structured sentences.

Figure 4: The overall model architecture of TransSent.

4 TransSent Model Architecture

In this section, we introduce TransSent in detail. It consists of two parts: an autoencoder and an relation translation network. The encoder of the autoencoder learns the latent representation of a sentence, and the decoder interprets the latent representation back to original text. Speically, we use a fine-tuned BERT on the discourse marker prediction task as the encoder. Relation translation network learns to translate a head discourse to its corresponding tail discourse in latent representation space, according to a specified relationship.

Formally, giving a training set of discourse pairs and relations between them. Each example denoted as 333Subscript represents , represents ., composed of head discourse , tail discourse , and a relationship

. Our model learns vector embeddings of the discourses and the relations. As discourses and relations are different, it may be not capable to represent them in a common embedding space. We thus propose to model discourses and relations in distinct spaces, i.e.

sentence space and relation space and perform sentence transfer in sentence space.


denoted by , consists of three parts: an AutoEncoder , a relation translation network and a discriminator constraint , as shown in Figure 4. The AutoEncoder comprises two sub-parts, an encoder and a decoder . encodes the head discourse into its feature representation . Then, the relation translation network exerts relation translation on and get , which is the feature representation of a tail discourse. Finally, decodes into a tail discourse . Finally we can decode a tail discourse from .

4.1 AutoEncoder

Encoder: Fine-fune BERT on Discourse Marker Prediction Task

Figure 5: Encoder: Fine-fune BERT on Discourse Marker Prediction Task.

We use a pre-trained deep bidirectional transformer, known as BERT [5], as our encoder. This model applies a multi-head self-attention operation over the input context tokens, corresponding position and segment embeddings. Trained by two novel unsupervised prediction task, BERT is able to provide good sentence representation. To get a highly self-interpretable representation z for a sentence s, we fine-tune BERT with the discourse marker prediction (DMP) task [18]

. DMP task aims to predict which discourse marker should be taken to connect the two adjacent discourse. BERT takes a pair of adjacent discourses as input, and output the probability distribution on discourse markers. Each of the discourses is encoded in one input segment. We leave out the details of fine-tuning as there is an official open tutorial about it.

444 The fine-tune process can also be considered as teaching BERT to understand the structural relations between sentence pairs. One extra benefit is that the fine-tuned BERT is also used as the discriminator for automatic evaluation, to judge whether the discourse marker within a structured sentences is correct or not.

We optimize the encoder by minimizing the following loss:


Then we keep the fine-tuned BERT model weights fixed, and use it to extract sentence representations. In details, supposing , where is num of total tokens, the fine-tuned BERT outputs hidden representations for each position in last layer, i.e. . We concatenate each and apply nonlinear projection on concatenated vector to get the sentence’s representation :



We adopt LSTM [9] as decoder, to recover from :


where is a projection matrix.

We optimize the decoder by minimizing the following loss:


4.2 Relation Translation Network

Following the similar setting in knowledge representation, we represent relations as translation in the embedding space555Sentence embedding is equivalent to sentence representation .. We propose a hypothesis, under which we construct a translation module.


Referring to TransE, we assume:


that is, there exists a mapping in embedding space between a tail discourse and a head discourse plus some vector that depends on the relation . Formally, in relation spaces, for each triple , sentences are represented as and relation is represented as .

Referring to TransR, the dimension of sentence embeddings and relation embeddings are not necessarily identical, i.e., . For each relation , we train a projection matrix , which projects sentence representations from sentence space to relation space. With the projection matrix, we define the projected representation vectors of sentences as:


Specifically, in relation space, for , we apply translation on :


, where denotes concatenation operation and is a projection matrix. We measure the distance between the two vectors in relation space with 2-norm distance:


To further encourage to be close to , we want is much farther from , than from . So we introduce another loss:


We feed into a feed forward network to map from relation space back into sentence space , which approximates matrix inversion operation with neural computation. Then can be obtained through decoder with Equation . Ideally, holds the original relation with original head discourse .

Combing equation 5 9 10 , we obtain the training objective:


where and

are balancing hyperparameters.

The training details of TransSent are shown in Algorithm 1.

1:  Pretrain a language model
2:  Fine-tune BERT with DMP task, to get the
3:  Fix the weights of
4:  for each iteration i=1,2,…,M do
5:     Sample a structured sentence
6:     Obtain sentence representation of head discourse based on Eq.2
7:     Calculate reconstruction loss based on Eq.5
8:     Do relation translation, acquire new tail discourse representation based on Eq.7-8
9:     Calculate and based on Eq.9-10
10:     Calculate based on Eq.11
11:     Update model parameters
12:  end for
Algorithm 1 Implementation of TransSent model in details

After well-trained, our TransSent can be used to generate structured sentences, as shown in Figure 2. We first adopt well-trained Variational AutoEncoder (VAE) to randomly generate a head discourse, then we select one discourse marker and perform Sentence Transfer with TransSent, generating the tail discourse. At last, we concatenate original head discourse and the generated tail discourse with discourse marker to get a long structure sentence.

5 Data Collection

We use public code for data collection from [18] 666 There are many discourse markers and we focus on the set of five common ones: and, but, because, if, when. We collect a dataset of discourse pairs and relations between them from natural text corpora using the set of explicit discourse markers.

  • Yelp-dm is extracted from a sentiment domain corpus Yelp, examples in Yelp dataset are from business review website .

Dependency Parsing

The position of discourse markers relative to their connected sentences can vary. For example, “Because [it was cold outside], [I wore a jacket]” equals to “[I wore a jacket], because [it was cold outside]”. So following [18], we use Stanford CoreNLP dependency parser [20] to extract the appropriate pairs of sentences and filter based on the order of the sentences in the original text.

We further exclude any cases where one of the two discourses is less than 5 or more than 15 words. However, as discourse markers are distributed differently in the corpus, the numbers of extracted pairs with different discourse markers are imbalanced. To construct a fair corpus for each discourse marker, we randomly select the same amount of pairs for each discourse marker and add them into corpus. All the datasets are randomly split into train, development and test sets. Dataset statistics are shown in table  2.

Dataset Train Dev Test
YELP-dm 10K 1K 1K
Table 2: Dataset statistics. Each discourse marker has the same amount in train/dev/test sets.

6 Experiments

There are many discourse markers and in our experiments we focus on the five most common ones: and, but, because, if, when.

6.1 Baselines

We compare TransSent with the VAE. For TransSent, when testing, it takes a head discourse and a discourse marker as input, to generate a tail discourse. We concatenate the head discourse and tail discourse with the discourse marker to form a sentence. For VAE, we train it on Yelp, and generate enough sentences with the trained model. Then we randomly select 1K generated sentences for comparison, each of the sentences has one of the five discourse markers. At last, we adopt the trained discriminator to discriminate whether the relation in generated sentences (represented by the discourse marker) holds.

6.2 Experiment detail

For TransSent, we use official released BERT as our encoder, with unchanged model configuration. The input size is kept compatible to original BERT and the hyperparameter setting can be found in [5]. We fine-tune BERT

on the discourse marker prediction task for 6 epochs, by which we will get the discriminator and encoder simultaneously. The trained discriminator achieves accuracy of 80.4% on Yelp-dm. We concatenate the token representations from the top layers of the fine-tuned BERT and use an affine fully-connected layer to project the concatenated vector to the discourse representation. We adopt a single layer bidirectional LSTM structure as our decoder, with dropout set to 0.4. We train our model for 50 epochs, with Adam


as our optimizer. The relation network project the head discourse representation into relation space and concatenate it with relation vector, then feed the concatenated vector into a feed-forward network to obtain the representation of tail discourse. The embedding dimensions are all set to 768. Our model is coded in PyTorch, while VAE model is from a released code in Tensorflow

777 .

6.3 Evaluation

6.4 Automatic Evaluation


We use the fine-tuned BERT as discriminator to assess whether relation between discourses in generated sentences holds. As VAE encodes the structure information and content information into one hidden vector, it performs poorly in both datasets. Although the TransSent model achieves much better performance than VAE, there is still far from satisfactory.

VAE 19.4
TransSent 73.0
Table 3:

Accuracy (%) automatic evaluation by pretrained DMP classifier.

Coherence and Cohesion

To evaluate the coherence and cohesion of different model, we report the scores of two widely used metrics, negative log-likelihood (NLL) and perplexity (PPL). 888As our sentences are generated from scratch randomly, without any references, so BLEU metric can not be used here.

The perplexity loss indicates the continuity and coherence of content between tail discourse and head discourse. A standard LM is trained with loss function


where is the length of concatenated sentence. The pretrained language model we used is from public code 999

The pretrained language model achieves NLL=3.90 and PPL=49.39, on 29K real world sentences. Then we evaluate our model and baseline models with it. As shown in table 4, our TransSent achieves better performance on both two baselines.

VAE 5.02 151.41
TransSent 4.33 75.69
Table 4: Perplexity performance by pretrained language model.

Human Evaluation

As our TransSent only generate tail discourses, for comparison, we only also evaluate the tail discourse in sentences generated by VAE. Annotators rate outputs for our models and VAE. We adopt two criteria range from 1 to 5 (1 is very bad and 5 is very good): grammaticality, relation correctness to the target attribute. For each dataset, we randomly sampled 200 generated examples. As shown in Table 5, our TransSent performs much better than VAE on both criteria. Our better performance on grammaticality partly benefits from, VAE has generated a sentence from scratch and LSTM decoder suffers from long range problem. In contrast, we generate only the tail discourse, it is much shorter and guaranteed.

Gra Rel
VAE 2.4 2.5
TransSent 3.4 3.8
Table 5: Human evaluation results on two datasets. We show average human ratings for grammaticality (Gra), relation correctness (Rel).

7 Future Work

There are many discourse markers and in our experiments we focus on the five most common ones, in the future , we will explore more discourse markers and construct larger corpus for further research.

Following  [25], we will use the language model as discriminator to further ensures the content quality of a long structured sentence.

GPT-2 has shown strong ability in text generation, which can be utilized as our decoder instead of LSTM.

We will focus on generating long structured sentences with several discourse markers in a recursive way.

We constructed another two open domain datasets for further research:

  • Wiki-dm(80K/5K/3K) is extracted from an open domain corpus WikiText-103[17], which consists of Wikipedia articles.

  • Book-dm(400/30K/30K) is extracted from another open domain corpus BookCorpus [28], which consists of text from unpublished novels.

8 Conclusions

In this paper, we focus on generating long sentences with explicit structure. To achieve this, we define a new task Sentence Transfer, which generate tail discourse based on head discourse and discourse marker, and construct a dataset for this task. We then propose a novel model TransSent, which translates the representation of head discourse to tail discourse in relation hidden space and outputs a structured sentence through decoding and concatenating. Empirical results on the Yelp dataset verify our method’s capacity to structured sentence generation.


  • [1] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2787–2795. External Links: Link Cited by: §2.2.
  • [2] A. Bosselut, A. Celikyilmaz, X. He, J. Gao, P. Huang, and Y. Choi (2018) Discourse-aware neural rewards for coherent text generation. arXiv preprint arXiv:1805.03766. Cited by: §1, §2.1.
  • [3] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, and S. Bengio (2016) Generating sentences from a continuous space. See DBLP:conf/conll/2016, pp. 10–21. External Links: Link Cited by: §1, §2.1.
  • [4] W. S. Cho, P. Zhang, Y. Zhang, X. Li, M. Galley, M. Wang, and J. Gao (2018) A bird’s-eye view on coherence, and a worm’s-eye view on cohesion. CoRR abs/1811.00511. External Links: Link, 1811.00511 Cited by: §1.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §4.1, §6.2.
  • [6] X. Gu, K. Cho, J. Ha, and S. Kim (2018) DialogWAE: multimodal response generation with conditional wasserstein auto-encoder. CoRR abs/1805.12352. External Links: Link, 1805.12352 Cited by: §1.
  • [7] J. Guo, S. Lu, H. Cai, W. Zhang, Y. Yu, and J. Wang (2018) Long text generation via adversarial training with leaked information. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §2.1.
  • [8] J. R. Hobbs (1990) Literature and cognition. (21). Cited by: §1, §2.3.
  • [9] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Link, Document Cited by: §4.1.
  • [10] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1587–1596. Cited by: §1.
  • [11] Y. Kikuchi, G. Neubig, R. Sasano, H. Takamura, and M. Okumura (2016) Controlling output length in neural encoder-decoders. See DBLP:conf/emnlp/2016, pp. 1328–1338. External Links: Link Cited by: §1.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.2.
  • [13] Q. V. Le (2014) Sequence to sequence learning with neural net- works. In Advances in neural information process- ing systems. Cited by: §1.
  • [14] K. Lin, D. Li, X. He, Z. Zhang, and M. Sun (2017) Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pp. 3155–3165. Cited by: §1.
  • [15] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu (2015) Learning entity and relation embeddings for knowledge graph completion. See DBLP:conf/aaai/2015, pp. 2181–2187. External Links: Link Cited by: §2.2.
  • [16] D. Marcu (1998) A surface-based approach to identifying discourse markers and elementary textual units in unrestricted texts. Discourse Relations and Discourse Markers. Cited by: §1.
  • [17] S. Merity, C. Xiong, J. Bradbury, and R. Socher (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: 1st item.
  • [18] A. Nie, E. D. Bennett, and N. D. Goodman (2017) DisSent: sentence representation learning from explicit discourse relations. CoRR abs/1710.04334. External Links: Link, 1710.04334 Cited by: §1, §1, §2.3, §4.1, §5, §5.
  • [19] S. Rajeswar, S. Subramanian, F. Dutil, C. Pal, and A. Courville (2017) Adversarial generation of natural language. arXiv preprint arXiv:1705.10929. Cited by: §1.
  • [20] S. Schuster and C. D. Manning (2016) Enhanced english universal dependencies: an improved representation for natural language understanding tasks.. In LREC, pp. 23–28. Cited by: §2.3, §5.
  • [21] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.1.
  • [22] Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014)

    Knowledge graph embedding by translating on hyperplanes

    See DBLP:conf/aaai/2014, pp. 1112–1119. External Links: Link Cited by: §2.2.
  • [23] B. Webber, A. Knott, M. Stone, and A. Joshi (1999) Discourse relations: a structural and presuppositional account using lexicalised tag. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 41–48. Cited by: §1.
  • [24] H. Xiong, Z. He, H. Wu, and H. Wang (2018)

    Modeling coherence for discourse neural machine translation

    arXiv preprint arXiv:1811.05683. Cited by: §2.3.
  • [25] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick (2018) Unsupervised text style transfer using language models as discriminators. In Advances in Neural Information Processing Systems, pp. 7287–7298. Cited by: §1, §7.
  • [26] L. Yu, W. Zhang, J. Wang, and Y. Yu (2017) Seqgan: sequence generative adversarial nets with policy gradient. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1.
  • [27] Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, and L. Carin (2017) Adversarial feature matching for text generation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 4006–4015. Cited by: §1.
  • [28] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In

    Proceedings of the IEEE international conference on computer vision

    pp. 19–27. Cited by: 2nd item.