Triple-to-Text: Converting RDF Triples into High-Quality Natural Languages via Optimizing an Inverse KL Divergence

05/25/2019 ∙ by Yaoming Zhu, et al. ∙ HUAWEI Technologies Co., Ltd. Shanghai Jiao Tong University 0

Knowledge base is one of the main forms to represent information in a structured way. A knowledge base typically consists of Resource Description Frameworks (RDF) triples which describe the entities and their relations. Generating natural language description of the knowledge base is an important task in NLP, which has been formulated as a conditional language generation task and tackled using the sequence-to-sequence framework. Current works mostly train the language models by maximum likelihood estimation, which tends to generate lousy sentences. In this paper, we argue that such a problem of maximum likelihood estimation is intrinsic, which is generally irrevocable via changing network structures. Accordingly, we propose a novel Triple-to-Text (T2T) framework, which approximately optimizes the inverse Kullback-Leibler (KL) divergence between the distributions of the real and generated sentences. Due to the nature that inverse KL imposes large penalty on fake-looking samples, the proposed method can significantly reduce the probability of generating low-quality sentences. Our experiments on three real-world datasets demonstrate that T2T can generate higher-quality sentences and outperform baseline models in several evaluation metrics.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

(a) Knowledge base and its RDF triples.
(b) Corresponding natural language description.
Figure 1. A small knowledge base, (a) its associated RDF triples and (b) an example of the corresponding natural language description.

Knowledge bases (KB) are gaining attention for their wide range of industrial applications, including, question answering (Q&A) systems (Fader et al., 2014; Zou et al., 2014), search engines (Ding et al., 2004), recommender systems (Huang et al., 2002) etc. The Resource Description Frameworks (RDF) is the general framework for representing entities and their relations in a structured knowledge base. Based on W3C standard (Magazine, 1998), each RDF datum is a triple consisting of three elements, in the form of (subject, predicate, object). An instance can be found in Figure 1(a), which illustrates a knowledge base about Neil Armstrong and its corresponding RDF triples.

Based on the RDF triples, the Q&A systems can answer questions such as ”which country does Neil Armstrong come from?” Although such tuples in RDF allow machines to process knowledge efficiently, they are generally hard for humans to understand. Some human interaction interfaces (e.g., DBpedia111 are designed to deliver knowledge bases in the form of RDF triples in a human-readable way.

In this paper, given a knowledge base in the form of RDF triples, our goal is to generate natural language description of the knowledge bases which are grammatically correct, easy to understand, and capable of delivering the information to humans. Figure 1(b) lays out the natural language description given the knowlege base about Neil Armstrong.

Traditionally, the Triple-to-Text task relies on rules and templates (Dale et al., 2003; Turner et al., 2010; Cimiano et al., 2013), which requires a large number of human efforts. Moreover, even if these systems are developed, they are often faced with problems of low scalability and inability to handle complex logic.

Recently, with significant progress on deep learning, the neural network (NN) based natural language generation models, especially the sequence to sequence framework (SEQ2SEQ)

(Sutskever et al., 2014), have achieved remarkable success in machine translation(Bahdanau et al., 2014)

and text summarization

(Nallapati et al., 2016). The SEQ2SEQ framework has also been employed to translate knowledge bases into natural languages. Vougiouklis et al. (Vougiouklis et al., 2018) proposed Neural Wikipedian to generate summaries of the RDF triples.

However, most existing studies focus on the design of the model structure (Vougiouklis et al., 2018), while paying less attention to the training objective. These models are usually trained via maximum likelihood estimation, which is equivalent to minimizing Kullback-Leibler (KL) divergence between the ground-truth conditional distribution () and the estimated distribution (), i.e., . Models trained with KL divergence tend to have high diversity, but at the same time, they are likely to generate shoddy samples (Huszár, 2015).

In such tasks, we usually care more about the quality of the translation and care less about diversity. Hence, we propose the triple-to-text model. By introducing a new component called judger, we optimize the model in two directions: minimizing the approximated inverse KL divergence and maximizing the self-entropy.

Our main contributions can be summarized as follows:

  • We propose a theoretically sound and empirically effective framework (T2T) for optimizing the inverse KL divergence for conditional language generation task of translating a knowledge base into its natural language description.

  • We conduct a series of experiments on different datasets to validate our proposed method. The results show that our method outperforms baselines in common metrics.

We organize the remaining parts of this paper as follows. In Section 2, we formulate the problem and introduce the preliminaries. In Section 3, we provide our analysis of why it is preferable to optimize an inverse KL divergence. Then Section 4 details our proposed model. We then present the experiment results in Section 5. Finally, we discuss the related work in Section 6 and conclude the paper in Section 7.

Symbol Description
a knowledge base that consists of RDF triples
a resource description framework (RDF) triple
subject, predicate and object within a RDF triple
a sentence
a word in a sentence
conditional context for SEQ2SEQ framework
target context for generative models
-th token from conditional context
-th token from target context
prefix of target context:
the target (ground-truth) distribution
learned distribution of generator
learned distribution of judger
parameters of generator
parameters of judger
Table 1. Glossary

2. Formulation and Preliminaries

In this section, we formulate the task and introduce the preliminaries of language generation models.

2.1. Task Definition

A knowledge base is formulated as a set of RDF triples, i.e., , where each RDF triple is represented as . The three elements in a triple denote subject, predicate and object, respectively. Given the knowledge base , our goal is to generate a natural language sentence which consists of a sequence of words , where denotes the -th word in the sentence . The generated sequence is required to be grammatically sound and correctly represent all the information contained in the knowledge base .

Figure 2. (a) shows the target distribution , and the histogram in the background represents frequency of different samples; (b), (c) illustrate the empirical results of by minimizing and respectively.

2.2. Sequence to Sequence Framework

Our work is based on the sequence to sequence framework (SEQ2SEQ). The standard sequence to sequence framework consists of an encoder and a decoder. Both of them are parameterized by recurrent neural networks (RNN).

The encoder takes in a sequence of discrete tokens . At -th step, the encoder takes in a token and updates the hidden state recurrently:


where denotes the word embedding (Mikolov et al., 2013) of the -th token. In general, , where

is a pre-trained or learned word embedding matrix with each column representing a embedding vector of a token; given

is a one-hot vector, get the corresponding column of for token . is a nonlinear function. Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) and gated recurrent unit (GRU) (Cho et al., 2014) are often considered as the paradigm of the function. The final output of encoder is an array of hidden states . Each hidden state can be regarded as a vector representation of all the previous tokens.

The decoder takes in the hidden states of the encoder as input and outputs a sequence of hidden states . The hidden state at its -th step is computed by:


where is the word embedding of the last output token of the decoder, and is set to be a zero vector. is a function of hidden states of encoder that provides the summary of the input sequence at step , and typical choices include: i) ; ii) the attention mechanism (Bahdanau et al., 2014) with .

In general, generation of language is modeled as an autoregressive sequential generation process with each token sampled from the probability distribution conditioning on its previous tokens. The probability distribution of

-th token is parameterized by a softmax over an affine transformation of the decoder’s hidden state at step , i.e.


where and

are the weight matrix and the bias vector of output layer respectively, and

denotes the first tokens of target content. The probability distribution of the entire output sequence conditioning on the input sequence is thus modeled as


2.3. Maximum Likelihood Estimation (MLE)

Training neural language models through maximum likelihood estimation (MLE) is the most widely used method. The objective is equivalent to minimizing the cross entropy between the real data distribution and the estimated probability distribution by the generative models:


where denotes a complete sequence sampled from the real data distribution and denotes the cross entropy.

Maximizing Eq. 5 is equivalent to minimizing the Kullback-Leibler (KL) divergence between target distribution and learned distribution , which is defined as


where is a constant irrelevant to parameter .

For clarity, we here ignore the conditional context here. We will later regard the maximum likelihood estimation as minimizing the Kullback-Leibler (KL) divergence.

3. Objective Analysis

In this section, we will give a detailed discussion on the fundamental problems of minimizing KL divergence in training and explain why we choose the inverse KL divergence as our optimization objective. We will also discuss several related solutions.

3.1. Practical Tendency of KL and Inverse KL

The KL divergence between two distributions and is formulated as


Since the KL divergence is non-negative, it is minimized when . Unfortunately, in real-world scenarios, the target is usually a very complex distribution. Given limited capacity, the learned probabilistic model may only be a rough approximation.

As pointed out in (Arjovsky and Bottou, 2017), goes to infinity if and , which means that the cost function is extremely high when the distribution of generator fails to cover some patterns of the real data. On the other hand, the cost function is relatively low when the generator is low-quality samples, as goes to zero if and .

That is, although the optimal is guaranteed to be under MLE objective, during training, the estimated distribution is more likely to have a wide coverage and possibly contain samples out of the real distribution, as illustrated in Figure 2(b). In practice, models trained via MLE have a high probability of generating rarely-seen sequences, most of which are inconsistent with human expressions due to exposure bias (Bengio et al., 2015).

With a similar argument to the behavioral tendency of , it can be shown that has less penalty to ”mode collapse”, which means tend to generate a family of similar samples. By contrast, assigns a large penalty to fake-looking samples. The typical non-optimal estimation, as illustrated in Figure 2(c), is that it covers several major modes of the real distribution, but misses several minor modes.

We here argue that in the conditional language generation task, especially such the triple-to-text tasks, minimizing the inverse KL divergence would be more preferred. Because, in these translation tasks, people usually care more about the quality of the generated text, rather than their diversity. In other words, it is tolerable to have low diversity, but it is usually unacceptable to be grammatically incorrect or miss important information.

Figure 3. General framework. Sentences and RDF triples are pre-processed into discrete tokens. Then after embedding, they are fed into an encoder-decoder neural network with attention mechanism.

3.2. The Decomposed Objective of Inverse KL

Here, we explain the property of inverse KL divergence via objective decomposition. We will show that minimizing the inverse KL divergence can be regarded as a direct optimization of the performance of the Turing test.

In Turing test , we assume that the human judges know the accurate natural language distribution .(Mahoney, 1999) Given a language sample , its quality is scored by . Thus the averaged score in a Turing test can be modeled as the negative cross-entropy .

The inverse KL divergence can be rewritten as


Eq. 8 illustrates that the objective of minimizing an inverse KL divergence can be be decomposed into two parts:

  • Minimizing , which corresponds to the objective of Turing test.

  • Maximizing , the self-entropy of the generator. It helps expand the support of , to avoid disjoint support between and , which may lead to gradient vanish problem (Arjovsky and Bottou, 2017).

3.3. Estimation of the Real Distribution

In most real applications, is an empirical distribution and not directly accessible. For this reason we could not directly optimize the inverse KL divergence. In our proposed method, we introduce a new module , called judger, to approximate target distribution . The judger is trained via maximum likelihood estimation and the objective function for is


Note that, although the judger distribution might suffer from the problems as mentioned earlier of MLE, i.e., it does not precisely model all the modes, it generally widely covers the distribution with the major modes having large probability masses. Then, based on this inaccurate estimated distribution , we minimize the inverse KL divergence . As we discussed before, the inverse KL divergence cares more about the major modes and tends to ignore these minor modes, including small fake modes stemming from imperfect MLE estimation, so the shortcoming of MLE-based estimated distribution poses no serious problems here.

It is also important to notice that, if the two steps in our algorithm both get the optimum, we have , which is the same as previous methods. The key benefit of our algorithm is that when it does not get the optimum, the generated samples still tend to be feasible.

Figure 4. The overall training process of our proposed algorithm.

3.4. JS Divergence: GANs and CoT

Some previous works also recognized the limitations of KL divergence and alleviated this problem with various optimization methods. Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) introduced a module named discriminator to distinguish whether a sample is from the real distribution or is forged by . In theory, given a perfect discriminator, the training objective of GAN is equivalent to minimize Jensen-Shannon Divergence, which is defined as the symmetrized version of two aforementioned divergences:


where is the average of two distributions.

GAN is initially designed for generating continuous data, which is not directly applicable to discrete tokens, such as language sentences. SeqGAN (Yu et al., 2017)

is introduced to generate discrete tokens via adversarial training. However, the generative models trained by SeqGAN tend to have high variance due to the REINFORCE algorithm

(Williams, 1992).

Another attempt to leverage Jensen-Shannon divergence on sequence generation tasks is CoT (Lu et al., 2018). CoT introduces a new module, the mediator , which estimates the mixture data distribution via maximum likelihood estimation. Then is used to guide the training of the generator with JSD. However, in practice, we find that the optimization of could be problematic. According to our experiments, as the real distribution becomes complicated, tends to get a distribution to fit rather than accurately modeling the . We explain this phenomenon as follows.

The real distribution is relatively complex, and the estimated distribution tends to be simple and smooth. Because of the wide coverage tendency of MLE, would cover in general; while due to limited capacity of , tends to fit the simple one, i.e., .

The problem that captures limited differences between and makes the training hard to converge.

Note that one key difference is that the target mediator distribution in CoT is dynamical and involves with the learning distribution , while the judger in our method is estimating the static distribution .

4. Methodology

In this section, we will first explain how to convert the task into a sequence-to-sequence generation problem, and then illustrate the details of how to optimize it with inverse KL divergence.

Input: a corpus of knowledge bases and its corresponding natural sentences {(, )}, hyper-parameters and
Output: a generator , a judger
1 Pre-process the knowledge bases corpus {(, )} into discrete token sequence pairs {(, )}
2 Initialize and with random parameters and
3 Pre-train using Maximum Likelihood Estimation (optional)
4 while  not converge  do
5       for  steps do
6             Sample from sequence pairs {(, )}
7             Update judger via maximizing
8       end for
9      for  steps do
10             Sample conditional context from pairs {(, )}
11             Generate the estimated target sentence given according to
12             Update generator via minimizing by Eq. 15
13       end for
15 end while
return ,
Algorithm 1 Triple-to-Text algorithm

4.1. General Framework

The SEQ2SEQ framework cannot process graph-based data like RDF triples directly. Thus we first use a pre-processing technique similar to the one mentioned in (Trisedya et al., 2018). It substitutes the subjects and objects in the RDF triples and their corresponding entities in the sentences into their types.

For example, given a knowledge base [(”Bill Gates”, ”founder”, ”Microsoft Corporation”) , (”Microsoft Corporation”, ”startDate”, ”April 4, 1975” )] and its corresponding human-annotated natural sentence ”Bill Gates founded the Microsoft Corporation in April 4, 1975”. The pre-process module will map ”Bill Gates,” ”Microsoft Corporation” and ”April 4, 1975” into ”PERSON,” ”CORPORATION” and ”DATE” respectively, in both RDFs and the corresponding sentences. The pre-process can reduce the size of the vocabulary list, and improve the generalization capacity of the model so that it can handle most dates rather than just ”April 4, 1975”. Considering that the nodes in the knowledge bases are unordered, we also apply permutation among the triples to enhance the training data, and we believe this approach can improve the generalization capabilities of the final generation model.

The pre-processed RDF triples are then transformed into a sequence of discrete tokens. We use commas to separate elements within an RDF triple, and semicolons to separate different RDFs. For instance, the knowledge base mentioned above is turned into ”PERSON, founder, CORPORATION; CORPORATION, startDate, DATE”. Simultaneously, zero padding is used to fill all sequences into the same length.

Finally, a SEQ2SEQ method introduced in Section 2.2 is used to encode the processed triple and then translate it into a human-understandable sentence. To enhance the performance of the encoder-decoder model, attention mechanisms (Bahdanau et al., 2014) are used in our proposed framework. Figure 3 illustrates the general structure of this method.

Dataset #Train #Test #RDF Triples #sentence length #vocabulary in sentence #vocabulary in triples
WebNLG 20288 2240 1-7 82 4678 2718
SemEval 8000 2717 1 97 24986 7333
Baidu SKE 19520 2000 1-5 84 25027 22713
Table 2. Dataset statistics, including the number of RDF triples-sentence pairs used in training and test, the number of RDF triples per datum, the (maximum) number of tokens per sentence and the vocabulary list size.

4.2. Algorithm Details

The general idea of the proposed method is that: a module called judger is introduced to approximate the target distribution , which is trained via maximum likelihood estimation. Based on the approximated distribution , we then minimize the inverse KL divergence . The overall process is illustrated in Figure 4.

Because we target at the sequence to sequence translation task, the distribution of generator is modeled as a chain product of probability distribution of the next token conditioning on the input sequence and prefix ,


Within our framework, the judger is trained to model the target distribution via maximum likelihood estimation. The judger is also modeled as a chain product of conditional distributions,


The objective function for is


Given which estimates the real distribution , we then update via minimizing the inverse KL divergence :


where denotes the data pair where is sampled from conditional context and is the output of generator given as input. The objective can be directly optimized by taking Eq. (11) and  (12) into Eq (14), which can be reformulated as


Algorithm 1 illustrated the overall algorithm of our proposed method. Note that instead of training the judger to convergence at the beginning, the judger and the generator are trained alternately. From the perspective of curriculum learning (Bengio et al., 2009), by gradually increasing the complexity of the generator’s training objective, it improves the generalization ability of the generator and helps find a better local optimum. Our method shares the same computational complexity as MLE training.

5. Experiments

5.1. Datasets

Our methods are evaluated on the following datasets.

WebNLG (Gardent et al., 2017) is extracted from 15 different DBPedia (Auer et al., 2007) categories, which consists of 25,298 (data, text) pairs and 9,674 distinct data units. The data units are sets of RDF triples, and the texts are sequences of one or more sentences verbalizing these data units. It also provides a set of 373 distinct RDF properties.

SemEval-2010 Task 8 (Hendrickx et al., 2009) was originally designed for multi-way classification of semantic relations between pairs of nominals. It contains 10,717 samples, divided as 8,000 for training and 2,717 for testing. The dataset contains nine relation types. Since each example is a sentence annotated for a pair of entities and the corresponding relation class for this entity pair in this dataset, we can extract an RDF triple from each sentence.

Baidu SKE222 is a large-scale human annotated dataset with more than 410,000 triples in over 200,000 real-world Chinese sentences, bounded by a pre-specified schema with 50 types of predicates. Each sample in SKE contains one sentence and a set of associated tuples. SKE Tuples are expressed in forms of (subject, predicate, object, subject type, object type). In our experiments, we only use knowledge bases related to Film and TV works domain, and each Chinese character is treated as a distinct token.

We select some data in the three data sets and divide them into a training set and a test set. Table 2 shows some statistical details about the data.

5.2. Implementation Details

The generator consists of a word embedding matrix, an encoder, a decoder, and the output layer. For the word embedding, we maintain two different sets of embeddings for encoder and decoder respectively; both are of 64 dimensions. Both encoder and decoder are built as an LSTM (Hochreiter and Schmidhuber, 1997) with hidden units of 128 dimensions. The dimension of hidden units of the output layer is also 128. We apply Bahdanau attention (Bahdanau et al., 2014) to the context vector , which is computed as the weighted sum of encoder states. For the judger, we use the same configuration as the generator.

For the initialization, all initial parameters follow a standard Gaussian distribution

. All models are optimized using Adam optimization (Kingma and Ba, 2014) with a learning rate of 0.001 and a batch size of 64. The hyper-parameters of and in the algorithm are both set as 1, which makes the objective of the generator gradually harder as indicated in Section 4.2

. We also pre-train the generator via MLE with the number pre-train epochs set as 2.

WebNLG SemEval SKE WebNLG SemEval SKE WebNLG SemEval SKE WebNLG SemEval SKE
MLE 40.8 4.24 18.6 30.2 2.73 15.6 0.497 1.07 1.01 0.636 0.222 0.349
CoT 9.84 1.90 15.7 6.40 1.41 12.9 1.085 1.16 1.08 0.349 0.102 0.305
SeqGAN 42.0 4.11 19.0 24.4 2.63 14.1 0.534 1.11 1.10 0.597 0.231 0.344
PG 41.7 4.21 17.9 30.9 2.06 14.1 0.607 1.13 1.12 0.628 0.197 0.310
NW 35.8 2.80 14.6 24.6 1.87 11.9 1.664 1.17 1.92 0.302 0.143 0.301
T2T 42.4 4.35 20.3 32.2 2.83 17.1 0.473 0.957 0.947 0.641 0.247 0.367
Table 3. Comparison of model performance.

5.3. Baseline Algorithms

We validate our proposed method for RDF triple-to-text (we will later refer to as T2T) by comparing it with the following baselines. To give a fair comparison, we apply the same RDF pre-processing technique discussed in Section 4.1 to all the baselines.

  • MLE. A common method for training sequence to sequence framework. For a fair comparison, the parameter setting of the generator is the same with our model.

  • CoT. We adapt CoT (Lu et al., 2018) into conditional sequence generation task. As its authors suggested, the size of the hidden unit of the mediator is twice the size of the generator.

  • Pointer-Generator Network (PG). See et al. (See et al., 2017) proposed pointer-generator network. Their work can be regarded as a combination of SEQ2SEQ and pointer network (Vinyals et al., 2015).

  • SeqGAN. Yu et al. (Yu et al., 2017) used an adversarial network to provide the reward and train a sequence generator with policy gradient. According to (Li et al., 2017)

    and our initial experiments, in SEQ2SEQ framework, when the discriminator is parameterized as a convolutional neural network, it is difficult for the discriminator in SeqGAN to improve the generator. We thus follow

    (Li et al., 2017) and adapt the discriminator into a hierarchical recurrent neural network (Li et al., 2015).

  • Neural Wikipedian (NW). Vougiouklis et al. (Vougiouklis et al., 2018)

    used a standard feed-forward neural network to encode RDF triples. Then the vectors derived from encoders are concatenated and used as the input of the decoder which generates summaries for RDF triples.

methods MLE CoT PG SeqGAN NW T2T
accuracy 0.240 0.258 0.231 0.244 0.155 0.276
Table 4. predicate accuracy on SemEval dataset.

5.4. Metrics

For natural language generation tasks, the most widely accepted metric is human evaluation (Belz and Reiter, 2006). While human evaluation is reliable, it is hardly applied to quality evaluating of large corpus since it will involve too many human resources. Therefore, we have to introduce automatic metrics for evaluating all the sentences our system has generated. However, to our knowledge, no single automatic evaluation metric is sufficient to measure the performance of a natural language generation system (Novikova et al., 2017). Thus, in order to give objective results, we use a variety of automatic metrics to compare our models and benchmarks.

We have adopted three widely used word-level metrics: BLEU (Papineni et al., 2002), TER (Snover et al., 2006) and METEOR (Banerjee and Lavie, 2005). BLEU and METEOR333We use METEOR 1.5 ( alavie/METEOR/README.html), with parameters suggested by Denkowski et al. (Denkowski and Lavie, 2014) for universal evaluation to calculate the number of -grams of the generated sentence occurs within the set of references.

Besides the traditional word-based metrics, we also evaluate the generator via likelihood and perplexity. Inspired by likelihood-based discrimination(McLachlan, 2004), we design a new metric which we refer to as ”predicate accuracy”. In detail, given a single RDF triple , a natural sentence describing the triple and a generative model , we can calculate , i.e. the predictive likelihood of the target sentence. If we keep subject and object unchanged, and substitute predicate with another predicate , then our generative model can derive a probability density for each predicate class , where denotes the triple . Then, we can use the likelihood of generative model to predict the predicate given subject , object and sentence , the predicted predicate is


where is the set of all kinds of predicate. The ”predicate accuracy” is defined as precision of in Eq. 16 being the correct predicate describing the sentence .

We also use forward perplexity () to evaluate the quality of the generated text. Different from the traditional perplexity evaluated only on generative models, evaluate perplexity of generated samples from generator using another language model (denoted as ) trained on real data via MLE. According to Zhao et al. (Kim et al., 2017), measures the fluency of generated sentences.


In our experiments, is implemented as an LSTM-based SEQ2SEQ model, whose word embedding size is set as 64, encoder hidden unit and decoder hidden unit is all set as 300.

methods WebNLG SemEval SKE
MLE 1.810 2.918 7.046
CoT 2.423 2.579 6.643
SeqGAN 3.556 4.129 7.834
PG 2.151 3.180 8.078
NW 2.461 2.771 6.916
T2T 1.589 2.067 3.565
Table 5. Forward perplexity among three datasets.
(a) WebNLG
(b) SemEval
(c) BaiduSKE
Figure 5. Forward perplexity training curves on Three datasets respectively.

5.5. Experiments Results

Table 3 shows the overall results of each training method on BLEU, TER, and METEOR among the three datasets. From the results, we find that our proposed T2T method improves the quality of the generated sentence on these word-based metrics. As we have analyzed, generators optimized via inverse KL divergence tend to generate text with more common expressions, while other baselines tend to use some low-quality text. Thus, sentences from T2T will overlap more words with reference text, which means it can achieve better performance on the word based metrics like BLEU.

The experiments on also validated our conclusion. Table 5(b) shows the results of on WebNLG dataset. Low forward perplexity validates that our method allows the generator to generate high-frequency language patterns more and better. We plot the training curves of .

Table 4 shows the predicate accuracy of different training methods on SemEval datasets. Our model can fit the logical connection between real sentences and RDF predicates better compared with baselines.

Human evaluation is conducted on WebNLG dataset to validate the performance of our framework further. We choose WebNLG dataset because it consists of more RDF triples and its reference sentences are relatively simple. We randomly select 20 RDF triples from the dataset, along with the corresponding sentences generated by T2T and baselines. Ten human volunteers are asked to rate the sentences from two aspects: grammar and correctness. The score on grammar is used to judge whether the sentence contains grammatical errors, improper use of words and repetition. Correctness measures whether the sentence accurately represents the information in the RDF triples. The score for each criterion takes an integer between 1 and 10. Volunteers are given both scoring criteria and examples. Table 6 lists the results of overall human evaluation score.

Table 7 presents samples of generated sentences from different baselines and T2T given a knowledge base about Amatriciana sauce. Compared with baselines, the text generated from T2T is not only grammatically sound and correctly expresses all the information from RDF triples as well. We also found that when the length of the generated sentence is long, the quality of output from SeqGAN is compromised, which may because they use Monte Carlo sampling to guide the generator, which will introduce variance. The sentences generated by MLE correctly express the knowledge, but the grammar and the words are not quite authentic. Text generated using Pointer-Generator suffers from repetition. Neural Wikipedian can hardly express all information soundly given multiple triples.

methods Grammar Correctness
MLE 7.6 6.7
CoT 5.5 3.5
PG 7.1 5.6
SeqGAN 8.0 5.8
NW 6.3 4.6
T2T 8.6 7.1
Table 6. Human evaluation on WebNLG dataset.
RDF inputs ¡Italy , capital , Rome¿, ¡Italy , leaderName , Matteo Renzi¿, ¡Amatriciana sauce , country , Italy¿,                                                        ¡Italy , leaderName , Laura Boldrini¿
Reference Amatriciana sauce is a traditional sauce in italy ( the capital of which is rome ) , where two of the country ’ s leaders are matteo renzi and laura boldrini .
MLE Italy is called a country Amatriciana sauce . Matteo Renzi and Laura Boldrini are leaders in Italy where the capital is Rome .
CoT Laura Boldrini is a leader in Italy where Rome is the capital of the country of Italy where where valencia is bacon.
SeqGAN Amatricana sauce comes from Italy , a political leader and the capital is Rome . matteo renzi and Laura Boldrini are one of the leaders of Italy is
PG Amatriciana sauce , a traditional italian dish from the Rome of the italian , where Rome the the leader is either two leaders include Matteo Renzi.
NW the leader of Italy is Laura Boldrini where amatriciana sauce can be found .
T2T Amatriciana sauce is from the country of Italy where capital is Rome . its leader is Laura Boldrini and Matteo Renzi leads the country .
Table 7. Sample output of the system.

6. Related Works

Our task can be regarded as a combination of two problems. One is on the training of neural language models; another is on converting knowledge bases (structured data) into natural languages.

6.1. Knowledge Base to Natural Language

Previous approaches on generating natural language from knowledge bases can be categorized into the following types: rule-based, template-based and neural language model based.

Generating sentences based on knowledge bases with hand-crafted rules is the main technology in traditional NLG systems, which often involves domain-specific knowledge and only works for a particular data type. Bontcheva et al. (Bontcheva and Wilks, 2004) designed a set of rules to generate natural language reports from medical data automatically. O’Donnell et al. (O’Donnell et al., 2000) designed a text generation system by utilizing the potential rules from relational databases. They specified the semantics of relational databases and reconstructed an ”Intelligent Labelling Explorer” (ILEX) system. Based on that, the ILEX system can interpret entities from databases based on information like domain taxonomy and specification of the data type. Cimiano et al. (Cimiano et al., 2013)

presented a principled language generation architecture by analyzing statistical information derived from a domain corpus. Their system can write recipes based on RDF representations of a cooking domain. They mainly focus on extracting lexicon and then formulate the recipes with a parse tree.

Template-based generation is another traditional approach to convert structured data into text. In general, developing such kind of system often requires complex design about grammar, semantic and lexicalization (Deemter et al., 2005). Kukich (Kukich, 1983) designed a knowledge-based report generator which infers semantic messages from the data and then maps that information into a grammar-based template. Flanigan et al. (Flanigan et al., 2016) proposed a two-stage method for natural language generation from Abstract Meaning Representation (Banarescu et al., 2013). Duma et al. (Duma and Klein, 2013) formulated a system which automatically learns sentence templates using the corpus extracted from Simple English Wikipedia and DBpedia.

The former two technologies have good availability, reliability and do not rely on large quantities of corpora to train the model. However, they require a labor expert and have poor scalability.

6.2. Neural Language Models

Sequence-to-sequence model (Dušek, 2016) adopts an end-to-end generation method that converts a meaning representation into a sentence. As attention mechanism (Bahdanau et al., 2014)

presents advantages in soft-searching the most relevant information among a sequence in neural machine translation task, Nallapati et al.

(Nallapati et al., 2016)

proposed a sequence-to-sequence attentional model to tackle text summarization task. See at al.

(See et al., 2017) proposes a hybrid pointer-generator network facilitating copying words from the source text via pointing (Vinyals et al., 2015) while retaining the ability to produce new words via generator, and uses coverage to discourage repetition. Yu et al. (Yu et al., 2017) proposed SeqGAN framework that introduces GAN discriminator (Goodfellow et al., 2014) to provide the reward signal and uses policy gradient technique (Sutton et al., 2000) to bypass the generator differentiation problem. Lu et al. (Lu et al., 2018) proposed Cooperative Training (CoT) that coordinately trains a generative module and an auxiliary predictive module, to optimize the estimated Jensen-Shannon divergence.

Besides the studies on design and training of language models, the researchers also proposed many indicators for evaluating the quality of samples generated by language models. These metrics can be classified into word-based metrics and grammar-based metrics. Word-based metrics move from simple

-gram overlap (including BLEU, TER (Snover et al., 2006), ROUGE (Lin, 2004), NIST (Doddington, 2002), LEPOR (Han et al., 2012), CIDER (Vedantam et al., 2015) and METEOR (Banerjee and Lavie, 2005)) to semantic similarity like Semantic Text Similarity (Han et al., 2013)

. Grammar-based metrics include F-score, MaxMatch

(Dahlmeier and Ng, 2012), I-measure (Felice and Briscoe, 2015). Besides, instead of comparing sentences words by words, EmbSim (Zhu et al., 2018) compares the word embeddings. Some metrics are likelihood-based metrics that estimate the cross-entropy between the generated sentences and the true data, such as (Yu et al., 2017) that estimates average negative log-likelihood of generated sentences on oracle LSTM.

7. Conclusion

In this paper, we studied the problem of converting knowledge base RDF triples into natural languages. To handle this problem, we formulated it as a conditional natural language problem and utilized the discrete sequence generative models. We analyzed the limitations of existing methods on conditional sequence generative models and proposed a new method T2T which approximately optimizes an inverse Kullback-Leibler divergence between the real distribution and the learned one. We validated the proposed method on three benchmark datasets. The experiment results show that our method outperforms the baselines.

Our model is not limited in the task of translating knowledge bases RDF triples to natural languages; it can also be applied to other conditional generation tasks like machine translation and question answering systems, which we leave as future work.

The work is sponsored by Huawei Innovation Research Program. The corresponding author Weinan Zhang thanks the support of National Natural Science Foundation of China (61702327, 61772333, 61632017), Shanghai Sailing Program (17YF1428200). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.


  • (1)
  • Arjovsky and Bottou (2017) Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017).
  • Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web. Springer, 722–735.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  • Banarescu et al. (2013) Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. 178–186.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
  • Belz and Reiter (2006) Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In NeurIPS. 1171–1179.
  • Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML. ACM, 41–48.
  • Bontcheva and Wilks (2004) Kalina Bontcheva and Yorick Wilks. 2004. Automatic report generation from ontologies: the MIAKT approach. In International conference on application of natural language to information systems. Springer, 324–335.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
  • Cimiano et al. (2013) Philipp Cimiano, Janna Lüker, David Nagel, and Christina Unger. 2013. Exploiting ontology lexica for generating natural language texts from RDF data. (2013).
  • Dahlmeier and Ng (2012) Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In NAACL. Association for Computational Linguistics, 568–572.
  • Dale et al. (2003) Robert Dale, Sabine Geldof, and Jean-Philippe Prost. 2003. CORAL: Using natural language generation for navigational assistance. In Proceedings of the 26th Australasian computer science conference-Volume 16. Australian Computer Society, Inc., 35–44.
  • Deemter et al. (2005) Kees Van Deemter, Mariët Theune, and Emiel Krahmer. 2005. Real versus template-based natural language generation: A false opposition? Computational Linguistics 31, 1 (2005), 15–24.
  • Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376–380.
  • Ding et al. (2004) Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R Scott Cost, Yun Peng, Pavan Reddivari, Vishal Doshi, and Joel Sachs. 2004. Swoogle: a search and metadata engine for the semantic web. In CIKM. ACM, 652–659.
  • Doddington (2002) George Doddington. 2002.

    Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In

    Baltic HLT. Morgan Kaufmann Publishers Inc., 138–145.
  • Duma and Klein (2013) Daniel Duma and Ewan Klein. 2013. Generating natural language from linked data: Unsupervised template extraction. In IWCS. 83–94.
  • Dušek (2016) Ondřej Dušek. 2016. Sequence-to-Sequence Natural Language Generation. Interaction (2016).
  • Fader et al. (2014) Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2014. Open question answering over curated and extracted knowledge bases. In SIGKDD. ACM, 1156–1165.
  • Felice and Briscoe (2015) Mariano Felice and Ted Briscoe. 2015. Towards a standard evaluation method for grammatical error detection and correction. In NAACL. 578–587.
  • Flanigan et al. (2016) Jeffrey Flanigan, Chris Dyer, Noah A Smith, and Jaime Carbonell. 2016. Generation from abstract meaning representation using tree transducers. In NAACL. 731–739.
  • Gardent et al. (2017) Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. The webnlg challenge: Generating text from rdf data. In INLG. 124–133.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NeurIPS. 2672–2680.
  • Han et al. (2012) Aaron LF Han, Derek F Wong, and Lidia S Chao. 2012. LEPOR: A robust evaluation metric for machine translation with augmented factors. Proceedings of COLING 2012: Posters (2012), 441–450.
  • Han et al. (2013) Lushan Han, Abhay L Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC_EBIQUITY-CORE: semantic textual similarity systems. In *SEM, Vol. 1. 44–52.
  • Hendrickx et al. (2009) Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2009. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions. Association for Computational Linguistics, 94–99.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • Huang et al. (2002) Zan Huang, Wingyan Chung, Thian-Huat Ong, and Hsinchun Chen. 2002. A graph-based recommender system for digital library. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries. ACM, 65–73.
  • Huszár (2015) Ferenc Huszár. 2015. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101 (2015).
  • Kim et al. (2017) Yoon Kim, Kelly Zhang, Alexander M Rush, Yann LeCun, et al. 2017. Adversarially Regularized Autoencoders. arXiv preprint arXiv:1706.04223 (2017).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kukich (1983) Karen Kukich. 1983. Design of a knowledge-based report generator. In ACL. Association for Computational Linguistics, 145–150.
  • Li et al. (2015) Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015).
  • Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Sėbastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial Learning for Neural Dialogue Generation. In EMNLP. 2157–2169.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004).
  • Lu et al. (2018) Sidi Lu, Lantao Yu, Weinan Zhang, and Yong Yu. 2018. CoT: Cooperative Training for Generative Modeling. arXiv preprint arXiv:1804.03782 (2018).
  • Magazine (1998) D-Lib Magazine. 1998. An Introduction to the Resource Description Framework. D-Lib Magazine (1998).
  • Mahoney (1999) Matthew V Mahoney. 1999.

    Text compression as a test for artificial intelligence. In

    AAAI/IAAI. 970.
  • McLachlan (2004) Geoffrey McLachlan. 2004.

    Discriminant analysis and statistical pattern recognition

    . Vol. 544.
    John Wiley & Sons.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS. 3111–3119.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023 (2016).
  • Novikova et al. (2017) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for nlg. arXiv preprint arXiv:1707.06875 (2017).
  • O’Donnell et al. (2000) Michael O’Donnell, Alistair Knott, Jon Oberlander, and Chris Mellish. 2000. Optimising text quality in generation from relational databases. In INLG. Association for Computational Linguistics, 133–140.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL. Association for Computational Linguistics, 311–318.
  • See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017).
  • Snover et al. (2006) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, Vol. 200.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NeurIPS. 3104–3112.
  • Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000.

    Policy gradient methods for reinforcement learning with function approximation. In

    NeurIPS. 1057–1063.
  • Trisedya et al. (2018) Bayu Distiawan Trisedya, Jianzhong Qi, Rui Zhang, and Wei Wang. 2018. GTR-LSTM: A Triple Encoder for Sentence Generation from RDF Data. In ACL, Vol. 1. 1627–1637.
  • Turner et al. (2010) Ross Turner, Somayajulu Sripada, and Ehud Reiter. 2010. Generating approximate geographic descriptions. In EMNLP. Springer, 121–140.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 4566–4575.
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In NeurIPS. 2692–2700.
  • Vougiouklis et al. (2018) Pavlos Vougiouklis, Hady Elsahar, Lucie-Aimée Kaffee, Christophe Gravier, Frederique Laforest, Jonathon Hare, and Elena Simperl. 2018. Neural wikipedian: Generating textual summaries from knowledge base triples. JWS (2018).
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.
  • Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient.. In AAAI. 2852–2858.
  • Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A Benchmarking Platform for Text Generation Models. SIGIR (2018).
  • Zou et al. (2014) Lei Zou, Ruizhe Huang, Haixun Wang, Jeffrey Xu Yu, Wenqiang He, and Dongyan Zhao. 2014. Natural language question answering over RDF: a graph data driven approach. In SIGKDD. ACM, 313–324.