Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation

04/30/2020 ∙ by Giuseppe Russo, et al. ∙ Swisscom ETH Zurich 0

In this work, we present a text generation approach with multi-attribute control for data augmentation. We introduce CGA, a Variational Autoencoder architecture, to control, generate, and augment text. CGA is able to generate natural sentences with multiple controlled attributes by combining adversarial learning with a context-aware loss. The scalability of our approach is established through a single discriminator, independently of the number of attributes. As the main application of our work, we test the potential of this new model in a data augmentation use case. In a downstream NLP task, the sentences generated by our CGA model not only show significant improvements over a strong baseline, but also a classification performance very similar to real data. Furthermore, we are able to show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, natural language generation (NLG) has become a prominent research topic NLP, due to its diverse applications, ranging from machine translation (e.g., sennrich2016controlling) to dialogue systems (e.g., budzianowski2019hello). The main application and common goal of automatic text generation is the augmentation of datasets used for supervised NLP tasks. Hence, one of the key demands of NLG is controlled text generation, more specifically, the ability to systematically control semantic and syntactic aspects of generated text.

Most previous approaches simplify this problem by approximating NLG with the control of one single aspect of the text, such as sentiment or formality (e.g., li2018delete, fu2018style, and john2019disentangled). However, the problem of controlled generation relies on multiple components such as lexical, syntactic, semantic and stylistic aspects. Therefore, the simultaneous control of multiple attributes becomes vital to generate natural sentences suitable for specific downstream tasks. Methods such as the ones presented by hu2017toward and subramanian2018multiple, succeed in simultaneous controlling multiple attributes of sentences. However, these methods depend on the transformation of input reference sentences, or do not scale easily to multiple attributes due to architectural complexities, such as the requirement for separate discriminators for each additional attribute.

In light of these challenges, with our Control, Generate, Augment model (CGA) we propose a powerful framework to synthesize additional labeled data. The accurate multi-attribute control of our approach offers significant performance gains on downstream NLP tasks.

The main contributions of this paper are:

  1. A scalable model which learns to control multiple semantic and syntactic attributes of a sentence. The CGA model requires only a single discriminator for simultaneously controlling multiple attributes (see Section 2). We present automatic and human assessments to confirm the control over multiple semantic and syntactic attributes. Further, we provide a quantitative comparison to previous work.

  2. A method for natural language generation for data augmentation, which boosts the performance of downstream tasks. To this end, we present data augmentation experiments of various datasets, where we significantly outperform a strong baseline and achieve a performance comparable to real data (Section 3).

2 Method

We now present our model for controlled text generation. Our model is based on the Sentence-VAE framework bowman2016generating

. However, we change the model to allow the generation of sentences conditioned not only on the latent code but also on a attribute vector. We achieve this by disentangling the latent code from the attribute vector, in a similar way as the Fader networks 

lample2017fader

, originally developed for computer vision tasks. As we will see, this simple adaption would not be sufficient. We carefully designed our architecture by taking advantage of a range of techniques, some from the NLP community 

edunov2018understanding, but many from the computer vision community zhu2017unpaired; sanakoyeu2018style.

Figure 1: Model Architecture depicting the key components: the conditional VAE, the multi-attribute discriminator and the context-aware loss.

2.1 Model Architecture

We assume access to a corpus of sentences and a set of categorical attributes of interest. For each sentence , we use an attribute vector to represent these associated attributes. Example attributes include the sentiment or verb tense of a sentence.

Given a sentence and its attribute vector , our goal is to construct an ML model that, given a different attribute vector , generates a new sentence that contains attributes .

Sentence Variational Autoencoder

The main component of our model is a Variational Auto-Encoder kingma2013auto. The encoder network , parameterized by a trainable parameter , takes as input a sentence and defines a probabilistic distribution over the latent code :

(1)

The decoder , parameterized by a trainable parameter , tries to reconstruct the input sentence from a latent code and its attribute vector . We always assume that the reconstructed sentence has the same number of tokens as the input sentence :

(2)

where is the total number of tokens of the input sentences and is the token. Here we abuse the notation slightly and use

to denote both sentence-level probability and word-level conditional probability.

To train the encoder and decoder, we use the following VAE loss:

(3)

where

is a standard Gaussian distribution.

When we try to optimize the loss in Equation 3, the KL term often vanishes. This problem is known in text generation as posterior collapse bowman2016generating. To mitigate this problem we follow bowman2016generating and add a weight to the KL term in Equation 3. At the start of training, we set the weight to zero, so that the model learns to encode as much information in as possible. Then, as training progresses, we gradually increase this weight, as in the standard KL-annealing technique.

Moreover, the posterior collapse problem is partially due to the fact that, during training, our decoder predicts each token conditioned on the previous ground-truth token. We hope to make the model rely more on . A natural way to achieve this is to weaken the decoder by removing some or all of this conditional information during the training process. Previous work bowman2016generating; hu2017toward replace a — randomly selected — significant portion of the ground-truth tokens with MASK. However, this can severely affect the decoder and worsen the generative capacity of the model. Therefore, we define a new word-dropout routine, which aims at both accommodating the posterior collapse problem and preserving the decoder capacity. Instead of fixing the word-dropout rate to a large constant value as in bowman2016generating, we use a cyclical word-dropout rate :

(4)

where is the current training iteration, is a constant value we use during the warmup phase, and defines the period of the cyclical word-dropout rate schedule.

Disentangling Latent Code and Attribute Vector

To be able to generate sentences given a different attribute vector , we have to disentangle the attribute vector with the latent code. In other words, we hope that is attribute-invariant: A latent code is attribute-invariant if given two semantically equivalent sentences and , that only differ on their attributes (e.g., two versions of the same review expressing opposite sentiment), they should result in the same latent representation = = .

To achieve this, we use a concept from predictability minimization schmidhuber1992learning and adversarial training for domain adaptation ganin2016domain; louppe2017learning, which was recently applied in the Fader Networks by lample2017fader. We apply adversarial learning directly on the latent code of the input sentence . We set a min-max game and introduce a discriminator , that takes as input the latent code and tries to predict the attribute vector . Specifically, outputs for each attribute

, a probability distribution

over all its possible values. To train the discriminator, we optimize for the following loss:

(5)

where is the ground-truth of the attribute.

Simultaneously, we hope to learn an encoder and decoder which (1) combined with the attribute vector , allows the decoder to reconstruct the input sentence , and (2) does not allow the discriminator to infer the correct attribute vector corresponding to . We optimize for:

(6)
Context-Aware Loss

Equation 6 forces our model to choose which information the latent code should retain or disregard. However, this approach comes with the risk of deteriorating the quality of the latent code itself. Therefore, inspired by sanakoyeu2018style, we propose an attribute-aware context loss, which tries to preserve the context information by comparing the sentence latent representation and its back-content representation:

(7)

The latent vector can be seen as a contextual representation of the input sentence . This latent representation is changing during the training process and hence adapts to the attribute vector. Thus, when measuring the similarity between and the back-context representation , we focus on preserving those aspects which are profoundly relevant for the context representation.

Finally, when training the encoder and decoder (given the current discriminator), we optimize for the following loss:

(8)
Sentence Attributes
it was a great time to get the best in town and i loved it. Past / Positive
it was a great time to get the food and it was delicious. Past / Positive
it is a must! Present/Positive
they’re very reasonable and they are very friendly and helpful. Present / Positive
i had a groupon and the service was horrible. Past / Negative
this place was the worst experience i’ve ever had. Past / Negative
it is not worth the money. Present / Negative
there is no excuse to choose this place. Present / Negative
Table 1: Examples of generated sentences with two attributes: sentiment and verb tense.
Sentence Attributes
they have a great selection of beers and shakes. Present / Positive / Plural
i love this place and i will continue to go here. Present / Positive / Singular
the mashed potatoes were all delicious! Past / Positive / Plural
the lady who answered was very friendly and helpful. Past / Positive / Singular
the people are clueless. Present / Negative / Plural
i mean i’m disappointed. Present / Negative / Singular
drinks were cold and not very good. Past / Negative / Plural
it was a complete disaster. Past / Negative / Singular
Table 2: Examples of generated sentences with three attributes: sentiment, verb tense, and person number.

3 Evaluation

To assess our newly proposed model for the controlled generation of sentences, we perform the following evaluations described in this section: An automatic and human evaluation to analyze the quality of the new sentences with multiple controlled attributes; an examination of sentence embedding similarity to assess the diversity of the generated samples; downstream classification experiments with data augmentation on two different datasets to prove the effectiveness of the new sentences in a relevant application scenario; and, finally, a comparison of our results to previous work to specifically contrast our model against other single and multi-attribute models.

Datasets

We conduct all experiments on two datasets, YELP and IMDB reviews. Both contain sentiment labels for the reviews. From the YELP business reviews dataset yelp2014data, we use reviews only from the category restaurants, which results in a dataset of approx. 600’000 sentences. The IMDB movie reviews dataset imdb2011 contains approx. 150’000 sentences. For reproducibility purposes, details about training splits and vocabulary sizes can be found in the supplementary materials.

Attributes

For our experiments we use three attributes: sentiment as a semantic attribute; verb tense and person number as syntactic attributes.

  1. Sentiment: We labeled each review as positive or negative following the approach of shen2017style.

  2. Verb Tense: We detect past and present verb tenses using SpaCy’s part-of-speech tagging model111https://spacy.io/usage/linguistic-features#pos-tagging. We define a sentence as present if it contains more present than past verbs.

  3. Person Number: We also use spaCy to detect singular or plural pronouns and nouns. Consequently, we label a sentence as singular if it contains more singular than plural pronouns or nouns, we define it plural, in the opposite case, balanced otherwise.

We provided more information about the specific PoS tags we use for the labeling in the supplementary materials.

We train our model to generate sentences by controlling one, two or three attributes simultaneously. The sentences are generated by the decoder as described in Equation 2. In Table 1 we illustrate some examples of sentences where we controlled two attributes, Sentiment and Verb Tense, at the same time. Table 2 presents sentences where the model controls three attributes simultaneously. Moreover, example sentences of controlling single attributes can be found in the supplementary material.

Experimental Setting

The generator and encoder are set as single-layer LSTM RNNs with hidden dimension of 256 and maximum sample length of 20. The Discriminator is set as a fully-connected layer or single-layer LSTM. To avoid a vanishingly small KL term in the VAE bowman2016generating, we use a KL term weight annealing that increases from 0 to 1 during training according to a logistic scheduling. increases linearly from 0 to 20. Finally, we set the back-translation weight to 0.5. More specific information is provided in the supplementary material.

3.1 Quality of Generated Sentences

First, we quantitatively measure the sentence attribute control of our CGA model by evaluating the accuracy of generating the designated attributes by conducting both automatic and human evaluations.

Attribute Matching

For this automatic evaluation, we generate sentences given the attribute vector as described in Section 2. To assign the Sentiment attribute labels, we apply a pre-trained TextCNN with 95% accuracy on YELP and 81% accuracy on IMDB kim2014convolutional. To assign the verb tense and person number labels we use SpaCy’s part-of-speech tagging (93% accuracy for English). We calculate the accuracy as the percentage of the predictions of these pre-trained models that match the attribute label generated by our CGA model. Table 3 shows the results on 30K sentences generated by CGA models trained on YELP and IMDB, respectively. The results are averaged over five balanced splits, each of them with 6000 samples.

Sentiment Tense Person No.
YELP 91.1%(0.04) 96.6% (0.03) 95.9% (0.06)
IMDB 88.0% (0.07) 93.6% (0.04) 91.4% (0.05)
Table 3:

Attribute matching accuracy (in %) of the generated sentences; standard deviation reported in brackets.

Human Evaluation

To further understand the quality of the generated sentences we go beyond the automatic attribute evaluation and perform a human judgement analysis.

One of our main contributions is the generation of sentences with up to three controlled attributes. Therefore, we randomly select 120 sentences generated from the model trained on YELP, which controls all three attributes. Two human annotators labelled these sentences by marking which of the attributes are included correctly in the sentence.

In addition to the accuracy we report inter-annotator rates with Cohen’s . In 80% of the sentences all three attributes are included correctly and in 100% of the sentences at least two of the three attributes are present. To facilitate comparisons to previous work (see Section 4), we also assign an attribute score between 1 and 5 from the annotator agreements (1 if the considered attribute is not included, 5 if it is evidently present in the sentence).

Finally, the annotators also judged whether the sentence is a correct, complete and coherent English sentence. Most of the incorrect sentences contain repeated words or incomplete endings. The results are shown in Table 4.

Attribute Sentences Accuracy () AS
Sentiment 106/120 0.88 (0.73) 4.40
Verb Tense 117/120 0.98 (0.97) 4.90
Person Number 114/120 0.95 (0.85) 4.25
2 Attributes 120/120 1.0 -
3 Attributes 97/120 0.80 -
Coherence 79/120 0.66 -
Table 4: Results of the human evaluation showing accuracy and Cohen’s for each attribute. AS stands for Attribute Score.
Sentence Embedding Similarity

Although generative models have been shown to produce outstanding results, in many circumstances they risk to produce extremely repetitive examples goodfellow2014generative; zhao2017infovae. In this experiment, we qualitatively assess the capacity of our model to generate diversified sentences to further strengthen the results obtained in this work. We sample 10K sentences from YELP () and from our generated sentences (), respectively, both labeled with the sentiment attribute. We retrieve the sentence embedding for each of the sentences in and using the Universal Sentence Encoder cer2018universal

. Then, we compute the cosine similarity between the embeddings of all sentences of

and, analogously, between the embeddings of our generated sentences .

Real Data
Generated Data
Figure 2: Similarity Matrices for real data and data generated by our CGA model controlling the sentiment attribute.
(a) Negative-Negative
(b) Negative-Positive
(c) Positive-Positive
Figure 3: Sentence similarity scores computed for real data and data generated by our CGA model on the three sentiment clusters (Negative-Negative, Negative-Positive, Positive-Positive).

Consequently, we obtain two similarity matrices and (see Figure 3.1). Both matrices show a four cluster structure:

  • Top-Left: Similarity scores between negative reviews ()

  • Top-Right or Bottom-Left: Similarity scores between negative and positive reviews ()

  • Bottom-Right:Similarity scores between positive reviews ()

Further, for each sample of and we compute a similarity score as follows:

(9)

where . is the i-th sample of or and is the cluster to which belongs. is the set of the k-most similar neighbours of in cluster , and =50.

To gain a qualitative understanding of the generation capacities of our model, we assume that an ideal generative model should produce samples that have comparable similarity scores to the ones or the real data. Therefore, Figure 3.1 contrasts the similarity scores of and , computed on each cluster separately.

Although our generated sentences are clearly more similar between themselves than to the original ones, our model is able to produce samples clustered according to their labels. This highlights the good attribute control abilities of our CGA model and shows that it is able to generate various sentences which robustly mimic the structure of the original dataset. Hence, the generated sentences are good candidates for augmenting existing datasets.

We generalized this experiment for the multi-attribute case. The similarity matrices and the histograms for these additional experiments are provided in the supplementary material.

(a) YELP, 500 samples
(b) YELP, 1000 samples
(c) YELP, 10000 samples
Figure 4: Data augmentation results for the YELP dataset.
(a) IMDB, 500 samples
(b) IMDB, 1000 samples
(c) IMDB, 10000 samples
Figure 5: Data augmentation results for the IMDB dataset.
Training Size
500 sentences 1000 sentences 10000 sentences
Model acc. (std) % acc. (std) % 10000 %
Real Data YELP 0.75 (0.01) 0 0.79 (0.01) 0 0.87 (0.03) 0
YELP + EDA 0.77 (0.02) 70 0.80 (0.08) 30 0.88 (0.02) 70
YELP + CGA (Ours) 0.80 (0.02) 150 0.82 (0.03) 120 0.88 (0.04) 100
Real Data IMDB 0.54 (0.01) 0 0.57 (0.06) 0 0.66 (0.05) 0
IMDB + EDA 0.56 (0.02) 150 0.58 (0.07) 70 0.67 (0.02) 100
IMDB + CGA (Ours) 0.60 (0.01) 120 0.61 (0.01) 200 0.67 (0.03) 120
Table 5: Largest increase in performance for each method independently from the augmentation percentage used. For each method we report accuracy (standard deviation in brackets) and the augmentation percentage.

3.2 Data Augmentation

The main application of our work is to generate sentences for data augmentation purposes. Simultaneously, the data augmentation experiments presented in this section reveal the quality of the sentences generated by our model.

As described, we conduct all experiments on two datasets, YELP and IMDB reviews. We train an LSTM sentiment classifier on both datasets, each which three different training set sizes. We run all experiments for training sets starting with 500, 1000 and 10000 sentences. These datasets are then augmented with different percentages of generated sentences (10, 20, 30, 50, 70, 100, 120, 150 and 200%). This allows us to analyze the effect of data augmentation on varying original training set sizes as well as varying increments of additionally generated data. In all experiments we average the results over 5 random seeds and we report the corresponding standard deviation.

Automatic Evaluation Human Evaluation
Model SM ST SM ST
shen2017style 83.5% 12K 63.8%(3.19/5) 500
fu2018style 96.0% 12K 71.2%(3.35/5) 100
john2019disentangled 93.4% 12K 86.4%(4.32/5) 500
CGA (Ours) 93.1% 30K 96.3%(4.81/5) 120
Table 6: Comparison between our results and the most relevant single-attribute related works. The accuracy refers to the amount of the generated sentences that match the required attribute. SM stands for Sentiment Matching and ST stands for the number of sentences tested.
Automatic Evaluation Human Evaluation
Model SM VTM ST SM ST
subramanian2018multiple 74.5% 91.1% 12K 72%(3.59/5) 500
logeswaran2018content 76.6% 94.9% 12K 71.2%(3.56/5) 100
lai2019multiple 79.9% 96.1% 12K 63.4%(3.17/5) 500
CGA (Ours) 91.1% 96.6% 30K 88.0%(4.40/5) 120
Table 7: Comparison between our results and the most relevant multi-attribute related works. The accuracies refer to the amount of the generated sentences that match the required attribute. SM stands for Sentiment Matching, VTM stands for Verb Tense Matching, and ST stands for the number of sentences tested.

To evaluate how beneficial our generated sentences are for the performance of downstream tasks, we compare data augmentation with sentences generated from our CGA model to (a) real sentences from the original datasets, and (b) sentences generated with the Easy Data Augmentation (EDA) method by wei2019eda. EDA applies a transformation (e.g, synonym replacement or random deletion) to a given sentence of the training set and provides a strong baseline.

The results are presented in Figures 4 and 5, for YELP and IDMB respectively. They show the performance of the classifiers augmented with sentences from our CGA model, from EDA or from the original datasets. Our augmentation method proved to be beneficial in all six scenarios. However, for the cases where the percentage increment is larger than 120% of the original training size, the average accuracy of the classifier augmented with CGA sentences diverge from the one of the classifier augmented with real data. Moreover, our model clearly outperforms EDA in all the possible scenarios, especially with larger training sets.

In Table 5, we report the best average test accuracy as well as the percentage of data increment for each of the the six experiments and each of the two datasets. We compare them with the results obtained by the classifier trained only using real data without augmentation.

4 Comparison with Related Work

As a final analysis, we compare our results with previous state-of-the-art models for both single-attribute and multi-attribute control.

4.1 Single-Attribute Control

li2018delete model style control in the Delete, Retrieve, Generate (DRG) framework, which erases words related to a specific attribute and then inserts new words which belongs to the vocabulary of the target style (e.g., sentiment). sudhakar2019transforming improve the DRG framework by combining it with a transformer architecture vaswani2017attention. However, these approaches are susceptible to error, due to the difficulty of accurately selecting only the style-containing words.

Other approaches on text generation have leveraged adversarial learning. Specifically, shen2017style train a cross-alignment auto-encoder (CAAE) with shared content and separate style distribution. fu2018style suggested a multi-head decoder to generate sentences with different styles. john2019disentangled use a VAE with multi-task loss to learn a content and style representation that allows to elegantly control the sentiment of the generated sentences.

In Table 6, we report a comparison with models focused on controlling only the sentiment of a sentence. fu2018style and john2019disentangled achieve better Sentiment Matching accuracy in the automatic evaluation than our CGA model. However, both fu2018style and john2019disentangled, by approximating the style of a sentence with its sentiment, proposed a model which is specifically designed to control this single attribute. When our CGA model is trained for sentiment control only, it obtains 93.1% accuracy for automatic evaluation and 96.3% in human evaluation, which is comparable with the score obtained by the related approaches. Consequently, CGA offers a strong competitive advantage because it guarantees high sentiment matching accuracy while controlling additional attributes and, thus, offers major control over multiple stylistic aspects of a sentence.

4.2 Multi-Attribute Control

Few works have succeed in designing an adequate model for text generation and controlling multiple attributes. hu2017toward use a VAE with controllable attributes. subramanian2018multiple and logeswaran2018content apply a back-translation technique from unsupervised machine translation for style transfer tasks. lai2019multiple follow the approach of the CAAE with a two-phase training procedure.

In addition to the provided quantitative evaluation for all three controlled attributes in Table 3, we compare the results for the SENTIMENT and VERB TENSE attribute, since they are the common denominator between all methods. These models were trained and tested on the same YELP data splits. We compare the results of our CGA model with the results achieved by lai2019multiple, logeswaran2018content and subramanian2018multiple. This comparison is reported in Table 7

. For both evaluation scenarios (i.e. automatic and human), CGA yields significantly better performance. lai2019multiple, logeswaran2018content and subramanian2018multiple reported content preservation as an additional evaluation metric. However, this metric is of no interest for our work, since, differently from these previous models, CGA generates sentences directly from an arbitrary hidden representations and it does not need a reference input sentence.

5 Discussion & Conclusion

To the best of our knowledge, we propose the first approach for controlled multi-attribute text generation which (1) generates coherent sentences with multiple correct attributes sampling from a smooth latent space, (2) works within a lean and scalable architecture, and (3) improves downstream discriminative tasks by synthesizing additional labeled data.

In this paper we presented a scalable framework for natural language generation which allows for fine-grained control on multiple stylistic aspects. We generate sentences of high quality with a maximum sentence length of 20 tokens. While this restricted sentence length is still a limitation, it is longer than sentences presented in previous work (e.g., hu2017toward). Additionally, although we provide extensive evaluation analyses, it is still an open research question to define an appropriate evaluation metric for text generation.

To sum up, our approach, which combines adversarial learning and back-translation, achieves state-of-the-art results with improved accuracy on sentiment, tense and person number attributes in automatic and human evaluations. Moreover, our experiments show that our CGA model can be used effectively as a data augmentation framework to boost the performance of downstream classifiers.

References

Appendix A Supplementary Material

a.1 Dataset

a.1.1 Structure

We use YELP and IMDB for the training, validation and testing of our CGA models. The label distributions for all attributes are described in Table 8.

From the YELP business reviews dataset yelp2014data, we use reviews only from the category restaurants. We use the same splits for training, validation and testing as john2019disentangled, which contain 444101, 63483 and 126670, respectively. The vocabulary contains 9304 words. We further evaluate our models on the IMDB dataset of movie reviews imdb2011. We use reviews with less than 20 sentences and we select only sentences with less than 20 tokens. Our final dataset contains 122345, 12732, 21224 sentences for train validation and test, respectively. The vocabulary size is 15362 words.

a.1.2 Attribute Labeling

In this work we simultaneously control three attributes: sentiment, verb tense and person number.

We use SpaCy’s Part-of-Speech tagging to assign the verb tense labels. Specifically, we use the tags VBP and VBZ to identify present verbs, and the tag VBD to identify past verbs.

Analogously, we use the SpaCy’s PoS tags and the personal pronouns to assign person number labels. In particular, we use the tag NN, which identifies singular nouns, and the following list of pronouns {”i”,”he”,”she”, ”it”, ”myself”} to identify a singular sentence. We use NNS and the list of pronous {”we”, ”they”, ”themselves”, ”ourselves”} to identify a plural sentence.

a.2 Training Details

VAE architecture

Our VAE has one GRU encoder and one GRU decoder. The encoder has a hidden layer of 256 dimensions linearly transfered to the content vector of 32 dimensions (for one or two attributes), or 50 dimensions (for three attributes). For training the decoder we set the initial hidden state as . Moreover, we use teacher-forcing combined with the cyclical word-dropout described in Equation 4.

Discriminator

The discriminator is used to create our attribute-free content vectors. We experimented with two architectures for the discriminator which held similar results. We tried a two-layer (64 dimensions each) fully-connected architecture with batch normalization; and a single-layer LSTM with 50 dimensions (for one or two attributes), or 64 dimensions (for three attributes).

KL-Annealing

One of the challenges during the training process was the KL annealing. Similar to bowman2016generating, we used a logistic KL annealing:

(10)

where is the current training step. indicates how many training steps are needed to set . is a constant value given by:

(11)

We set for YELP and for IMDB. is a constant we set to .

Real Data
Generated Data
Figure 6: Similarity Matrices for real data and data generated by our CGA model controlling the sentiment attribute.
(a) Negative&Present
(b) Negative&Past
(c) Positive&Present
(d) Positive&Past
Figure 7: Sentence similarity scores computed for real data and data generated by our CGA model on the three sentiment clusters (Negative-Negative, Negative-Positive, Positive-Positive).
Discriminator Weight

The interaction between the VAE and the Discriminator is a crucial factor for our model. Indeed, we decide to linearly increase during the training process the discriminator weight according to Equation 12.

(12)

where is the maximum value that can have. indicates after how many training steps . is the current training step. is the warm-up value and it indicates after how many training steps the is included in . We set , and for YELP or and .

Word-Dropout

We use the Equation 4 with the following parameters , and for YELP. We use and and the same for IMDB.

Optimizer

The Adam optimizer, with initial learning rates of , was used for both the VAE and the discriminator kingma2014adam.

(a) Negative&Present-Positive&Present
(b) Negative&Present-Positive&Past
(c) Negative&Present-Negative&Past
(d) Negative&Past-Positive&Present
(e) Negative&Past-Positive&Past
(f) Positive&Present-Positive&Past
Figure 8: Sentence similarity scores computed for real data and data generated by our CGA model of the six clusters.
Sentence Sentiment
but i’m very impressed with the food and the service is great. Positive
i love this place for the best sushi! Positive
it is a great place to get a quick bite and a great price. Positive
it’s fresh and the food was good and reasonably priced. Positive
not even a good deal. Negative
so i ordered the chicken and it was very disappointing. Negative
by far the worst hotel i have ever had in the life. Negative
the staff was very rude and unorganized. Negative
Table 9: Examples of generated sentences controlling the sentiment attribute.
Sentence Tense
i love the fact that they have a great selection of wines. Present
they also have the best desserts ever. Present
the food is good , but it’s not worth the wait for it. Present
management is rude and doesn’t care about their patients. Present
my family and i had a great time. Past
when i walked in the door , i was robbed. Past
had the best burger i’ve ever had. Past
my husband and i enjoyed the food. Past
Table 10: Examples of generated sentences controlling the verb tense attribute.
Sentence Person
it was a little pricey but i ordered the chicken teriyaki. Singular
she was a great stylist and she was a sweetheart. Singular
worst customer service i’ve ever been to. Singular
this is a nice guy who cares about the customer service. Singular
they were very friendly and eager to help. Plural
these guys are awesome! Plural
the people working there were so friendly and we were very nice. Plural
we stayed here for NUM nights and we will definitely be back. Plural
Table 11: Examples of generated sentences controlling the person number attribute.

a.3 Evaluation

a.3.1 Sentence Embedding Similarities

Following the approach described in Section 3, we report the results of the sentence embedding similarities for the multi-attribute case (sentiment and verb tense). Similarly to the similarity matrices for the single-attribute case, in Figure 6 we recognize the clustered structure of the similarities. These matrices can be divide into the following clusters:

  • Intra-class Clusters These are the clusters which are placed over the diagonal of the matrices and show a high cosine similarity scores. They contain similarity scores between the embeddings of samples with the same labels.

  • Cross-Class Clusters These are the clusters located above the intra-class clusters. They contains the similarity scores between embeddings of samples with different labels. Indeed, they show lower similarity scores.

To gain a qualitative understanding of the generation capacities of our model, we start from the same assumption as in Section 3: an ideal generative model should produce samples that have comparable similarity scores to the ones of the real data. We contrast the similarity scores computed on each cluster separately in the histograms in Figures 7 and A.2.

a.3.2 Data Augmentation

For the data augmentation experiments we use a bidirectional LSTM with input size 300 and hidden size 256. We set dropout to 0.8. For the training we use early stopping, specifically we stop the training process after 8 epochs without improving the validation loss.

a.3.3 TextCNN

For the Sentiment Matching we use the pre-trained TextCNN kim2014convolutional. This network uses 100 dimensional Glove word embeddings pennington2014glove, 3 convolutional layers with 100 filters each. The dropout rate is set to 0.5 during the training process.

a.4 Generated Sentences

Tables 9 to 11 provide example sentences generated by the CGA model for the three individual attributes.