SAM: Semantic Attribute Modulation for Language Modeling and Style Variation

07/01/2017 ∙ by Wenbo Hu, et al. ∙ Toutiao Tsinghua University ebay 0

This paper presents a Semantic Attribute Modulation (SAM) for language modeling and style variation. The semantic attribute modulation includes various document attributes, such as titles, authors, and document categories. We consider two types of attributes, (title attributes and category attributes), and a flexible attribute selection scheme by automatically scoring them via an attribute attention mechanism. The semantic attributes are embedded into the hidden semantic space as the generation inputs. With the attributes properly harnessed, our proposed SAM can generate interpretable texts with regard to the input attributes. Qualitative analysis, including word semantic analysis and attention values, shows the interpretability of SAM. On several typical text datasets, we empirically demonstrate the superiority of the Semantic Attribute Modulated language model with different combinations of document attributes. Moreover, we present a style variation for the lyric generation using SAM, which shows a strong connection between the style variation and the semantic attributes.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language generation is considered as a key task in the artificial intelligence field 


. The language modeling task aims to present the word distributions of text sequences and is considered as a degenerated text generation task, which generates only one word at each step. Traditional language generation approaches use phrase templates and related generation rules. For the language modeling task, the counting-based

n-gram method is broadly used. These methods are conceptually simple but hard to generalize like humans.

Later on, Bengio et al. bengio2003neural developed a feed-forward neural network language model and Mikolov et al. 


used the recurrent neural network (RNN) to train a language model. With the benefits of the large-scale corpora and the modified gating functions, such as the long-short term memory (LSTM) 


or the gated recurrent unit (GRU) 


, the recurrent neural network (RNN) has been demonstrated a good capability in modeling word probabilities and now is the most widely used method for language modeling and language generation 

[MKB10, Gra13]. Nevertheless, RNN is often criticized for incapable of capturing the long-term dependency, resulting in losing the important contextual information. It has been shown that the RNN language models (RNNLMs) can be enhanced with some specific long-term contextual information, including document topics [MZ12, GVS16, DWGP17], bag-of-words contexts [WC16], a neural cache [GJU17], etc. Several specific text structure was considered in the RNNLMs, such as the hierarchical sentence sequences [LLY15], tree-structured texts [TZH16] and dialog contexts [LL17, MBW17].

In the aforementioned models, only main text sequences were modeled but the vastly-accessible attributes of documents were ignored. Interestingly, the document attributes implicitly convey global contextual information of the word distributions and are vastly-accessible before reading the main texts in daily reading or speaking. Document titles are compact abstracts carefully chosen by authors and keynote speakers. Labels and tags are specific categories assigned by experienced editors. Authorships reflect writing styles. With these vastly-accessible attributes, one can predict word distributions better (see a concrete example in Figure. 1).

Figure 1: An AlphaGo News from the NY Times have several important semantic attributes, such as the title, the author, the category and the dateline.

Moreover, from the generation perspective, several previous works generate the designed outputs from scratch or from a single semantic attribute [LVM15, SHB16, PT16, LGA16, KZC16, RJS17, HYL17]. However, only a few semantic attributes were incorporated at the same time and were incapable to meet the huge complexity of the text generation task. In this paper, we consider a diversity of semantic attributes and use the attention mechanism to conjoin the semantic attribute as a joint embedding. Hence, the semantic attribute modulation brings a flexible way to generate texts because we can choose different combinations of these attributes. Due to the strong semantic information conveyed by the attributes, the text generations are interpretable with regard to the different combinations of the input attributes. With this flexibility, we can get a text style variation with replacements of semantic attributes. An interesting example is Please let Jason Mraz rewrite the lyric ‘Last Kiss’111A famous song by Taylor Swift..

1.1 Our Proposal

In this paper, we present SAM, the Semantic Attribute Modulation for language modeling and style variation. We consider the vastly-accessible semantic language attributes and extract the attribute embedding. Specifically, we adopt two types of semantic attributes: the title attribute and the category attribute. For the title attribute, we use an RNN encoder to get the title embedding. For the category embedding, our model learns a shared embedding from the documents in the specific category. Then, we generate the outputs with an attention mechanism over a diversity of attribute embeddings.

The semantic attribute modulated (SAM) language model obtains better per-word prediction results than the vanilla RNNLM without SAM. The improved word predictions are highly related to the semantic attributes and therefore interpretable to humans. Moreover, we present the lyric generation task with lyric variation derived from semantic attributes. The text generation conditioned on the semantic attribute has a flexible attribute selection. With a learned attribute as a replaced input, we can get the output with the style variation. Interesting lyric style variations examples further demonstrate the flexibility of SAM.

In summary, our contributions are as follows:

  • We present SAM, a Semantic Attribute Modulation, which incorporates a diversity of semantic document attributes, as a flexible language generation modulation input.

  • By incorporating the Semantic Attribute Modulation, our language model gets better word prediction results on several text datasets. The better word predictions are highly related to the semantic attribute and hence is interpretable to humans.

  • Based on our model, we present the stylistic variations of the lyric generation with a fake author attribute, which further demonstrates the flexibility of SAM.

2 Preliminaries

In this section, we first give a concrete example of semantic attributes and then list the related language generation models.

2.1 A concrete example of semantic attributes

We take an AlphaGo news article from the New York Times as a concrete example (Figure. 1). Given the title ‘Google’s Computer Program Beats Lee Se-dol in Go Tournament’, the main text words ‘Google’, ‘program’ and ‘Go’ could be predicted more easily. Given the author attribute ‘CHOE SANG-HUN’ who is a Pulitzer Prize-winning South Korean journalist, we can better predict the words ‘South-Korean’ and ‘Go’. That is to say, the semantic attributes are indicative of different aspects of the text generation, which motivate us to modularize the semantic attributes in the text generation models.

2.2 Rnn-Lm

Given a sequence of words , language modeling aims at computing its probability by


where are the words ahead of . We can use the recurrent neural network to build the word probabilities [MKB10]. At each time step , the transition gating function reads one word and updates the hidden state as , where

is the continuous vector representation of the one hot input vector

and is the embedding matrix. The probability of the next possible word in the vocabulary is computed by


where , are the affine weights and biases respectively and is the dimension of the hidden state . Here the subscription specifies the specific column.

The RNN models were always criticized for their lacking capacity of the long-term sequential dependence, resulting in an unsatisfactory performance on modeling contextual information. Several previous works tried to capture the contextual information using the previous contexts. Let be the contextual representation extracted from the contexts and the generation process of the RNNLM with is


The previous context representation can be extracted as the bag-of-words contexts [WC16], the latent topics [MZ12] and the neural network embedding [JCK15].

Figure 2: The SAM architecture

3 Semantic Attribute Modulation

Other than main texts, documents have semantic attributes, such as titles, authorships, tags, and sentiments, which convey important semantic information. In this section, we present SAM, the Semantic Attribute Modulation originated from an attention mechanism over a diversity of attributes. Then, we use SAM to do language modeling and style variation for language generation. Given the semantic attribute modulated representation , the generative process of our model is , where are the words in the same document.

3.1 Semantic Attributes

Due to the discrepant forms among the semantic attributes, we use two methods to extract the representations from semantic attributes.

3.1.1 Title Attributes

The title is often carefully chosen by the author and is a compact abstract of a document. Given an -length title sequence , we use a recurrent neural network to extract the hidden state of every title word as


where the dimension of the title word hidden state is . Since the title words do not have equal contribution to the whole context embedding, we use an attention mechanism for the title attribute, and obtain the different title representation for different main text words as a weighted sum:


where is the attention value of the title word for the main text word ,


is the hidden state of the previous time step in the main text and is an attention function which scores how the title word affects the main text word :


With this title attention, we automatically learn different importance weights of the title words for each main text word.

3.1.2 Category Attributes

Category attributes are commonly used in daily writing and speaking. Useful category attributes include document categories, authorships, sentiments, etc. We formulate the category attribute as a one hot vector and the embedding of the category attribute is counted via an encoder of the one hot vector


where is a weight matrix which maps the one hot vector to a continuous category embedding. We use the same embedding dimension for category attributes with the dimension of the title embedding as .

3.2 Language Generation and Style Variation with SAM

With the above semantic embedding extractions, we obtain a set of semantic attribute embeddings . To leverage the importance of each attribute for a main content word , we adopt another semantic attribute attention mechanism to learn the semantic attribute embedding for different main text words as


where is an attention function which scores how the attribute affects the main text word .

We incorporate the obtained semantic attributes into the RNN framework. By using an attribute attention mechanism, the transition of RNN hidden state reads not only the current word but also the semantic attribute embedding . Specifically,we concatenate the semantic attribute embedding and the input word embedding vector . Thus, the hidden states update as:


For the recurrent neural network function , we use the gated recurrent unit (GRU) [CGCB14]. The GRU cell has two gates and a single memory cell. They are updated as:



is the sigmoid function and

is the Hadamard product. Our model is trained by maximizing the log-likelihood of the corpus, using the back-propagation through time (BPTT) method [Bod02].

As can be seen in Figure. 2, we build a Semantic attribute Modulated language generation model. Semantic attributes can be considered as the inputs for the designed generation outputs. By comparing the semantic attributes, the corresponding outputs are interpretable to users. Moreover, considering that some attributes reflect the text styles, we realize the text style variation by replacing with some other related attributes. We will give some generated variations of the typical lyrics in the experiment part.

#training docs - 1,780 75k 70k 3.6k
#training tokens 923k 890k 21m 30m 118k
#vocabulary 10k 10k 30k 40k 3k
attribute(s) Category Title+Category Title Author+title Author+title
hidden size 200 1,000 1,000 1,000 1000
Table 1: Statistics and Parameters of PTB, BBCNews, IMDB, TTNews, XLyrics

4 Discussions and Related Work

Neural Machine TranslationNeural machine translation (NMT) uses the encoder-decoder network to generate specific response [CVMG14]. In NMT, the encoder network reads some source texts of one language and encodes them into continuous embeddings. Then the decoder network translates them into another language. NMT is also used to generate some poems after encoding some keywords [WHW16]. This is similar to our work as generating some texts given some useful attributes. The difference from them is that our work uses a semantic attribute attention modulation to extract the semantic embedding instead of an encoder-decoder framework.

Contextual RNN Our work is related to several contextual language modeling works. In [HCH16], the titles and the keywords were represented as bag-of-words and used it to build a conditional RNNLM model. But this work only involved text attributes but could not model the discrete attributes. Discrete attributes, such as review rates and document categories, were also used to control the content generation [TYC16]. The variational auto-encoder based model with a generator-discriminator scheme was also used for generating controllable texts [HYL17] but the input attributes are limited to be only discrete categories.

There are several major advantages of our paper over the above methods. First, we adopt a more diverse attribute set, including the widely used category attributes. The semantic information brings the interpretability of SAM. Second, we use better attribute representation method, including a semantic attention mechanism and we can get flexibility with the attention mechanism. Third, by replacing the semantic attributes, our model realize the style variation for the lyric generation.

Words in the documents of the politics category
Improved to, be, ireland, bush, one, chairman, fiscal, week, in, or, plan
Alike both, general, many, both, but, is, N, in, been, the, said
Worse of, gm, stock, orders, law, jerry
Words in the documents of the finance category
Improved exchange, share, group, third-quarter, soared, from, is, profit
Alike N,of, days, had, than, month, share, were, yield
Worse reported, analysis, all, yield, vehicles, economics, gm, currently
Table 2: Word predictions that is improved, alike and worse after adding category attribute in Categories Politics, stocking and finance
Figure 3: An example of alignment matrix from SAM-title (Best viewed in color)

5 Experiments

In this section, we first show that the Semantic Attribute Modulated language model gets better word predictions. The extensive qualitative analyses demonstrate the interpretability of the word predictions with regard to the input attributes. We then give several examples of the lyric style variation with SAM, which shows the flexibility of SAM.

5.1 Datasets

We evaluate the proposed language model with semantic attribute attention on five different datasets with the different attribute combinations. Among these datasets, TTNews, XLyrics and the titles of IMDB are collected by ourselves. We have the future plans to release the collected corpora after resolving the copyright issues. For detailed statistics, see Table. 1.

Penn TreeBank (PTB)  Penn TreeBank (PTB) is a commonly-used dataset for evaluating language models and its texts are derived from the Wall Street Journal (WSJ). We use the preprocessed corpus by mikolov2011extensions and it has 929k training tokens with a vocabulary of size 10k 222 We use the LDA topic model to analyze the PTB corpus with the topic number as . We assign the label of one document as the topic assignment with the largest weight. The analysis of this category attribute and more discussions can be seen in Appendix A.

BBCNews BBCNews is a formal English news dataset and contains 2,225 BBC news articles collected by griffiths2004integrating333 The BBCNews documents have 5 class labels: business, entertainment, politics, sport and technology.

IMDB Movie Reviews (IMDB)  IMDB Movie Reviews (IMDB) is a movie review corpus [MDP11] and has 75k training reviews and 25k testing reviews444˜amaas/data/sentiment/. Note that maas2011learning did not provide review titles and we collected the titles according to the provided web links.

TTNews  TTNews is a Chinese news dataset crawled from the several major Chinese media555,,, etc. TTNews has 70,000 news articles with 30 million tokens and a vocabulary of size 40k. Each document contains the title and author annotations.

XLyrics  XLyrics is a Chinese pop music lyric dataset crawled from the web. XLyrics has 4k lyrics, about 118k tokens and a vocabulary of size 3k666

5.2 Experimental Settings

We consider several variants of the proposed methods with different combinations of semantic attributes. In detail, we consider the language modeling with a) a category attribute, b) a title attribute and c) a title attribute plus a category attribute. In order to realize the style variation of the generations, we consider generating lyrics with an original title attribute and a fake author attribute.

We train a recurrent language model without any side information as a baseline method. We also report the results of a count-based -gram model with the Kneser-Ney smooth method [CG96, Hea11].

For training, we use the ADAM method with the initial learning rate of 0.001 [KB14] to maximize the log-likelihood and use early-stop method based on the validation log-likelihood. The dimension of word embedding is set to be the same as the hidden size of RNN. The detailed parameter settings for each dataset are listed in Table. 1.

5-Gram 141.2 131.1
RNN 117.1 76.7
SAM-Cat 113.5 73.8

Table 3: Corpus-level perplexity with Category-Attribute on (a) Penn Tree Bank and (b) BBCNews
Attributes Source Method BBCNews IMDB TTNews XLyrics
Main Texts Only 5-Gram 131.1 124.6 136.7 8.13
RNN 76.7 62.6 120.1 7.56
+Titles RNN-State 72.2 61.0 118.2 8.20
RNN-BOW 72.2 61.8 118.4 8.18
SAM-Title-Att 71.3 61.3 118.3 7.56
SAM-Title-Att-State 72.5 60.9 118.1 7.23
+Titles+Authors SAM-Title-Au-Att - - 114.1 7.08
SAM-Title-State-Au-Att - - 113.4 6.84
Table 4: Corpus-level perplexity with Title-Attribute on (a) BBCNews and (b) IMDB

5.3 Language Modeling Word predictions

We first show that the Semantic Attribute Modulated language model gets better word predictions. Then we give some qualitative analysis to show the interpretability of SAM.

5.3.1 Language Modeling with Category-Attribute

Document categories are indicative of the discoursed topics and therefore of the distribution over words. We first consider applying language modeling with category attribute on two corpora, PTB and BBCNews. For the PTB dataset, we use the LDA topic model to analyze the semantic information and we set the category as the topic which has the largest weight in LDA for every document. The details of the PTB dataset pre-processing can be seen in Appendix A. For the BBCNews dataset, we use the news category labels provided as a discrete category attribute.

In Table. 3, 5-Gram represents the count-based 5-gram model [CG96], RNN represents the conventional RNN model without any semantic attribute and SAM-Cat is our SAM model with a category attribute. As can be seen in the results, by adding a semantic category attribute, SAM-Cat outperforms the baseline models by achieving lower perplexities.

5.3.2 Language Modeling with Title-Attribute

Document titles are carefully chosen by the authors to summarize the document contents and attract the attention of readers. In this part, we incorporate the title attribute to take advantage of the implicit word distribution represented by the title. We use four corpora for this task. BBCNews and TTNews are two formal published corpora, IMDB is a movie review corpus and XLyrics is a lyric corpus.

We implement the 5-gram model and the conventional RNN model on the corpus without titles. RNN-State is the conventional RNNLM model with the title’s last hidden state as initialization. This means the title is considered as the first sentence but is not included in the prediction of per-word perplexities. RNN-BOW is the conventional RNNLM model incorporated with a bag-of-words representation of the title at each time step, which is a re-implementation of [HCH16]. The SAM-Title-Att method is the SAM model with the title attribute and the attention mechanism. By adding the title’s last hidden state to SAM-Title-Att as initialization, we get the SAM-Title-Att-State method.

We show the word prediction perplexity results in Table. 4. The RNN-based models, with the title embedding, has better perplexity results. Moreover, SAM-Title is better than RNN-state because the added title information would disappear after several nonlinear gating functions. The attention-based title attribute performs better than the one without attention. This is because the attention mechanism provides the different importance weights for the title words.

Generally, our SAM model with title attribute performs better on BBCNews, compared with IMDB. We believe the result is caused by the different genres of these datasets. In order to make our title attribute useful, titles should be able to convey refined summaries of documents. BBCNews, as a formal news corpus written by professional journalists, usually has titles with higher quality than IMDB corpus.

5.3.3 Language Modeling with Title-Author-Attribute

In this part, we incorporate two different attributes, title, and author. We will demonstrate that these two attributes are complementary.

We use the semantic attribute attention to conjoin the two attributes and the suffix ‘Au’ means that this method incorporates the author categorical attribute and maintains the method notations used in the previous part. We show the word prediction perplexity results of several attribute combinations in Table. 4. For the TTNews and XLyrics datasets, we can see that incorporating both title and author attributes are better than the single one.

Figure 4: Generated lyrics with the same title but a fake authorship. The original lyric is of the country style (left) and the generated lyric with a fake author is of the pop rock style (right).
Figure 5: Generated lyrics with the same title but a fake authorship. The original lyric is sentimental (left) and the generated lyric with a fake author is cheerful (right).

5.3.4 Qualitative Analysis on Interpretability of SAM

In order to discover why SAM-Cat outperforms traditional methods for the PTB dataset in Table. 3, we demonstrate the words in the each category with the largest and the least perplexity changes in Table. 2. We mark the words which have a strong semantic information of the each specific category in bold. For example for the politics category, after adding the category attribute, the words, which have the largest prediction improvement, are generally related to the politics, such as ‘Ireland’, ‘bush’ and ‘chairman’. The words, which have the largest prediction degeneration, generally have a semantic meaning but not related to the politics, such as ‘gm’, ‘stock’ and ‘orders’. The words, which have the least word prediction change, are generally function words, such as ‘both’, ‘many’ and ‘but’. The word prediction changes in other categories are similar with the politics category. We put the results of the finance category in Table. 2 and show the results of other categories in Appendix. B due to the space limit.

To further investigate how attention values control the importance weights of the attributes, we visualize some of the attention values in Figure. 3. The color depth shows the attention weights. The red rectangles show the title word ‘Microsoft’ has a large effect on the content words ‘software’ and ‘unauthorized’. The title word ‘move’ has a large effect on the content word ‘prove’. This example shows that the attention mechanism works as a flexible selection of the attributes.

5.4 Flexible Style Variation with SAM

Many downstream applications of the language modeling can be enhanced with the proposed semantic attributes. For machine translation, the semantic attributes could also be titles, authors, and categories. For the speech recognition task, the semantic attributes include the age and the dialect of the speaker. For language generation tasks, such as the question-answering and the poem/lyric generation, the possible attributes are titles, authors, and even styles.

We use the SAM model to perform lyric generation based and use both the title and author attributes. Given an original lyric, we generate a new one with the same title but a fake author. We get several amazing generation results and the differences between two are highly related to the title attribute. Here we give two concrete examples (one in Chinese and the other in English) and left more examples in Appendix. C.

For the English example in Fig. 4: The original lyric last kiss is a popular song by Taylor Swift which is of the pop country style. After changing the authorship to Jason Mraz, we generate a new love song which looks likes a rock lyric. The styles of the two lyrics tally the styles of the two singers.

For the Chinese example in Fig. 5: The original lyric Your Face is a sentimental love song written by Xiaotian Wang which is recalling the past love. After changing the authorship to Lovely Sisters, a trending Chinese band, we generate a joyful love song about the happiness of falling in love.

6 Conclusion

In this paper, we propose SAM, the semantic attribute modulation for language modeling and style variation. The main idea is to take advantage of vastly-accessible and meaningful attributes to generate interpretable texts. Our model adopts a diversity of semantic attributes including titles, authors, and categories. With the attention mechanism, our model automatically scores the attributes in a flexible way and embeds the attribute representations into the hidden feature spaces as the generation model inputs. The diversity of the input attributes make the model more powerful and interpretable and the semantic attribute attention mechanism brings flexibility for the whole model. Extensive experimental demonstrates the effectiveness and the interpretability of our flexible Semantic Attribute Modulated language generation model.

In the future, we are interested in exploring more attributes which have semantic meaning for the language model task. In addition to the lyric generation task, other language generation tasks can also use our SAM model to utilize more semantic attributes. One possible example is to incorporate the geographic position attribute into the speech recognition task to model the dialects.


Appendix A: Data Preparation of PTB

PTB is a commonly-used corpus benchmark for the language modeling task. We use the LDA topic model to extract semantic category attributes. Actually, adding a pseudo-category seems to be subtle for the language modeling task to see the words in advance and then predict them. We argue that the pseudo-category makes sense in the language modeling task evaluation for the following two reasons. First, We only add one discrete assignment for each document and there’s no straightforward word distribution information propagated. Second, in fact, the category assignments have strong semantic information and we can find real category assignments for other datasets. The semantic analysis is as follows.

For the PTB dataset, we set the topic number as and set the largest topic weight assignment as each document’s category assignment. As can be seen in Table. 5, the topic focuses on the corporate finance, the topic focuses on the politics, the topic focuses on the managers, the topic focuses on the stocking market and the topic focuses on the daily news.

Topic Top words
0 million billion share year company cents stock sales income revenue bonds profit corp.
1 its mr. federal company u.s. new government state court plan officials bill house
2 market stock trading prices stocks investors new price big index friday rates markets traders
3 its company mr. inc. new co. corp. president chief executive says group chairman business vice
4 mr. says when people years new time president work first few think good want city know back
Table 5: Top words of topics extracted from the PTB dataset

Appendix B: More word predictions of the SAM-Cat on PTB dataset

In this part, we show some more word generations of our SAM-Cat model on the PTB dataset. We show that after adding the category attribute, we get more semantic word prediction improvements. In Appendix B, we show the results on the categories ‘stock’ and ‘managements’ in Table. 6. We mark the words which have a strong semantic information of the each specific category in bold. After adding the category attribute, the words, which have the largest prediction improvement, are generally related to the category information. The words, which have the largest prediction degeneration, are generally have a semantic meaning but not related to the category information. The words, which have the least word prediction change, are generally function words.

Words in the documents of the stocking category
Improved co, operating, an, markets, considered, commercial, stake
Alike N, usa, the is, these, discussion, at, the, chicken
Worse offering, million, money, read, communications, lines, issues, city
Words in the documents of the management category
Improved market,about,results,orders, trading, dow, portfolio, price, market
Alike N, likely, of, prepared, southeast, futures, see, group, the
Worse bear, totaled, optimistic, executive, chief, manufacturers, about
Table 6: Word predictions that is improved, alike and worse after adding category attribute in categories ‘corporate finance’, ‘managers’ and ‘stock market’

Appendix C: More Lyric Generation Variations

In this part, we give several lyric generation examples, two in English and one in Chinese. We observe that if the two authors have different style contents in the training data, the generation would very possibly be with different styles. In the following figures, we give the detailed generation and the corresponding analyses in the figure captions.

Figure 6: Generated lyrics with the same title but a fake authorship. The original lyric is narrative (left) and the generated lyric with a fake author is whispering and piteous (right)
Figure 7: Generated lyrics with the same title but a fake authorship. The original lyric is complaining (left) and the generated lyric with a fake author has a bystander view (right).
Figure 8: Generated lyrics with the same title but a fake authorship. The original lyric is retro (left) and the generated lyric with a fake author is modern (right).