Language generation is considered as a key task in the artificial intelligence field[RD00]
. The language modeling task aims to present the word distributions of text sequences and is considered as a degenerated text generation task, which generates only one word at each step. Traditional language generation approaches use phrase templates and related generation rules. For the language modeling task, the counting-basedn-gram method is broadly used. These methods are conceptually simple but hard to generalize like humans.
Later on, Bengio et al. bengio2003neural developed a feed-forward neural network language model and Mikolov et al.[MKB10]HS97]
or the gated recurrent unit (GRU)[CGCB14]
, the recurrent neural network (RNN) has been demonstrated a good capability in modeling word probabilities and now is the most widely used method for language modeling and language generation[MKB10, Gra13]. Nevertheless, RNN is often criticized for incapable of capturing the long-term dependency, resulting in losing the important contextual information. It has been shown that the RNN language models (RNNLMs) can be enhanced with some specific long-term contextual information, including document topics [MZ12, GVS16, DWGP17], bag-of-words contexts [WC16], a neural cache [GJU17], etc. Several specific text structure was considered in the RNNLMs, such as the hierarchical sentence sequences [LLY15], tree-structured texts [TZH16] and dialog contexts [LL17, MBW17].
In the aforementioned models, only main text sequences were modeled but the vastly-accessible attributes of documents were ignored. Interestingly, the document attributes implicitly convey global contextual information of the word distributions and are vastly-accessible before reading the main texts in daily reading or speaking. Document titles are compact abstracts carefully chosen by authors and keynote speakers. Labels and tags are specific categories assigned by experienced editors. Authorships reflect writing styles. With these vastly-accessible attributes, one can predict word distributions better (see a concrete example in Figure. 1).
Moreover, from the generation perspective, several previous works generate the designed outputs from scratch or from a single semantic attribute [LVM15, SHB16, PT16, LGA16, KZC16, RJS17, HYL17]. However, only a few semantic attributes were incorporated at the same time and were incapable to meet the huge complexity of the text generation task. In this paper, we consider a diversity of semantic attributes and use the attention mechanism to conjoin the semantic attribute as a joint embedding. Hence, the semantic attribute modulation brings a flexible way to generate texts because we can choose different combinations of these attributes. Due to the strong semantic information conveyed by the attributes, the text generations are interpretable with regard to the different combinations of the input attributes. With this flexibility, we can get a text style variation with replacements of semantic attributes. An interesting example is Please let Jason Mraz rewrite the lyric ‘Last Kiss’111A famous song by Taylor Swift..
1.1 Our Proposal
In this paper, we present SAM, the Semantic Attribute Modulation for language modeling and style variation. We consider the vastly-accessible semantic language attributes and extract the attribute embedding. Specifically, we adopt two types of semantic attributes: the title attribute and the category attribute. For the title attribute, we use an RNN encoder to get the title embedding. For the category embedding, our model learns a shared embedding from the documents in the specific category. Then, we generate the outputs with an attention mechanism over a diversity of attribute embeddings.
The semantic attribute modulated (SAM) language model obtains better per-word prediction results than the vanilla RNNLM without SAM. The improved word predictions are highly related to the semantic attributes and therefore interpretable to humans. Moreover, we present the lyric generation task with lyric variation derived from semantic attributes. The text generation conditioned on the semantic attribute has a flexible attribute selection. With a learned attribute as a replaced input, we can get the output with the style variation. Interesting lyric style variations examples further demonstrate the flexibility of SAM.
In summary, our contributions are as follows:
We present SAM, a Semantic Attribute Modulation, which incorporates a diversity of semantic document attributes, as a flexible language generation modulation input.
By incorporating the Semantic Attribute Modulation, our language model gets better word prediction results on several text datasets. The better word predictions are highly related to the semantic attribute and hence is interpretable to humans.
Based on our model, we present the stylistic variations of the lyric generation with a fake author attribute, which further demonstrates the flexibility of SAM.
In this section, we first give a concrete example of semantic attributes and then list the related language generation models.
2.1 A concrete example of semantic attributes
We take an AlphaGo news article from the New York Times as a concrete example (Figure. 1). Given the title ‘Google’s Computer Program Beats Lee Se-dol in Go Tournament’, the main text words ‘Google’, ‘program’ and ‘Go’ could be predicted more easily. Given the author attribute ‘CHOE SANG-HUN’ who is a Pulitzer Prize-winning South Korean journalist, we can better predict the words ‘South-Korean’ and ‘Go’. That is to say, the semantic attributes are indicative of different aspects of the text generation, which motivate us to modularize the semantic attributes in the text generation models.
Given a sequence of words , language modeling aims at computing its probability by
where are the words ahead of . We can use the recurrent neural network to build the word probabilities [MKB10]. At each time step , the transition gating function reads one word and updates the hidden state as , where
is the continuous vector representation of the one hot input vectorand is the embedding matrix. The probability of the next possible word in the vocabulary is computed by
where , are the affine weights and biases respectively and is the dimension of the hidden state . Here the subscription specifies the specific column.
The RNN models were always criticized for their lacking capacity of the long-term sequential dependence, resulting in an unsatisfactory performance on modeling contextual information. Several previous works tried to capture the contextual information using the previous contexts. Let be the contextual representation extracted from the contexts and the generation process of the RNNLM with is
3 Semantic Attribute Modulation
Other than main texts, documents have semantic attributes, such as titles, authorships, tags, and sentiments, which convey important semantic information. In this section, we present SAM, the Semantic Attribute Modulation originated from an attention mechanism over a diversity of attributes. Then, we use SAM to do language modeling and style variation for language generation. Given the semantic attribute modulated representation , the generative process of our model is , where are the words in the same document.
3.1 Semantic Attributes
Due to the discrepant forms among the semantic attributes, we use two methods to extract the representations from semantic attributes.
3.1.1 Title Attributes
The title is often carefully chosen by the author and is a compact abstract of a document. Given an -length title sequence , we use a recurrent neural network to extract the hidden state of every title word as
where the dimension of the title word hidden state is . Since the title words do not have equal contribution to the whole context embedding, we use an attention mechanism for the title attribute, and obtain the different title representation for different main text words as a weighted sum:
where is the attention value of the title word for the main text word ,
is the hidden state of the previous time step in the main text and is an attention function which scores how the title word affects the main text word :
With this title attention, we automatically learn different importance weights of the title words for each main text word.
3.1.2 Category Attributes
Category attributes are commonly used in daily writing and speaking. Useful category attributes include document categories, authorships, sentiments, etc. We formulate the category attribute as a one hot vector and the embedding of the category attribute is counted via an encoder of the one hot vector
where is a weight matrix which maps the one hot vector to a continuous category embedding. We use the same embedding dimension for category attributes with the dimension of the title embedding as .
3.2 Language Generation and Style Variation with SAM
With the above semantic embedding extractions, we obtain a set of semantic attribute embeddings . To leverage the importance of each attribute for a main content word , we adopt another semantic attribute attention mechanism to learn the semantic attribute embedding for different main text words as
where is an attention function which scores how the attribute affects the main text word .
We incorporate the obtained semantic attributes into the RNN framework. By using an attribute attention mechanism, the transition of RNN hidden state reads not only the current word but also the semantic attribute embedding . Specifically,we concatenate the semantic attribute embedding and the input word embedding vector . Thus, the hidden states update as:
For the recurrent neural network function , we use the gated recurrent unit (GRU) [CGCB14]. The GRU cell has two gates and a single memory cell. They are updated as:
is the sigmoid function andis the Hadamard product. Our model is trained by maximizing the log-likelihood of the corpus, using the back-propagation through time (BPTT) method [Bod02].
As can be seen in Figure. 2, we build a Semantic attribute Modulated language generation model. Semantic attributes can be considered as the inputs for the designed generation outputs. By comparing the semantic attributes, the corresponding outputs are interpretable to users. Moreover, considering that some attributes reflect the text styles, we realize the text style variation by replacing with some other related attributes. We will give some generated variations of the typical lyrics in the experiment part.
4 Discussions and Related Work
Neural Machine TranslationNeural machine translation (NMT) uses the encoder-decoder network to generate specific response [CVMG14]. In NMT, the encoder network reads some source texts of one language and encodes them into continuous embeddings. Then the decoder network translates them into another language. NMT is also used to generate some poems after encoding some keywords [WHW16]. This is similar to our work as generating some texts given some useful attributes. The difference from them is that our work uses a semantic attribute attention modulation to extract the semantic embedding instead of an encoder-decoder framework.
Contextual RNN Our work is related to several contextual language modeling works. In [HCH16], the titles and the keywords were represented as bag-of-words and used it to build a conditional RNNLM model. But this work only involved text attributes but could not model the discrete attributes. Discrete attributes, such as review rates and document categories, were also used to control the content generation [TYC16]. The variational auto-encoder based model with a generator-discriminator scheme was also used for generating controllable texts [HYL17] but the input attributes are limited to be only discrete categories.
There are several major advantages of our paper over the above methods. First, we adopt a more diverse attribute set, including the widely used category attributes. The semantic information brings the interpretability of SAM. Second, we use better attribute representation method, including a semantic attention mechanism and we can get flexibility with the attention mechanism. Third, by replacing the semantic attributes, our model realize the style variation for the lyric generation.
|Words in the documents of the politics category|
|Improved||to, be, ireland, bush, one, chairman, fiscal, week, in, or, plan|
|Alike||both, general, many, both, but, is, N, in, been, the, said|
|Worse||of, gm, stock, orders, law, jerry|
|Words in the documents of the finance category|
|Improved||exchange, share, group, third-quarter, soared, from, is, profit|
|Alike||N,of, days, had, than, month, share, were, yield|
|Worse||reported, analysis, all, yield, vehicles, economics, gm, currently|
In this section, we first show that the Semantic Attribute Modulated language model gets better word predictions. The extensive qualitative analyses demonstrate the interpretability of the word predictions with regard to the input attributes. We then give several examples of the lyric style variation with SAM, which shows the flexibility of SAM.
We evaluate the proposed language model with semantic attribute attention on five different datasets with the different attribute combinations. Among these datasets, TTNews, XLyrics and the titles of IMDB are collected by ourselves. We have the future plans to release the collected corpora after resolving the copyright issues. For detailed statistics, see Table. 1.
Penn TreeBank (PTB) Penn TreeBank (PTB) is a commonly-used dataset for evaluating language models and its texts are derived from the Wall Street Journal (WSJ). We use the preprocessed corpus by mikolov2011extensions and it has 929k training tokens with a vocabulary of size 10k 222http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz. We use the LDA topic model to analyze the PTB corpus with the topic number as . We assign the label of one document as the topic assignment with the largest weight. The analysis of this category attribute and more discussions can be seen in Appendix A.
BBCNews BBCNews is a formal English news dataset and contains 2,225 BBC news articles collected by griffiths2004integrating333http://mlg.ucd.ie/datasets/bbc.html. The BBCNews documents have 5 class labels: business, entertainment, politics, sport and technology.
IMDB Movie Reviews (IMDB) IMDB Movie Reviews (IMDB) is a movie review corpus [MDP11] and has 75k training reviews and 25k testing reviews444http://ai.stanford.edu/˜amaas/data/sentiment/. Note that maas2011learning did not provide review titles and we collected the titles according to the provided web links.
TTNews TTNews is a Chinese news dataset crawled from the several major Chinese media555http://www.ywnews.cn/, http://www.toutiao.com, http://www.huanqiu.com/, etc. TTNews has 70,000 news articles with 30 million tokens and a vocabulary of size 40k. Each document contains the title and author annotations.
XLyrics XLyrics is a Chinese pop music lyric dataset crawled from the web. XLyrics has 4k lyrics, about 118k tokens and a vocabulary of size 3k666http://www.xiami.com/song/1771862045?spm=a1z1s.6639577.471966477.105.3HI96A.
5.2 Experimental Settings
We consider several variants of the proposed methods with different combinations of semantic attributes. In detail, we consider the language modeling with a) a category attribute, b) a title attribute and c) a title attribute plus a category attribute. In order to realize the style variation of the generations, we consider generating lyrics with an original title attribute and a fake author attribute.
For training, we use the ADAM method with the initial learning rate of 0.001 [KB14] to maximize the log-likelihood and use early-stop method based on the validation log-likelihood. The dimension of word embedding is set to be the same as the hidden size of RNN. The detailed parameter settings for each dataset are listed in Table. 1.
|Main Texts Only||5-Gram||131.1||124.6||136.7||8.13|
5.3 Language Modeling Word predictions
We first show that the Semantic Attribute Modulated language model gets better word predictions. Then we give some qualitative analysis to show the interpretability of SAM.
5.3.1 Language Modeling with Category-Attribute
Document categories are indicative of the discoursed topics and therefore of the distribution over words. We first consider applying language modeling with category attribute on two corpora, PTB and BBCNews. For the PTB dataset, we use the LDA topic model to analyze the semantic information and we set the category as the topic which has the largest weight in LDA for every document. The details of the PTB dataset pre-processing can be seen in Appendix A. For the BBCNews dataset, we use the news category labels provided as a discrete category attribute.
In Table. 3, 5-Gram represents the count-based 5-gram model [CG96], RNN represents the conventional RNN model without any semantic attribute and SAM-Cat is our SAM model with a category attribute. As can be seen in the results, by adding a semantic category attribute, SAM-Cat outperforms the baseline models by achieving lower perplexities.
5.3.2 Language Modeling with Title-Attribute
Document titles are carefully chosen by the authors to summarize the document contents and attract the attention of readers. In this part, we incorporate the title attribute to take advantage of the implicit word distribution represented by the title. We use four corpora for this task. BBCNews and TTNews are two formal published corpora, IMDB is a movie review corpus and XLyrics is a lyric corpus.
We implement the 5-gram model and the conventional RNN model on the corpus without titles. RNN-State is the conventional RNNLM model with the title’s last hidden state as initialization. This means the title is considered as the first sentence but is not included in the prediction of per-word perplexities. RNN-BOW is the conventional RNNLM model incorporated with a bag-of-words representation of the title at each time step, which is a re-implementation of [HCH16]. The SAM-Title-Att method is the SAM model with the title attribute and the attention mechanism. By adding the title’s last hidden state to SAM-Title-Att as initialization, we get the SAM-Title-Att-State method.
We show the word prediction perplexity results in Table. 4. The RNN-based models, with the title embedding, has better perplexity results. Moreover, SAM-Title is better than RNN-state because the added title information would disappear after several nonlinear gating functions. The attention-based title attribute performs better than the one without attention. This is because the attention mechanism provides the different importance weights for the title words.
Generally, our SAM model with title attribute performs better on BBCNews, compared with IMDB. We believe the result is caused by the different genres of these datasets. In order to make our title attribute useful, titles should be able to convey refined summaries of documents. BBCNews, as a formal news corpus written by professional journalists, usually has titles with higher quality than IMDB corpus.
5.3.3 Language Modeling with Title-Author-Attribute
In this part, we incorporate two different attributes, title, and author. We will demonstrate that these two attributes are complementary.
We use the semantic attribute attention to conjoin the two attributes and the suffix ‘Au’ means that this method incorporates the author categorical attribute and maintains the method notations used in the previous part. We show the word prediction perplexity results of several attribute combinations in Table. 4. For the TTNews and XLyrics datasets, we can see that incorporating both title and author attributes are better than the single one.
5.3.4 Qualitative Analysis on Interpretability of SAM
In order to discover why SAM-Cat outperforms traditional methods for the PTB dataset in Table. 3, we demonstrate the words in the each category with the largest and the least perplexity changes in Table. 2. We mark the words which have a strong semantic information of the each specific category in bold. For example for the politics category, after adding the category attribute, the words, which have the largest prediction improvement, are generally related to the politics, such as ‘Ireland’, ‘bush’ and ‘chairman’. The words, which have the largest prediction degeneration, generally have a semantic meaning but not related to the politics, such as ‘gm’, ‘stock’ and ‘orders’. The words, which have the least word prediction change, are generally function words, such as ‘both’, ‘many’ and ‘but’. The word prediction changes in other categories are similar with the politics category. We put the results of the finance category in Table. 2 and show the results of other categories in Appendix. B due to the space limit.
To further investigate how attention values control the importance weights of the attributes, we visualize some of the attention values in Figure. 3. The color depth shows the attention weights. The red rectangles show the title word ‘Microsoft’ has a large effect on the content words ‘software’ and ‘unauthorized’. The title word ‘move’ has a large effect on the content word ‘prove’. This example shows that the attention mechanism works as a flexible selection of the attributes.
5.4 Flexible Style Variation with SAM
Many downstream applications of the language modeling can be enhanced with the proposed semantic attributes. For machine translation, the semantic attributes could also be titles, authors, and categories. For the speech recognition task, the semantic attributes include the age and the dialect of the speaker. For language generation tasks, such as the question-answering and the poem/lyric generation, the possible attributes are titles, authors, and even styles.
We use the SAM model to perform lyric generation based and use both the title and author attributes. Given an original lyric, we generate a new one with the same title but a fake author. We get several amazing generation results and the differences between two are highly related to the title attribute. Here we give two concrete examples (one in Chinese and the other in English) and left more examples in Appendix. C.
For the English example in Fig. 4: The original lyric last kiss is a popular song by Taylor Swift which is of the pop country style. After changing the authorship to Jason Mraz, we generate a new love song which looks likes a rock lyric. The styles of the two lyrics tally the styles of the two singers.
For the Chinese example in Fig. 5: The original lyric Your Face is a sentimental love song written by Xiaotian Wang which is recalling the past love. After changing the authorship to Lovely Sisters, a trending Chinese band, we generate a joyful love song about the happiness of falling in love.
In this paper, we propose SAM, the semantic attribute modulation for language modeling and style variation. The main idea is to take advantage of vastly-accessible and meaningful attributes to generate interpretable texts. Our model adopts a diversity of semantic attributes including titles, authors, and categories. With the attention mechanism, our model automatically scores the attributes in a flexible way and embeds the attribute representations into the hidden feature spaces as the generation model inputs. The diversity of the input attributes make the model more powerful and interpretable and the semantic attribute attention mechanism brings flexibility for the whole model. Extensive experimental demonstrates the effectiveness and the interpretability of our flexible Semantic Attribute Modulated language generation model.
In the future, we are interested in exploring more attributes which have semantic meaning for the language model task. In addition to the lyric generation task, other language generation tasks can also use our SAM model to utilize more semantic attributes. One possible example is to incorporate the geographic position attribute into the speech recognition task to model the dialects.
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin.
A neural probabilistic language model.
Journal of machine learning research, 3(Feb):1137–1155, 2003.
A guide to recurrent neural networks and backpropagation.the Dallas project, 2002.
- [CG96] Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In ACL, pages 310–318, 1996.
- [CGCB14] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- [CVMG14] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- [DWGP17] Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. Topicrnn: A recurrent neural network with long-range semantic dependency. In ICLR, 2017.
- [GJU17] Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In ICLR, 2017.
- [Gra13] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- [GSBT04] Thomas L Griffiths, Mark Steyvers, David M Blei, and Joshua B Tenenbaum. Integrating topics and syntax. In NIPS, 2004.
- [GVS16] Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry Heck. Contextual lstm (CLSTM) models for large scale nlp tasks. arXiv preprint arXiv:1602.06291, 2016.
- [HCH16] Cong Duy Vu Hoang, Trevor Cohn, and Gholamreza Haffari. Incorporating side information into recurrent neural network language models. In Proceedings of NAACL-HLT, pages 1250–1255, 2016.
- [Hea11] Kenneth Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics, 2011.
- [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- [HYL17] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric Xing. Toward controlled generation of text. In ICML, 2017.
- [JCK15] Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. Document context language models. arXiv preprint arXiv:1511.03962, 2015.
- [KB14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi.
Globally coherent text generation with neural checklist models.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 329–339, 2016.
- [LGA16] Rémi Lebret, David Grangier, and Michael Auli. Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771, 2016.
- [LL17] Bing Liu and Ian Lane. Dialog context language modeling with recurrent neural networks. In ICASSP, 2017.
- [LLY15] Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. Hierarchical recurrent neural network for document modeling. In EMNLP, pages 899–907, 2015.
- [LVM15] Zachary C Lipton, Sharad Vikram, and Julian McAuley. Capturing meaning in product reviews with character-level generative text models. arXiv preprint arXiv:1511.03683, 2015.
- [MBW17] Hongyuan Mei, Mohit Bansal, and Matthew R Walter. Coherent dialogue with attention-based language models. In AAAI, 2017.
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and
Learning word vectors for sentiment analysis.In ACL, 2011.
- [MKB10] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, page 3, 2010.
- [MKB11] Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Extensions of recurrent neural network language model. In ICASSP, pages 5528–5531. IEEE, 2011.
- [MZ12] Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In SLT, pages 234–239, 2012.
- [PT16] Ellie Pavlick and Joel Tetreault. An empirical analysis of formality in online communication. Transactions of the Association for Computational Linguistics, 4:61–74, 2016.
- [RD00] Ehud Reiter and Robert Dale. Building natural language generation systems. Cambridge university press, 2000.
- [RJS17] Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
- [SHB16] Rico Sennrich, Barry Haddow, and Alexandra Birch. Controlling politeness in neural machine translation via side constraints. In HLT-NAACL, pages 35–40, 2016.
- [TYC16] Jian Tang, Yifan Yang, Sam Carton, Ming Zhang, and Qiaozhu Mei. Context-aware natural language generation with recurrent neural networks. arXiv preprint arXiv:1611.09900, 2016.
- [TZH16] Quan Hung Tran, Ingrid Zukerman, and Gholamreza Haffari. Inter-document contextual language model. In Proceedings of NAACL-HLT, pages 762–766, 2016.
- [WC16] Tian Wang and Kyunghyun Cho. Larger-context language modelling with recurrent neural network. In ACL, 2016.
- [WHW16] Zhe Wang, Wei He, Hua Wu, Haiyang Wu, Wei Li, Haifeng Wang, and Enhong Chen. Chinese poetry generation with planning based neural network. In COLING, 2016.
Appendix A: Data Preparation of PTB
PTB is a commonly-used corpus benchmark for the language modeling task. We use the LDA topic model to extract semantic category attributes. Actually, adding a pseudo-category seems to be subtle for the language modeling task to see the words in advance and then predict them. We argue that the pseudo-category makes sense in the language modeling task evaluation for the following two reasons. First, We only add one discrete assignment for each document and there’s no straightforward word distribution information propagated. Second, in fact, the category assignments have strong semantic information and we can find real category assignments for other datasets. The semantic analysis is as follows.
For the PTB dataset, we set the topic number as and set the largest topic weight assignment as each document’s category assignment. As can be seen in Table. 5, the topic focuses on the corporate finance, the topic focuses on the politics, the topic focuses on the managers, the topic focuses on the stocking market and the topic focuses on the daily news.
|0||million billion share year company cents stock sales income revenue bonds profit corp.|
|1||its mr. federal company u.s. new government state court plan officials bill house|
|2||market stock trading prices stocks investors new price big index friday rates markets traders|
|3||its company mr. inc. new co. corp. president chief executive says group chairman business vice|
|4||mr. says when people years new time president work first few think good want city know back|
Appendix B: More word predictions of the SAM-Cat on PTB dataset
In this part, we show some more word generations of our SAM-Cat model on the PTB dataset. We show that after adding the category attribute, we get more semantic word prediction improvements. In Appendix B, we show the results on the categories ‘stock’ and ‘managements’ in Table. 6. We mark the words which have a strong semantic information of the each specific category in bold. After adding the category attribute, the words, which have the largest prediction improvement, are generally related to the category information. The words, which have the largest prediction degeneration, are generally have a semantic meaning but not related to the category information. The words, which have the least word prediction change, are generally function words.
|Words in the documents of the stocking category|
|Improved||co, operating, an, markets, considered, commercial, stake|
|Alike||N, usa, the is, these, discussion, at, the, chicken|
|Worse||offering, million, money, read, communications, lines, issues, city|
|Words in the documents of the management category|
|Improved||market,about,results,orders, trading, dow, portfolio, price, market|
|Alike||N, likely, of, prepared, southeast, futures, see, group, the|
|Worse||bear, totaled, optimistic, executive, chief, manufacturers, about|
Appendix C: More Lyric Generation Variations
In this part, we give several lyric generation examples, two in English and one in Chinese. We observe that if the two authors have different style contents in the training data, the generation would very possibly be with different styles. In the following figures, we give the detailed generation and the corresponding analyses in the figure captions.