Log In Sign Up

Image Inspired Poetry Generation in XiaoIce

Vision is a common source of inspiration for poetry. The objects and the sentimental imprints that one perceives from an image may lead to various feelings depending on the reader. In this paper, we present a system of poetry generation from images to mimic the process. Given an image, we first extract a few keywords representing objects and sentiments perceived from the image. These keywords are then expanded to related ones based on their associations in human written poems. Finally, verses are generated gradually from the keywords using recurrent neural networks trained on existing poems. Our approach is evaluated by human assessors and compared to other generation baselines. The results show that our method can generate poems that are more artistic than the baseline methods. This is one of the few attempts to generate poetry from images. By deploying our proposed approach, XiaoIce has already generated more than 12 million poems for users since its release in July 2017. A book of its poems has been published by Cheers Publishing, which claimed that the book is the first-ever poetry collection written by an AI in human history.


page 2

page 6


Automatic Comment Generation via Multi-Pass Deliberation

Deliberation is a common and natural behavior in human daily life. For e...

Rhythm, Chord and Melody Generation for Lead Sheets using Recurrent Neural Networks

Music that is generated by recurrent neural networks often lacks a sense...

Visual Story Generation Based on Emotion and Keywords

Automated visual story generation aims to produce stories with correspon...

How Images Inspire Poems: Generating Classical Chinese Poetry from Images with Memory Networks

With the recent advances of neural models and natural language processin...

Keywords Guided Method Name Generation

High quality method names are descriptive and readable, which are helpfu...

Generating Chinese Poetry from Images via Concrete and Abstract Information

In recent years, the automatic generation of classical Chinese poetry ha...

Template Controllable keywords-to-text Generation

This paper proposes a novel neural model for the understudied task of ge...

1 Introduction

Figure 1: Illustration of the Image to Poetry framework. The system takes an image query given by a user, and outputs a semantically relevant piece of modern Chinese poetry. For the left part of the figure, after the intermediate keywords extracted from the query by object and sentiment detection, keyword filtering and expansion are applied to generate a keyword set. After that, each keyword in the keyword set is considered as a seed for each line in the poem, as shown in the poem generation part. A hierarchical generating model is proposed to maintain both the fluency of sentences and the coherence between sentences. In addition, an automatic evaluator is used to select sentences with good quality.

Poetry is always important and fascinating in Chinese literature, not only in traditional Chinese poetry but also in modern Chinese poetry. While traditional Chinese poetry is constructed with strict rules and patterns (e.g., five-word quatrains are required to contain four sentences and each sentence has five Chinese characters, also words need rhymes in specific positions), modern Chinese poetry is unstructured in vernacular Chinese. Compared to traditional Chinese poetry, although the readability of vernacular Chinese makes modern Chinese poetry easier to strike a chord, errors in words or grammar can more easily be criticized by users. Good modern poetry also requires more imagination and creative uses of language. From these perspectives, it may be more difficult to generate a good modern poem than a classic poem.

Poetry can be inspired by many things, among which vision (and images) is certainly a major source. Indeed, poetic feelings may emerge when one contemplates an image (which may represent anything from a natural scene to a painting). It is usually the case that different people have different readings and feelings of the same image. This makes it particularly interesting to read poems by others inspired by the same image. In this work, we present a system that mimic poetry writing of a poet by looking at an image.

Generating poetry from image is a special task of text generation from image. There have been many studies in this area. However, most of them focus on image captioning rather than literature creation. Only few of previous systems addressed the problem of generating poems from images. There have also many studies and systems for generating poetry. In most cases, a system is provided with a few keywords and is required to compose a poem containing or relating to the keywords. In comparison with poetry generation from keywords, using image as inspiration for poetry has many advantages. First, an image is worth thousand words, and it contains richer information than keywords. Poems generated from images maybe more various. Second, as mentioned earlier, for different people, the same image could lead to different interpretations, thus using images to inspire poetry generation may often provide an enjoyable surprise and leave the impression of higher imagination. Finally, compared with asking users for providing keywords, uploading an image is a much simpler and more nature way to interact with a system nowadays.

The system we propose, as illustrated in Figure 1, aims to generate a modern Chinese poem inspired by a visual content. For the image on the left hand side, we extract objects and sentiments to form our initial keyword set, such as city and busy. Then, the keywords are filtered and expanded by associated objects and feelings. Finally, each keyword is regarded as an initial seed for each sentence in the poem. A hierarchical recurrent neural network is used for modeling the structure between words and between sentences, and a fluency checker automatically detects low quality sentences early so that a new sentence is generated when necessary.

Our main contributions are as follows:

  • We introduce a novel application that uses an image to inspire modern poetry generation, which mimic the human behaviors of expressing their feelings when they are touched by vision.

  • In order to generate poetry of good quality, we incorporate several verification mechanisms for text fluency, poetry integrity, and the matching with the image.

  • We leverage keyword expansion to improve the diversity of generated poems and make them more imaginative.

A book of 139 generated poems, titled “Sunshine Misses Windows”, was published on May 19, 2017 by Cheers Publishing, which claimed that the book is the first-ever poetry collection written by an AI in human history. We also release the system in XiaoIce products in July, 2017. As by August, 2018, about 12 million poems have been generated for users.

The rest of the paper is organized as follows. Section 2 includes related works on image caption and poetry generation. Section 3 describes the details of the problem and our approach. The training details are explained in Section 4 and the datasets and experiments are presented in Section 5. We also design a user study to compare our approach with state-of-the-art image to caption and CTRIP (the only known poetry generation system from image) in Section 6. Section 7 concludes this paper.

2 Related Work

Image to caption has been a popular research topic in recent years. [Bernardi et al.2016]

provides an overview of most image description research and classifies approaches into three categories. Our work would be categorized as “Description as Generation from Visual Input”, which takes visual features or information from images as input for text generation.

[Patterson et al.2014] and [Devlin et al.2015] regard descriptions as retrieved results in the visual space. Although they can retrieve grammatically correct sentences and be applicable to novel images, the quality greatly depends on the training dataset. Among the works similar to ours which exploit visual input to description generation, RNN-based models achieve great quality recently. [Socher et al.2014] maps image and sentence representation to a latent space so that text and image become related. [Soto et al.2015] exploits a decoder-encoder framework. [Karpathy and Li2015] and [Donahue et al.2015] apply either LSTM architecture or alignment of image and sentence models for further improvement. However, most of them need image-sentence pairs for training. For image to poetry, there is no existing large scale data of paired images and poems.

Along with the glorious poetry history, automatic poetry generation is another popular research topic in artificial intelligence, starting from the Stochastische Texte system (Lutz 1959). Like the system, the first few generators are template-based.

[Tosa, Obara, and Minoh2008] and [Wu, Tosa, and Nakatsu2009] developed an interactive system for traditional Japanese Poetry. [Oliveira2012] proposed a system based on semantic and grammar templates. Word association rules were applied in [Netzer et al.2009]

. The systems based on templates and rules can generate sentences with have correct grammar but this is at the price of less flexibility. As the second type of generator, genetic algorithms are applied in previous works, like

[Manurung2004] and [Manurung, Ritchie, and Thompson2012], which regard poetry generation as a state search. [Yan et al.2013] formulate the task as an optimization problem based on a generative summarization framework under several constraints. [Jiang and Zhou2008] present a phrase-based statistical machine translation to generate the second sentence from the first sentence. [He, Zhou, and Jiang2012] extend the approach to a sequential translation for quatrains.

The growth of deep learning also brings success to poem generation. The basic recurrent neural network language model (RNNLM)

[Mikolov et al.2010] can generate poetry by using poetry corpus. [Zhang and Lapata2014] generated lines incrementally instead of regarding a poem as a single sequence. [Yan2016] added an iterative polishing to a hierarchical architecture. [Wang et al.2016a] applied the attention-based model. [Yi, Li, and Sun2016] extended the approach into a quatrain generator with an input word as a topic. [Ghazvininejad et al.2016] generated poems on a user-supplied topic with rhythmic and rhyme constraints. [Wang et al.2016b] proposed planning-based method to ensure the poem coherence and consistency. All these studies focus on the problem of generating a poem from a text input. None of them involves non-textual modality.

There have been other studies connecting multiple modalities. [Schwarz, Berg, and Lensch2016] connected images and poetry by automatically illustrating poems via semantically relevant and visually coherent illustrations. However, the task is not to generate a poem from an image, which is a more complex task. Our work, focuses on automatically generating a semantically relevant poem from an image.

Figure 2:

Our proposed hierarchical poem model includes two levels of LSTM. With the poem level model illustrated in the lower half of the figure, we predict the content vector of the next sentence by considering all previous sentences. After that, the content vector will be regarded as an input of sentence level LSTM in the upper half of the figure. Notice that this figure only shows the backward generator for recursive generating, while the forward version can be modified by reversing the structure.

3 Image to Poetry

3.1 Problem Formulation and System Overview

To achieve the goal of generating poems inspired by image, we formulate the problem as follows: for an image query , we try to generate a poem , where represents the -th line of the poem and is the number of lines in poem. The poem is supposed to be relevant to the image content, fluent in language and coherent in semantics.

The overview of our solution is shown in Figure 1. For the image query, object and sentiment detection are used to extract appropriate nouns, such as city and street, and adjectives, such as busy, as initial keyword set. After filtering out words with low confidence and rare words, keyword expansion will be applied to construct a keyword set , whose size is equal to lines of the poem. In the example, place and smile are expanded. Now contains four keywords, i.e. city, busy, place and smile. Next, each keyword is regarded as an initial seed for each sentence in the poem generation process. For example, the first sentence is generated from the seed city. A hierarchical recurrent neural network is proposed for modifying the structure between words and between sentences. Finally we apply a fluency checker to automatically detect low quality sentences early on and re-generate them.

We use Long-Short Term Memory (LSTM) for RNN mentioned below. The basic element for generation could be a character or a word. We try both in our experiments.

3.2 Keyword Extraction

We propose detecting objects and sentiments from each image with two parallel convolutional neural networks (CNN), which share the same network architecture but with different parameters. Specifically, one network learns to describe objects by the output of noun words, and the other learns to understand the sentiments by the output of adjective words. The two CNNs are pre-trained on ImageNet

[Krizhevsky, Sutskever, and Hinton2012] and fine-tuned on noun and adjective categories, respectively. For each CNN, the extracted deep convolutional representations are denoted as , where denotes the overall parameters of one CNN, denotes a set of operations of convolution, pooling and activation, and

denotes the input image. Based on deep representation, we further generate a probability distribution

over the output object or sentiment categories , shown as:



represents fully-connected layers to map convolutional features to a feature vector that could be matched with the category entries, and includes a softmax layer to further transform the feature vector to probabilities. For the proposed parallel CNN, we denote the probability over noun and adjective categories as

and , respectively. Categories with high probabilities are chosen to construct the candidate keyword set.

3.3 Sentence Model

RNNLM We follow the recurrent neural network language model (RNNLM) [Mikolov et al.2010] to predict text sequence. Each word is predicted sequentially by the previous word sequence:


where is the -th word and means the preceding words sequence.
Recursive Generation To control the content of generated sentences, we use specific keywords as the seed for sentence generation, which means that we force the RNNLM to generate sentence with specific keywords. Due to the directivity of RNNLM, one can only generate forward from the existing word. To allow the keyword to appear at any position in a sentence, a simple idea is training a reverse version of RNNLMs (which input the corpus by a reverse ordering in training), and generating backward from the existing text:


However, if we generate the forward and backward separately, the result would be two independent parts without semantic connections. To solve this problem, we use a simple recursive strategy described below.

Let and represent the start symbol and end symbol of a sentence. Also, and are the original and reversed version of RNNLM. The process of generating the -th line with -th keyword in the poem is described in Algorithm 1.

2:while  or  do
3:     if  then
6:     if  then
Algorithm 1 Recursive Generator

3.4 Poem Model

Generation with Previous Line While the fluency of sentences can be controlled with the RNNLM model and a recursive strategy, in multi-keyword and multi-line scenario, another issue is to maintain consistency between sentences. Since we need to generate in two directions, using the state of RNNLM to pass the information is no longer feasible. Instead, we try to extend the input gate of the RNNLM model to two parts, one is the originally previous word input, and another is the previous sentence’s information. Here, we use the encoding of previous line by LSTM as input context. For generating -th line in the poem, we use:


Hierarchical Poem Model Although the model above can maintain the consistency of a poem by capturing the previous line’s information, an alternative idea is to maintain a poem level network. For the poem level, we try to predict the content vector of the next sentence by all previous sentences, and for the sentence level, we use the prediction as another input. By using the hierarchical structure as shown in Figure 2, we can maintain the fluency and consistency not only using the previous line but also all previous lines. For generating -th line in the poem, we use:


Notice that since we still need to use the recursive strategy described above, a forward version and a backward version of models are both required.

3.5 Keyword Expansion

Since we attempt to control the generation using the objects and sentiments as keywords, the final results would correspond closely to the keywords. However, two possible reasons might lead to failure: low confidence keywords and rare keywords. The former is caused by the limitation of the image recognition model which will make the generated sentences irrelevant to the query images, while the latter will lead to low-quality or monotonous generated sentences due to insufficient training data. Thus we choose to use the keywords that have not only high confidence in image recognition but also enough occurrences in training corpus. However, sometimes the number of image keywords may be less than . Even if the number of initial keywords is larger than N, keyword expansion is also useful, it allows us to go beyond what is directly observable from the image. Using such expanded keywords could make the poetry more imaginative and less descriptive. In this work, we test several options:
Without Expansion The first idea is simple, we can choose keywords only with high confidence in recognition and enough occurrences in corpus. While these keywords can be considered seeds for the recursive generation with forward and backward models, for the rest of the poem, without giving any new keywords, we only generate new lines according to the previous line by the forward model only.
Frequent Words To expand the keyword set, an approach is to select some frequent nouns and adjectives in the training corpus. After deleting the rare and inappropriate words, the expanded keywords are sampled with the word distribution of the training corpus. The higher frequency a word is, the greater the chance it will get in. The three nouns with the highest frequency of occurrence in our corpus are life, time, and place. Applying these words can enhance both the diversity and imagination of the generated poems without getting off the topic too much.
High Co-occurred Words Another idea is only considering words with high co-occurrence with the original image keywords in the training corpus. We sample the keywords with the distribution of co-occurrence frequency with the original keywords. The more often a word co-occurs with the selected image keyword, the greater the chance it will get in. Take city for example, words with highest co-occurrence with city are place, child, heart and land. Unlike the previous method, these words are usually more relevant to the keywords recognized from the image query, hence the result is expected to be more on topic.

3.6 Fluency Evaluator

In poetry, it is desirable to generate diverse sentences even for the same keyword, we randomly sample word candidates among the top n best in beam search. This resulting sentence may be the one never seen in training data. At the same time, we can generate diverse sentences for images with same objects or sentiments. However, this diversity in the generation process may lead to poor sentences which are not fluent or inconsistent semantically.

To overcome these issues, we use an automatic evaluator of a sentence. We use n-gram and skip n-gram models to measure whether a word is correct and whether two words have semantic consistency. For the grammar level, we train a LSTM-based language model with POS tagged corpus and then apply it to calculate the generation probabilities of POS tagged candidate sentences. The failure to pass the evaluation will lead to the generation of another sentence.

4 Training Details

As a training corpus, we collect 2,027 modern Chinese poems that are composed of 45,729 sentences from The character vocabulary size is 4,547. For the training of word based model, word segmentation are applied on the corpus. The size of word vocabulary is 54,318.

In the keyword extraction model, for each CNN in our parallel architecture, we select GoogleNet

[Szegedy et al.2015] as the basic network structure, since GoogleNet can produce the first-class performance on ImageNet competition [Krizhevsky, Sutskever, and Hinton2012]. Following [Fu et al.2015] and [Borth et al.2013], we use 272 nouns and 181 adjectives as the categories for noun and adjective CNN training, since these categories are adequate to describe common objects and sentiments conveyed from images. The training time for each CNN takes about 50 hours on a Tesla K40 GPU, with the top-1 classification accuracy of and on noun and adjective testing sets from [Fu et al.2015] and [Borth et al.2013], respectively.

In the poetry generation model, the recurrent hidden layers for the sentence level and poem level both contain 3 layers and 1024 hidden units for each layer. The sentence encoder dimensionality is 64. The model was trained with the Adam optimizer [Kingma and Ba2014], where 128 is used as the minibatch size. The training time for each CNN takes about 100 hours on a Tesla K80 GPU.

5 Experiments of Our Approach

The system involves several components with several choices. Since it is hard to measure all combinations as they may influence each other, to optimize our system, we design the experiment process with a greedy strategy and separate the experiment into two parts. For each part, we compare the different method choices for one more step combined with the best approach of the previous experiment. The former is poem generation considering sentence level and poem level models. Since we use keywords as the seed for generation, for the latter, we focus on the quality of keywords from keyword extraction and different keyword expansion methods.

Figure 3: The human assessment tool is designed to capture the relative judgments among methods. For each image, a four-line poem is generated with each method, and all poems are displayed side by side.

5.1 Experiment Setup

Test Image Data For the model optimization experiment, 100 public domain images are crawled from Bing image search by searching 60 randomly sampled nouns and adjectives in our predefined categories. We focus on 45 images recognized as views for optimizing our model. The data will be released to research communities. Please note although our experiments are conducted on some type of images, our proposed method is general. Actually since we released our system in July, 2017, users have submitted about 1.2 million images of all kinds and gotten created poems by August, 2018.
Human Evaluation As shown in [Liu et al.2016]

, overlap-based evaluation metrics, such as BLEU and METEOR, have little correlation with real human feelings. Thus, we conduct user studies to evaluate our method.

The interface of the judgments is shown in Figure 3. We present an image at the top and poems generated by different methods for comparison side by side below the image. For each poem, we ask assessors to give a rating from 1 (dislike) to 5 (like) after they compare all the poems. We do not choose the design that shows a poem each time and asks for a rating from assessors because such kind of rating is not stable for comparing the quality of poems. The assessors may change their standards unconsciously. Our design borrows the idea of A-B test widely used in search evaluation. When an assessor can easily read and compare all poems before rating, his/her scores can provide meaningful information on relative ordering of the poems. Therefore, we focus on the relative performance in each experiment rather than absolute scores. In addition, We randomly arrange the order of methods for each image to remove biases of ordering and about a particular method. Due to the high cost of human evaluation, we invite five engineer background college students and two literature background students to judge all methods for optimizing models.

5.2 Poem generation

In poem generation, we consider sentence level experiment first. After the best approach is chosen, information from the previous sentence is used in poem level models.
Sentence Level For the sentence level, we aim to figure out whether the recursive generating strategy can produce more fluent sentences with specific keywords (here two nouns and two adjectives). As a baseline, we generate the part before a keyword by a backward model and the part after it by a forward model separately and then combine them. We also consider the influence of using different generating elements (character or word). There are four methods: char_combine, char_recursive, word_combine, and word_recursive. Although the sentence level experiment focuses on the sentence generation, for the convenience of the users’ judgments, we still present a four-line poem with four fixed keywords for each method. As shown in Table 1, word_recursive is significantly better than the two character-based methods. The char_recursive method is also significantly better than char_combine. Although the difference between two word-based methods is not significant, the word_recursive method is better than the word_combine method, in particular when we compare them with the method char_recursive. The word_recursive method significantly outperforms char_recursive method, while the word_combine does not. Therefore, we choose the word_recursive as the best method in this step.
Poem Level As the best method in the previous level, word_recursive is kept with adding additional methods in the second user study. While word_recursive ignores the relationship between sentences, word_preline uses the encoding of previous lines as an additional input, and word_poemlstm considers all previous lines by maintaining a poem level LSTM. Again, four fixed keywords are given for each method. The results in Table 1 show that both methods bring significant improvements over the word_recursive method. By considering all previous lines, the word_poemlstm method also significantly outperforms the word_preline by about 11%. This indicates that our proposed hierarchical poem model works the best.

Sentence Level
Average p-value
Approaches Score char_recursive word_combine word_recursive
char_combine 2.30
char_recursive 2.55
word_combine 2.68
word_recursive 2.86
Poem Level
Average p-value
Approach Score word_preline word_poemlstm
word_recursive 2.64
word_preline 2.95
word_poemlstm 3.27
Table 1: Human evaluation results of our poem generator on sentence level and poem level. The average scores show that both the recursive strategy and hierarchical model gain improvement significantly (with p-value less than 0.01).

5.3 Keyword Extraction and Expansion

Given the superiority of the word_poemlstm method, we then measure keyword filtering and expansion with this poem generation method. The original four keywords we provide are two nouns and two adjectives with the highest probabilities from the keyword extracting step. We compare the baseline with different keyword expansion methods. The without_expand method only uses two appropriate extracted keywords (one noun and one adjective) after keyword filtering. The expand_freq enlarges the two keywords according to the word frequency in the whole corpus. And the expand_cona expands the two keywords based on the co-occurred frequency with them. While this step focuses on the relevance of keyword to image, besides a 1-to-5 score to each poem, the assessors are also asked to label a true/false label on each keyword and corresponding sentence according to their relevance to the image query.

While there is no obvious difference between the average scores in Table 2, these four approaches show a totally different performance on the relevance to image query. Since without_expand only uses high confidence keywords, it has the lowest keyword irrelevance rate. However, as the longer generation is to be more problematic, the irrelevance rate of generated sentences by without_expand increases dramatically to 30%. expand_freq and expand_cona, by contrast, reduce the sentence irrelevance rate by enlarging the keyword set with additional words. While the keyword irrelevance rate of expand_freq increases slightly, by considering the word co-occurrence, expand_cona gains the best sentence relevance rate and decreases the keyword irrelevance rate from 18.7% to 15.6%.

Approach Average Keyword Sentence
Score irrelevance rate irrelevance rate
word_poemlstm 3.10 18.7% 23.5%
without_expand 3.11 6.9% 30.0%
expand_freq 3.12 19.4% 22.8%
expand_cona 3.11 15.6% 21.1%
Table 2: The performance of different keyword expansion methods. While they share close average scores, both keyword and sentence irrelevance rate drop by applying query expansion based on word co-occurrence.
Overall Relevant Fluent
Method Average p-value Positive p-value Positive P-value
score CTRIP Image2caption rate CTRIP Image2caption rate CTRIP Image2caption
our method 4.27 6.3% 26.5%
CTRIP 4.25 5.9% 63.0%
Image2caption 3.67 82.1% 6.2%
Imaginative Touching Impressive
Method Positive p-value Positive p-value Positive p-value
rate CTRIP Image2caption rate CTRIP Image2caption rate CTRIP Image2caption
our method 57.0% 45.4% 43.4%%
CTRIP 27.7% 36.0% 36.6%
Image2caption 8.1% 10.1% 10.7%
Table 3: Human evaluation results of our method and two other baselines. While the “related” part is dominated by Image2caption and CTRIP is stronger in the “fluent” part, our method significantly outperforms both baselines in other aspects.

6 Experiments on Comparison

After the optimization of the method is handled by leveraging the pilot experiments described above, the model is then compared with an existing image caption generator and a poem generator with a large scale experiment.

6.1 Experiment Setup

Test Image Data While we use randomly crawled images as testing queries in the previous experiment, we report our results on the existing Microsoft COCO dataset for image describing tasks[Chen et al.2015] [Lin et al.2014]. We use 270 images (recognized as view) in the validation set of COCO for the competitive analysis experiment.
Assessors To obtain rich feedback without user bias, 22 assessors from variety career fields are chosen, including: 1) 8 female users and 14 male users, 2) 13 users with bachelor degree and 1 user with master or higher degree, 3) 11 users prefer traditional poetry, 10 users prefer modern poetry and 1 user prefers neither.
Baselines We compare with two state-of-the-art methods.

  • Image2caption Since we are trying to describe an image, a first-class performance approach of image captioning [Fang et al.2015] is chosen as a baseline according to the leader board on the COCO dataset. While the approach generates a one-line English sentence from a given image query, we translate the caption into Chinese by human effort and separate it into multiple lines (e.g., usually two lines) with appropriate Chinese grammar.

  • CTRIP At the final stage of our research, we noticed that CTRIP released a traditional poetry generation application on 2017 [Ctrip2017]. While we cannot find the corresponding publication, since it has a similar goal, we choose their released service as the second baseline in the comparison experiment. 47 image queries fail in their poetry generation process, though the specific reason why are unknown due to lack of technical detail. To maintain fairness, we only evaluate the performance on other 223 images for all three methods.

Figure 4: The example poems generated by two baselines method and our proposed method.

Evaluation Metrics Beside the 1-to-5 overall rating for each approach, we also ask the assessors to give votes to the best methods in terms of five different criteria: relevant, fluent, imaginative, touching, and impressive. We provide multi-choice check boxes for each aspect and ask them to vote. We accumulate votes from all assessors for all images and then divide the number by . The percentage of votes obtained are shown in Table 3.

6.2 Experiment Results

As Table 3 indicates, both CTRIP and our method significantly outperform the image2caption method in all but Relevant measures, which indicates that describing images with poetry is more attractive. Our methods is significantly better than the two baselines in terms of being imaginative, touching and impressive. This indicates that our proposed method is more effective to bring objects and feelings into modern poetry and thus our generated results are more imaginative and touching. With full understanding of the meaning, attention paid to modern poems can be also sustained longer. Thus our method is the best in being impressive. Along with the advantages, our method sacrifices some relevance compared to image2caption and fluency compared to CTRIP.

We show three generated poems from one image in Figure 4. The image2caption describes images with a straightforward way and thus most relevant. The poem generated by CTRIP is fluent and enjoyable in rhythms, but it brings the “pair of butterfly” that do not have natural association with the image in semantic or sentiment. Our poem is related to image, but bring some feelings like “loneliness” into the poem. When a user reads “Stroll the empty / The land becomes soft”, he/she may get soft and hopeful too. This example shows that our modern poems are more easier to stimulates interesting ideas and generates emotional resonance.

In a summary, while the “relevant” part is dominated by image2caption and CTRIP is stronger in “fluent”, our method significantly outperforms both baselines in other three aspects related to being artistic. The best overall score is also achieved by our proposed method.

7 Conclusion and Future Work

This paper introduces an innovative idea for generating an artistic poem from an image. Our best proposed solution includes a hierarchical model from sentence to poem for poem generation, a deep learning based keyword extraction and a statistical based keyword expansion. A large scale user study indicates that our generated modern poems earn much more favor than the generated captions. The reasons are that our generated poems are more imaginative, touching and impressive. Although the generated traditional poems are more fluent than our modern poems, the method we proposed significantly outperform them in all other three aspects.

This study is our first attempt to generate poetry from image. In this study, while some connection between image and poetry is leverage, more can be done. For example, when we expand keywords, it is possible to verify if the additional keywords have some connection with the image content. It may be desirable that they carry some sort of connection with the image. In the same way, the poem generation step is separated from image, once the keywords have been extracted. It may also be desirable that generated sentences correspond better to the image. We will investigate approaches for this in the future.


  • [Bernardi et al.2016] Bernardi, R.; Cakici, R.; Elliott, D.; Erdem, A.; Erdem, E.; Ikizler-Cinbis, N.; Keller, F.; Muscat, A.; and Plank, B. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res.(JAIR) 55:409–442.
  • [Borth et al.2013] Borth, D.; Ji, R.; Chen, T.; Breuel, T.; and Chang, S.-F. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In ACM Multimedia, 223–232.
  • [Chen et al.2015] Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Dollár, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  • [Ctrip2017] Ctrip. 2017. Ctrip Xiao Shi Ji.
  • [Devlin et al.2015] Devlin, J.; Cheng, H.; Fang, H.; Gupta, S.; Deng, L.; He, X.; Zweig, G.; and Mitchell, M. 2015. Language models for image captioning: The quirks and what works. arXiv preprint arXiv:1505.01809.
  • [Donahue et al.2015] Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; and Darrell, T. 2015. Long-term recurrent convolutional networks for visual recognition and description. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2625–2634.
  • [Fang et al.2015] Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R. K.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1473–1482.
  • [Fu et al.2015] Fu, J.; Mei, T.; Yang, K.; Lu, H.; and Rui, Y. 2015. Tagging personal photos with transfer deep learning. In WWW, 344–354.
  • [Ghazvininejad et al.2016] Ghazvininejad, M.; Shi, X.; Choi, Y.; and Knight, K. 2016. Generating topical poetry. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , 1183–1191.
  • [He, Zhou, and Jiang2012] He, J.; Zhou, M.; and Jiang, L. 2012. Generating chinese classical poems with statistical machine translation models. In AAAI.
  • [Jiang and Zhou2008] Jiang, L., and Zhou, M. 2008. Generating chinese couplets using a statistical mt approach. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, 377–384. Association for Computational Linguistics.
  • [Karpathy and Li2015] Karpathy, A., and Li, F. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 3128–3137.
  • [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
  • [Lin et al.2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 740–755. Springer.
  • [Liu et al.2016] Liu, C.-W.; Lowe, R.; Serban, I. V.; Noseworthy, M.; Charlin, L.; and Pineau, J. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023.
  • [Manurung, Ritchie, and Thompson2012] Manurung, R.; Ritchie, G.; and Thompson, H. 2012. Using genetic algorithms to create meaningful poetic text. Journal of Experimental & Theoretical Artificial Intelligence 24(1):43–64.
  • [Manurung2004] Manurung, H. 2004.

    An evolutionary algorithm approach to poetry generation.

  • [Mikolov et al.2010] Mikolov, T.; Karafiát, M.; Burget, L.; Cernockỳ, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In Interspeech, volume 2,  3.
  • [Netzer et al.2009] Netzer, Y.; Gabay, D.; Goldberg, Y.; and Elhadad, M. 2009. Gaiku: Generating haiku with word associations norms. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, 32–39. Association for Computational Linguistics.
  • [Oliveira2012] Oliveira, H. G. 2012. Poetryme: a versatile platform for poetry generation. Computational Creativity, Concept Invention, and General Intelligence 1:21.
  • [Patterson et al.2014] Patterson, G.; Xu, C.; Su, H.; and Hays, J. 2014.

    The sun attribute database: Beyond categories for deeper scene understanding.

    International Journal of Computer Vision 108(1-2):59–81.
  • [Schwarz, Berg, and Lensch2016] Schwarz, K.; Berg, T. L.; and Lensch, H. P. 2016. Auto-illustrating poems and songs with style. In Asian Conference on Computer Vision, 87–103. Springer.
  • [Shum, He, and Li2018] Shum, H.-Y.; He, X.; and Li, D. 2018. From eliza to xiaoice: Challenges and opportunities with social chatbots.
  • [Socher et al.2014] Socher, R.; Karpathy, A.; Le, Q. V.; Manning, C. D.; and Ng, A. Y. 2014. Grounded compositional semantics for finding and describing images with sentences. TACL 2:207–218.
  • [Soto et al.2015] Soto, A. J.; Kiros, R.; Keselj, V.; and Milios, E. E. 2015. Machine learning meets visualization for extracting insights from text data. AI Matters 2(2):15–17.
  • [Szegedy et al.2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S. E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR, 1–9.
  • [Tosa, Obara, and Minoh2008] Tosa, N.; Obara, H.; and Minoh, M. 2008. Hitch haiku: An interactive supporting system for composing haiku poem. In International Conference on Entertainment Computing, 209–216. Springer.
  • [Wang et al.2016a] Wang, Q.; Luo, T.; Wang, D.; and Xing, C. 2016. Chinese song iambics generation with neural attention-based model. arXiv preprint arXiv:1604.06274.
  • [Wang et al.2016b] Wang, Z.; He, W.; Wu, H.; Wu, H.; Li, W.; Wang, H.; and Chen, E. 2016. Chinese poetry generation with planning based neural network. arXiv preprint arXiv:1610.09889.
  • [Wu, Tosa, and Nakatsu2009] Wu, X.; Tosa, N.; and Nakatsu, R. 2009. New hitch haiku: An interactive renku poem composition supporting tool applied for sightseeing navigation system. In International Conference on Entertainment Computing, 191–196. Springer.
  • [Yan et al.2013] Yan, R.; Jiang, H.; Lapata, M.; Lin, S.-D.; Lv, X.; and Li, X. 2013. i, poet: Automatic chinese poetry composition through a generative summarization framework under constrained optimization. In IJCAI.
  • [Yan2016] Yan, R. 2016. i, poet: Automatic poetry composition through recurrent neural networks with iterative polishing schema. IJCAI.
  • [Yi, Li, and Sun2016] Yi, X.; Li, R.; and Sun, M. 2016. Generating chinese classical poems with rnn encoder-decoder. arXiv preprint arXiv:1604.01537.
  • [Zhang and Lapata2014] Zhang, X., and Lapata, M. 2014. Chinese poetry generation with recurrent neural networks. In EMNLP, 670–680.