E-commerce companies such as TaoBao and Amazon put many efforts to improve the user experience of their mobile Apps. For the sake of improving retrieval results by search engines, merchants usually write lengthy, over-informative, and sometimes incorrect titles, e.g., the original product title may contain more than 20 Chinese words, which may be suitable for PCs. However, these titles are cut down and no more than 10 words can be displayed on a mobile phone with limited screen size varying from 4 to 5.8 inches. Hence, to properly display products in mobile screen, it is important to produce succinct short titles to preserve important information of original long titles and accurate descriptions of products.
This problem is related to text summarization, which can be categorized into two classes: extractive[3, 11, 10], and abstractive [5, 4, 13]
methods. The extractive methods select important words from original titles, while the abstractive methods generate titles by extracting words from original titles or generating new words from data corpus. They usually approximate such goals by predicting the next words given previous predicted words using maximum likelihood estimation (MLE) objective. Despite their success to a large extent, they suffer from the issue of exposure bias. It may cause the models to behave in undesired ways, e.g., generating repetitive or truncated outputs. In addition, predicting next words based on previously generated words will make the learned model lack of human-like holistic view of the whole generated short product titles.
More recent state-of-the-art methods [6, 16] treat short product titles generation as a sentence compression task following attention-based extractive mechanism. They extract key characteristics mainly from original long product titles. However, in real E-Commerce scenario, product titles are usually redundant and over-informative, and sometimes even inaccurate. e.g., long titles of a cloth may include both “嘻哈狂野 (hip-popwild)" and “文艺淑女 (artsydelicate)" simultaneously. It is very hard to generate succinct and accurate short titles just relying on the original titles. Therefore, it is insufficient to regard short title generation as traditional text summarization problem in which original text has already contained complete information.
In this paper, we propose a novel Multi-Modal Generative Adversarial Network, named MM-GAN, to better generate short product titles. It includes a generator and a discriminator. The generator generates a short product title based on original long titles, while additional information from corresponding visual image and attribute tags, the discriminator distinguishes whether the generated short titles are human-produced or machine-produced in a human-like view. The task is treated as a reinforcement learning problem, in which the quality of a machine-generated short product title depends on its ability to fool the discriminator into believing it is generated by human, and output of the discriminator is a reward for the generator to improve generated quality.
We highlight that our model can: (a) incorporate the image and attribute tags aside from original long product titles into the generator. To the best of our knowledge, it’s the first attempt for us to design a multi-modal model to consider multiple modalities of inputs for better short product titles generation in E-commerce; (b) generate short product titles in a holistic manner.
2 Multi-Modal Generative Adversarial Network
In this section, we describe in details the proposed MM-GAN. The problem can be formulated as follows: given an original long product title consisted of Chinese or English words, a single word can be represented in a form like “skirt" in English or “半身裙" in Chinese. With an additional image and attribute tags , the model targets at generating a human-like short product title , where and are the number of words in and , respectively.
2.1 Multi-Modal Generator
The multi-modal generative model defines a policy of generating short product titles given original long titles , with additional information from product image and attribute tags . Fig. 1 illustrates the architecture of our proposed multi-modal generator which follows the seq2seq  framework.
Multi-Modal Encoder. As we mentioned before, our model tries to incorporate multiple modalities of a product (i.e., image, attribute tags and long title). To learn the multi-modal embedding of a product, we first adopt a pre-trained VGG16  as the CNN architecture to extract features of an image from the condensed fully connected layers, where is the number of latent features. In order to get more descriptive features, we fine-tune the last 3 layers of VGG16 based on a supervised classification task given classes of products images. Second, we encode the attribute tags
into a fixed-length feature vector, and , where denotes fully connected layers. Third, we encode the original long titles
. Specifically, the features extracted from original long titlesare , where . Here represents a non-linear function, and in this paper the LSTM unit  is adopted.
Decoder. The hidden state for the -th target word in short product titles can be calculated as . Here we adopt an attention mechanism  to capture important words from original long titles . The context vector is a weighted sum of hidden states , it can be computed by , where is the contribution of an input word to the -th target word using an alignment model : . After obtaining all features , , from , and , respectively, we then concatenate them into the final feature vector: , where are learnable weights and denotes the concatenation operator. Finally,
is fed into the LSTM based decoder to predict the probability of generating each target word for short product titles. As the sequence generation problem can be viewed as a sequence decision making process , we denote the whole generation process as a policy .
The discriminator model
is a binary classifier which takes an input of a generated short product titlesand distinguishes whether it is human-generated or machine-generated. The short product titles are encoded into a vector representation through a two-layer LSTM model, and sent to a two-way softmax function, which returns the probability of the input short product titles being generated by human: , where is a weight matrix and is a bias.
2.3 End-to-End Training
The multi-modal generator tries to generate a sequence of tokens under a policy and fool the discriminator via maximizing the reward signal received from . The objective of can be formulated as follows:
where and are learnable parameters for and , respectively.
Conventionally, GANs are designed for generating continuous data and thus is differential with continuous parameters guided by the objective function from 
. Unfortunately, it has difficulty in updating parameters through back-propagation when dealing with discrete data in text generation. To solve the problem, we adopt the REINFORCE algorithm. Specifically, once the generator reaches the end of a sequence (i.e., ), it receives a reward from based on the probability of being real.
In text generation, will provide a reward to only when the whole sequence has been generated, and no intermediate reward is obtained before the final token of is generated. This may cause the discriminator to assign a low reward to all tokens in the sequence though some tokens are proper results. To mitigate the issue, we utilize Monte Carlo (MC) search with -time roll-outs  to assign rewards to intermediate tokens. where and are sampled based on roll-out policy and the current state. The intermediate reward now is , here is an action at current state . Now we can compute the gradient of the objective function for the generator :
where is the partial differential operator for in , and is fixed during updating of generator.
The objective function for the discriminator can be formulated as:
3.1 Experimental Setup
Dataset. We train our model on LESD4EC dataset , which consists of more than 6M products from Taobao Platform, each product includes a long product title and a short product title written by professional writers, along with a high quality image and attributes tags. We randomly select of data as training set, and the rest as validation set and test set ( each).
Baselines. We compare our proposed model with the following four baselines: (a) Pointer Network (Ptr-Net)  which is a seq2seq based framework with pointer-generator network copying words from the source text via pointing. (b) Feature-Enriched-Net (FE-Net)  which is a deep and wide model based on attentive RNN to generate the textual long product titles. (c) Agreement-based MTL (Agree-MTL)  which is a multi-task learning approach to improve product title compression with user searching log data. (d) Generative Adversarial Network (GAN)  which is a generative adversarial method for text generation with only single input.
3.2 Automatic Evaluation
To evaluate the quality of generated product short titles, we follow [16, 6] and use standard recall-oriented ROUGE metric . Experimental results on test set are shown in Table 1. From this table, we note that our proposed MM-GAN achieves best performance on three metrics. Furthermore, when comparing MM-GAN with GAN, we can find that additional information such as image and attribute tags from product can absolutely facilitate our model to generate better short titles. In addition, our proposed model also outperforms Agree-MTL which can be illustrated from two aspects: (a) it incorporates multiple sources, containing more information than other single-source based model. (b) it applies a discriminator to distinguish whether a product short titles are human-generated or machine-generated, which makes the model evaluate the generated sequence in a human-like view, and naturally avoid exposure bias in other methods.
|Product Long Titles||
3.3 Case Study
Table 2 shows a sample of product short titles generated by MM-GAN and baselines. From this table, we can note that (a) product short titles generated by our model are more fluent, informative than baselines, and core product words (e.g., “Artka (阿卡)", “复古 (retro)", “衬衫 (skirt)") can be recognized. (b) There are over-informative words (e.g., “阿卡 (Artka)", “S110061Q") and irrelevant words (e.g., “狂野 (wild)") in product long titles. Over-informative words may disturb model’s generation process, irrelevant words may give incorrect information to the model. These situations could happen in real E-commerce environment. FE-Net misses the English brand name “Artka" and gives its Chinese name ‘阿卡" instead. Agree-MTL using user searching log data performs better than GAN. However, Agree-MTL still generates the over-informative word ‘阿卡". MM-GAN outperforms all baselines, information in additional attribute tags such as “复古 (retro)", “Artka"), and other information from the product main image are together considered by the model and help the model select core words and filter out irrelevant words in generated product short titles. Which shows that MM-GAN using different types of inputs can help generate better product short titles.
In this paper, we propose a multi-modal generative adversarial network for short product title generation in E-commerce. Different from conventional methods which only consider textual information from long product titles, we design a multi-modal generative model to incorporate additional information from product image and attribute tags. Extensive experiments on a large real-world E-commerce dataset verify the effectiveness of our proposed model when comparing with several state-of-the-art baselines.
-  Philip Bachman and Doina Precup. Data generation as sequential decision making. In NIPS, pages 3249–3257, 2015.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
-  Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. Attsum: Joint learning of focusing and summarization with neural attention. In COLING, 2016.
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang.
Distraction-based neural networks for document summarization.In IJCAI, 2016.
Sumit Chopra, Michael Auli, and Alexander M Rush.
Abstractive sentence summarization with attentive recurrent neural networks.In NAACL-HLT, pages 93–98, 2016.
-  Yu Gong, Xusheng Luo, Kenny Q Zhu, Shichen Liu, and Wenwu Ou. Automatic generation of chinese short product titles for mobile display. In IAAI, 2018.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In EMNLP, 2017.
-  Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
-  Yishu Miao and Phil Blunsom. Language as a latent variable: Discrete generative models for sentence compression. In EMNLP, 2016.
-  Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In AAAI, pages 3075–3081, 2017.
-  Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.
-  Abigail See, Peter J Liu, and Christopher D Manning. Get to the point: Summarization with pointer-generator networks. In ACL, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.
-  Jingang Wang, Junfeng Tian, Long Qiu, Sheng Li, Jun Lang, Luo Si, and Man Lan. A multi-task learning approach for improving product title compression with user search log data. In AAAI, 2018.
-  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
-  Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.