Smartphones and voice-activated smart speakers, such as Amazon Alexa, Google Home, and Apple Siri, have led to increased adoption of voice-enabled shopping experiences. In such voice-enabled shopping experiences, reducing user friction and saving time is key, especially for low consideration purchases and repeat grocery purchases. Display-based experiences typically utilize a short product title when presenting a product, but these short titles do not naturally fit in a typical conversational flow. For example, for a display-based experience showing “Jergens Natural Glow Daily Moisturizer, Medium to Tan, 7.5 oz” as a short title of the product may be okay but it is not suitable for voice-based applications as it is not a naturally spoken title. At the same time, display-based experiences have the added benefit of being able to show a lot of additional meta-data which is not possible in conversational systems. The product title for a conversational system needs to encapsulate the important information in a succinct, grammatically correct natural language sentence, such that it naturally fits with the conversational flow during the dialogue rounds between the user and the conversational system. E-commerce companies can have millions to billions of products in their ever-changing product catalogs, so it is typically prohibitively expensive to manually annotate naturally spoken titles for all products. In such industry settings, it is ideal to have a model that can generate naturally spoken titles for an evolving catalog. The primary goal of this work is to examine the methods and challenges to convert short titles of products into a naturally spoken language that is grammatically correct.
This problem is similar to the text summarization
task which is well studied in natural language processing (NLP). However, using it for e-commerce application is not a trivial task. Text summarization approaches can be classified into two broad sub-categories,extractive text summarization and abstractive text summarization. Extractive text summarization-based approaches usually try to extract a few sentences (or keywords) from lengthy documents [4, 10, 9]. To generate natural language titles from web titles, model should have capability to generate conjunctions, articles etc. at the appropriate position.
Abstractive text summarization attempts to understand the content of the document and produce summaries which may contain novel words or phrases. Recurrent Neural Networks(RNN)[5, 2] based sequence to sequence (seq2seq) models 
or recently developed attention models have been shown to perform well on abstractive summarization tasks . However these models tend to generate repetitive words and this can cause negative customer experience in voice e-commerce applications. For traditional text summarization tasks we can introduce novel words which are not part of the input sequence but in an e-commerce setting paraphrasing of factual details like brand, quantity, etc. may imply a different product. In addition to this, new products are added to e-commerce product catalogs continuously, which can introduce out-of-vocabulary words. The summarization model should be able to generalize to these words.
In this paper, we investigate application of text summarization techniques to voice based e-commerce application. Our major contributions are
We adapt different state-of-the-art NLP models to a real-world e-commerce dataset with limited labels.
We perform extensive evaluation of these models on established evaluation metrics, as well metrics relevant to our application. We also perform human judgement evaluation.
The following sections provides a summary of related work followed by a description of methods applied to convert web-based short titles of products (sequence of words in English) into more naturally spoken summary titles (sequence of words in English) for voice-based applications. In our problem setting, we are more interested in building an abstractive text summarization based model that can generate novel words in the decoded summary. Section 4 provides the salient features of the dataset and implementation details of the methods described earlier. This is followed by a discussion of results observed and conclusion.
2 Related Work
Text summarization is a long-studied problem in natural language processing. With the advent of deep learning based approaches, seq2seq models have proven highly successful in abstractive text summarization. Some of these models and relevant developments in the field are mentioned below.
develops on the seq2seq model with attention for the summarization task. It uses the concept of the pointer network introduced in Vinyals et al. vinyals2015pointer to decide which words from the main text should directly be copied to the summary. This helps to preserve the important factual information from the input text and also assists in handling out-of-vocabulary words. Ptr-Net model also adds coverage loss, which examines the difference between the attentions of previous words generated and the current attention, in an attempt to fix the issue of word repetition, a persistent issue in seq2seq models. Gehrmann et al. gehrmann2018bottom try to improve the fluency of the generated text through various constraints applied during model training. Soft constraints on the size of text are used to constrain the length of generated descriptions, while constraints on the output probability distribution of words ameliorates word repetition.
Development in language models have subsequently led to increased use of pre-trained models such as BERT  and GPT-2 , which are trained on huge text corpuses and are used to generate the word embeddings for input texts. While Khandelwal et al. khandelwal2019sample examines the feasibility of pre-trained language models in a low-data setting and moves away from a seq2seq framework, the work of Liu and Lapata presum use BERT in a seq2seq model to summarize data. Details of this model are discussed further in Section3. In our study, this is the primary model adapted for our use-case.
Text summarizarion finds natural application in e-commerce where products may have a long description but only the salient features of a particular product is what the end user is in interested in. Increased interaction with mobile devices and voice based interfaces such as Amazon Alexa and Google Home present new challenges as the product title now need to be succinct. There has been development in rule based methods as in Camargo de Souza et al. camargo-de-souza-etal-2018-generating. Deep learning based methods find natural application and attempt to use multi-model information (images and text) to generate product title. Chen et al. chen2019towards generates personalized product titles utilizing user personas and an external knowledge base. Mathur et al. mathur-etal-2018-multi attempts at generating titles in different languages for the same product. Sun et al. sun2018multi develops further on the work of Ptr-Net using a separate encoder network for important attributes like quantity, brand of products and then then uses 2 pointers to decide where to copy data from. The work of Zhang et al. zhang2019multi attempts a novel method of generating short descriptions for e-commerce use case and uses multimodal information. A word sequence symbolizing description and the image of a product in the catalog are chosen as descriptors for the product and are used to generate short titles for the product. An adversarial network based approach is then used to decide if the generated title is machine or human generated to improve the quality of generated titles and make it more human like.
In this section, we formulate the problem of automatic natural language title generation and discuss various approaches that can be used to solve this problem.
Problem Definition: The goal of this task is to build a system that can automatically generate natural language product titles which are easily interpretable in a voice-enabled shopping experience. Given the short web-title represented as a sequence of words , the goal of this system is to generate the corresponding natural language title .
In the following sub-sections, we discuss various models which we apply for the automatic natural language title generation task.
3.1 seq2seq + Attention
Consider a sequence of input tokens fed into an encoder (LSTM) producing a sequence of encoder hidden states . The decoder receives the word embeddings of the previous words and has a decoder state at time step . The attention distribution is computed as shown in following equations:
where , , , and
are learnable parameters. The attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vectorcomputed by:
The context vector and decoder state is fed through linear layers to get the vocabulary distribution . The network is trained end-to-end using the negative log-likelihood of the target word at each timestep.
Ptr-Net  is a hybrid between the baseline seq2seq with attention and a pointer network. In the this model, the generation probability (or the probability of using a new word in the output) depends on context vector and attention distribution , and is computed as follows:
where , , , and are learnable parameters. is used to choose between generating a word from the vocabulary by sampling from , or copying a word from the input sequence by sampling from the attention distribution . The final probability distribution over the vocabulary is computed by:
The model is trained end-to-end similar to seq2seq + attention using the negative log-likelihood of the target word
as the loss function.
Transformers  are attention-based models, where the relationship between a given word and the context is modeled through multi-head attention. Each layer in a transformer consists of multi-head attention () followed by a layer norm and feed forward network as shown in Equation 7.
The final layer representation from the encoder is given to the decoder, and the decoder is trained using negative log likelihood with the target word .
We use BERT (Bidirectional Encoder Representations from Transformer)  to encode the web titles. BERT is trained on a large text corpora using unsupervised tasks ( masked language modelling and next sentence prediction). BERT uses token, segment, and position embeddings to represent input tokens. Segment embeddings are used in pairwise tasks to differentiate between segments (e.g., question and answer in SQUAD tasks). This input representation is then fed into multiple transformer layers.
Liu and Lapata presum, modify the BERT formulation for the text summarization task. A [CLS] token is used at the start of a sentence, and the representation of this token is used as the sentence representation. and
are used as segment embeddings for odd and even sentences, respectively. Position embeddings for sentences larger than 500 words are learnt as model parameters. We adapt to this format to represent data. For our use case, web product titles are one sentence long, hence the segment embeddingis not used.
We insert [CLS] and [SEP] tokens in input web title as shown in Figure 1.The input representation for transformer is then prepared by adding position (), token (), and segment embedding () for corresponding words
Encoder then transforms this input representation using transformer layers which applies transformation as shown in Equation 7 on input representation . In Equation 7, represents input representation . The continuous representation from the final layer of BERT is then given to the decoder. While pretrained BERT is used as the encoder, an 8-layer transformer, randomly initialized, is used as the decoder. We train the decoder to generate summaries using true labels in ground truth data using the framework for abstractive summarization, as in , without using the coverage and copy mechanism. We use the Adam optimizer with the following learning rate schedule for the encoder and decoder :
We set , , , and .
4 Experimental Setup
We use a proprietary dataset from one of the largest E-commerce retailers in the world. Our dataset consists of only 19,269 pairs of web product titles along with their corresponding voice titles. Details of the dataset have been provided in Table 2. The web product titles are either entered by merchants or algorithmically generated for certain categories while publishing products on e-commerce website. The voice titles for the corresponding web titles are manually created by human annotators through a crowdsourcing platform. Some examples of web titles and corresponding voice titles are shown in Table 1.
|Web titles (Input Sequence of Words)||Voice titles (Output Sequence of Words)|
|El Monterey Beef & Cheese Burritos 8 ct bag a family size pack El Monterey Beef and cheese||a family size pack of 8 El Monterey Frozen Beef And Cheese Burrito|
|Paas Magical Color Cup Egg Decorating Kit a pack Paas Magical Color Cup||a pack of Paas Magical Color Cup Egg Decorating Kit|
|Wonderful Roasted & Salted Pistachios 8 oz. Bag a bag Wonderful Roasted and salted||an 8 ounce bag of Wonderful Roasted And Salted Pistachios|
There are certain key differences in the characteristics of web titles and voice titles. Important distinctions are listed below:
Web titles often contain abbreviations for units of measurement for succinctness, e.g., Row 3 in Table 1 mentions ”8 oz. bag”. However, the voice title should contain the corresponding natural language word ”ounce”.
Web titles may or many not contain articles, but voice titles need to have grammatically correct articles, conjunctions, etc. For example, refer to Row 1 in Table 1.
Web titles sometimes contain specific product attributes, such as brand or quantity. These product attributes may have altered positions in the voice title, but the attribute phrase needs to be retained exactly in its entirety.
As shown in Table 2, the average voice title length is tokens. Voice titles need to be short and succinct, as this information is spoken through a voice device to the end-user.
|avg. web title length||15.3352|
|avg. voice title length||11.3886|
|avg. # unique words in web title||13.0805|
|avg. # unique words in voice titles||11.3009|
|avg. # of novel one grams in voice title||4.0138|
|# train examples||13874|
|# val examples||1926|
|# test examples||3469|
We observed that web titles do not have specific details like brand for few products, hence we append additional product metadata when available to web title. This metadata contains attributes such as brand, container type, and size (whenever available).
The dataset is randomly partitioned, into 13874 train examples, 1926 validation examples, and 3469 test examples.
4.2 Implementation Details
We use the pytorch ‘bert-base-uncased’ version of BERT for the encoder along with the subword tokenizer111 https://github.com/nlpyang/PreSumm/. In the decoder, the transformer has 768 hidden units and 8 layers, while the feed-forward layers have size 2,048. The learning rate used is as mentioned in Section 3, with a batch size of 256, and the model was trained for 35,000 steps. We used beam search with size and for decoding. Decoding is done until end of sequence token is emitted. we also block repeat trigrams . We use a minimum length of 4 for decoding and a maximum length of 50 for decoding. A checkpoint model was saved every 2,000 steps, with the best performing checkpoint model on validation data being used to report performance on the test data.
We compare the BERT model with seq2seq, Ptr-Net, Ptr-Net + Coverage, and the Transformer model as baselines. For the implementation of seq2seq, Ptr-Net, and Ptr-Net + Coverage, the implementation of  was used to generate the results222 https://github.com/abisee/pointer-generator.
The implementation details of the various baselines is provided below:
seq2seq: Stanford core nlp PTBtokenizer is used to tokenize the data, which is converted into story format as in the popular CNN-Daily mail dataset  for text summarization. We use default parameters from authors and change minimum length in decoder to 50 and maximum length in decoder to 35, as those are the corresponding maximum lengths of augmented web title and voice title respectively. When decoding test data, a beam search of beam size 4 is used to generate the predicted title. A minimum length of 5 is set for the prediction title.
Ptr-Net: The Pointer Net uses the same parameters as seq2seq model, and the validation set is used to identify the optimal training checkpoint for the model.
Ptr-Net + Coverage: We use the default implementation of the authors and train the model in a 2-step training process. First, the pointer net model is trained without any coverage loss. Using validation loss, the best model is extracted. We then add the coverage loss term, and train the model again using the previously mentioned best model as warm-up.
Transformer: We use a 6-layer transformer encoder with 512 hidden size and 2048 dimensional feed-forward layer. For the decoder, we use the same configuration as BERT. The learning rate and other hyper-parameters are obtained from .
4.3 Evaluation Metrics
We use the following metrics to evaluate the above proposed model and baselines:
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): measures overlap between the candidate summary and ground truth using precision, recall, and F-1 scores. However, ROUGE does not give a clear idea about repetitions or duplicates in the generated summary. We report F-1 ROUGE score at 1, 2, and L(Longest Common Subsequence). 
ROUGE-1 refers to the overlap of the unigrams
ROUGE-2 refers to the overlap of the bigrams
takes into account sentence level structure similarity naturally and identifies the longest co-occurring in-sequence n-grams automatically.
Avg. # duplicate 1-grams - Number of duplicate one-grams between the ground truth summary and candidate summary.
Human Evaluation - 3 human annotators were provided 100 random samples of model-output titles and were asked to rate the title on a scale of 1-5, based on product relevance(i.e., if the product is the same), grammatical correctness, and correct preservation of important attributes (such as brand name, quantity, and unit of measurement). The average score of the annotators is taken as the judgement score for the model. This evaluation was performed for only the Transformer and BERT Abstractive model outputs.
5 Results and Observations
|Method||R-1||R-2||R-L||avg. # of duplicates||Human Evaluation|
|seq2seq + attention||0.7951||0.6607||0.7883||0.257135||X|
|Ptr-Net with Coverage||0.8917||0.8042||0.8800||0.3041||X|
Table 3 provides a summary of model performance on different evaluation metrics. We observe that transformer and BERT have ROUGE-1 F1 scores of and respectively, outperform seq2seq-based approaches in terms of both ROUGE and avg. # duplicates metrics. The Ptr-Net model outperforms basic seq2seq+attention approach, however, paradoxically Ptr-Net with coverage underperforms compared to Ptr-Net model. The coverage loss is present to specifically address the issue of repetition of words and instead of fixing it, lead to an increase in the avg.# of duplicates to from . While transformer model does have a better ROUGE score than BERT model, BERT has lower repeated words in output ( compared to ) which has a greater impact on the title qualitatively. Since, BERT is pre-trained on a large corpus, it should be able to generalize better especially in low data scenarios. Human evaluation of generated titles reinforce this, showing that BERT performs better than transformer based model on output quality.
|Web title||Ground truth||Model prediction|
|1||White Onions2 lbs a bag premium||a 2 pound bag of white onions||a 2 pound bag of white onions|
|2||Great Value Large Grade AA, 6 Eggs a carton Great Value Large Grade AA||a half dozen carton of Great Value Large Grade AA Eggs||a 6 count carton of great value large grade aa eggs|
|3||Lucky Charms Gluten Free Breakfast Cereal, 20.5 oz a box Lucky Charms Gluten Free||a 19.3 ounce box of Lucky Charms Gluten Free Cereal||a 20.5 ounce box of lucky charms gluten free cereal|
|4||Hostess Donettes Frosted Mini Donuts, 6 ct, 3 oz a pack hostess donettes||a pack of 6 hostess donettes mini donuts||a pack of 6 hostess donettes frosted mini donuts|
|5||Pork Butt Steaks Large, Tray, 3.1 - 5.1 lbs a tray||a 34.4 ounce tray of pork butt||a 3.1 to 5.1 pound tray of pork butt steaks|
|6||Pork Cube Steaks, Tray, 0.45 - 1.35 lbs a tray||a 12 ounce tray of ground pork||a 1.35 pound tray of pork cubes|
|Web Title||Ground truth||Model prediction|
|1||Yoo-Hoo Chocolate Milk Fridge Pack, 12 pk a pack Yoo Hoo Chocolate Fridge Pack||a 12 count pack of yoo hoo chocolate milk fridge pack||a pack of 12 yoo hoo chocolate bar|
|2||Produce Unbranded Eat Your Vegetables Blend 7 Oz a pack||a 7 ounce pack of veggie blend snack||a 7 ounce pack of produce produce produce baby food blend|
|3||Diet 7UP, 0.5 L, 6 pack a pack||a 6 pack of .5 liter diet 7up||a 6 pack of 0.5 fluid ounce diet 7 up|
|4||Garlic, each (1 bulb) a garlic||garlic sold individually||a pound of garlic sold individually|
|5||Harvestland Chicken Breast, 1.5-2 lbs. a tray perdue harvestl and free range||a 1.5 to 2 pound tray of Perdue Harvestland Free Range Chicken Breasts||a 1.5 to 1.5 pound tray of perdue harvestland free range chicken breast|
Table 4 and Table 5 lists examples of BERT Abstractive Model prediction on the test dataset where the model performs well and poorly, respectively. The corresponding ground truth and web titles have also been provided for comparison. The model seems to have repeated words in certain cases, for example Row 2 of Table 5.Given that most of the data has been trained on products with ounce and pound as the units of measurement, it can be seen that liter is incorrectly converted to fluid ounce by the model in Row 3 of Table 5 and pound is added to a single bulb of garlic in Row 4 of Table 5. However, it can be observed that in some cases, the model clearly does better than ground truth evaluation and even fixes the incorrect quantities in ground truth (rows 3-6 in Table 4). The model is able to add important attributes like frosted in the case of row 4 of Table 4. From row 2 of Table 4 we can see that the model is able to maintain the brand name as it is like great value and provide the correct measurements units for the different products like 6 count or pack of 6. Thus the model is able to fulfill requirements of preserving quantity and brand between web and voice titles.
We observe that overall BERT based model performs better both quantitatively and qualitatively in maintaining factual details in the output title and also reducing repeated words in the output.
In this paper, we studied the problem of generating succinct, grammatically correct voice titles for products in a large e-commerce catalog with limited labels. We evaluate 4 different baselines and demonstrate that BERT summarization can generate good titles through ROUGE metrics and human evaluation, even when there is extremely limited data. Generating personalized titles for different user segments based on rich user metadata and incorporating web data with additional product attributes that may be product dependent are some directions to extend this work.
-  (2014) Neural machine translation by jointly learning to align and translate. External Links: Cited by: §3.1.
-  (2014) On the properties of neural machine translation: encoder-decoder approaches. CoRR abs/1409.1259. External Links: Cited by: §1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §3.4.
-  (2003) Hedge trimmer: a parse-and-trim approach to headline generation. Technical report MARYLAND UNIV COLLEGE PARK INST FOR ADVANCED COMPUTER STUDIES. Cited by: §1.
-  (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Cited by: §1.
-  (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: 1st item.
-  (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345. Cited by: §3.4, 4th item, §4.2.
-  (2016) Sequence-to-sequence rnns for text summarization. CoRR abs/1602.06023. External Links: Cited by: §1, 1st item.
-  (2016) SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents. CoRR abs/1611.04230. External Links: Cited by: §1.
Automatic text summarization using a machine learning approach. In
Advances in Artificial Intelligence, G. Bittencourt and G. L. Ramalho (Eds.), Berlin, Heidelberg, pp. 205–215. External Links: Cited by: §1.
-  (2019) Language models are unsupervised multitask learners. Cited by: §2.
-  (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §2, §3.2, §3.4, §4.2.
-  (2014) Sequence to sequence learning with neural networks. External Links: Cited by: §1.
-  (2017) Attention is all you need. External Links: Cited by: §1, §3.3.