In recent years, live streaming is becoming an increasingly popular trend of sales in E-commerce. In a live streaming, an anchor will introduce the listed items 111We use items and products interchangeably in this paper. in an attractive way, and offer certain discounts or coupons, to facilitate user interaction and volume of transaction. If a customer is interested in a particular item, s/he could further view its details, and ask specific questions. Being different from traditional purchasing process, where customers make purchasing decision through product search and evaluation on an E-commerce website, it is more natural for customers of live streaming to purchase within an online broadcasting room.
To support this new purchasing mode, we need to provide rich and attractive information that details the important aspects of each product item in an online broadcasting room. It is better for such information to constitute a cognitive product profile, through which customers are able to truly understand a product: why they need, when and where can they use, how to use, what is the effect, etc. Also, as customers may have specific questions about an item during browsing, we also need to help an anchor to answer questions, to respond to users and provide a better user experience.
To this end, we establish AliMe MKG, a multi-modal knowledge graph that centers on and aggregates rich information about items. Based on the knowledge graph, we further build an online live assistant that highlights product search, product exhibition and question answering, allowing customers to conveniently seek information in an online broadcasting room, including but not limited to skimming over item list, viewing item details, and asking item-related questions.
Our system has been launched online in the Taobao APP, and currently serves hundreds of thousands of customers per day. In this paper, we introduce how we construct our multi-modal knowledge graph through cross-modal information fusion, and demonstrate its value through its application in the live assistant.
2. Multi-Modal Knowledge Graph
In this section, we introduce our multi-modal knowledge graph, including core ontology, construction process, experimental results and statistics.
2.1. KG Ontology
We show our core ontology in Figure 1. Three commonly accepted concepts, namely “User”, “Item” and “Scenario”, are adopted from classic buying process: a user intent to buy some items at/for a certain scenario. The concept “Property_value” captures property values of items (e.g., “bisabolol 红没药醇” is the property value of ingredient of “cleansing foam 洁面泡沫”). “Problem” and “POI (Point of Interest)” are adopted from AliMe KG (Li et al., 2020a), our previous domain knowledge graph: “Problem” refers to a problematic state that a user is at (e.g., “pimple 长痘痘”); “POI” captures users’ need or solution to user problem (“antiacne 清痘抑痘”). The new concept “Image” is a visual entity that graphically represents a textual scenario, user problem, property value or POI, and is the key contribution of this work.
There are three types of newly added links: cause, which relates scenario to problem (e.g., “changeover period between autumn and winter 秋冬换季” - cause - “dry skin 皮肤干”); need, which relates problem to POI (e.g., “pimple 长痘痘” - need - “antiacne 清痘抑痘”); and satisfy, which links property value and POI (e.g., “bisabolol 红没药醇” - satisfy - “antiacne 清痘抑痘”). These links are established based on domain knowledge, while the dotted ones are added or predicted through KG completion. For instance, if a user has a problem “pimple”, which needs “antiacne”, we will add a preference link between the user and “antiacne”. In addition, we use “has_image” to link an image to a corresponding textual entity.
2.2. KG Construction
We present our KG construction process in Figure 2. In general, it includes three phases: textual knowledge extraction, image extraction and cross-modal fusion.
2.2.1. Textual Knowledge Extraction
As in AliMe KG (Li et al., 2020a), Item, property, Property_value
triples are imported from the Alibaba product knowledge graph, POIs are extracted from E-commerce content (i.e., item detail pages and articles) using phrase mining and binary BERT classifier, and the relation between property values and POIs are acquired through relation extraction. Finally, the relation between an item and a POI can be inferred as follows: if an item has a property value that satisfy a specific POI, then the item will have that POI.
2.2.2. Image Extraction
We extract images mainly from item detail pages at present. Initially collected images can be overlong on size or convey time sensitive information (e.g.., a promotion will be offline after a certain time point). We first cut overlong images into pieces based on image content using edge detection, filter noisy images through image classification, optical character recognition (OCR) together with heuristic rules (e.g., the area proportion of OCR text, the number of OCR text blocks).
2.2.3. Cross-Modal Fusion
We treat the linking of images to textual entities as a text-image matching problem. Inspired by the noticeable success of pre-trained language models (e.g., BERT(Devlin et al., 2018)) in NLP, the pre-training for vision-language tasks has also attracted increasing attention in recent years. On observing the limitation of pre-defined and mismatching categories of region detection (Huang et al., 2020), and the effectiveness of directly applying Transformer to images (Dosovitskiy et al., 2020), we choose to learn from image patches rather than bounding boxes, to represent an image.
Model Pre-training. As shown in Figure 3, we follow LightningDOT (Sun et al., 2021) to adopt a two-stream Transformer architecture that removes the time-consuming cross-modal attention to accelerate inference. For image input (the left branch), we employ vision Transformer (ViT) (Dosovitskiy et al., 2020) as the feature extractor. Specifically, we first reshape an image into a sequence of flattened 2-D patches , where is the resolution of the original image, is the number of channels, is the resolution of each image patch and is the number of patches, and then provide the sequence of linear embeddings of these patches along with corresponding segment and position embeddings as input to ViT. Moreover, to represent the whole image, we add a “[CLS]” token to the beginning of the patch list. For text input, we employ StructBERT (Wang et al., 2020) as our backbone and follow the BERT convention to use the “[CLS]” token to represent the whole text.
We pre-train our model with a large amount of image-sentence pairs and three objectives, namely Masked Language Modeling (MLM), Masked Patch Feature Regression (MPFR) and Cross-modal Retrieval (CMR). In the MLM task, we consider the paired image as complementary information when reconstructing masked tokens in a sentence through adding the embedding of the whole image to that of the masked tokens. In the MPFR task, our model learns to regress the output of each masked patch to its visual feature through an MSE (mean square error) loss, and exploits the global text representation when reconstructing masked patches in a similar spirit. Finally, in the CMR task, we use dot product to measure the similarity score between a text and an image . Specifically, we calculate the dot product between the embedding of “[CLS]” tokens respectively from ViT and StructBERT. To better capture the supervision signal, we employ a bi-directional variant of contrastive loss that is proposed in LightningDOT (Sun et al., 2021): give a batch of matched image-text pairs , we treat image as the query and the other texts as negatives, and perform similarly by taking text as the query.
To perform pre-training, we collect more than 6 million image-sentence pairs from items’ detail pages. Operationally, the pairs are constructed through pairing an image with its OCR text. Heuristic rules are also employed to prune low quality text and images.
Model Fine-tuning. For the downstream text-image matching task, we fine-tune the pre-trained multi-modal model with task-specific datasets. Finally, we will obtain triples that describe the “has_image” relation between textual and image entities through model matching.
2.3. The Performance of Cross-Modal Matching
to extract image features, and feeds the text and visual features into a single-stream Transformer network for image-text matching. For fair comparison, we also pre-trained Pixel-BERT with the same amount of image-text pairs in E-commerce. Instead of sampling pixels, we use all extracted pixel features during our pre-training. Also, we compared the linear projection of patches with ResNet50 on image feature extraction when using ViT as the image encoder in our approach.
We evaluated the three methods on an image-text matching dataset in the Beauty domain, where each text has 10 candidate images. We extract image features offline in advance, and then perform image-text matching for each coming text query. We run each experiment 3 times and show the average , inference speed and feature extraction speed in Table 1. We can see that our approach not only achieves the best performance (AUC=0.9861) but also brings substantial speed improvement (55.18 for online inference time and 1.61 for offline image feature extraction).
2.4. KG Statistics
Our multi-modal KG is ongoing. Currently, our MKG covers three vertical domains, namely Clothing, Beauty and Snacks. It has accumulated 400 scenarios, 1K user problems, 500K POIs, 2K “Scenario - cause - Problem” triples, 12K “Problem - need - POI” triples, 500K “Property_value - satisfy - POI” triples, 300K items, and 28M associated images. The triples except items and images have been completely checked by crowd-sourcing, hence their quality can be ensured. The spot check on cross-modal matching shows that the accuracy is of high quality and can be applied in practice.
2.5. KG Example
We show an excerpt of our multi-modal knowledge graph in Figure 4. As it shows, the scenario “Stay up late 熬夜” causes the problem “Dull skin 皮肤暗沉”, which requires “Fair skin 皮肤白皙”. A “facial mask 面膜” item contains as its ingredient “Dipotassium glycyrrhizinate 甘草酸二钾”, which is able to help to “Fair skin”, and hence is fit for corresponding users.
3. KG Application: Live Assistant
We have built an online live assistant based on our MKG for users to skim over item list, view item details and ask item-related questions. We show the overall processing flow of customer inputs in Figure 5. Given a user query , our online system first conduct intention identification. If a user is requesting for viewing item, e.g., “Can I see the lipstick?”, s/he will be directed to our Item Exhibition component that displays item cards. If the user is asking item-related question, e.g., “What is the size of the T-shirt?”, s/he will be replied by our QA Engine module. If the query is irrelevant to products, our system will respond with a pre-configured answer.
Item Retrieval Engine.
Given a customer query that is related to product(s), the engine will parse the query, and search for and rank related items. If there are more than one item, it will return the ranked list, and ask users for their selection. To enable our model to better understand products, we employ a offline NER model to identify the key aspects of items from their titles and detail pages, e.g., category, brand, functionality, etc., and add the identified semantic types into product profiles in our KG. During item search, we employ FLAT (Li et al., 2020b), a light-weight NER model to identify pre-defined semantic types from the query , and calculate an enhanced similarity score between and each item based on their tokens and identified semantic types.
Once a specific item is chosen, our live assistant will retrieve a rich set of information about the item from our multi-modal KG, and properly display them in a pop-up window. Roughly, item information is organized into three categories: appearance, POI and comment. The core elements of our MKG, including scenarios, POIs, item images and property values, are exhibited according to the pre-defined structure.
To help an anchor to answer questions, we designed a hybrid approach that employs both KBQA (question answering over knowledge base) and DeepQA (question answering over frequently asked question). In KBQA, we adopt an NER model to identify properties from customer questions, and retrieve corresponding property values and associated images from our MKG. To improve readability, we synthesize a natural language sentence instead of delivering a single property value. If there is no identified property in or no answer in the MKG, we use a text matching model to find the most similar FAQ and return the corresponding answer.
Our system has been launched online in the Taobao app, and currently serves hundreds of thousands of customers per day
We demonstrate the key features of our system in Figure 6. Figure 6 (a) shows a list of items that are retrieved based on a user query “Can I see the lipstick?” and shown in a pop-up window. Users are able to skim over the item list and choose the one that interests them most. Once an item is chosen, the important images, POIs and properties of the item will be exhibited, as shown in figure 6 (b). Figure 6 (c) demonstrates the richness of our KBQA answer, which contains a textual description and a size chart, and answers a question about the size of a specific T-shirt.
We also show an innovative application of our multi-modal KG in short video production in Figure 7. By following the cognitive path in Figure 4, we generate a nature language utterance for each node, and organize the associated images according to the given order, and finally produce a short video through using templates. The generated short videos can be played in the item card as in Figure 6 (b). Such knowledge-based short videos tell the core selling points of a product item in an attractive and cognitive manner, hence are more convincing on affecting customers’ buying decision. Moreover, with our knowledge graph, the productivity of such knowledge-based video producing can be largely improved, making its large-scale application feasible.
In this work, we present AliMe MKG, a multi-modal knowledge graph that aggregates rich product information, and introduce its innovative applications in live-streaming E-commerce. Providing cognitive product profiles to customers is of value and challenging. Many interesting problems, such as acquiring encyclopedic or scenarized knowledge, personalized exhibition, knowledge enhanced text generation will be further explored to enrich our multi-modal KG and polish our online live assistant.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.3.
- An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.2.3, §2.2.3.
- Deep residual learning for image recognition. In , pp. 770–778. External Links: Cited by: §2.3.
- Pixel-bert: aligning image pixels with text by deep multi-modal transformers. CoRR abs/2004.00849. External Links: Cited by: §2.2.3, §2.3.
- AliMe KG: domain knowledge graph construction and application in e-commerce. CoRR abs/2009.11684. External Links: Cited by: §2.1, §2.2.1.
- FLAT: chinese NER using flat-lattice transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 6836–6842. External Links: Cited by: §3.
- LightningDOT: pre-training visual-semantic embeddings for real-time image-text retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 982–997. Cited by: §2.2.3, §2.2.3.
- StructBERT: incorporating language structures into pre-training for deep language understanding. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Cited by: §2.2.3.