1. Introduction and Related Work
Query intent understanding is a key step in designing advanced retrieval systems like e-commerce search engines (Croft et al., 2010)
. Various approaches have been proposed to address query understanding such as 1) considering predefined high-level categories (i.e., informational, navigational, and transactional), 2) deploying semi-supervised learning with click graphs, 3) considering temporal query intent modeling, 4) understanding word-level user intent, and 5) applying relevance feedback and user behaviors. Although there has been a significant improvement in user intent inference, query understanding remains a major challenge(et al, 2019b).
E-commerce search queries have multiple intents associated with them. Ashkan et al. (Ashkan et al., 2009) categorized search queries for e-commerce websites into commercial and nonc-commercial intents. However, Zhao et al. (Zhao et al., 2019) ignore the non-commercial queries due to small percentage of the search traffic. Commercial queries are queries with purchasing intent, while non-commercial queries cover a wide range of customer services (e.g., “military discounts” and “installation guides”) as shown in Table. 1.
|Search Queries||intent||Product Categories|
|where is my shipped order||non-commercial||-|
|how to install my tiles||non-commercial||-|
|cost to rent a carpet cleaner||non-commercial||-|
|18 volt ryobi||commercial||[tools, electrical, lighting]|
|24 in. classic Samsung refrigerator||commercial||[appliance, electrical]|
Query understanding in e-commerce search is challenging: 1) queries are often short, vague, and suffer from the lack of textual evidence (Ha et al., 2016), 2) small variation in textual evidence causes a drastic change in query intent; for example, “30 in. 5.8 cu. ft. gas range installation kit” has commercial intent but “30 in. 5.8 cu. ft. gas range installation”, has non-commercial intent, 3) product category mapping is a multi-label and non-exclusive problem. A practical solution must include a broader possible set of correct categories, while simultaneously keeping precision as high as possible (Zhao et al., 2019), 4) there is class imbalance in both commercial vs. non-commercial and product category mapping tasks, because only a small fraction of data (1.5% in our domain) has a non-commercial intent, and within the commercial queries, some product categories contain significantly more samples compared to others, and 5) commercial queries are easy to identify using user behavior information like click rates; however, that is not the case for non-commercial queries.
To address these problems, we introduce a new method of jointly learning query intent and category mapping, which allows transferring the inductive bias between these two relevant tasks. Also, we leverage label representation, which provides a richer representation to model the product categories. Finally, we propose an active learning algorithm to generate data for commercial vs. non-commercial intent. To address the class imbalance problem, we deploy focal loss, which is borrowed from computer vision.
Joint learning has been proposed as a practical approach to simultaneously learn relevant tasks due to the transfer of the inductive bias among them. Joint-learning finds applications in computer vision and natural language understanding (Khatri et al., 2018). Joint-learning improves the regularization and generalization of the learning models by utilizing the domain information (Caruana, 1997). In addition, with a joint model that addresses multiple tasks, only one model needs to be deployed; this contributes to reducing overhead and facilitates the maintenance of the system (Wang et al., 2018b). In this paper, we propose a joint-learning model that simultaneously learns both commercial and non-commercial query intents, and maps the incoming commercial query to a set of relevant product categories.
In this paper, we introduce a data-driven approach, which we call Joint Query Intent Mapping (JointMap). JointMap leverages the label representation proposed by Guoyin et al. (Wang et al., 2018a)
and modifies it to be applicable for a joint-learning task. JointMap also utilizes self-attention mechanism to improve the quality of the joint word-label attention vectors. For product category mapping, JointMap handles the imbalanced class problem using focal loss(Lin et al., 2017) which has been well-studied in the computer vision field to control the sparse set of candidate object locations. Finally, we propose an approach based on distant supervision in combination of active learning to generate both commercial and non-commercial queries.
In summary, our contributions are: 1) proposing a deep learning model to jointly learn product category mapping as well as users’ non-commercial intents, 2) developing an active learning algorithm in conjunction with distant supervision to generate a user intent dataset from e-commerce data logs, and 3) modifying the joint word-category representation for query intent mapping tasks in e-commerce, as described in detail next.
2. Model Overview
In this section, we present the network architecture of JointMap, as shown in Figure 1
. JointMap utilizes both word and category embeddings in which both representations are jointly trained to achieve an efficient semantic representation for a query. The proposed model consists of two deep learning layers: the first layer for the understanding of the user’s commercial intent and the second layer for the prediction of relevant product categories in the taxonomy. As a result, the proposed model contains three embedding layers: a word embedding layer and two category embeddings layers, i.e., commercial vs. non-commercial and product-categories. Both category embedding types are concatenated, to compute the final product category representations. Then, a Compatibility Matrix (CM) is generated by computing the cosine similarity between the label and word representations. CM represents the relative spatial information among consecutive words (phrases) with their associated product category and commercial vs. non-commercial labels. Finally, CM is passed through a Multi-head self-attention layer to calculate attention scores. The word vectors simultaneously go through two Highway layers, and the output of each Highway is multiplied by their corresponding attention scores to generate the final query representation. Finally, the loss value ofis computed using sigmoid cross-entropy for the product category mapping. Also, the loss value is calculated using Softmax cross-entropy for determining the query’s commercial intent.
In the next section we explain the details of the proposed model.
2.1. Joint-Learning of High-Level Intent Tasks
We now introduce JointMap, a joint-learning model for high-level user intent prediction.
Suppose there is a search query dataset , where is a set of search queries, represents user commercial vs. non-commercial intent, and is the candidate product category set. Each query consists of a sequence of words of size , and represents as . Also, and are mapped to the embedding spaces and , respectively. Then, the matrices and are concatenated to illustrate the whole label space. The word and label embeddings are initialized with Word2Vec and random embeddings of size , respectively. Cosine similarity between and is computed for each query to extract the relative spatial information among consecutive words with their associated labels, where indicates the cosine similarity function.
To extract the contribution of the words concerning their category, a multi-head self-attention mechanism with different heads is implemented on . Multi-head self-attention contains a parallel of linear projections of a single scaled dot-product function. Eq. 2 shows a single head of the self-attention mechanism.
where is the key matrix, is the value matrix, and is the dimension of the keys. Also, each projection is responsible for extracting the attention between word-label in a query and computes using weighted sum of the values. Next, is split into two matrices of size and . For both tasks, the word embedding vectors W are fed into a highway encoder layer, which has shown its effectiveness in improving network capacity (et al, 2019b). Then, the output is multiplied by their corresponding attention scores of .
Then, resulted and have the size of
. They go through a fully connected layer to generate the semantic representations of both tasks. For product category mapping, a sigmoid cross-entropy loss functionis used since in sigmoid, the loss computed for every output is not affected by other component values. Also, a binary softmax cross-entropy loss is applied to train the user commercial vs. non-commercial intent.
Where represents the prediction distribution and indicates the target labels. To address the class imbalance problem, particularly in the product category dataset, we update the loss values based on focal loss proposed in (Lin et al., 2017). The focal loss was initially proposed for object detection and removing the effect of extreme foreground-background class imbalance in the images.
where is the target vector, is the class index, and
is a factor to decrease the influence of well-classified samples.
JointMap overall loss:
The final loss function is computed using a weighted loss over commercial vs. non-commercial, product category mapping intents.
3. Dataset Overview
In this section, we describe the dataset collected from search logs of a large e-commerce search engine in July 2019, and provide details the algorithms used for generating user-intent datasets. We propose an algorithm to simultaneously generate both datasets, which consists of three steps: 1) generating the commercial vs. non-commercial queries, 2) oversampling of the non-commercial queries to balance the dataset, and 3) creating the product category dataset based on the commercial queries. Algorithm. 1 represents the steps for generating commercial vs. non-commercial samples. In this method, first we need to generate a small-size dataset that covers all expected non-commercial intents (e.g., “installation guides”).
Then, we over-sample the non-commercial queries as described in (et al, 2019a) to make the dataset balanced (only 1.5% of the queries have a non-commercial intent). Similar to (Zhao et al., 2019), we utilize user behavior data like click rate, to generate the category labels associated with each commercial query. Algorithm. 2 describes different steps to create the product mapping dataset.
Finally, a dataset of size 195K with 32 product categories such as tools, appliance, outdoors, etc. extracted from the search logs.
4. Experimental Setup
In this section, we describe the parameter setting, metrics, baseline models, and experimental procedures used to evaluate JointMap.
We used Adam optimizer with a learning rate of
and a mini-batch of size 64 for training. The dropout rate of 0.5 is applied at the fully-connected and ReLU layers to prevent the model from overfitting.
To evalate JointMap, both Micro- and Macro- averaged F1-score for both tasks are reported.
We summarize the multi-label classification methods compared in the experimental results.
Dataset Experimental Design.
We use an SVM model with n-gram tf*idf as features to perform distant supervision method due to multiple reasons: 1) SVM is fast and scalable, 2) the features and results are interpretable for supervisors, 3) SVM has proved its effectiveness on text data, 4) SVM provides confidence scores to detect the tricky samples. Moreover, two different human annotators were asked to label 540 samples manually. The (Matching, Kappa) scores of (0.98, 0.96) are computed, which is a “significant agreement.” The category distribution is shown in Figure2.
4.1. Main Results and Ablation Analysis
summarizes the performance of the models. The results are reported for both commercial vs. non-commercial classification and product category mapping. All the improvements are statistically significant using a one-tailed Student’s t-test with a p-value ¡ 0.05.
|Method||Commercial vs. Non-commercial||Product Category Mapping|
|VDCNN (Conneau et al., 2016)||91.28||91.34||51.41||79.34|
|FastText (Bojanowski et al., 2017)||92.18||92.15||60.06||79.69|
|XML-CNN (Liu et al., 2017)||93.11||93.01||58.40||81.61|
|LEAM (Wang et al., 2018a)||93.96||93.66||58.90||81.31|
|JointMap||94.80 (+1.1%)||94.63 (+1.0%)||62.60 (+6.3%)||83.01 (2.1%)|
For the user commercial intent mapping task, the results indicate that the Macro-averaged F1 improves 4.5%,3.8%,2.8%,1.0%, and 1.8% compared to tf*idf, VDCNN, FastText, LEAM, and XML-CNN models respectively. In product category mapping task, the improvements are more significant. There is improvement of 28.4%, 22.1%, 4.2%, 6.3%, and 7.2% over tf*idf, VDCNN, FastText, LEAM, and XML-CNN models, respectively. As a results, JointMap improves macro-averaged F1 scores over current state-of-the-art deep learning models by 2.3% on commercial vs. non-commercial intents, and a 10% improvement over product category mapping.
In reference to user commercial intent prediction, a 2.3% improvement is considerable since it is in the context of a large e-commerce search engine that receives billions of search queries per year. For product category mapping, the F1-averaged macro experiences a higher jump when compared to the F1-averaged micro (6.3% vs. 2.1%). This improvement indicates the positive impact of inductive bias between these two tasks, which not only boosts the performance of majority classes, but it also contributes to minority classes. For instance, the Macro-average F1 for 8-button minority classes shows in Figure. 2 for XML-CNN and LEAM are 21.76% and 18.33%, respectively, while this number jumps to 31.28% for JointMap.
Focal Loss Impact.
Using focal loss deteriorates the overall micro- and macro- averaged F1-scores by 0.6%, 1.5%, respectively. However, the macro-average F1 on 8-button minority classes without focal loss is 31.28%, while with presence of focal loss is 33.81%. This shows a relevant improvement of 8.1%. Also, we observe that in absence of focal loss, the performance of at least two of the minority classes is 0%, therefore making the use of focal loss necessary.
To evaluate the impact of hyper-parameter tuning in JointMap, we implemented a grid search approach on and in Eq. 9. We observed that using a smaller for each task causes a slower convergence for that specific task. However, the final results is not significantly different. In our experiments, a simple average works as good as a fine-tuned hyper-parameter model. For focal loss hyper-parameter tuning, we repeat the experiments with different values of 1, 1.2, 1.5, and 2. We observed that the best results achieve using the , where the original paper suggested using for computer vision application.
5. Conclusions and Future Work
We introduced JointMap, a deep learning model designed for jointly learning two high-level intent tasks on e-commerce search data. JointMap utilized word and label representations and leveraged focal loss to tackle class imbalance problem in catalog categories. Our results were promising compared to the state-of-the-art deep learning models with an average raise of 2.3% and 10.9% on Macro-averaged F1 in user commercial vs. non-commercial intent and product category mapping, respectively. Our future work includes tuning the JointMap model incorporate contextual information within a session. In summary, the presented work advances the state-of-the-art user intent prediction, and lays the groundwork for future research on user intent understanding in e-commerce.
We gratefully acknowledge the financial and computing support from The Home Depot Search & NLP team.
-  (2009) Classifying and characterizing query intent. In proceedings of ECIR, pp. 578–586. Cited by: §1.
-  (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: 3rd item, Table 2.
-  (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §1.
-  (2016) Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781. Cited by: 2nd item, Table 2.
-  (2010) Query representation and understanding workshop. In SIGIR Forum, Vol. 44, pp. 48–53. Cited by: §1.
-  (2019) Dealing with difficult minority labels in imbalanced mutilabel data sets. Neurocomputing 326, pp. 39–53. Cited by: §3.
-  (2019) Generic intent representation in web search. In proceedings of SIGIR, pp. 65–74. Cited by: §1, §2.1.
Large-scale item categorization in e-commerce using multiple recurrent neural networks. In SIGKDD, pp. 107–115. Cited by: §1.
-  (2018) Contextual topic modeling for dialog systems. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 892–899. Cited by: §1.
-  (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §1, §2.1.
-  (2017) Deep learning for extreme multi-label text classification. In proceedings of SIGIR, pp. 115–124. Cited by: 5th item, Table 2.
-  (2018) Joint embedding of words and labels for text classification. arXiv preprint arXiv:1805.04174. Cited by: §1, 4th item, Table 2.
-  (2018) A multi-task learning approach for improving product title compression with user search log data. In proceeding of AAAI, Cited by: §1.
-  (2019) A dynamic product-aware learning model for e-commerce query intent understanding. In CIKM, pp. 1843–1852. Cited by: §1, §1, §3.