is one of the basic tasks for natural language processing and has many real-world applications such as question answering, textual entailment, paraphrase identification, and information retrieval1; 2; 3; 4
. Unlike simple text matching, semantic similarity match aims to infer the semantic similarity of two sentences rather than the extent of common words between the two. For example, while “mac pro” and “mac lipstick” look alike, they describe two different items, i.e. computer and lipstick; “iPad” and “apple tablet” have no common word at all but rather refer to the same item, i.e. tablet computer. While similarity match usually deals with two homogeneous sentences of comparable lengths and expects to match every position of both sentences, semantic relevance match deals with heterogeneous pieces of text such as query and document in ad-hoc information retrieval and expects to match some keywords in documents with queries. As a specific application of ad-hoc information retrieval, e-commerce search serves as a platform to fetch candidate items highly relevant to a given query to satisfy the user’s purchase requirement. If a search system returns too many semantically irrelevant items, it will render unpleasant user experience and erode the user’s trust and confidence toward the e-commerce platform. Therefore, the semantic relevance estimation is critically important for long-term user satisfaction.
The current research on deep semantic learning can be grouped into two camps: representation-focused and interaction-focused. The representation-focused methods 5; 6
typically embed two input sentences separately using two neural networks and computes a relevance score to measure the similarity between the two embeddings. The common similarity measures used include the cosine similarity and negative Jensen-Shannon divergence. On the other hand, the interaction-focused methods7; 8 concatenate two input sentences as input to a single neural network which captures interactions between their features. These two types of methods exploit the semantic relationship between the query and the item text and are effective for static and context-free matching problems. However, the semantic match problem in many real applications is often not context-free, e.g. search logs store plenty of valuable context data111In this work, context information refers broadly to the neighbor features in a query/item bipartite graph, rather than the more restricted sense of co-occurring items in a search session. such as the query/item incidence network and the users’ historical behavior sequences (view, click, purchase, etc). These types of context information provide the actual semantic content of a query or item, which as ignored by the current semantic match methods.
When exploiting the graphical context information for semantic relevance match, we face the following challenges:
C1: Limited supervised information. Although a variety of user behavior signals were recorded in search logs, they are often noisy and misaligned with the search relevance objective: many factors other than relevance may affect a user’s final decision such as item popularity, title attractiveness (click-baitiness), and result set diversity. Human labels can provide accurate relevance information, but training an excellent deep model often requires millions or more examples, which are labor-intensive and costly to collect. In short, high quality annotated signals are scarce in our problem domain.
C2: Uncertainty in how to integrate context information. Since user behavioral feedback cannot naively substitute relevance signals, systematic utilization of the massive amounts of search log data has been a central research theme in e-commerce search. Incorporating context information on top of the logged user feedback presents new challenges. Any proposed method should be context-aware and corroborative of the final relevance objective, an under-explored area in the current research.
C3: Memory and latency constraints. A query phrase and an item’s title in an e-commerce search can often be represented in diverse forms. If a text is stored as a unique entity in an online key-value store, it may take 100 Gigabytes of memory (space complexity). Such a large memory usage will result in large models, which in turn will fail to respond quickly to online queries and make the run-time complexity severely limited.
To address these challenges, we propose to study the semantic match problem in dynamically evolving search scenarios. This problem is different from the existing context-free search problems. In particular, we consider that queries and items in the search log constitute a natural heterogeneous bipartite network structure (Fig. 1). In this network, there are two types of nodes (queries and items) and two types of edges (click behaviors and purchase behaviors). Traditional approaches estimate the relevance of two adjacent nodes in this network in isolation. We argue that contextual relevance can be significantly improved by taking into account their neighbors’ semantic information.
A central problem in constructing a bipartite network is edge refinement, specifically that in the query/item co-occurrence graph. Due to the noisy nature of user behaviors, we cannot rely on them exclusively to build the connection between two nodes. On the other hand, the pre-trained language representation model (e.g. BERT 10) is equipped with a good semantic understanding capacity. So we choose BERT as the teacher model and extract relevance knowledge which can be used as annotated information (C1). We then use this external knowledge to refine the user behavior network, via adding ignored edges and deleting noisy edges, to ensure a high-quality network structure.
Similar to the concept of first and second order proximity 11; 12, we propose alternative definitions of the first-order relevance and second-order relevance. Previous semantic match research 26; 27 only considers the first-order relevance, which is reasonable for context-free semantic match problems, but will lose valuable context information when applied to real-world applications. We argued in C2 that both the first and second order relevance should be taken into consideration together for e-commerce search. Therefore, we propose a new model of Heterogeneous GNN for Semantic Match problem (HG4SM), which can be broadly applied to any search ranking problem that seeks to incorporate the context of real-world applications. The model captures the first-order relevance using a word interaction matrix attached with positional encoding and captures the second-order relevance using the metapath-guided embedding with graph attention scores. To our best knowledge, HG4SM is the first heterogeneous network embedding for the task of search relevance.
Although a query phrase and an item’s title may appear in many different forms, the words in these sentences have smaller representation space and are easy to embed. Thus, we use a word distributed representation to depict various queries and titles, which greatly reduces the model’s space complexity compared to a document embedding. The above design ensures that our model derives explicit interaction matching signals and reasonable node semantic embeddings so that we only need to employ shallow neural networks to combine all embeddings. Therefore, the whole model has a low time complexity which is suitable to deploy online (C3). We list some related works and compare them with our work in the appendices.
Here we give: 1) three basic definitions about heterogeneous network and node proximity; 2) and two new definitions about the novel problem of Second-order Relevance.
Heterogeneous Network. Given a network or graph with nodes or vertices set and edges set , if the node type’s mapping function and the edge type’s mapping function satisfy the condition: , then is a heterogeneous network.
For the convenience of exploiting the heterogeneous information in a network, we only consider unsigned networks (i.e. no negative edges) with undirected and unweighted edges. In a heterogeneous network, metapath and its corresponding instances are universal concepts and defined as:
Metapath and Metapath Instance. A metapath represents the path from to successively through (, ). A metapath instance is a definite node sequence instantiated from metapath .
A good network embedding can well preserve the network’s structural information, e.g., local network structure and global network structure. The local and global network structures can be respectively represented by the first-order and second-order proximity. Proximity represents two nodes’ spatial closeness, where the first-order proximity is the local pairwise proximity and the second-order proximity is the neighbor structure proximity between two nodes.
First-order Proximity and Second-order Proximity. For two nodes and in a network, their first-order proximity can be formalized as the function and their second-order proximity can be formalized as the function where denotes a neighbor set. If there is an edge between and , then the ’s value is bigger than its value when there is no such edge. If and share more common neighbors, then the ’s value is bigger.
For the first-order proximity function, a common design is:
For the second-order proximity function, a possible choice can be:
Intuitively, even though there is no edge between and , if they share many common neighbors, they should also be very similar to each other. So the second-order proximity is an important supplement to the first-order proximity.
If we view proximity from the perspective of “path”, we can conclude that: 1) the first-order proximity reveals that the shortest path between and is a path whose length is 1, and 2) the second-order proximity reveals that the length of their shortest path is 2. Higher-order proximity reveals that the length of their shortest path is greater than 2. With the increase of the path’s length, the information intensity will gradually weaken along paths, so higher-order proximity is not considered here. Proximity is also suitable for the heterogeneous network, but the definition of neighbors in the homogeneous network and heterogeneous networks are different. In heterogeneous networks, neighbors are metapath-based and denote those nodes that have the same type as the central node. For example, in a citation network, the author citation relationship can be represented as “--” and its corresponding metapath is “--”. On this metapath, if two authors together cite the same paper, they establish a metapath-based neighbor relationship and tend to have similar research interests.
In a heterogeneous network, the relevance score represents the semantic closeness of two nodes. Similar to node proximity, we introduce the following new definitions.
First-order Relevance and Second-order Relevance. When mapping two sentences into two nodes and in a heterogeneous network , the first-order relevance and second-order relevance can be separately formalized as the function and . If is semantically more similar to , then is larger. If ’s neighbors are semantically more similar to ’s neighbors, then is larger.
Instead of proximity, our relevance considers local and global node semantics. We argue that both the first-order relevance and second-order relevance are necessary for semantic match. They supplement each other in the same way as the first-order and second-order proximity. A good mechanism should fully consider these two types of relevance and make them cooperate with each other, i.e. the first-order relevance plays a dominant role and the second-order relevance provides important auxiliary information. In other words, when it is difficult to judge whether two nodes are semantically relevant from the textual information of themselves, the second-order relevance which is based on neighbor set can become useful. Finally, we give the definition of the research objective of this work below.
Heterogeneous Network Embedding for Relevance Estimation. For a heterogeneous network , the network embedding for relevance estimation aims to learn each node’s low-dimension distributed representation which simultaneously satisfies the need of the first-order relevance and the second-order relevance, and thus preserves both the local and global node semantics.
We propose a complete heterogeneous network embedding architecture for the e-commerce search relevance match problem (Fig. 2). Concretely, we first need to denoise a user behavior network. We introduce the fine-tuned BERT model of producing external knowledge in order to refine the click behavior edge in the original network. Based on this knowledge-enhanced network derived, we select two most important metapaths. We apply a node-level encoder and a metapath attention unit together to integrate these neighboring nodes’ context information into the central node. In addition, considering the e-commerce scene’s particularity, we give: 1) a special vocabulary formation rule to preserve the complete semantics of many products or brands, 2) the word-level interaction representation to capture the micro semantics matching signals between the query and title; and 3) the sentence-level semantic representation to capture the macro semantics matching signals between the query and title.
In a real-world search application, queries and items and their relationships naturally form a heterogeneous bipartite network based on the multi-type user behaviors like view, click, and purchase. However, as mentioned earlier, user behaviors are typically biased and noisy. So if one directly conducted embedding learning on the original user behavior network, it would be difficult for the model to estimate explicit relevance relationships. To solve this problem, we introduce external knowledge to refine the user behavior network and then construct a knowledge-enhanced heterogeneous network. Here the external knowledge is provided by the BERT model that is pre-trained on a large text corpus and then fine-tuned on some in-house data. The whole network construction includes the following phrases.
Fine-tuning BERT. Transformer-based models such as BERT and ERNIE 15 have been preferably used as NLP benchmark in recent years. Here we use the BERT-Base model222https://github.com/google-research/bert composed of stacked Transformer units and fine-tune it on some in-house data. The positive and negative examples in the data are human-labeled and cover various categories of items. The fine-tuned BERT is equipped with a high relevance discrimination ability and thus can act as an expert on filtering noisy data.
Behavior network formation. The user behavior network in the left network of Fig. 2 is built on some user search log (over several months) which records user clicks and purchase behaviors as well as their frequencies. In this network, an edge represents an existing click behavior or purchase behavior between a query-node and an item-node. The edge of purchase behavior is sparse but important. The edge of click behavior is denser but highly noisy, making it difficult to learn a good model and predict reasonable results. Therefore, we introduce BERT to refine this network.
Knowledge-enhanced network refinement. Given the scarcity of user purchase feedback and noisiness of click feedback, we retain all the original purchase behavior edges while use the external knowledge from BERT to refine the click behaviors. Specifically, we set two different thresholds and , i.e. is used for judging whether a clicked item is truly relevant to its query, and is used for judging whether an unclicked item is truly irrelevant to its query. This strategy can help remove noises in user behaviors and at the same time extract the missing but crucial relevance signals not captured by user behaviors. To preserve the high-quality neighbor set, for each central node, we rank its neighboring nodes with the priority of “purchasehigh clicklow click” and sample top-2 of them as the final neighbor set.
Heterogeneous Network Analysis
In HG4SM, the basic units are query nodes, item nodes and the refined user-behavior-oriented edges between them. Based on this schema, we employ two metapaths, “--” and “--”, where “” and “” stand for “” and “”. These two metapaths correspond closely to the second-order relevance definition defined earlier. Compared to some more complex metapaths (such as “----” and “----
”), the adopted metapaths in our model is both effective (with less information density loss) and computationally efficient. We further choose two instances for each metapath whenever they are available and pad with zero embeddings otherwise . Take “--” as an example, its instances can include “--” and “--”, as shown in Fig.2.
First-order Relevance Modeling
The whole framework of the new HG4SM model includes both the first-order relevance modeling and second-order relevance modeling. We first introduce the first-order relevance modeling which captures macro and micro semantic match signals by incorporating both the representation-focused and interaction-focused designs.
Word embedding in e-commerce scene.
In e-commerce applications, it is infeasible to represent queries and items as individual embeddings since their entity space is effectively unbounded. Instead, we adopt word embedding in HG4SM, which dramatically reduces the representation space and deals well with the cold-start situation. Numerous product type names or attribute names (like “iphone11” and “256GB”) exist, but the basic word segmentation based on N-gram or WordPiece word segmentation splits these special names and thus cannot reveal their complete semantics. To adapt to this feature, we first treat single Chinese characters, contiguous numerals or English letters as single words (Fig.3). We then retain only the high-frequency words in this list. Compared to potentially million queries and billion items, our vocabularies are only in the tens of thousands, which saves memory consumption and lookup time of id and embedding by a large margin. We represent each word with a
-dimensional vector and train these word vectors or embeddings using Word2Vec24. We use (and ) to denote the -th word’s embedding of query (and the item’s title ).
Macro matching element. Most of the first-order relevance methods are either representation-focused (capturing macro matching signals) or interaction-focused (capturing micro matching signals). We employ their mixture in the HG4SM’s first-order relevance modeling. For the representation-focused part, take query with words as an example, its sequence embedding is obtained by calculating the element-wise mean value of :
The above depicts the whole semantic information of query , so that it can be viewed as a simplified version of representation-focused embedding.
Micro matching element. To capture micro matching signals, we need to model the word-level interaction information 7 between queries and titles. Suppose the sequence lengths of query and title are and , respectively, we build an interaction matrix :
The interaction embedding is derived by reshaping into a one-dimension vector:
Position encoding. Besides, considering the sequential structure of texts, we further add a position embedding to each word embedding before calculating the correlation matrix. The position embedding is set as a trainable embedding vector and has the same dimension as the word embedding.
Second-order Relevance Modeling
Most semantic match methods focus on the first-order relevance, but ignore the second-order relevance (which integrates the neighbor information on metapaths into the central node and is essentially important in many real context-aware scenes). A complete semantic relevance estimation model should integrate them together. Here we consider how to generate second-order relevance embeddings so that it can incorporate context information in the network to enrich and improve the central nodes’ semantics. In general, the second-order relevance model consists of a node-level encoder and a metapath instance-level aggregator.
Node-level encoder. Metapath instance bridges the communication gap between heterogeneous nodes and thereby can infer the node’s global semantic embedding (rather than the local semantic embedding). For each metapath instance, to derive the global node semantics, we integrate the neighboring node embedding into the central node embedding with mean encoder. Take the instance “--” as an example, its corresponding embedding is:
Instance-level aggregator with graph attention.
Different meatapath instances convey different information, so they should have different effects on the final metapath’s embedding. However, the mapping relationship between the metapath instance’s embedding and the metapath’s embedding is unknown. To learn their relationship automatically, we introduce a “graph attention” mechanism to the formation progress of the metapath’s embedding. The attention mechanism enables the model to learn a weight distribution and assign different weights to different components, which has been successfully applied in computer vision and natural language processing23. Here we introduce graph attention to represent the mapping relationship between metapath and its instances. The final metapath’s embedding is then obtained by accumulating the embeddings of each metapath’s instances with attention scores. Take metapath “--” as an example, its corresponding embedding is defined as:
is the activation function. Though
can be set as a fixed value, we adopt a more flexible way, i.e., we use a single-layer neural network to learn it automatically. Specifically, we feed the concatenation of embeddings of the central node and the metapath instances into a one-layer neural network and output an attention distribution in the softmax layer:
where is a 1*2d trainable vector, shared across all metapaths.
Embedding fusion. Based on the first-order and second-order relevance modeling, three types of embeddings can be generated, including the representation- focused, interaction-focused and metapath-guided embeddings. To combine them, we concatenate these three embeddings together, feed it to the deep neural networks, and output a relevance score. To alleviate the model’s high time-cost, we use efficient three-layer DNNs without considering more complex neural network structures such as CNN and LSTM.
We now present experimental results. We first introduce the baseline methods, performance metrics, final comparison results, ablation study, and online performance. We discuss the datasets used, implementation details, and parameters sensitivity experiment in the appendices.
The Baselines and Metrics
The HG4SM model is compared to several existing state-of-the-art semantic models. To facilitate their implementation, we use the open-source codebase packages which are as follows.
†: Nine methods are implemented using MatchZoo 2528; 29; 30; 31; 32; 33 in MatchZoo with HG4SM. We use the default hyper-parameter setting for all the methods in MatchZoo333https://github.com/NTMC-Community/MatchZoo.
§: Two other baselines DSSM 5 and ESIM 8 are also included in the comparison. DSSM is a classical representation-focused method using pairwise examples to train the model. ESIM is a successful interaction-focused method which incorporates the syntactic parsing information into chain LSTM for natural language inference.
To comprehensively measure the performance of our HG4SM and these baseline methods, we use six metrics, including: 1) Area Under the receiver operating characteristic Curve, 2) Accuracy, 3) Precision, 4) Recall, 5) F1-score and 6) False Negative Rate (since it is more serious than False Positive Rate for e-commerce ranking). They are respectively denoted by AUC, Acc, Pre, Recall, F1 and FNR for short. For the first five metrics, the higher the metric value, the better the model’s performance. For FNR, the lower its value, the better the model’s performance. Note that AUC often serves as the most important metric while the others also provide auxiliary supports for our analysis.
|1st & 2nd||Int+HIN||0.8761||0.8009||0.7799||0.8691||0.8221||0.2758||0.8824||0.8585||0.8941||0.9263||0.9099||0.3705|
|1st & 2nd||Rep+HIN||0.8656||0.7956||0.7708||0.8736||0.8190||0.2922||0.8750||0.8576||0.8928||0.9266||0.9094||0.3753|
|1st & 2nd||HG4SM||0.8786||0.8025||0.7790||0.8754||0.8244||0.2794||0.8862||0.8597||0.8949||0.9270||0.9107||0.3673|
We compare HG4SM with eleven state-of-the-art deep semantic matching methods using the in-house e-commerce search log data. The results are shown in Table 1. Because some baseline methods (e.g., DRMM and ESIM) have relatively high time complexities, we sample ten million training data from the all-categories dataset. For fairness, all methods including our HG4SM are trained on these data.
As shown in Table 1, HG4SM nearly always outperforms the other methods compared, across all six metrics. More specifically,
Compared to the second best method, HG4SM obtains 1.8% (and 5.4%) gains under AUC in the mobile-phone- dataset (and all-categories dataset). Furthermore, HG4SM achieves the best (smallest) FNR on both these datasets. This implies that HG4SM has a high discrimination power on negative examples, so that it can return a list of satisfactory items which are relevant to the user’s shopping needs.
The collected training data have imbalanced classes (i.e. the positive examples are far more than the negative examples), making model learning a challenge. Most of the methods compared, such as DSSM, are vulnerable to the class imbalance and often fail to correctly estimate many testing examples in this case. Fortunately, our HG4SM learns explicit node semantics benefiting from the neighboring node’s effect, so that it has a robust performance even though the training data are highly imbalanced.
To further examine the importance of each component in the HG4SM model, we remove one or two components from HG4SM at a time and examine how the components affect the overall performance.We have the following empirical observation and analysis of the results, shown in Table 2:
In general, for the three submodels of HG4SM (including the representation-focused component, the interaction-focused component, and both), the both-component setting often outperforms each single-component setting but worse than the triple-component setting by introducing HIN (i.e. HG4SM). It demonstrates that introducing HIN helps the model learn comprehensive knowledge and thus gives an explicit estimation.
The introduction of second-order relevance modeling can provide stable improvement (e.g. 1.0%1.6% of AUC gains) to every first-order solution. This demonstrates that applying HIN to the semantic match can effectively exploit the neighboring nodes’ information contained in user-behavior networks, benefiting the final relevance estimation between central nodes.
To further evaluate HG4SM’s performance in the real search scene, we deploy it to an online e-commerce search platform and report its A/B test results in Table 3. Four widely-used online business metrics are adopted: 1) Conversion Rate (CVR): Total order number / total click number; 2) User Conversion Rate (UCVR): Total order number / search user number; 3) UV-value: Total Gross Merchandise Volume / total user number; and 4) Revenue Per Mile (RPM): 1000 * search GMV / search number.
|Metrics||Price Sort||Default Sort|
The results of A/B tests show that our HG4SM outperforms the existing online DNN model in this platform on all of the business metrics. For example, HG4SM improves 5.7% and 0.5% of UV-values under both price sort and default sort. It indicates that 1) our HG4SM model has a low time complexity, which makes it easy to cooperate with other online serial models, and 2) HG4SM provides accurate relevance estimation between queries and items, so that it provides users efficient and intelligent search experiences.
In this paper, we studied the semantic relevance match problem in e-commerce search. In reference to the previous semantic match research using first-order relevance modeling, we proposed a novel idea to combine first-order and second-order relevance match. Based on this new idea, we employed a heterogeneous network embedding to exploit the potential context information in the “query-item” heterogeneous bipartite network. Compared to the current state-of-the-art methods, our novel HG4SM model showed a robust prediction performance. The ablation study verified that the addition of the second-order relevance modeling can significantly improve the performance of the method using the traditional first-order relevance alone. Finally, we applied HG4SM to our in-house e-commerce platform by deploying it to the online search system, which significantly improved the user’s search experience.