Adversarial Mixture Of Experts with Category Hierarchy Soft Constraint

by   Zhuojian Xiao, et al., Inc.

Product search is the most common way for people to satisfy their shopping needs on e-commerce websites. Products are typically annotated with one of several broad categorical tags, such as "Clothing" or "Electronics", as well as finer-grained categories like "Refrigerator" or "TV", both under "Electronics". These tags are used to construct a hierarchy of query categories. Feature distributions such as price and brand popularity vary wildly across query categories. In addition, feature importance for the purpose of CTR/CVR predictions differs from one category to another. In this work, we leverage the Mixture of Expert (MoE) framework to learn a ranking model that specializes for each query category. In particular, our gate network relies solely on the category ids extracted from the user query. While classical MoE's pick expert towers spontaneously for each input example, we explore two techniques to establish more explicit and transparent connections between the experts and query categories. To help differentiate experts on their domain specialties, we introduce a form of adversarial regularization among the expert outputs, forcing them to disagree with one another. As a result, they tend to approach each prediction problem from different angles, rather than copying one another. This is validated by a much stronger clustering effect of the gate output vectors under different categories. In addition, soft gating constraints based on the categorical hierarchy are imposed to help similar products choose similar gate values. and make them more likely to share similar experts. This allows aggregation of training data among smaller sibling categories to overcome data scarcity issues among the latter. Experiments on a learning-to-rank dataset gathered from a leading e-commerce search log demonstrate that MoE with our improvements consistently outperforms competing models.


page 1

page 2

page 3

page 4


Learning to Rank Broad and Narrow Queries in E-Commerce

Search is a prominent channel for discovering products on an e-commerce ...

DeepCAT: Deep Category Representation for Query Understanding in E-commerce Search

Mapping a search query to a set of relevant categories in the product ta...

Mixture of Virtual-Kernel Experts for Multi-Objective User Profile Modeling

In many industrial applications like online advertising and recommendati...

Balancing Expert Utilization in Mixture-of-Experts Layers Embedded in CNNs

This work addresses the problem of unbalanced expert utilization in spar...

Dense-to-Sparse Gate for Mixture-of-Experts

Mixture-of-experts (MoE) is becoming popular due to its success in impro...

Generating Categories for Sets of Entities

Category systems are central components of knowledge bases, as they prov...

Query Answering via Decentralized Search

Expert networks are formed by a group of expert-professionals with diffe...

1. Introduction

Increasingly people are turning to e-commerce to satisfy their shopping needs. From the early days of selling books and durable goods, today e-commerce platforms offer a wide range of products, including perishables and services. This poses fresh challenges in search ranking as the user queries invariably become more diverse and colloquial, similar to how users would interact with a store cashier.

One key input in e-commerce search ranking is the product category tagging. Often the shop owners are required to label their products with these categories, to facilitate search indexing. From these product categories one can construct a notion of query categories, usually by aggregating the most frequently occurring product categories correctly retrieved under the query. Most e-commerce ranking systems today do not have the engineering resource to deploy dedicated models for each query category, even the major ones. But as a human cataloguer, a natural strategy is to first identify the most likely category the query belongs to, then retrieve items within the category. Features in various categories may have different importance for product ranking. Intuitively, it is expected that separate ranking strategies on different categories should be able to improve overall product search relevance, as judged by user purchase feedback.

In order to put this intuition into practice without incurring unwieldy engineering cost, modeling ideas such as Mixture of Experts quickly come to mind. The latter excels at delegating a single task into a bidding and polling system of multiple expert predictors. The actual mechanism of MoE models however differs from this intuition: the model actually learns the experts spontaneously, without meaningful connection to natural covariates like the product category. While the model quality may be improved, the model is still opaque and monolithic, difficult to understand from the business dimension.

Here we propose a set of techniques based on MoE to take advantage of natural business categories, such as electronics or books, which ultimately improves ranking quality on individual categories as well as making the expert specialties more distinctive and transparent. This opens up the possibility for subsequent extraction and tweaking of category-dedicated models from the unified ensemble.

We summarize our contributions as follows:

  • Hierarchical Soft Constraint: We introduce a novel soft constraint based on hierarchical categories (Figure 1

    ) in the e-commerce product search scenario to help similar categories learn from one another. By sharing network weights more strategically among similar categories, smaller sibling categories can combine their training data to mitigate data size skew.

  • Adversarial Mixture of Experts: We propose an adversarial regularization technique in MoE model to encourage that experts of different problem domains disagree with one another, thereby improving diversity of viewpoints in the final ensemble.

  • Benefits in Real-world Datasets: To our best knowledge, this work represents the first study of deep MoE models on learning-to-rank datasets. An early work based on classical modeling techniques can be found in [3]. Applications in content recommendation domains such as [18] do not involve the user query, which is a key input feature in our models. Experiments show that our improved MoE model outperforms competing methods, especially on smaller categories that have traditionally suffered from insufficient training data.

Figure 1. Hierarchical Categories

2. Related Works

Since our proposed model is based primarily on the Mixture-of-Experts framework and applies to e-commerce search ranking, we discuss related works in these two areas.

2.0.1. Deep Search and Recommendation Algorithms

Since the beginning of the neural net revolution, deep neural nets have been successfully applied in industrial ranking problems, with the notable pioneer of DSSM [11] that embeds query and documents as separate input features in a multi-layer perception.

Subsequent improvements of DSSM include Deep & Cross [27], Wide & Deep [4], DeepFM [8], etc. A fully-connected network approach is presented in [15]

. Reinforcement learning in conjunction with DNN in the e-commerce search setting has been extensively studied in

[10]. Ensembling of neural net models with other traditional techniques has been explored in [30]. Mixture of Experts can also be viewed as a form of end-to-end ensembling, with potentially unbounded number of ensemble components at constant serving cost.

Airbnb’s account of neural network application in their search engine

[9] is a remarkable compilation of techniques and challenges that mirror our own. In particular, it emphasizes the notion of “Don’t be a hero” in model choice. Mixture of Experts in our experience however, has been a heroic exception, as it takes into account serving constraint from the very beginning of its design.

The invention of attention-based transformer model [26] ushered in a new era of natural language based information retrieval system, most notably BERT [5] and its variant successors. These have been applied in large scale industrial text search system [21] to dramatically improve textual relevance. E-commerce search however has an additional emphasis on user conversion rate, thus the use of non-text features is essential, and typically requires training from scratch with custom dataset.

An orthogonal angle to our notion of query categories is presented in [25]

. The authors classify queries into 5 generic categories based on user intent, while we try to match the experts with existing product categories.

In the content recommendation domain, we cite some recent exciting advancements exploiting users’ historical behavior sequence, most notably DIN [33], DIEN [20], and MIMN [22]. Here the entire user history and profile take on a similar role as the query. We do not focus particularly on user history in this work. This makes our method more generally applicable. Our only requirement is that each query be assigned a category id, preferably with hierarchical structure.

2.0.2. Mixture of Experts

The idea of Mixture of Experts (MoE) was first introduced in [12]

, and has been used in a considerable number of Machine Learning (ML) problems, such as pattern recognition and classification; see

[31] for an early survey. The idea proposed in [12] is quite intuitive: a separate gate network (usually a single layer neural net) softly divides a complex problem into simpler sub-spaces, to be assigned to one or more expert networks. Different ML models have been used as expert network and gate network in MoE, such as neural networks model [14], SVM [16], [2]

, and decision trees

[1]. Above work shares many similarities, namely that the complete model consists of many expert sub-models and one or a few gate models that control the activation of the experts. [6] introduces a deep mixture of experts by stacking multiple sets of gating and experts as multiple layer MoE. Model capacity of deep mixture of experts increases polynomially, through different combinations of experts in each layer.

A major milestone in language modeling applications appears in [24], which introduces a novel gate network, called Noisy Top-K Gating, to accommodate both large model capacity and efficient computation. Since the gate only activates a few experts ( the total number of experts) for each example, the model can increase capacity by increasing a large number of experts. Furthermore, their model imposes a load balancing regularization constraint so that each expert sees roughly the same number of training examples per mini-batch, and gets similar share of training opportunities. In this work, we extend the Noisy Top-K Gating as well as the load-balancing idea with a hierarchical soft constraint to help similar examples share the same experts. In addition, we also encourage experts to disagree among themselves, so that various experts specialize in different problem sub-spaces.

One major variant of [24] is the MMoE (Multi-gate Mixture-of-Experts) model [18], which combines multi-task learning with MoE, by allocating multiple independent gate networks, one for each task. This was successfully deployed in the Youtube recommendation system [32]. One of our baseline models attempts to replicate this by treating different groups of major product categories as different tasks, within a mini-batch.

Finally, we mention two recent applications of MoE. [29]

embeds hundreds of shallow MoE networks, one for each layer of a convolutional neural net, as a way to reduce its computational cost while maintaining or improving generalization error. This is however mostly applicable in the computer vision domain.

In the video captioning domain, [28] builds an MoE model on top of an underlying recurrent neural net with attention mechanism, and achieves impressive wins in generating accurate captions for previously unseen video activities.

3. Category Inhomogeneity

We discuss variance of features across categories in our training data, as a motivation for dedicated expert tower combinations for different categories.

Let be the ranked product list in a search session, and let denote items in . Given a feature , we define its feature-importance to be the ROC-AUC of the item ranking based on , with respect to the user purchase label:


Here means the item has been purchased, means the item has not been purchased, and range over item pairs in a single query session. The expression stands for the number of items in the set . is the number of search sessions.

We analyze products in five different categories from our search logs: Clothing, Sports, Foods, Computer, and Electronics. Figure 2(a) shows the feature-importance of different features, including sales volume and good comments ratio, in different top-categories. Good comments ratio is likely more important in Clothing and Sports products than in Foods, because users pay more attention to bad comments to avoid defective products in the Clothing or Sports categories. In Foods, Computer, or Electronics, they may tend to buy more popular products with high sales volume. We also compute the feature-importance in sub-categories within the same top-category, namely Foods(Figure 2(b)). In contrast to the high variance of feature importance among top-categories, the intra-category feature-importances are more similar. Other top-categories have similar intra-category variance. This agrees with the intuition that users focus their attention to similar features when buying products from the same top-category.

(a) Inter-categories
(b) Intra-categories
Figure 2. Feature-importance in different categories. The X-axis corresponds to different categories, and Different colors refer to different features.

In order to assess the variance of sparse features, we examine the relationship between sparse features and the sales volume in specific categories. We look at the distribution of the proportion and absolute number of brands (a sparse feature in ranking model) in top 80% of products by sale volume ranking. As shown in Figure 3, the X-axis corresponds to the categories, the left-hand Y-axis refers to the proportion of brands and the right-hand Y-axis refer to the number of brands. The sales volume in Electronics are concentrated in the top brands, as top 80% of sales in top 2% brands. It means that the top brands have a great influence when users decide whether to buy a product in Electronics. In contrast, the distribution of Sports brand is more dispersed than Electronics, as top 80% of product sales are scattered in nearly 10% brands. Similar to the feature-importance, we also compute the top 80% of sales in sub-categories within top-categories of Foods. The resulting intra-category variance is significantly smaller than inter-categories as shown in Figure 3(b). Consistent with the observation regarding numeric features feature-importance, sparse features have wildly differing influences on the purchase decision of inter-categories products, however, they have similar importance among sibling sub-categories.

(a) Inter-categories
(b) Intra-categories
Figure 3. The Proportion and number of brands in the top 80% of products by sale volume ranking in different categories. The X-axis refers to categories, the left-hand Y-axis corresponds to the Proportion of Brands, and the right-hand Y-axis corresponds to the Number of Brands.

The observations we found in the log data verify the intuition that features, whether it is numeric or sparse, have different importance on the purchase decision of different categories. This motivates us to develop a category-wise model to capture different ranking strategies on different categories in product search.

4. Model Description

Formally, let to be an individual example in the training data set, where consists of product features, query/user features, and joint 2-sided features. Those features can be divided into two types: numeric features and sparse features. In practice, we employ an embedding procedure to transfer sparse features into dense vectors. Finally, we concatenate the embedding vectors and normalized numeric features as one input vector to the ranking model.


Here, is the embedding vector for the -th sparse feature, and is the normalized value for -th numeric features. is the input layer width, where is the embedding dimension for each sparse feature. is thus the input vector fed to the rest of the model.

indicates whether the user has purchased the product. Our goal is to learn a ranking model to evaluate the purchase probability of a product for a given query.

Figure 4. The architecture of Adv & HSC-MoE.

4.1. Query Level product categorical Ids

While most products have well-defined category ids, it is often helpful in a search engine to assign categories to queries as well, for instance to quickly filter irrelevant products during the retrieval phase. Since such features are crucial as our gating input, we describe their calculation in some detail.

First, we sample about 100k unique queries from the search log. These queries are then annotated with appropriate categories by multiple human raters, and cross validated to ensure consistency. A bidirectional GRU model is then trained with a softmax output layer to predict the most likely product category a given input query belongs to. The set of product categories is also pre-determined and updated very infrequently.

Once the model predicts the sub-categories for a given query, the top-categories are determined automatically via the category hierarchy.

4.2. Basic Mixture of Experts Model

We follow closely the Mixture of Experts model proposed in [24], which appears to be its first successful application in a large scale neural network setting.

The latter consists of a set of expert towers,

, with identical network structure and different parameters due to random initialization, and a top K gate network. The output layer combines predictions from a subset of the experts selected by the gate network, weighted by their corresponding gate values. In this paper, we apply a Multi-Layer Perceptron (MLP) as the basic structure of an expert, due to its serving efficiency and ability to handle input data without any sequence structure. Let

denote the gate network and , , the set of expert towers respectively. The MoE model predicts according to the following recipe:


Here stands for the input feature (in dense form) to the gate network, and stands for the -th largest gate value among , that is,

. The final prediction logit is simply a weighted sum of the expert towers. Only the towers whose gate values are among the top

will be computed during training and serving.

4.3. Hierarchical Soft Constraint Gate

The Hierarchical Soft Constraint Gate (HSC Gate) is an additional network, identical in structure to the base MoE gate network (MoE Gate), as proposed in [24], namely a Noisy Top-K Gating. HSC Gate takes the top-categorical (TC) Ids as input, which by design is determined by the sub-category (SC) Ids, and therefore omitted from the input of MoE Gate, since the latter always takes sub-category Ids as one of the inputs.

As shown in Figure 4, TC and SC have a hierarchical relationship in a tree-based category system, where TC is a parent node and SC are children nodes. To further emphasize the functions of gates, we call them inference MoE gate (green marked) and constraint HSC gate (blue marked) in the legend respectively.

4.3.1. Inference MoE Gate

In our model, we feed the embedding vector of Sub-Category Ids to the inference gate. Inference gate is designed as a parameterized indicator, whose output represents the weights of experts. We define the inference gate function as follows


where is a dimensional weight matrix, , being the embedding dimension and number of experts respectively. stands for the weight of the -th expert. is SC embedding vector, a part of all input vector defined in (2). is a trainable matrix.

To save computation, we only keep the top values in and set the rest to

. Then we apply the softmax function to get the probability distributions from top

’s as follows.


As a result, only the probabilities of the top values remain greater than . A noise term is added to the original output of to ensure differentiability of the top K operation, as detailed in [24].

Like other MoE-based models, the output of our model can be written as follows (8). Since has values greater 0, we only activate these corresponding experts to save computation. The computational complexity of the model depends on the network of single expert and the value of .


4.3.2. Constraint HSC Gate

In our model, the constraint gate and inference gate have the same structure. We denote the constraint gate by . In contrast to the inference gate , however, the input feature of , denoted , is the embedding vector of TC, which has a hierarchical relationship with the . As shown in Figure 4 we define the Hierarchical Soft Constraint (HSC) between inference gate and constraint gate as follows:


Where is the index set corresponding to top values in . and are probability distributions of inference gate and constraint gate.

By design, products from different sub-categories under the same top-categories are a lot more similar than products from completely different top-categories. Therefore, it is intuitively helpful to share expert towers among sibling sub-categories. However we do not know a priori which experts to assign to each sub-category. Indeed the philosophy of MoE is to let the model figure it out by itself. On the other hand, we do not care about the exact experts assigned to each sub-category. The HSC gate thus seeks to preserve the hierarchy relationship between SC and TC, encouraging queries from sibling sub-categories to choose similar experts.

HSC will be a part of loss function in our model to help the inference gate learn the hierarchical knowledge. The smaller HSC, the easier it is to activate the same experts for similar categories.

4.4. Adversarial Regularization

Ideally different experts add different perspectives in the final ensemble score. In reality, however, experts tend to reach similar or identical conclusions in prediction tasks, especially if they see the same sequence of training data. To overcome this undesirable unanimity problem, we consider a regularization technique taking inspiration from the Generative Adversarial Networks (GAN)

[7][23]. More specifically, for each input example in the MoE network, some experts are left idle, due to their relatively low gating values. Intuitively, the model determines that their predictions are somewhat irrelevant to the ground truth. Out of these, we randomly sample adversarial experts, whose indices are denoted by , and subtract the L2 difference between their predictions and those of the top experts from the training loss. In other words, we reward those adversarial experts who predict differently from the top K experts. As a mnemonic convention, these adversarial experts will also be called disagreeing experts. Note that .

We also observe that different examples have different sets of random disagreeing experts and top K inference experts, making the implementation less than straightforward. We define adversarial loss to measure the distance between the to joint expert sets for a single example as follows:


The helps experts stay different from each other without directly interfering with the active expert predictions in the original MoE framework, the larger , the further the distance between disagreeing experts and inference experts.

4.5. Combined Training Loss

Our best model combines both Hierarchical Soft Constraint and Adversarial Loss during training. The full objective function contains three parts: (1) Cross Entropy Loss with respect to the user purchase binary label , (2) HSC between the inference gates and constraint gates , and (3) between inference experts and disagreeing experts:


Here is the number of examples, and control the relative importance of corresponding item, for which we perform grid search in powers of 10.

While the inference gating weights affect all three components of the training loss, the expert tower weights , , does not affect the HSC regularization component. In other words, . Thus the gradient descent update formula simplifies slightly as follows:


5. Experiments

In this section, we compare our improved MoE models with several MoE baselines empirically. We also report the improvements of various MoE-based models on test datasets with different categories in Section 5.3. Towards the end (Section 5.4), we study the impact of different hyper-parameters on our model.

5.1. Experiment Setup

5.1.1. Datasets

To verify the effectiveness of our proposed model in a real industrial setting, we experiment on an in-house dataset. We are not able to find a public dataset with good coverage of hierarchical category annotations, which is large enough for neural net models to be effective.

We collect users’ purchase records from a leading e-commerce search system. Each example consists of product features (e.g., category, title, price), user features (e.g., age, behavior sequence), and query. In addition there are so-called 2-sided features that depend on both query/user and the product, for instance, the historical CTR of the product under the present query.

The category system has a hierarchical tree structure, with the parent nodes given by the top-categories (TC) and child nodes by the sub-categories(SC). Data statistics are presented in Table 1.

Statistics Training Set Test Set
Data Size Complete 26,674,871 2,059,293
Clothing(C) 755,659 24,588
Books (B) 1,520,243 75,218
Mobile Phone (M) 1,344,726 73,549
Category # of Top Categories 38 37
# of Sub Categories 3,479 2,228
Query # of queries 2,234,913 63,172
# of query/item pairs 9,978,755 1,479,115
Table 1. Datasets statistics.

5.1.2. Evaluation Metrics

We use two evaluation metrics in our experiments: AUC (Area Under the ROC Curve) and NDCG (Normalized Discounted Cumulative Gain)

[13]. Both metrics are computed on a per session basis and averaged over all sessions in the evaluation set. AUC intuitively measures the agreement between the model and user purchase actions, on pairs of items within a session. NDCG is a ranking metric that achieves similar effect, except it places more weight on pairs whose positions are close to the top of the search result page.

5.1.3. Model Comparison

We compare 7 models in our experiments: DNN, MoE, MMoE with 4 experts(4-MMoE), MMoE with 10 experts (10-MMoE), Adversarial MoE (Adv-MoE), Hierarchical Soft Constraint MoE (HSC-MoE) and our model with both Adversarial experts and Hierarchical Soft Constraint (Adv & HSC-MoE). Specifically, the prediction tasks under different top categories (TC) are viewed as the multiple tasks for MMoE model in our experiments.

5.1.4. Parameter Settings

In our experiments, the DNN and a single expert tower have the same network structure,

, as well as embedding dimension, which is set to 16 for all sparse features. We use ReLU as the activation functions for hidden layer and AdamW proposed in

[17] as optimizer for all models. The learning rate is for all models, and in objective function (14) are both .

For MMoE model, we divide categories into 10 buckets of roughly equal example counts in the training set, which are treated as 10 separate tasks in multi-task learning. Thus each training mini-batch is subdivided into 10 disjoint sub-minibatches, each with its own gating network.

5.2. Full Evaluations

The evaluation results for different models are shown in Table 2. To be fair, we use the same setting on MoE-based model (MoE, Adv-MoE, HSC-MoE, and Adv & HSC-MoE), hyper-parameter including , . Adv-MoE and Adv & HSC-MoE add a disagreeing expert to calculate adversarial loss. The computational complexity of 4-MMoE is approximately the same as the MoE-based model since the MoE-based model activates 4 experts for each example. The model capacity (number of parameters) of 10-MMoE is approximately equal to the MoE-based model as they have the same number of experts.

Compared with DNN model, our model(Adv & HSC-MoE) achieves absolute AUC gain of 0.96% (0.99% in term of NDCG), which indicates its good generalization performance. We also have the following observations by carefully comparing the effects of different models.

  • All MoE-based networks improve upon DNN baseline model on all 3 metrics, including AUC, NDCG, and NDCG restricted to the top 10 shown positions. The original MoE model already brings 0.45% improvement on AUC and 0.53% improvement on NDCG over DNN.

  • Hierarchical Soft Constraint has a stable improvement. It brings 0.18% AUC (0.24% NDCG) gain for HSC-MoE over MoE, and 0.49% AUC (0.44% NDCG) gain for Adv & HSC-MoE over Adv-MoE.

  • Using adversarial loss during model training also improves the model’s generalization performance. This component brings an additional 0.02% or 0.33% improvement in AUC compared to the baseline MoE or HSC-MoE models respectively.

  • MMoE (4-MMoE and 10-MMoE) models have close performance compared with our model as measured by AUC. However, our model is flexible enough to deal with various category tagging systems without extra data pre-processing. By contrast, the MMoE models require additional work to divide categories into different predict tasks, which heavily rely on analytic experience. The model structure of MMoE also changes when the category system changes, for instance when categories are added or removed, or even when sample volume changes under some categories. In terms of NDCG metrics, the MMoE models trail behind our top candidate Adv & HSC-MoE by a wide margin.

DNN 0.8131 0.5596 0.5960
MoE 0.8176 0.5656 0.6013
4-MMoE 0.8203 0.5635 0.5954
10-MMoE 0.8211 0.5644 0.5964
Adv-MoE 0.8178 0.5658 0.6015
HSC-MoE 0.8194 0.5685 0.6037
Adv & HSC-MoE 0.8227 0.5710 0.6059
Table 2. Performance on Different Models. A larger AUC means better performance. NDCG@N is computed with top N items in rank list. A larger NDCG means better performance.

5.3. Performance on different categories

We test model performance on various categories to verify the benefit of our model in different categories of products. Firstly, we evaluate the performance of our model in different categories, which have different data sizes in the training dataset. Then, we use three categories dataset to train and compare DNN and our model. Finally, we analyze the distribution of inference MoE gate values in all categories to investigate the relationship between experts and categories.

We put various categories in different buckets according to data size in the training dataset. As presented in Figure 5, the blue bar in the figure shows that the data size corresponding to categories bucket, left bar stand for some small categories with few training data, right is large categories otherwise. The left-hand side Y-axis corresponds to the data size of categories buckets. The right-hand side Y-axis is the improvement of AUC. Lines with different colors illustrate the improvement of AUC in different models with increasing data sizes in various categories bucket. All MoE-based models outperform the baseline model (DNN), as the AUC improvement are all greater 0 in all lines. It is worth noting that our model (purple line) is more effective for small categories than for large categories, as the decreasing trend when from left to right.

It is likely that the improvement in small categories is owing to the HSC. The HSC constraint the distribution of gate value for different categories, which in turn affects the choice of experts. Similar categories are easier to activate the same experts and small categories can easier transfer learning from shared experts.

Figure 5. Model performance on different top categories. The X-axis corresponds to different category buckets. Left-hand Y-axis gives the combined data sizes of each category bucket, while the right-hand Y-axis shows AUC improvement with respect to the DNN baseline.

Moreover, we collect different training and testing datasets within three different categories from our e-commerce search log. The data statistics of the three categories are shown in Table 1, including Mobile Phone (M), Books (B), and Clothing (C). The dataset sizes of Books and Mobile Phone are sufficient to train a good model, thus the AUC for Books and Mobile Phone are always higher than that for Clothing. As shown in Table 3, we train four versions of DNN models and an Adv & HSC-MoE model as follows:

  • M-DNN uses the training dataset of Mobile Phone category only to train a 3-layer Feed-Forward model.

  • B-DNN uses the training dataset of Books category only.

  • C-DNN uses the training dataset of Clothing category only.

  • Joint-DNN use the joint dataset (A+B+C) to train DNN.

  • Joint-Ours also uses the joint dataset (A+B+C) to train our best candidate model, namely HSC & Adv-MoE.

As shown in Table 3, we test those models on all 3 category test sets separately. The joint training dataset is more beneficial for those categories with less training dataset. It can be seen that there is 0.36% improvement of AUC in Clothing, compared with 0.25%, and -0.08% AUC gain in Books and Mobile Phone. Meanwhile, our method (Joint-Ours) outperforms Joint-DNN and separate-DNNs in all categories, showing the advantage of our proposed method in different categories. The improvement of AUC are the same as Figure 5, which smaller categories (e.g. Clothing) gain a higher improvement.

Model Train set Test set(AUC)
M-DNN M 0.8059 - -
B-DNN B - 0.8393 -
C-DNN C - - 0.7957
Joint-DNN M + B + C 0.8051 0.8418 0.7993
Joint-Ours 0.8098 0.8422 0.8052
Table 3. Evaluations on different training and testing datasets. M, B, C are datasets on three categories respectively. M: Mobile Phone, B: Books, C: Clothing

In order to clearly investigate the impact of HSC and adversarial loss, we analyze the distribution of inference MoE gate values in all categories. In our experiments, inference MoE gate values form a -dimensional vector, which stands for the probability that each expert should be activated, for a given example. To clearly illustrate the relationship between categories and activated experts, we cluster those gate values into 2-dimension using t-SNE [19], which effectively learns 2-dimensional points that approximate the pairwise similarities between output vectors of the gate network for a set of input examples. We group together semantically similar categories and assign a distinct color to each group as shown in the Table 4.

Figure 6 makes it clear that similar categories have much more similar gate vectors under Adv-MoE and Adv & HSC-MoE than under the vanilla MoE. In particular, semantically similar categories form much more structured clumps in Figure 6(c) and Figure 6(b) than the MoE baseline in Figure 6(a).

Moreover, between Figure 6(c) and Figure 6(b), the presence of HSC gates produces an even cleaner separation of clusters than adversarial regularization alone.

Semantic Class Color Representative Categories
Daily Necessities blue Foods, Kitchenware, Furniture
Electronics green Mobile Phone, Computer
Fashion red Clothing, Jewelry, Leather
Table 4. Coloring scheme of similar category grouping
(a) MoE
(b) Adv-MoE
(c) Adv & HSC-MoE
Figure 6. Distribution of inference gate values in different models. Points with the same color appear better clustered under our improved MoE models, indicating that similar categories are better able to share similar sets of experts.

5.4. Hyper-Parameter Study

We test different hyper-parameter settings in our model, including 1) the number of experts and different number of top experts and number of disagreeing experts ; 2) different input features for the MoE gate network; 3) the weight multipliers and for HSC and AdvLoss in the training objective function.

We set the total number of experts to be 10, 16, 32; the number of chosen experts to be 2, 4; and the number of adversarial experts to be 1, 2. As presented in Figure 7, holding the other parameters fixed, increasing consistently improves model generalization. This is expected since higher yields greater expert capacity per example. On the other hand, there is no monotonic pattern among the other parameters and , as evidenced by the pairs of triplets (16, 2, 2), (32, 2, 2), and (32, 4, 1), (32, 4, 2) respectively. A very large would dilute the amount of training data seen by each expert. Too many adversarial experts () can also prevent the overall model from learning even the basic notions of relevance. Overall, our combined candidate model (HSC & Adv-MoE) achieves the best test AUC on our search log dataset when , and .

Figure 7. The HSC & Adv-MoE model under different (, , ) hyper-parameter settings.

We test different gate input features in our model. As presented in Table 5, the model is able to achieve the best performance when using sub-categories alone. Adding top-categories, query, user feature or even all features does not bring benefit.

One possible explanation is that adding other features brings some noise that activates the “wrong” experts for special categories. The model gets the worst performance when we feed all features to the inference gate. The use of all features, including some product specific features, causes different products in the same query session to have different inference gate values. This causes variance between expert towers to dominate the variance between intra-session products, leading to ranking noise. The result demonstrates that the inference gate should be fed query-side features only in our model to guarantee a unique experts set as well as weight values within the same query session.

gate input feature AUC
SC 0.8212
(TC, SC) 0.8137
(query, TC, SC) 0.8135
(user feature, TC, SC) 0.8131
all features 0.8129
Table 5. The model performance in different gate input feature; the other hyper-parameters remain the same: , , , .

The parameters and in the objective function (14) control the relative importance of and in the overall training loss. We make parameter sweeps from to for both and . As shown in Table 6, our model achieves the best performance when when and .

1e-1 1e-1 0.8167
1e-2 0.8217
1e-3 0.8221
1e-2 1e-1 0.8140
1e-2 0.8212
1e-3 0.8168
1e-3 1e-1 0.8167
1e-2 0.8172
1e-3 0.8227
Table 6. Experiments with different combinations of and . The other parameters remain the same: , , .

5.5. Case Studies

To illustrate the qualitative improvements of adversarial loss and HSC, we look at the predicted score of example queries and items pair under MoE and Adv & HSC-MoE. As shown in Table 7, we use the query “Necklace thick rope, long” and 3 items from the ranked results. The label in Table 7 means whether corresponding item is purchased; thus only the first item is purchased by the user. Figure 8 presents the output of 10 experts in MoE and our model for each example. Both models activate 4 experts, marked red bars in all sub-figures. The X-axis represents the expert ids, ranging from 0 to 9; the Y-axis stands for scores of individual experts. For the third example, which is a negative sample, the outputs of all MoE experts are positive, leading to an erroneous prediction. Although there are two positive experts in the our model (Figure 8(f)), the presence of negative experts 0 and 4 experts effectively dominate the final weighted average score.

qury title image label predict score
MoE Our Model
necklace thick rope,
fashion necklace
1 0.5648139 0.630661
four leaf clover necklace
0 0.500628 0.356902
necklace female
0 0.636468 0.463494
Table 7. Model compare results with MoE and Adv & HSC-MoE
(a) Scores of MoE for first example
(b) Scores of our model for first example
(c) Scores of MoE for second example
(d) Scores of our model for second example
(e) Scores of MoE for third example
(f) Scores of our model for third example
Figure 8. Scores of individual experts for baseline MoE and Adv + HSC improved version. The first example is positive while the last two are negative. Red bars indicate selected experts by either model.

6. Conclusions and Future Work

The adversarial regularization and hierarchical soft constraint techniques presented here are promising steps towards developing a category-aware ranking model in product search. It achieves significant improvement on an industry scale dataset, mainly from these advantages: 1) Small sub-categories under the same top-category are able to share similar experts, thereby overcoming parameter sparsity under limited training data. 2) Adversarial regularization encourages the experts to “think independently” and approach each problem from a diversity of angles.

A few directions remain to be explored. First, we did not use product side category id as input to the gate network, since the result actually deteriorates compared to using only query-side category information. One explanation is that adding item side gating input causes different items to go through different experts under the same query, leading to large prediction variance. Product side categories are typically more accurate, however, and we plan to incorporate them by exploring more factorized architectures, with multi-phased MoE.

The 2-level category hierarchy considered here can be viewed as the simplest form of knowledge graph. An interesting generalization is to apply the soft constraint technique to more general human or model annotated knowledge graph, provided the latter has enough coverage on the training corpus.

Another important source of features in e-commerce search deals with personalization, in particular user historical behavior sequence. While our techniques do not require such structured data, it is natural to apply different experts to different events in the user history, and hopefully focus more on the relevant historical events.

Lastly it is desirable to fine-tune individual expert models to suit evolving business requirement or training data. Thus it would be interesting to assess transfer learning potential based on the component expert models.


  • [1] E. Arnaud, A. Dapogny, and K. Bailly (2019) Tree-gated deep mixture-of-experts for pose-robust face alignment. IEEE Transactions on Biometrics, Behavior, and Identity Science. Cited by: §2.0.2.
  • [2] L. Cao (2003) Support vector machines experts for time series forecasting. Neurocomputing 51, pp. 321 – 339. External Links: ISSN 0925-2312, Document, Link Cited by: §2.0.2.
  • [3] S. Chakraborty (2007) Learning to rank using mixture of experts and matching loss functions. Cited by: 3rd item.
  • [4] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. D. Chandra, H. Aradhye, G. Anderson, G. S. Corrado, W. Chai, M. Ispir, et al. (2016)

    Wide & deep learning for recommender systems

    pp. 7–10. Cited by: §2.0.1.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.0.1.
  • [6] D. Eigen, M. Ranzato, and I. Sutskever (2013-12) Learning factored representations in a deep mixture of experts. pp. . Cited by: §2.0.2.
  • [7] I. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. pp. 2672–2680. Cited by: §4.4.
  • [8] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: A factorization-machine based neural network for CTR prediction. In

    Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017

    , C. Sierra (Ed.),
    pp. 1725–1731. External Links: Link, Document Cited by: §2.0.1.
  • [9] M. Haldar, M. Abdool, P. Ramanathan, T. Xu, S. Yang, H. Duan, Q. Zhang, N. Barrow-Williams, B. C. Turnbull, B. M. Collins, et al. (2019) Applying deep learning to airbnb search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1927–1935. Cited by: §2.0.1.
  • [10] Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu (2018) Reinforcement learning to rank in e-commerce search engine: formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 368–377. Cited by: §2.0.1.
  • [11] P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pp. 2333–2338. Cited by: §2.0.1.
  • [12] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton (1991) Adaptive mixtures of local experts. Neural computation 3 (1), pp. 79–87. Cited by: §2.0.2.
  • [13] K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 422–446. Cited by: §5.1.2.
  • [14] M. I. Jordan and R. A. Jacobs (1994) Hierarchical mixtures of experts and the em algorithm. Neural computation 6 (2), pp. 181–214. Cited by: §2.0.2.
  • [15] R. Li, Y. Jiang, W. Yang, G. Tang, S. Wang, C. Ma, W. He, X. Xiong, Y. Xiao, and E. Y. Zhao (2019) From semantic retrieval to pairwise ranking: applying deep learning in e-commerce search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1383–1384. Cited by: §2.0.1.
  • [16] C. A.M. Lima, A. L.V. Coelho, and F. J. [. Zuben] (2007) Hybridizing mixtures of experts with support vector machines: investigation into nonlinear dynamic systems identification. Information Sciences 177 (10), pp. 2049 – 2074. Note: Including Special Issue on Hybrid Intelligent Systems External Links: ISSN 0020-0255, Document, Link Cited by: §2.0.2.
  • [17] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §5.1.4.
  • [18] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi (2018) Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. pp. 1930–1939. Cited by: 3rd item, §2.0.2.
  • [19] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.3.
  • [20] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, et al. (2019) Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091. Cited by: §2.0.1.
  • [21] P. Nayak (2019-10) Understanding searches better than ever before. Note: Cited by: §2.0.1.
  • [22] Q. Pi, W. Bian, G. Zhou, X. Zhu, and K. Gai (2019) Practice on long sequential user behavior modeling for click-through rate prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2671–2679. Cited by: §2.0.1.
  • [23] M. Qiu, B. Wang, C. Chen, X. Zeng, J. Huang, D. Cai, J. Zhou, and F. S. Bao (2019) Cross-domain attention network with wasserstein regularizers for e-commerce search. pp. 2509–2515. Cited by: §4.4.
  • [24] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. Cited by: §2.0.2, §2.0.2, §4.2, §4.3.1, §4.3.
  • [25] P. Sondhi, M. Sharma, P. Kolari, and C. Zhai (2018) A taxonomy of queries for e-commerce search. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1245–1248. Cited by: §2.0.1.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §2.0.1.
  • [27] R. Wang, B. Fu, G. Fu, and M. Wang (2017) Deep & cross network for ad click predictions. CoRR abs/1708.05123. External Links: Link, 1708.05123 Cited by: §2.0.1.
  • [28] X. Wang, J. Wu, D. Zhang, Y. Su, and W. Y. Wang (2019) Learning to compose topic-aware mixture of experts for zero-shot video captioning. Cited by: §2.0.2.
  • [29] X. Wang, F. Yu, L. Dunlap, Y. Ma, R. Wang, A. Mirhoseini, T. Darrell, and J. E. Gonzalez (2018) Deep mixture of experts via shallow embedding. arXiv preprint arXiv:1806.01531. Cited by: §2.0.2.
  • [30] C. Wu, M. Yan, and L. Si (2017) Ensemble methods for personalized e-commerce search challenge at cikm cup 2016. arXiv preprint arXiv:1708.04479. Cited by: §2.0.1.
  • [31] S. E. Yuksel, J. N. Wilson, and P. D. Gader (2012) Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems 23 (8), pp. 1177–1193. Cited by: §2.0.2.
  • [32] Z. Zhao, L. Hong, L. Wei, J. Chen, A. Nath, S. Andrews, A. Kumthekar, M. Sathiamoorthy, X. Yi, and E. H. Chi (2019) Recommending what video to watch next: a multitask ranking system. pp. 43–51. Cited by: §2.0.2.
  • [33] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai (2018) Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §2.0.1.