QUEACO: Borrowing Treasures from Weakly-labeled Behavior Data for Query Attribute Value Extraction

We study the problem of query attribute value extraction, which aims to identify named entities from user queries as diverse surface form attribute values and afterward transform them into formally canonical forms. Such a problem consists of two phases: named entity recognition (NER) and attribute value normalization (AVN). However, existing works only focus on the NER phase but neglect equally important AVN. To bridge this gap, this paper proposes a unified query attribute value extraction system in e-commerce search named QUEACO, which involves both two phases. Moreover, by leveraging large-scale weakly-labeled behavior data, we further improve the extraction performance with less supervision cost. Specifically, for the NER phase, QUEACO adopts a novel teacher-student network, where a teacher network that is trained on the strongly-labeled data generates pseudo-labels to refine the weakly-labeled data for training a student network. Meanwhile, the teacher network can be dynamically adapted by the feedback of the student's performance on strongly-labeled data to maximally denoise the noisy supervisions from the weak labels. For the AVN phase, we also leverage the weakly-labeled query-to-attribute behavior data to normalize surface form attribute values from queries into canonical forms from products. Extensive experiments on a real-world large-scale E-commerce dataset demonstrate the effectiveness of QUEACO.



There are no comments yet.


page 1

page 2

page 3

page 4


Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data

Weak supervision has shown promising results in many natural language pr...

SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data

We present SwellShark, a framework for building biomedical named entity ...

CL-NERIL: A Cross-Lingual Model for NER in Indian Languages

Developing Named Entity Recognition (NER) systems for Indian languages h...

Noisy-Labeled NER with Confidence Estimation

Recent studies in deep learning have shown significant progress in named...

BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition

We study the problem of learning a named entity recognition (NER) tagger...

Improving Distantly-Supervised Named Entity Recognition with Self-Collaborative Denoising Learning

Distantly supervised named entity recognition (DS-NER) efficiently reduc...

Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

In this paper, we present SANTA, a scalable framework to automatically n...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Query attribute value extraction is the joint task of detecting named entities in the search queries as the diverse surface form attribute values and normalizing them into a canonical form to avoid misspelling and abbreviation problems. These two sub-tasks are typically called named entity recognition (NER) (Chiu and Nichols, 2016) and attribute value normalization (AVN)  (Putthividhya and Hu, 2011).

Figure 1. The ideal product attribute extraction pipeline.
Case# Query & Ground-truth Labels Clicked Product Attribute Values Weak Labels
1 [lg][smart tv][32] lg, 32-inch, television [lg] smart tv 32
2 [womans][socks] women, socks womans [socks]
3 [braun][7 series][shaver] braun, series 7, electric shaver [braun] 7 series shaver
4 [trixie] [cat litter tray bags][46 x 59][10 pack] Trixie, waste bag, 46 × 59 cm [trixie] cat litter tray bags 46 x 59 10 pack
Table 1. Ground-truth labels and noisy weakly-labels for query NER examples based on the behavior data from the product side. We use colors to denote the entity type and use brackets to indicate the entity boundary. Entity labels: Brand, ProductLine, Size, ProductType, Audience.

As shown in Figure 1, we illustrate the process of the ideal query attribute value extraction. When a user enters the query “MK tote for womans”, we firstly use a NER model to identify the entity type “brand” for ”MK”, “product type” for ”tote”, and “audience” for “womans”. These extracted named entities are in the informal surface form of attribute values. However, such an informal surface is not accordant with the products indexed with canonical form attribute values in the formal written style. Specifically, “MK” is an abbreviation of brand “Michael Kors”, “tote” is a hyponym of the product type ”handbag”, and “womans” contains a spelling error. This misalignment poses tremendous challenges to the product search engine to retrieve relevant product items that users really prefer. Therefore, the AVN module is equally important to transform the surface form for each attribute value into the canonical form, i.e., “MK” to “Michael Kors”, “tote” to “handbag” and “womans” to “women”. In the E-commerce domain, extracting these attributes values from queries is critical to a wide variety of product search applications, such as product retrieval (Cheng et al., 2020) and ranking (Wen et al., 2019), and query rewriting (Guisado-Gámez et al., 2016).

Unfortunately, existing works only focus on the surface form attribute value extraction based on NER while ignoring the canonical form transformation, which is impractical in the realistic scenarios (Kozareva et al., 2016; Cheng et al., 2020; Wen et al., 2019; Cowan et al., 2015)

. To bridge this gap, this paper proposes a unified query attribute value extraction system that involves both phases. By borrowing treasures from large-scale weakly-labeled behavior data to mitigate the supervision cost, we further improve the extraction performance. Considering the first NER stage, recent advances in deep learning models (e.g., Bi-LSTM+CRF) have achieved promising results 

(Huang et al., 2015; Raganato et al., 2017). However, they highly rely on massive labeled data, where manual labeling for token-level labels is particularly costly and labor-intensive. To alleviate the issue in E-commerce, prior studies (Kozareva et al., 2016; Cheng et al., 2020; Wen et al., 2019) resort to leveraging large-scale behavior data from the product side as the weak supervision for queries based on some simple string match strategies. Nonetheless, these weakly-supervised labels contain enormous noises due to the partial or incomplete token labels based on the exact string matching. For example, as shown in Table 1 case#1, when we use the attribute values of top-clicked product “LG 32-inch television”, i.e., “brand” for “LG”, “size” for “32-inch”, “product type” for “television”as the weak supervision to match the query “lg smart tv 32”, it can only generate the label “brand” for “lg”, concealing useful knowledge for the unannotated tokens. For this reason, weak supervision-based methods (Shang et al., 2018b; Cheng et al., 2020)

usually perform very poorly, even worse after powerful pre-trained language models (PLMs) (e.g., BERT 

(Devlin et al., 2019) ) are introduced since PLMs are much easier to fit noises. To address the issue, we consider a more reliable regime, which further includes some strongly-labeled human annotated data to denoise the weak labels from the distant supervision. As such, the NER model can be improved by making more effective use of both the large-scale weakly-labeled behavior data and the strongly-labeled human-annotated data.

As for the second AVN phase, customers tend to use diverse surface forms to mention each attribute value in search queries due to the misspellings, spelling variants, or abbreviations. This circumstance occurs frequently in user queries and product titles of e-commerce. For example, eBay has noted that 20% product titles in the clothing and shoes category involve such surface form brand (Putthividhya and Hu, 2011). Thus, normalizing these surface form attribute values derived from the NER signals to a single normalized attribute value is critical. It is usually ignored by existing works (Kozareva et al., 2016; Cheng et al., 2020; Wen et al., 2019; Cowan et al., 2015). To mitigate human annotating efforts, weakly-labeled behavior data can also contribute to the AVN. For example, “MK tote for womans” mentioning the brand “MK” leads to the click of product items associated with the brand “Michael Kors”. We can reasonably infer a strong connection between the surface form value “MK” and the canonical form value “Michael Kors” if this association occurs in many queries.

Motivated by those, we propose a unified QUEry Attribute Value Extraction in ECOmmerce (QUEACO) framework that efficiently utilizes the large-scale weakly-labeled behavior data for both the query NER and AVN. For query NER, QUEACO leverages the strongly-labeled data to denoise the weakly-labeled data based on a novel teacher-student network, where a teacher network trained on the strongly-labeled data generates pseudo-labels to refine the weakly-labeled data for teaching a student network. Unlike the classic teacher-student networks that can only produce pseudo-labels from a fixed teacher, our pseudo-labeling process from the teacher is continuously and dynamically adapted by the feedback of the student’s performance on the strongly-labeled data. This encourages the teacher network to generate better pseudo-labels to teach the student, maximally mitigating the error propagation from the noisy weak labels. For query AVN, we utilize the weakly-labeled query-to-attribute behavior data and QUEACO NER predictions to model the associations between the surface form and canonical form attribute values. As such, the surface form attribute values from queries can be normalized to the most relevant canonical form attribute values from the products. Empirically, extensive experiments on a real-world large-scale E-commerce dataset demonstrate that QUEACO NER can significantly outperform the state-of-the-art semi-supervised and weakly-supervised methods. Moreover, we qualitatively show the effectiveness and the necessity of QUEACO AVN.

Our contributions can be summarized as follows: (1) To the best of our knowledge, our work is the first attempt to propose a unified query attribute value extraction system in E-commerce, involving both the query NER and AVN. QUEACO can automatically identify product-related attributes from user queries and transform them into canonical forms, by leveraging weak supervisions from large-scale behavior data; (2) Our QUEACO NER is also the first work that efficiently utilizes both human-annotated strongly-labeled data and large-scale weakly-labeled data from the query-product click graph. Moreover, the proposed QUEACO NER model can significantly outperform the existing state-of-the-art baselines; (3) We propose the QUEACO AVN module that uses aggregated query to attribute behavioral data to build the connections among queries, surface form attribute value, and canonical form value. The proposed QUEACO AVN module can effectively normalize the surface form values with spelling errors, spelling variants, and abbreviations problems.

2. Preliminaries

In this section, we introduce some preliminaries before detailing the proposed QUEACO framework, including the problem formulation and the query NER base model.

2.1. Problem Formulation

2.1.1. QUEACO Named Entity Recognition

We firstly introduce the task definition for the QUEACO NER.

NER Given a user input query with tokens, the goal of NER is to predict a tag sequence . We use the BIO  (Li et al., 2012) tagging strategy. Specifically, the first token of an entity mention with each entity type ( is the entity type set) is labeled as ; the remaining tokens inside that entity mention are labeled as ; and the non-entity tokens are labeled as .

Strongly-Labeled and Large Weakly-Labeled Setting For our query NER, we have two types of data: 1) strongly-labeled data , which is manually annotated by human annotators; 2) large-scale weakly-labeled data , where . The goal is to borrow treasures from large-scale noisy weakly-labeled data to further enhance a supervised NER model trained on the strongly-labeled data.

2.1.2. QUEACO Attribute Value Normalization

For each query
with tokens, QUEACO NER predicts a tag sequence . Given a entity type (e.g., brand) and the NER prediction , we can extract the query term as the surface form attribute value for the entity type . Assume that we have a diverse set of canonical form product attribute values for the entity type . For each canonical form attribute value , we can define the relevance given the query as

where is the number of total clicks on the product of the the searches using query in a period of time, such as one month. And is the set of all products. indicates whether the product is indexed with the value for the entity type . In a nutshell, we quantify the query-attribute relevance using the query-product relevance and the product-attribute membership. The query-product relevance is measured by number of clicks in the query logs, which can be viewed as the implicit feedback from customers. Finally, we can get the most relevant attribute value of the entity type by as the normalized canonical form for the surface form attribute value .

2.2. Query NER Base Model

The recent emergence of the pre-trained language models (PLMs) such BERT (Devlin et al., 2019) has achieved superior performance on a variety of public NER datasets. However, existing query NER works (Cheng et al., 2020; Wen et al., 2019; Cowan et al., 2015; Kozareva et al., 2016) still rely on the shallow deep learning models (e.g., BiLSTM-CRF) while not equipping with the powerful PLMs.

Why PLMs are not deployed for existing query NER works? Due to labeled data scarcity in user queries, previous query NER works can only rely on the noisy distant supervision data for model training. In such a condition, using the powerful mPLMs as the encoder has even worse performance than a shallow Bi-LSTM for the query NER (Cheng et al., 2020).  Liang et al. (2020) have found that the PLM-based NER models are easier to overfit the noises from the distant labels and forget the general knowledge from the pre-training stage. On the other hand, distant supervision based methods for NER (Shang et al., 2018b; Cheng et al., 2020) usually underperform, which cannot meet the high performance requirement for query NER used by various downstream applications in product search like retrieval and ranking. To tackle the issue, we target a different query NER setting, which leverages some strongly-labeled human-annotated data to train a more reliable PLM-based NER model and uses the weakly-labeled data from the distant supervision to further improve the model performance. To meet the strict latency constraint, we choose DistilmBERT (Sanh et al., 2019) as the base NER model and we do not add the CRF layer.

Figure 2. An overview of the proposed framework QUEACO, showing how weakly behavioral data contributes to the two inter-dependent stages of QUACO. Entity labels: Brand, ProductLine, Size, ProductType, namedPersonGroup, Color, Audience.

3. Queaco

In this section, we firstly give an overview of how weakly-labeled behavior data contributes to both the query NER and AVN and then detail the two components for QUEACO, respectively.

3.1. Overview

Figure 2 shows an overview of QUEACO. At a high level, QUEACO leverages weakly-labeled behavior data for both the query NER and AVN. For QUEACO NER, we have the strongly-labeled data and the large-scale weakly-labeled data for training. Specifically, the QUEACO NER has two stages: the weak supervision pretraining stage and the finetuning stage. 1) In the pretraining stage, we adopt a novel teacher-student network where the teacher network is dynamically adapted based on the feedback from the student network. The goal is to encourage the teacher network to generate better pseudo labels to refine the weakly-labeled data for improving the student network’s performance. 2) After the pretraining stage, we continue to finetune the student network on the strongly-labeled data as the final model. For QUEACO AVN, we extract the surface form attribute values based on the NER predictions and leverage the weakly-labeled query-to-attribute behavior data to transform them into the canonical forms.

3.2. QUEACO Named Entity Recognition

3.2.1. Model architecture

Teacher-Student Network Before introducing the QUEACO NER model, we give some preliminary of the teacher-student network of self-training (Lee and others, 2013; Yarowsky, 1995)

. Self-training stands out among semi-supervised learning approaches, in which a teacher model produces pseudo-labels for unlabeled samples, and a student model learns from these samples with generated pseudo-labels. We give the mathematical formulation of self-training in the context of NER. Let

and respectively be the teacher and student network, parameterized by and . We use and denote the NER predictions of the query for the teacher and student, respectively. can be either soft or converted to hard pseudo labels. Then the knowledge transfer is usually achieved by minimizing the consistency loss between the two predicted distributions from the teacher and the student: .

Pseudo & Weak Label Refinement Weakly-labeled data suffers from severe incompleteness that the overall span recall is usually very low. Therefore, it is natural to use self-training to annotate the missing labels of the weakly-labeled data. The pseudo labels make up the missing tags for the weak labels, and meanwhile weak labels can provide high precision tags to restrict pseudo labels.

For each weakly-labeled sample , we convert the soft predictions of the teacher network into the hard pseudo labels, i.e., . Additionally, we have weak labels that partially annotate the samples, which can be used to further refine the pseudo labels. We maintain the weak labels of the entity tokens and replace the weak labels of the no entity tokens with the pseudo labels. Then the refined pseudo labels are generated by:

QUEACO Teacher-Student Network Prior teacher-student frameworks of self-training rely on rigid teaching strategies, which may hardly produce high-quality pseudo-labels for consecutive and interdependent tokens. This results in progressive drifts on the noisy pseudo-labeled data provided by the teacher (a.k.a the confirmation bias (Arazo et al., 2020)). In QUEACO NER, we propose a novel teacher-student network, where the teacher can be dynamically adapted from the student’s feedback to adjust its pseudo-labeling strategies, inspired by Pham et al. (2021). Student’s feedback is defined as the student’s performance on the strongly-labeled data. Formally, we can formulate our teacher-student network as a bi-level optimization problem,


where is the cross-entropy loss. The ultimate goal is to minimize the loss of the student on the strongly-labeled data after learning from the refined pseudo labels , i.e., , which is a function of the teacher’s parameters .

is the prediction logits of the student network on the weakly-labeled sample

. By optimizing the teacher’s parameter in light of the student’s performance on the strongly-labeled data, the teacher can be adapted to generate better pseudo labels to further improve student’s performance. This bi-level optimization problem is extremely complicated, but we can approximate the multi-step with one step gradient update of . Plugging this into the constrained optimization problem leads to an unconstrained optimization for the teacher network learning. This gives rise to the alternating optimization procedure between the student and the teacher updates.

3.2.2. Model Training

Student Network The student network is trained with refined pseudo-labeled data in order to move closer to the teacher,

We update student network parameter with one step of gradient descent. In our proposed framework, the feedback signal from the student network to the teacher network is the student’s performance on the strongly-labeled data. We use the student loss on the strongly-labeled data to measure the performance before the update () and after the update (, learning on the refined pseudo-labeled data),

The difference between and , i.e., , can be used as the feedback to meta-optimize the teacher network towards the direction that generates better pseudo labels. If the current generated pseudo labels can further boost the student network, then will be negative, and positive vice versa.

Teacher Network The teacher network is jointly optimized by two objectives: a typical semi-supervised learning loss and a meta learning loss :

For the SSL loss, it consists of the supervised loss on the strongly-labeled data and the regularization loss on the weakly-labeled data.

The supervised loss is defined as

The regularization loss alleviates the overfitting of the teacher by enforcing the prediction consistency between the original and augmented weakly-labeled samples.

where is the prediction logits of the teacher network on the -th token of the -th weakly-labeled sample . is the prediction logits of the corresponding token of the augmented weakly-labeled sample and is the temperature factor to control the smoothness. Here, we do not explicitly augment the sentence and instead add random Gaussian noises to the BERT embedding of each token to increase the diversity of the sentence.

The meta loss for the teacher network is defined as:

The performance variation of the student network on the strongly-labeled data is formulated as the feedback signal to dynamically adapt the teacher network’s pseudo-labeling strategies. The teacher and student can have the same encoder (e.g., DistilBERT (Sanh et al., 2019)), or a larger teacher for better prediction (e.g., BERT (Devlin et al., 2019)) and a small student (e.g., DistilBERT) for fast online production inference.

3.3. QUEACO Attribute Value Normalization

In this section, we discuss two different types of AVN method for and the product type attribute and general attributes, respectively.

3.3.1. AVN for Product type attribute

E-commerce websites usually have their own self-defined product category taxonomy, which is used for organizing and indexing the products. Thus, identifying the product type of a given query is one of the most critical components of the query attribute value extraction.

However, there are three challenges in directly normalizing the surface form product type: 1) some queries do not have explicit surface form product type while they are implicitly associated with some product types. For example, as shown in Table 2 case#2, there is no surface form product type in a movie query “wonder woman 1984”, but the product type of the query is “movie”; 2) many entity mentions are the hyponyms of product type values. For example, as shown in Table 2 case#6, for the query “mini pocket detangler brush”, its surface form product type “detangler brush” is a hyponym of its product type “hair brush”; 3) the same surface form might correspond to different product types. For example, the product type of the query “tote for travel” is “luggage”, but the product type of the query “mk tote for woman” is “handbag”.

Alternatively, we can get the query-to-productType associations using the weakly-labeled behavior data. For frequent queries, we use query search logs to get the product type relevance vector

of query as defined in Section 2.1.2, and then get the most relevant product types. Given that not all queries have enough user-behavioral signals, we use this weakly labeled data to train a multi-label query classification model (Hashemi et al., ; Kim et al., 2016; Lin et al., 2020) for predicting the product type distribution of less frequent queries. To meet the latency constraint, we also use DistilmBERT as the encoder.

Case# Query surface form Behavior-based
1 nike None shoes
2 wonder woman 1984 None movie
3 unicorn None clothes, toys
4 lg smart tv 32 smart tv television
5 patio umbrella patio umbrella umbrella
6 mini pocket detangler brush detangler brush hair brush
7 tote for travel tote luggage
8 mk tote for women tote handbag
Table 2. Case study on surface & behavior-based product type.

3.3.2. AVN for general attributes

The attribute value normalization corresponds to the entity disambiguation task in entity linking. Prior entity linking works for search queries (Cornolti et al., 2014; Tan et al., 2017; Blanco et al., 2015) leverage additional information, such as knowledge base and query log, and search results. Inspired by this, we propose to extract common surface form to canonical form mapping based on QUEACO NER predictions and weakly-labeled query-to-attribute associations.

We use the entity type “brand” as the example. Using the method defined in Section 2.1.2, we can get the most relevant brand for the query by aggregating the query search logs. Then we can associate surface form brand and the most relevant behavior-based brand through the query . Given a surface form brand value and a canonical form brand value

, we can define the mapping probability between them as,

However, we find the same surface form can be normalized to different canonical forms depending on the query context. For example, as shown in Table 2 case#1 and #2, the same surface form size “apple” can be mapped to “apple barrel” given the query “apple craft paint”, and “Apple computer” given the query “apple macbook pro”. The finding is consistent with the recent embedding-based entity linking works (Wu et al., 2020; Agarwal and Bikel, 2020). However, due to the strict requirement on the inference latency and very high request volume, it is hard to directly apply the current state-of-the-art embedding-based entity disambiguation models, which use the context embedding for the candidate ranking, to the query side (Yamada et al., 2016, 2019; Gillick et al., 2019; Wu et al., 2020; Agarwal and Bikel, 2020). Alternatively, we simplify the setting by using query product type as the context of the query. We then define the probability of one surface form value conditioned on canonical form attribute value , given product type as:

Case# Query entity surface form canonical form
1 lg smart tv 32 size 32 32 inch
2 fish tank 32 size 32 32 gallon
3 apple craft paint brand apple apple barrel
4 apple macbook pro brand apple Apple computer
Table 3. Case study on surface & canonical value.

4. Experiments

4.1. QUEACO query NER

4.1.1. Data Description

We collect search queries from a real-world e-commerce website and construct two datasets: (1) strongly-labeled dataset, which is human annotated, and (2) weakly-labeled dataset, which is generated through the partial query tagging, as shown in Table 1. The statistics of the strongly-labeled and the weakly-labeled datasets are shown in table 4 and table 5. The details of these datasets are shown below:

  • We split the entire dataset into train/dev/test by , , and . The size of strongly-labeled and the weakly-labeled training data are 677K and 17M. The weakly-labeled dataset is more noisy and is more than 26 times bigger than the strongly-labeled dataset.

  • The strongly-labeled data contains 12 languages: English (En), German (De), Spanish (Es), French (Fr), Italian (It), Japanese (Jp), Chinese (Zh), Czech (Cs), Dutch (Nl), Polish (Pl), Portugal (Pt), Turkish (Tr). The weakly-labeled dataset does not have Zh, CS, NI, and PI languages.

  • The non-O %coverage for the strongly-labeled dataset is , and there are 13 non-O types. However, the non-O %coverage for weakly-labeled data is , and there are 11 non-O types, indicating the weak labels suffer from severe incompleteness issues. The incomplete annotation is due to the exact string matching between query span and product attribute values (Mehta et al., 2021). Table 5

    also presents the precision and recall of weak label performance on an evaluation golden set. In particular, the overall recall is lower than 50, which is consistent with the non-O %coverage. The low recall issue is even more severe for low-resource languages, like Jp, Pt, and Tr. At the same time, the weak labels also suffer from labeling bias since the overall precision is lower than


Dataset #Train #Dev #Test # Non-O Type Non-O %Coverage
En 256571 14193 14269 13 98.87
De 98980 5442 5473 13 95.49
Es 63844 3600 3488 13 99.05
Fr 79176 4383 4504 13 98.91
It 52136 2933 2867 13 99.04
Jp 77457 4422 4365 13 98.65
Zh 22467 1238 1247 13 98.51
Cs 4430 272 252 13 93.66
Nl 8562 423 478 13 97.09
Pl 4489 251 229 13 92.19
Pt 4467 273 247 13 99.45
Tr 5093 267 274 13 99.52
Total 677672 37697 37693 13 98.31
Table 4. The data statistics of strongly-labeled NER dataset.
Dataset #Train # Type %Coverage Span Precision Span Recall
En 14144225 11 42.64 78.50 47.53
De 2004144 11 48.55 83.18 52.35
Es 322435 11 45.79 82.24 51.32
Fr 504309 11 49.00 81.15 51.56
It 475594 11 48.87 81.69 50.82
Jp 241078 11 20.80 67.67 25.53
Pt 134458 11 33.91 80.83 32.23
Tr 23980 11 32.87 86.12 34.95
Total 17850787 11 43.21 79.80 48.04
Table 5. The data statistics of weakly-labeled NER dataset. Type and Coverage denote the number of entity type and the ratio of non-O entity.

4.1.2. Evaluation Metrics

We use the span-level micro precision, recall and F1-score as the evaluation metrics for all experiments. For the per language experiment, we only report the span-level micro F1-score for each language, due to the space limit.

4.1.3. Analysis of the Base Encoder

We benchmark the DistilmBERT performance with the baseline models in the query attribute extraction literature. All RNN experiments use FastText multi-lingual word embeddings (Conneau et al., 2017) and the TARGER implementation (Chernodub et al., 2019).

  • RNN models: BiLSTM, BiGRU, BiLSTM-CRF and BiGRU-CRF models are benchmarked for the Home Depot query NER model (Cheng et al., 2020).

  • BiLSTM-CNN-CRF (Lample et al., 2016; Ma and Hovy, 2016) is the state-of-the-art NER model architecture before BERT (Devlin et al., 2019; Yang et al., ).

  • DistilmBERT baselines: 1) DistilmBERT (Single) means separately finetuning DistilmBERT on the strongly-labeled data for each single language. 2) DistilmBERT (Multi) means finetuning DistilmBERT on the strongly-labeled data for all languages.

Method (Span level) Precision Recall F1
BiLSTM 65.66 70.09 67.81
BiGRU 64.35 68.96 66.58
BiLSTM-CRF 71.04 69.36 70.19
BiGRU-CRF 69.45 67.98 68.71
BiLSTM-CNN-CRF 70.33 67.92 69.11
BiGRU-CNN-CRF 67.75 65.40 66.56
DistilmBERT (Single) 71.72 74.16 72.92
DistilmBERT (Multi) 73.33 75.29 74.29
Table 6. Comparison of different encoders.

As shown in Table 6: the DistilmBERT has better performance than other non-BERT baselines. Furthermore, finetuning DistilmBERT with all languages has better performance than training a separate model for each language.

4.1.4. Discussion on the training data

(a) Span Precision (b) Span Recall (c) Span F1
Figure 3. Size of strongly & weakly labeled data vs. Performance. All results are produced by directly finetuning the DistilmBERT model with the subsampled dataset. We subsample , , , and of the 677K strongly-labeled data, and subsampled , , and of the 17M weakly-labeled data. In (a), (b) and (c), span-level precision, recall and micro-f1 are shown.

In this section, we discuss the use of training data for QUEACO query NER model. We benchmark our setting with the baseline in the query NER literature, where only weakly-labeled data is available. All experiments use DistilmBERT as the base NER model for a fair comparison.

In Figure 3, we subsample the strongly and weakly-labeled dataset and we find:

  • The precision and recall of the model trained with the weakly-labeled data do not change much when the training data size increases from to . However, both the precision and recall increase dramatically when size of strongly-labeled data increases, especially the precision.

  • The best precision that weakly-labeled data can achieve is around . However, 34K strongly-labeled queries can already achieve precision. And the precision reaches to when trained with 677K strongly-labeled queries. With only weakly-labeled data, the best recall is only around , much lower than that using strongly-labeled data. 7K strong-labeled data can already achieve recall.

The findings are consistent with the conclusion of BOND (Liang et al., 2020) that pre-trained language models can easily overfit to incomplete weak labels. And this explains why the existing query NER works (Cheng et al., 2020; Wen et al., 2019; Cowan et al., 2015; Kozareva et al., 2016) do not adopt the state-of-the-art pre-trained language model.

In Figure 4, we show the performance improvement for introducing weakly-labeled data to different sizes of randomly sub-sampled strongly-labeled data. It is shown that the smaller strongly-labeled data, the bigger improvement the weak labels can introduce. However, the performance improvement is marginal when the strongly-labeled dataset is sufficient. In section 3.2, we introduce the QUEACO query NER model to better utilize the weak labels to further improve the query NER model performance.

Figure 4. Size of Strongly Labeled Data vs. Micro span-level F1. ”strongly labeled”: a baseline that finetunes DistilmBERT with the strongly labeled data, ”strongly & weakly labeled”: a baseline that pretrains Distil-mBERT with weakly labels and then finetunes it on the strongly labeled data.

4.1.5. Implementation Details of QUEACO

We employ the DistilmBERT (Sanh et al., 2019) with 6 layers, 768 dimension, 12 heads and 134M parameters as our encoder. We use ADAM optimizer with a learning rate of , tuned amongst {, , , ,

}. We search the number of epochs in [1,2,3,4,5] and batch size in [8, 16, 32, 64]. The Gaussian noise variance

is tuned amongst {0.01, 0.1, 1.0}. The temperature factor for smoothness is tuned amongst {0.5, 0.6, 0.7, 0.8, 0.9}. The threshold

is tuned amongst {0.5, 0.6, 0.7, 0.8, 0.9}. All implementations are based on transformers in Pytorch 1.7.0. To alleviate overfitting, we perform early stopping on the validation set during both the pretraining and finetuning stages. For model training, we use an Amazon EC2 virtual machine with 8 NVIDIA A100-SXM4-40GB GPUs, configured with CUDA 11.0.

4.1.6. Baseline Models

As discussed in section 2.2 and section 4.1.4, it is evident that the setting of using DistilmBERT as base NER model and using both strongly and weakly-labeled dataset as training data, outperforms the other settings. We also conduct baseline experiments in similar settings to show the effectiveness of the QUEACO query NER model. All experiments use DistilmBERT as the base NER model for the fair comparison.

Supervised Learning Baseline: We directly fine-tune the pre-trained model on the strongly-labeled data.

Semi-supervised Baseline

  • Self Training: self-training with hard pseudo-labels

  • NoisyStudent (Xie et al., 2020) extends the idea of self-training and distillation with the use of noise added to the student during learning.

Weakly-supervised Baseline: Similar to QUEACO, these weakly-supervised baselines also have two stages: pretraining with strongly-labeled and weakly-labeled data, and finetuning with strongly-labeled data. We only report stage 2 performance.

  • Weakly Supervised Learning (WSL): Simply combining strongly-labeled data with weakly-labeled data (Mann and McCallum, 2010).

  • Weighted Weakly Supervised Learning (Weighted WSL): WSL with weighted loss, where weakly-labeled samples have a fixed smaller weight and strongly-labeled samples have weight = 1. We tune the weight and present the best result.

  • Robust WSL: WSL with mean squared error loss function, which is robust to label noises 

    (Ghosh et al., 2017).

  • BOND (hard/soft): BOND (Liang et al., 2020) employs a state-of-the-art two-stage teacher-student framework with hard pseudo-labels or soft pseudo-labels (Xie et al., 2016).

  • BOND (soft-high): only uses the soft pseudo-labels, with high confidence selection for student network training in the BOND framework.

  • BOND (NoisyStudent): applies noisy student (Xie et al., 2020) to the BOND framework.

4.1.7. Main results

From Table  7 and  8, our results obviously demonstrate the effectiveness of our proposed QUEACO query NER model:

  • The proposed QUEACO query NER model achieves the state-of-the-art performance. More specifically, we can improve upon the best weakly-supervised baseline model by a margin of on micro span-level F1. QUEACO query NER model with mBERT as the teacher network can further enhance the model performance.

  • We also find weak labels improve by w.r.t the best semi-supervised result, showing the weak labels have useful information if utilized effectively.

  • Table 8 compares the span F1 between the baseline DistilmBERT model and the QUEACO query NER model for each language. We can observe consistent performance improvement for the high resource languages (En, De, Es, Fr, It, Jp). On the other hand, we observe performance drop for those low resources languages with a few or no weakly-supervised data (Cs, Nl, Pl, Tr). Pt is also a low-resource language but observes significant performance improvement because we have more than 100k weakly supervised training data for Pt. We believe we can further improve the performance of those low-resource languages if more weak supervised data is collected.

Method (Span level) Precision Recall F1
Supervised Baseline
DistilmBERT (Single) 71.72 74.16 72.92
DistilmBERT (Multi) 73.33 75.29 74.29
Semi-supervised Baseline (Encoder: DistilmBERT)
ST 73.29 75.44 74.35
Noisy student 73.28 75.38 74.32
Weakly-supervised Baseline (Encoder: DistilmBERT)
unweighted WSL 73.81 75.93 74.85
weighted WSL 73.77 75.97 74.85
robust WSL 73.10 75.20 74.14
BOND hard 73.77 75.81 74.78
BOND soft 73.65 75.68 74.65
BOND soft high conf 73.95 76.05 74.98
BOND noisy student 73.97 75.99 74.97
Ours (Student: distillmBERT)
QUEACO (Teacher: distilmBERT) 74.44 76.35 75.38
QUEACO (Teacher: mBERT) 74.48 76.41 75.44
(+0.51) (+0.36) (+0.46)
Table 7. Comparison between QUEACO and baseline methods on micro span-level F1.
Language Weakly Data available DistilmBERT (Multi) QUEACO
En True 75.42 76.97 (+1.55)
De True 75.26 76.70 (+1.44)
Es True 77.30 77.67 (+0.37)
Fr True 71.56 73.20 (+1.64)
It True 77.88 78.42 (+0.54)
Jp True 65.49 65.88 (+0.39)
Zh False 71.02 72.19 (+1.17)
Cs False 72.61 70.93 (-1.68)
Nl False 75.46 75.30 (-0.16)
Pl False 79.71 79.43 (-0.28)
Pt True 58.24 62.00 (+3.76)
Tr True 72.12 71.80 (-0.32)
Table 8. Comparison between DistilmBERT (Multi) and QUEACO for each language on micro span-level F1.

4.1.8. Ablation Study

  • QUEACO w/o student feedback : use a fixed teacher network to generate pseudo labels for a student network.

  • QUEACO w/o noise: remove random Gaussian noise added to the BERT embedding when training the teacher network.

  • QUEACO w/o weak labels: remove the pseudo & weak label refinement step, and only use the pseudo labels for student network training.

  • QUEACO w/o finetune: remove stage 2: strong labels finetuning.

As shown in table 9, we find the final finetuning is essential to QUEACO NER. All components from QUEACO, including student feedback, random Gaussian noise to the BERT embedding and the pseudo & weak label refinement, are effective.

Method (Span level) Precision Recall F1
QUEACO w/o student feedback 74.09 76.11 75.09
QUEACO w/o noise 74.18 76.01 75.08
QUEACO w/o weak labels 74.04 75.77 74.89
QUEACO w/o finetune 63.31 66.62 64.92
QUEACO 74.44 76.35 75.38
Table 9. Ablation study.

4.2. QUEACO Attribute Value Normalization

4.2.1. Product type AVN

In the query NER, the span-level micro F1-score for product type is only . The performance for NER-based product type value extraction will be even worse, since many surface forms cannot be normalized. In Table 10, we show the product type precision, recall and F1 of the multi-label query classification model, as described in section 3.3.1, on a golden set. We can conclude the query classification approach, trained with weakly-labeled data, is more suitable to product type attribute extraction than query NER.

Country Eval Data Size Precision Recall F1
USA 2746 85.13 81.1 83.08
UK 2590 85.44 85.71 85.58
Canada 2705 85.07 86.41 85.73
Japan 2151 85.2 80.06 82.55
Germany 2254 85.01 88.54 86.74
Table 10. Product type attribute value extraction performance.

4.2.2. AVN for other attributes

In Table  11, we show some attribute normalization result for brand, color and size attributes, using our proposed method. We can see that our proposed method is effective in finding common surface attributes, including:

  • spelling error: brand “Michael Kors” is often misspelled as “Micheal Kors”, “Levi’s” is often misspelled as “levi”;

  • spelling invariants: for example, “3 by 5” and “3x5” are different variants with the same meaning.

  • abbreviation: for example, “mk” is the abbreviation for “Micheal Kors”, “wd” is the abbreviation for “Western Digital”, “in” in the mention “8 in” is the abbreviation for unit “inches”.

attribute surface form product type canonical form
size 3 by 5 rug 3x5
size 2 pack air filter Value Pack (2)
size 28 foot ladder 28 Feet
size 10.5 inch screen protector 10.5 Inches
size 8 in toy figure 8 inches
color golden belt Gold
color turquoise dress blue
color navy blue dress blue
brand levi underpants Levi’s
brand mk watch Michael Kors
brand Micheal Kors watch Michael Kors
brand wd computer drive Western Digital
Table 11. QUEACO attribute normalization result.

5. QUEACO Online Deployment

5.1. Online End-to-End Evaluation

We conducted an end-to-end evaluation of QUEACO on real-world search traffic. We have two evaluation metrics: span-level precision and token-level coverage. For span-level precision, we resort to a crowdsourcing data labeling platform called Toloka111https://toloka.yandex.com and the reported overall precision of the QUEACO system is . Since the query attribute value extraction is an open-domain problem, the human annotators cannot verify the recall of the extracted attribute spans. Therefore, we calculate token-level coverage, i.e., the percentage of tokens annotated by QUEACO, as an approximation of recall. The token-level coverage increased by compared to the current system.

5.2. Application: Extracted Attribute Value for Product Reranking

To validate the effectiveness of QUEACO signal on the product search system, we design a downstream task, product reranking, whose goal is to rerank the top-16 products based on their relevance to the query intent. Specifically, we first use QUEACO to extract attributes for the product search queries. Then, we generate boolean features, such as is pt match, is brand match

, based on the attribute values of queries and products. We refer to these boolean features as QUEACO features. We then train two learning-to-rank (LTR) models: one model uses QUEACO features while the other does not. All other features, settings and hyperparameters of these two models are the same. To compare these two models, we use NDCG@16, which is the normalized discounted cumulative gain (NDCG) score for the top 16 products of the search result. We conducted online A/B experiments for this reranking application in four countries: India, Canada, Japan, and Germany. On average, we improve the NDCG@16 by


6. Related Work

6.1. E-commerce Attribute Value Extraction

Most of the previous works on e-commerce attribute value extraction focus on extracting surface-form attribute values from product titles and descriptions. Some early machine learning works formulate the task as a (semi-) classification problem 

(Ghani et al., 2006; Probst et al., 2007). Later, several researchers (Putthividhya and Hu, 2011; More, 2016) employ a sequence tagging formulation and adopt the CRF model architecture. With the recent advances in deep learning, many RNN-CRF based models are applied to the sequence tagging task (Huang et al., 2015; Lample et al., 2016; Ma and Hovy, 2016), and have achieved promising results. Following this trend, recent works on the product attribute value extraction task (Zheng et al., 2018; Xu et al., 2019; Mehta et al., 2021) also adopt variants of the BiLSTM-CRF model architecture. In addition, some recent studies have explored BERT-based (Devlin et al., 2019) Machine Reading Comprehension (MRC) (Xu et al., 2019) and Question & Answering (Q&A) (Wang et al., 2020) formulation.

Query attribute value extraction works (Cheng et al., 2020; Wen et al., 2019; Cowan et al., 2015; Kozareva et al., 2016) also employ the sequence tagging formulation and adopt BiLSTM-CRF model architectures as well as its variants. Recent works (Cheng et al., 2020; Wen et al., 2019)

utilize large behavioral-based data to generate partial query tagging as distant supervision to train the NER model, and they also explore data augmentation and active learning to deal with the data quality issues.

6.2. NER with Distant Supervision

To alleviate human labeling efforts, various approaches such as transfer learning 

(Pan and Yang, 2009), semi-supervised learning (Chapelle et al., 2009), and weakly-supervised learning (Zhou, 2018) are emerging and widely applied to low-resource NLP tasks (Zhang et al., 2020; Li et al., 2020b; Liu et al., 2021), e.g., sentiment classification (Li et al., 2017, 2018, 2019b, 2019a), information extraction (He, 2017; Shang et al., 2018a; Li et al., 2020a)

, etc. Specifically, distant supervision is a type of weak supervision, and is automatically generated based on some heuristics, such as matching spans of unlabeled text to a domain dictionary 

(Shang et al., 2018b; Liang et al., 2020). Existing works on NER with distant supervision (Shang et al., 2018b; Liang et al., 2020) mainly focus on the setting that can only access distant supervision. Besides, most existing query NER works (Cheng et al., 2020; Wen et al., 2019) only rely on the distant supervision, generated from partial query tagging, for NER model training.

However, in some cases both strongly-labeled data and a large amount of distant supervision are available. The strongly-labeled data, though expensive to collect, is validated to be critical to boost distant supervised NER performance (Jiang et al., 2021).

7. Conclusion

This paper proposes to utilize the weakly-labeled behavioral data to improve the named entity recognition and attribute value normalization phases of query attribute value extraction. We conduct extensive experiments on a real-world large-scale E-commerce dataset and demonstrate that the QUEACO NER can achieve the state-of-the-art performance and the QUEACO AVN effectively normalizes some common customer typed surface forms. We also validate the effectiveness of the proposed QUEACO system for the downstream product reranking application.


  • O. Agarwal and D. M. Bikel (2020) Entity linking via dual and cross-attention encoders. arXiv preprint arXiv:2004.03555. Cited by: §3.3.2.
  • E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness (2020) Pseudo-labeling and confirmation bias in deep semi-supervised learning. In IJCNN, pp. 1–8. Cited by: §3.2.1.
  • R. Blanco, G. Ottaviano, and E. Meij (2015) Fast and space-efficient entity linking for queries. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 179–188. Cited by: §3.3.2.
  • O. Chapelle, B. Scholkopf, and A. Zien (2009) Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews].

    IEEE Transactions on Neural Networks

    20 (3), pp. 542–542.
    Cited by: §6.2.
  • X. Cheng, M. Bowden, B. R. Bhange, P. Goyal, T. Packer, and F. Javed (2020) An end-to-end solution for named entity recognition in ecommerce search. arXiv preprint arXiv:2012.07553. Cited by: §1, §1, §1, §2.2, §2.2, 1st item, §4.1.4, §6.1, §6.2.
  • A. Chernodub, O. Oliynyk, P. Heidenreich, A. Bondarenko, M. Hagen, C. Biemann, and A. Panchenko (2019) Targer: neural argument mining at your fingertips. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 195–200. Cited by: §4.1.3.
  • J. P. Chiu and E. Nichols (2016) Named entity recognition with bidirectional lstm-cnns. TACL 4, pp. 357–370. Cited by: §1.
  • A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §4.1.3.
  • M. Cornolti, P. Ferragina, M. Ciaramita, H. Schütze, and S. Rüd (2014) The smaph system for query entity recognition and disambiguation. In Proceedings of the first international workshop on Entity recognition & disambiguation, pp. 25–30. Cited by: §3.3.2.
  • B. Cowan, S. Zethelius, B. Luk, T. Baras, P. Ukarde, and D. Zhang (2015) Named entity recognition in travel-related search queries. In

    Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

    pp. 3935–3941. Cited by: §1, §1, §2.2, §4.1.4, §6.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.2, §3.2.2, 2nd item, §6.1.
  • R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano (2006) Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter 8 (1), pp. 41–48. Cited by: §6.1.
  • A. Ghosh, H. Kumar, and P. Sastry (2017) Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: 3rd item.
  • D. Gillick, S. Kulkarni, L. Lansing, A. Presta, J. Baldridge, E. Ie, and D. Garcia-Olano (2019) Learning dense representations for entity retrieval. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 528–537. Cited by: §3.3.2.
  • J. Guisado-Gámez, D. Tamayo-Domenech, J. Urmeneta, and J. L. Larriba-Pey (2016) ENRICH: a query rewriting service powered by wikipedia graph structure. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 10. Cited by: §1.
  • [16] H. B. Hashemi, A. Asiaee, and R. Kraft

    Query intent detection using convolutional neural networks

    Cited by: §3.3.1.
  • W. He (2017) Autoentity: automated entity detection from massive text corpora. Cited by: §6.2.
  • Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §1, §6.1.
  • H. Jiang, D. Zhang, T. Cao, B. Yin, and T. Zhao (2021) Named entity recognition with small strongly labeled and large weakly labeled data. In ACL/IJCNLP, Cited by: §6.2.
  • J. Kim, G. Tur, A. Celikyilmaz, B. Cao, and Y. Wang (2016) Intent detection using semantically enriched word embeddings. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 414–419. Cited by: §3.3.1.
  • Z. Kozareva, Q. Li, K. Zhai, and W. Guo (2016) Recognizing salient entities in shopping queries. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 107–111. Cited by: §1, §1, §2.2, §4.1.4, §6.1.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Cited by: 2nd item, §6.1.
  • D. Lee et al. (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 896. Cited by: §3.2.1.
  • Q. Li, H. Li, H. Ji, W. Wang, J. Zheng, and F. Huang (2012) Joint bilingual name tagging for parallel corpora. In CIKM, pp. 1727–1731. Cited by: §2.1.1.
  • X. Li, L. Bing, W. Zhang, Z. Li, and W. Lam (2020a) Unsupervised cross-lingual adaptation for sequence tagging and beyond. arXiv preprint arXiv:2010.12405. Cited by: §6.2.
  • Z. Li, M. Kumar, W. Headden, B. Yin, Y. Wei, Y. Zhang, and Q. Yang (2020b) Learn to cross-lingual transfer with meta graph learning across heterogeneous languages. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 2290–2301. Cited by: §6.2.
  • Z. Li, X. Li, W. Ying, B. Lidong, Z. Yu, and Q. Yang (2019a)

    Transferable end-to-end aspect-based sentiment analysis with selective adversarial learning

    Cited by: §6.2.
  • Z. Li, Y. Wei, Y. Zhang, and Q. Yang (2018) Hierarchical attention transfer network for cross-domain sentiment classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §6.2.
  • Z. Li, Y. Wei, Y. Zhang, X. Zhang, and X. Li (2019b) Exploiting coarse-to-fine task transfer for aspect-level sentiment classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4253–4260. Cited by: §6.2.
  • Z. Li, Y. Zhang, Y. Wei, Y. Wu, and Q. Yang (2017) End-to-end adversarial memory network for cross-domain sentiment classification.. In IJCAI, pp. 2237–2243. Cited by: §6.2.
  • C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao, and C. Zhang (2020) BOND: bert-assisted open-domain named entity recognition with distant supervision. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Cited by: §2.2, 4th item, §4.1.4, §6.2.
  • H. Lin, P. Xiong, D. Zhang, F. Yang, R. Kato, M. Kumar, W. Headden, and B. Yin (2020) Light feed-forward networks for shard selection in large-scale product search. Cited by: §3.3.1.
  • H. Liu, D. Zhang, B. Yin, and X. Zhu (2021) Improving pretrained models for zero-shot multi-label text classification through reinforced label hierarchy reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp. 1051–1062. Cited by: §6.2.
  • X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074. Cited by: 2nd item, §6.1.
  • G. S. Mann and A. McCallum (2010) Generalized expectation criteria for semi-supervised learning with weakly labeled data.. Journal of machine learning research 11 (2). Cited by: 1st item.
  • K. Mehta, I. Oprea, and N. Rasiwasia (2021) LaTeX-numeric: language-agnostic text attribute extraction for e-commerce numeric attributes. arXiv preprint arXiv:2104.09576. Cited by: 3rd item, §6.1.
  • A. More (2016) Attribute extraction from product titles in ecommerce. arXiv preprint arXiv:1608.04670. Cited by: §6.1.
  • S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §6.2.
  • H. Pham, Z. Dai, Q. Xie, and Q. V. Le (2021) Meta pseudo labels. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 11557–11568. Cited by: §3.2.1.
  • K. Probst, R. Ghani, M. Krema, A. E. Fano, and Y. Liu (2007) Semi-supervised learning of attribute-value pairs from product descriptions.. In IJCAI, Vol. 7, pp. 2838–2843. Cited by: §6.1.
  • D. Putthividhya and J. Hu (2011) Bootstrapped named entity recognition for product attribute extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1557–1567. Cited by: §1, §1, §6.1.
  • A. Raganato, C. D. Bovi, and R. Navigli (2017) Neural sequence learning models for word sense disambiguation. In EMNLP, pp. 1156–1167. Cited by: §1.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108. Cited by: §2.2, §3.2.2, §4.1.5.
  • J. Shang, J. Liu, M. Jiang, X. Ren, C. R. Voss, and J. Han (2018a) Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering 30 (10), pp. 1825–1837. Cited by: §6.2.
  • J. Shang, L. Liu, X. Gu, X. Ren, T. Ren, and J. Han (2018b) Learning named entity tagger using domain-specific dictionary. In EMNLP, pp. 2054–2064. Cited by: §1, §2.2, §6.2.
  • C. Tan, F. Wei, P. Ren, W. Lv, and M. Zhou (2017) Entity linking for queries by searching wikipedia sentences. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 68–77. Cited by: §3.3.2.
  • Q. Wang, L. Yang, B. Kanagal, S. Sanghai, D. Sivakumar, B. Shu, Z. Yu, and J. Elsas (2020) Learning to extract attribute value from product via question answering: a multi-task approach. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 47–55. Cited by: §6.1.
  • M. Wen, D. K. Vasthimal, A. Lu, T. Wang, and A. Guo (2019) Building large-scale deep learning system for entity recognition in e-commerce search. In Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, pp. 149–154. Cited by: §1, §1, §1, §2.2, §4.1.4, §6.1, §6.2.
  • L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer (2020) Scalable zero-shot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6397–6407. Cited by: §3.3.2.
  • J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In ICML, pp. 478–487. Cited by: 4th item.
  • Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020)

    Self-training with noisy student improves imagenet classification

    In CVPR, Cited by: 2nd item, 6th item.
  • H. Xu, W. Wang, X. Mao, X. Jiang, and M. Lan (2019) Scaling up open tagging from tens to thousands: comprehension empowered attribute value extraction from product title. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5214–5223. Cited by: §6.1.
  • I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji (2016) Joint learning of the embedding of words and entities for named entity disambiguation. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 250–259. Cited by: §3.3.2.
  • I. Yamada, K. Washio, H. Shindo, and Y. Matsumoto (2019) Global entity disambiguation with pretrained contextualized embeddings of words and entities. arXiv preprint arXiv:1909.00426. Cited by: §3.3.2.
  • [55] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le XLNet: generalized autoregressive pretraining for language understanding. Cited by: 2nd item.
  • D. Yarowsky (1995) Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pp. 189–196. Cited by: §3.2.1.
  • D. Zhang, T. Li, H. Zhang, and B. Yin (2020) On data augmentation for extreme multi-label classification. arXiv preprint arXiv:2009.10778. Cited by: §6.2.
  • G. Zheng, S. Mukherjee, X. L. Dong, and F. Li (2018) Opentag: open attribute value extraction from product profiles. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1049–1058. Cited by: §6.1.
  • Z. Zhou (2018) A brief introduction to weakly supervised learning. National science review 5 (1), pp. 44–53. Cited by: §6.2.