Domain-oriented Language Pre-training with Adaptive Hybrid Masking and Optimal Transport Alignment

by   Denghui Zhang, et al.

Motivated by the success of pre-trained language models such as BERT in a broad range of natural language processing (NLP) tasks, recent research efforts have been made for adapting these models for different application domains. Along this line, existing domain-oriented models have primarily followed the vanilla BERT architecture and have a straightforward use of the domain corpus. However, domain-oriented tasks usually require accurate understanding of domain phrases, and such fine-grained phrase-level knowledge is hard to be captured by existing pre-training scheme. Also, the word co-occurrences guided semantic learning of pre-training models can be largely augmented by entity-level association knowledge. But meanwhile, by doing so there is a risk of introducing noise due to the lack of groundtruth word-level alignment. To address the above issues, we provide a generalized domain-oriented approach, which leverages auxiliary domain knowledge to improve the existing pre-training framework from two aspects. First, to preserve phrase knowledge effectively, we build a domain phrase pool as auxiliary training tool, meanwhile we introduce Adaptive Hybrid Masked Model to incorporate such knowledge. It integrates two learning modes, word learning and phrase learning, and allows them to switch between each other. Second, we introduce Cross Entity Alignment to leverage entity association as weak supervision to augment the semantic learning of pre-trained models. To alleviate the potential noise in this process, we introduce an interpretable Optimal Transport based approach to guide alignment learning. Experiments on four domain-oriented tasks demonstrate the superiority of our framework.


E-BERT: A Phrase and Product Knowledge Enhanced Language Model for E-commerce

Pre-trained language models such as BERT have achieved great success in ...

MLRIP: Pre-training a military language representation model with informative factual knowledge and professional knowledge base

Incorporating prior knowledge into pre-trained language models has prove...

A Tailored Pre-Training Model for Task-Oriented Dialog Generation

The recent success of large pre-trained language models such as BERT and...

JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs

Existing pre-trained models for knowledge-graph-to-text (KG-to-text) gen...

LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization

Language model pre-training based on large corpora has achieved tremendo...

Conflict-Aware Pseudo Labeling via Optimal Transport for Entity Alignment

Entity alignment aims to discover unique equivalent entity pairs with th...

Weakly supervised cross-domain alignment with optimal transport

Cross-domain alignment between image objects and text sequences is key t...

1. Introduction

Figure 1. The single-word and phrase reconstruction accuracy of several existing language pre-training models.
Table 1. An example of review aspect extraction, where correct answers (marked in color) are usually phrases.

Recent years have witnessed the great success of pre-trained language models (PLMs), such as BERT (devlin2019bert), in a broad range of natural language processing (NLP) tasks. Moreover, several domain-oriented PLMs have been proposed to adapt to specific domains (huang2019clinicalbert; gu2020domain; chalkidis2020legal). For instance, BioBERT (lee2020biobert) and SciBERT (beltagy2019scibert) are pre-trained leveraging large-scale domain-specific corpora for biomedical and scientific domain tasks respectively. However, in the above models, the same pre-training scheme as BERT is reused straightforwardly, while insightful domain characteristics are largely overlooked. To this end, we raise a natural question: for domain language pre-training, can we go further beyond the strategy of vanilla BERT + domain corpus by leveraging domain characteristics? In this paper, we explore this question under e-commerce domain and present promising approaches that can also be generalized to other domains when auxiliary knowledge is available.

We first discuss the characteristics of domain-oriented tasks, and the limitations of current pre-training approaches, then present two major improving strategies, corresponding to leveraging two types of auxiliary domain knowledge smartly. On the one hand, understanding a great variety of domain phrases is critical to domain-oriented tasks. As shown in Table 1, the review aspect extraction task, widely used in the e-commerce domain, requires language models to understand domain phrases to extract the correct answers. However, such phrase-level domain knowledge is hard to be captured by Masked Language Model (MLM) (devlin2019bert) (i.e., the self-supervised task employed in most language pre-training models). Figure 1 depicts the language reconstruction performance of three existing language pre-training models on a public e-commerce corpus. As can be seen, the reconstruction accuracy drops drastically when the prediction length is increased from single word to multi-word phrase. We attribute this to the fact that MLM is a word-oriented task, i.e., it only reconstructs randomly masked words from the incomplete input however does not explicitly encourage any perception ability for domain phrases. Although later works (joshi2020spanbert; sun2020ernie) propose to mask phrases instead of words in MLM to enable BERT for phrase perception, they have two major drawbacks: (i) Overgeneralized phrase selection, they use chunking (sun2019ernie) to randomly select phrases to mask, without considering the quality of phrases and the relatedness to specific domains. (ii) Discard of word masking, word masking helps to acquire word-level semantics essential for phrase learning, hence should be preserved in pre-training.

On the other hand, pre-trained language models are limited by corpus-level statistics such as co-occurrence, which can be mitigated by auxiliary domain knowledge. For instance, to learn that Android and iOS are semantically related, a large number of co-occurrences in similar contexts are required in the pre-training data. For domain-oriented learning, this can be mitigated by auxiliary knowledge, i.e., entity association. As shown in Table 2, when leveraging the “substitutable” association to pair the description texts of two product entities, Samsung galaxy and iPhone, we can augment the co-occurrence of some words/phrases (e.g, 5G network vs 4G signal; Android vs iOS) by learning the alignments of similar words across entities. However, the above intuition is challenging to fulfill in practice as it constitutes a

weakly supervised learning task

. In other words, only weak-supervision signals (i.e., entity-level alignments) are available, while the word-level groundtruth alignments across entities are hard to obtain. Hence, the aligning problem needs a robust learning algorithm to overcome the potential noises under the weak supervision. Moreover, the algorithm should also offer decent interpretability over the alignment for the ease of understanding and validation.

Based on the above insights, we propose an enhanced domain-oriented framework for language pre-training. Our framework takes the mentioned domain characteristics into consideration, and introduces two approaches to tackle the challenges. First, to enable language pre-training with the perception ability for domain phrases, we propose an advanced alternative for Masked Language Model, namely, Adaptive Hybrid Masked Model (AHM). In contrast to MLM only masking and reconstructing single words, AHM introduces a new sampling scheme for masking quality phrases with the guidance of an external domain phrase pool, and meanwhile, a novel phrase completeness regularization term is proposed for sophisticated phrase reconstruction. Furthermore, since both word-level and phrase-level semantics are critical to language modeling, we unify the word and phrase learning modes via a loss-based parameter. It allows the adaptive switching between each other, ensuring a smooth and progressive learning process resembling the human cognition of language.

Table 2. An example of relational text in the e-commerce domain, where product descriptions are connected by the “substitutable” product association.

Second, to exploit the rich co-occurrence signals hidden in entity associations, we formulate a new pre-training task, namely, Cross Entity Alignment  (CEA). Specifically, CEA aims to learn the word-level alignment matrix of entity association based text pair (e.g., description pair) with only weak supervision, i.e., only knowing two entities are related but no word-level groundtruth alignments available. Moreover, we propose an alignment learning scheme leveraging Optimal Transport (OT) to train this task in a weakly-supervised fashion. At each round, the OT objective helps to find the pseudo optimal matching of similar words (or phrases) and returns a sparse transport plan, which reveals robust and interpretable alignments. The language model is further optimized with the guidance of the transport plan to minimize the Wasserstein Distance of the aligned entity contents, enabling the model to learn fine-grained semantic correlations.

To validate the effectiveness of the proposed approach, we conduct extensive experiments in the e-commerce domain to compare our pre-training framework with state-of-the-art baselines. Specifically, we employ the pre-training corpus created from publicly available resources and fine-tune on four downstream tasks, i.e., Review-based Question Answering (RQA), Aspect Extraction (AE), Aspect Sentiment Classification (ASC), and Product Title Categorization (PTC). Quantitative results show that our method significantly outperforms BERT and other variants on all the tasks. Additionally, the visualization of OT-based approach reveals feasible alignment results despite the weak supervision, meanwhile, presenting convincing interpretability as the alignment vector is enforced to be sparse. Lastly, while we demonstrate the effectiveness of our approach in the e-commerce domain, the ideas of the framework can be generalized to broader domains since the aforementioned auxiliary knowledge is free of annotation cost. The domain phrase pool can be constructed from domain corpus. Entity association is broad and general, which is easy to obtain in main domains.

2. Related Work

Pre-trained Language Models. Recently, the emergence of pre-trained language models (PLMs) (devlin2019bert; peters2018deep; radford2018improving) has brought natural language processing to a new era. Compared with traditional word embedding models (mikolov2013distributed), PLMs learn to represent words based on the entire input context to tackle polysemy, hence captures semantics more accurately. Following PLMs, many endeavors have been made for further optimization in terms of both architecture and training scheme (liu2019roberta; brown2020language; sun2020ernie; liang2020bond). Along this line, SpanBERT (joshi2020spanbert) proposes to reconstruct randomly masked spans instead of single words. However, the span consists of random continuous words and may not form phrases, thus fails to capture phrase-level knowledge effectively. ERNIE (sun2019ernie) integrates phrase-level masking and entity-level masking into BERT, which is closely related to our masking scheme. Differing from their work simply using chunking to get general phrases, we build high-quality domain phrase pool to assist learning domain-oriented phrase knowledge. Also, we propose a novel phrase regularization term over the reconstruction loss to encourage complete phrase learning. Moreover, we combine word and phrase learning cohesively according to their optimizing progress, achieving better performance than each single mode.

Domain-oriented PLMs. To adapt PLMs to specific domains, several domain-oriented BERTs such as BioBERT (lee2020biobert), SciBERT (beltagy2019scibert), and TweetBERT (qudar2020tweetbert), have been proposed recently. BERT-PT (xu2019bert) proposes to post-train BERT on a review corpus and obtains better performance on the task of review reading comprehension. gururangan2020don (gururangan2020don) proposes an approach for post-training BERT on domain corpus as well as task corpus to obtain more performance gains on domain-specific tasks. DomBERT (xu2020dombert) proposes to select data from a mixed multi-domain corpus for the target domain, improving the diversity of domain language learning. More work along this line can be referred to (rietzler2020adapt; ma2019domain). Similarly, incorporating domain knowledge has shown effectiveness in broader areas (yuan2020spatio; zhang2017efficient; sun2021market; zhang2019job2vec; Yuan_Liu_Hu_Zhang_Xiong_2021; li2018link) such as representation learning. The above solutions have primarily leveraged domain corpus for pre-training in a straightforward way, without considering insightful domain characteristics and domain knowledge such as domain phrase and entity association. Our work is the first leveraging auxiliary domain knowledge to enhance domain-oriented pre-training.

3. Preliminaries

In this section, we give a brief introduction to two essential concepts that are related to our work, namely, Masked Language Model and Optimal Transport.

Masked Language Model. Masked Language Model (MLM) (devlin2019bert) refers to the self-supervised pre-training task that have been applied in pre-trained language models (e.g., BERT, RoBERTa, etc.). It is considered as a fill-in-the-blank task, i.e., given an input sequence partially masked (15% tokens), it aims to predict those masked words using the embeddings generated by the language model:


where denotes the Transformer based language model. is the full input sequence, denotes the indices of all masked tokens in , indicates one of the tokens in , is set minus. denotes the output vector corresponding to the masked token and denotes the softmax matrix with the same number of entries as the vocabulary .

Maximizing enforces to infer the meaning of masked words from their surroundings, in other words, preserving contextual semantics.

Optimal Transport and Wasserstein Distance

. Optimal Transport (OT) studies the problem of transforming one probability distribution into another one (e.g., one group of embeddings to another) with the lowest cost. When considering the “cost” as distance, a commonly used distance metric for OT is Wasserstein Distance (WD)

(villani2008optimal). Formal definition is as follows(chen2020graph):

Definition 3.1 ().


denote two probability distributions, formulated as

and , with as the Dirac function centered on .

denotes all the couplings (joint distributions) of

and , with marginals and . The optimal Wasserstein Distance between the two distributions is defined as:


where , denotes an -dimensional all-one vector, the weight vectors and belong to the - and -dimensional simplex, respectively (i.e., ). And is the cost function evaluating the distance between and (samples of the two distributions). Computing the optimal distance (1st line) is equivalent to solving the network-flow problem (2nd line) (luise2018differential). The calculated matrix denotes the “transport plan”, where each element represents the amount of mass shifted from to . We propose an Optimal Transport based approach for the cross entity alignment problem in Section 4.2.

4. Methodology

Figure 2. Framework overview.
Figure 3. Illustration of Adaptive Hybrid Masked Model. Based on the feedback losses, it adaptively switches between two learning modes, enabling the language model to learn word-level and phrase-level knowledge simultaneously.

In this section, we provide an in-depth introduction to our enhanced framework for domain-oriented language pre-training. Figure 2 presents an overview of the framework, consisting of two major improvements, i.e., Adaptive Hybrid Masked Model (AHM) to replace MLM and a new weakly-supervised pre-training task, OT-based Cross Entity Alignment (CEA). The former leverages a domain corpus and a domain phrase pool to learn both word-level and phrase-level semantics, the latter utilizes the same corpus and an entity association graph to obtain text pairs for augmenting domain semantic learning. Moreover, we employ continual multi-task pre-training (sun2020ernie) to jointly train AHM and CEA. Lastly, the model is fine-tuned to be deployed in domain-oriented applications.

4.1. Adaptive Hybrid Masked Model

In order to enhance the phrase perception ability of language model while meantime preserving its original word perception ability, we introduce a new masked language model, namely, Adaptive Hybrid Masked Model (AHM). Specifically, we set two learning modes in AHM, i.e., word learning and phrase learning, which in a nutshell, masks then reconstructs word units and phrase units, respectively. Moreover, we combine the two learning modes by adaptively switching between them, enabling the model to capture the word-level and phrase-level semantics simultaneously and progressively. Figure 3 provides an illustration of the model.

4.1.1. Word Learning Mode

In this mode, given an input sequence ( denotes the iteration), we first randomly sample words from iteratively until the selected words constitute 15% of all tokens. Then we replace them with: (1) the [MASK]

token 80% of the time, (2) a random token 10% of the time, (3) the original token 10% of the time. Next, we predict all the masked/perturbed tokens by feeding their embeddings of the language model to a shared softmax layer. Equivalently, we optimize the log-likelihood function below:


where denotes the indices of all the masked/perturbed tokens in . and denotes the masked token and perturbed input, respectively. follows the definition in Eq.(1). This mode resembles the original masking scheme in MLM except that we only mask whole words. It helps to learn preliminary word-level semantics, which is not only the basis of language understanding but also essential for phrase learning.

4.1.2. Phrase Learning Mode

In the phrase learning mode, we randomly mask consecutive tokens that constitutes quality domain phrases and train the language model to reconstruct them. First, given an input sequence and a domain phrase pool (comprising high-quality phrases and their quality scores)111In this paper, we leverage AutoPhrase (shang2018automated) to obtain domain phrase pool., following Algorithm 1, we detect domain phrases and sample to obtain tokens. Then similar to the word mode, we replace the selected tokens with [MASK]

token 80% of the time, a random token and the original token 10% of the time respectively. Next, we optimize the following loss function to reconstruct the masked phrases:


where the first term is defined the same way as Eq.(1) and (3) except that denotes indices of all the masked tokens obtained via Algorithm 1. With the first term, we reconstruct masked phrases by predicting their tokens. Additionally, we propose an completeness regularization term (the second term) over the masked phrases to encourage complete phrase reconstruction, i.e., the model will get more rewards when an entire phrase is correctly predicted. As defined in Eq.(5), where also denotes the indices of masked tokens but grouped by phrases, denotes one of the group in , we first average all the token embeddings of a phrase to obtain the merged phrase feature (i.e., ). Then we predict each complete phrase instead of the tokens in it using its merged feature along with a new phrase softmax matrix (i.e., ). represents the set of all phrases in corpus.

Figure 4. Illustration of the OT based approach for learning the word-level alignments for entity association based text pair.
0:  An sequence ; The domain phrase pool .
0:  Token indices of domain phrases, denoted by ; Token indices grouped by phrases, denoted by .
1:  Detect phrases222Fulfill via a rule-based phrase matcher, in that intersect with , denoted by ;
2:  Retrieve their quality scores from ;
3:  Normalize all the scores by softmax, i.e.,;
4:  Let count , , ;
5:  while count/num_token()15% do
6:     Sample a phrase from based on the normalized scores, i.e., ;
7:     Add the indices of all tokens in into ;
8:     Add the indices in as a list into ;
9:     count ;
10:  end while
11:  Return , .
Algorithm 1 Token sampling algorithm for the phrase mode.

4.1.3. Adaptive Hybrid Learning

As both word-level and phrase-level semantics are critical to language modeling, we combine the two learning modes via a dynamic parameter based on the feedback losses of them. At each iteration, as shown in Figure 3, the model automatically selects the weaker mode according to the value of .

Calculating . We calculate based on the relative loss reduction speed of the two modes. Specifically, at each iteration (assuming ), we first calculate a special variable for both modes to track their fitting progress, i.e., and . The larger () is, the less sufficient the model is trained on the word (phrase) mode. Then for next iteration is calculated as the rescaled ratio of and , i.e.,


where denotes the loss of the word learning mode and will only be updated if word mode is selected at the -th iteration. Function is equivalent to . denotes the loss reduction of word mode between the current and last iteration. denotes the total loss reduction. , , represents the same variables in the phrase mode. Thus, and indicates the relative loss reduction speed of the two modes respectively, and the ratio them () reflects the relative importance of the word mode. The non-linear function tanh is used to rescale the ratio to [0,1].

Loss Function of AHM. The overall loss function of AHM is the combined losses of the two learning modes, with weights dynamically adjusted by , i.e.,


where represents the training corpus. denotes the indicator function defined in Eq.(9). As can be seen, when , , the word mode becomes dominating, and vice versa. In other words, is able to control the model to switch to the weaker learning mode adaptively.

4.2. OT-based Cross Entity Alignment

To exploit the co-occurrence signals hidden in entity associations, we formulate a new pre-training task, i.e., Cross Entity Alignment (CEA), as defined below. We first exploit the entity association graph to extract a collection of associated text pairs from the domain corpus as training data. Next, an Optimal Transport (OT) based approach is introduced to train CEA effectively.

Definition 4.1 ().

Given two paired entity contents denoted by word sequences and , Cross Entity Alignment aims to learn an word-level alignment matrix A, where indicates the correlation of and (s.t. ).

The task is challenging due to the lack of groundtruth alignment matrix . A common solution to this problem involves designing advanced attention mechanisms to simulate soft alignment. However, the learned attention matrices are often too dense and lack interpretability, inducing less effective alignment learning. On the other hand, OT possesses ideal sparsity that makes it a good choice for cross-domain alignment problems (chen2020graph). Specifically, when solved exactly, OT yields a sparse solution containing non-zero elements at most, where , leading to a more interpretable and robust alignment. Hence, we propose an OT-based approach to the address CEA. Figure 4 presents an overview illustration of our Optimal Transport based approach for CEA. Concretely, we follow the below procedures to fulfill it.

Content Embeddings and Cost Matrix. Given the entity pair , we first feed their content texts into the language model (Transformer) respectively to get the contextual embeddings, denoted by and . Then we calculate a cost matrix , where defines the cost (distance) of shifting one mass from to , where we use cosine distance as the cost function.

Computing Transport Plan as Alignments. Next, by regarding the two set of content embeddings as two probability distributions, we calculate the optimal transport plan of transforming one distribution to the other. Here is obtained via substituting into Eq.(2), i.e.,


where each element in denotes how much mass should be shifted from to . To be noted, the value of can be automatically optimized smaller if and are not very correlated, i.e., having a high cost value . In other words, actually reflect the strength of correlations between the word-level content pair across two products. Therefore, after jointly optimized with the language model, we use as the approximation to the alignment matrix.

Efficient Solver: IPOT. Unfortunately, it is computational intractable (arjovsky2017wasserstein; salimans2018improving) to compute the exact minimization over

. Hence, to ensure an efficient training on large neural networks of language models, we propose to apply the recent introduced Inexact Proximal point method for Optimal Transport (IPOT) algorithm

(xie2020fast) to compute the optimal transport plan . IPOT approximates the exact solution by iteratively solving the following optimization problem:


where is the proximity metric term used to penalizes solutions that are too distant from the latest approximation. We do not choose Sinkhorn algorithm (cuturi2013sinkhorn) to solve the efficiency issue as it is too sensitive to the choice of the hyper-parameter in experiments.

Loss Function of CEA. Lastly, we train the language model via optimizing the OT distance (i.e., Wasserstein distance) between the aligned content embeddings, with overall loss function defined as:


where denotes the set of entity association based text pairs.

5. Experiments

In this section, we conduct extensive experiments in the e-commerce domain to validate the effectiveness of the proposed framework. We first introduce the external and internal baselines compared in the paper. Next, we present the corpus as well as the auxiliary domain knowledge data used during pre-training. Besides, we elaborate the downstream tasks (definitions, datasets, performance metrics) for evaluating all the models. Lastly, we report the main performance comparison, ablation studies, case studies, and visualization of the OT-based alignments.

5.1. Baseline Models

External Baselines. In this paper, we compare our framework to following external baselines. (1) BERT: The vanilla BERT which is pre-trained on large-scale open-domain corpora by huggingface. (2) BERT-PT (xu2019bert): The vanilla BERT that is further post-trained on review data. This can be considered as the domain-oriented vanilla BERT. (3) BERT-NP: The vanilla BERT using a different masking strategy, i.e., masks noun phrases instead of words. We contrast this method with another internal baseline (DPM) to reveal the effects of different phrase selection schemes. (4) SpanBERT (joshi2020spanbert): An variant of BERT which masks spans of tokens instead of individual tokens. We compare with it to further validate the effect of different masking schemes. (5) RoBERTa (liu2019roberta): A robustly optimized variant of BERT which deletes the Next Sentence Prediction task. (6) ALBERT (lan2019albert): A memory-efficient lite BERT that also high performances. To enable the above baselines (2)-(6) to be domain-oriented, like most existing work, we pre-train them on the same domain corpus as our method (except BERT for validating the effects of using domain corpus).

Internal Baselines. For ablation studies (validating the effects of each component in framework), we further compare with the following internal baselines: (1) DPM: The vanilla BERT that only masks domain phrases using our phrase pool, abandons word masking. (2) DPM-R: The vanilla BERT that only masks domain phrases and further employs the phrase regularization term, abandons word masking. (3) HM-R: The vanilla BERT that masks domain phrases and words in a hybrid way (50%/50% of the time), employs the phrase regularization term. (4) AHM: Adaptive Hybrid Masked Model, without leveraging entity association knowledge by Cross Entity Alignment. (5) AHM+CEA: The full version of our framework, combines AHM, OT-based CEA via continual multi-task learning. All internal baselines are pre-trained on the same domain corpus.

5.2. Domain-oriented Tasks and Metrics

We perform evaluations on four tasks of e-commerce. The definition, fine-tuning head and metric of each task is provided below.

Review Question Answering (Review QA). Given a question about a product and a related review snippet , it aims to find the span from that can answer . We employ the same BERT fine-tuning head (devlin2019bert) as which on span-based QA to fine-tune this task, which maximizes the log-likelihoods of the correct start and end positions of the answer.

Review Aspect Extraction (Review AE). Given a review , the task aims to find product aspects that reviewers have expressed opinions on. It is typically formalized as a sequence labeling task (xu2019bert)

, in which each token is classified as one of

, and tokens between and are considered as extracted aspects. Following (xu2019bert), we apply a dense layer and softmax layer on top of BERT output embeddings to predict the sequence labels.

Review Aspect Sentiment Classification (Review ASC). Given an aspect and the review sentence where extracted from, this task aims to classify the sentiment polarity (positive, negative, or neutral) expressed on aspect . For fine-tuning, following (xu2019bert), both and are input into our framework, and we feed the [CLS] token to a dense layer and softmax layer to predict the polarity. Training loss is the cross entropy on the polarities.

Product Title Categorization (PTC). Given a product title , the task aims to classify using a predefined category collection . Each title may belong to multiple categories, hence being a multi-label classification problem. We feed the embedding of [CLS] token to a dense layer and the multi-label classification head for fine-tuning.

Evaluation Metrics. For review QA, we adopt the standard evaluation script from SQuAD 1.1 (rajpurkar2016squad) to report Precision, Recall, F1 scores, and Exact Match (EM). To evaluate review AE, we report Precision, Recall, and F1 score. For review ASC, we report Macro-F1 and Accuracy following (xu2019bert). Lastly, we adopt Accuracy (Acc), and Macro-F1 to evaluate product title categorization.

5.3. Experimental Datasets

5.3.1. Pre-training Resources

In the paper, we collect and leverage a domain corpus and two domain knowledge datasets. Table 3 shows the datasets statistics and below presents the detailed collecting steps.

Domain Corpus. We extract millions of product titles, descriptions, and reviews from the Amazon Dataset(ni2019justifying) to build this corpus. The entire corpus consists of two sub-corpus, i.e., product corpus and review corpus. In the first corpus, each line corresponds to a product title and its description, while in the second, each line corresponds to a user comment on a specific product. The corpus serves as the foundation for language models to learn essential semantics of the e-commerce domain.

Domain Phrase Pool. To build the e-commerce domain phrase pool, we extract one million domain phrases from the above corpus leveraging AutoPhrase333, a high efficient phrase mining algorithm, which is able to generate a quality score for each phrase based on corpus-level statistics such as popularity, concordance, informativeness, and completeness. Moreover, we filter out phrases that have a score lower than 0.5 to keep quality domain phrases. Table 4 shows the top-ranked phrases from six product categories.

Table 3. The statistics of the pre-training datasets.
Category    Representative phrases
Automotive jumper cables, cometic gasket, angel eyes, drink holder, static cling
Clothing, Shoes and Jewelry high waisted jean, nike classic, removable tie, elegant victorian, vintage grey
Electronics ipads tablets, SDHC memory card, memory bandwidth, auto switching
Office Products decorative paper, heavy duty rubber, mailing labels, hybrid notebinder
Sports and Outdoors basketball backboard, table tennis paddle, string oscillation, fishing tackles
Toys and Games hulk hogan, augmented reality, teacup piggies, beam sabers, naruto uzumaki
Table 4. High-quality phrases of the e-commerce domain.

Entity Association Graph

. We build this graph to store the product entity associations in the form of associated entity pairs. In the paper, we only consider the “substitutable” associations and use a shopping pattern based heuristic method

(mcauley2015inferring) to extract corresponding product pairs with this relation. We exploit all entity pairs in this graph to extract the same amount of associated text (title, description) pairs from the product corpus for the task of CEA.

Figure 5 presents when sampling phrases on the same domain corpus, the overlap between the results by our phrase pool based scheme and the ones by chunking based scheme. Results are reported based on nine categories of the product corpus. Each entry represents the ratio of the overlapped phrases to the general chunking based phrases. As can be seen, the overlap ratio is relatively low across all the sub categories, indicating our phrase pool based scheme yields more domain-oriented phrases.

5.3.2. Task-specific Datasets

For review QA, we evaluate on a newly released Amazon QA dataset (miller2020effect), consisting of 8,967 product-related QA pairs. For the task of review AE and review ASC, we employ the laptop dataset of SemEval 2014 Task 4 (pontiki2016semeval) which contains 3,845 review sentences, 3,012 annotated aspects and the sentiment polarities on them. For product title categorization, we create an evaluation dataset by extracting a subset of Amazon metadata, consisting of 10,039 product titles and 98 categories. The first three datasets above are publicly available from prior works and we will share the fourth dataset in future. For all the datasets, we divide them into training/validation/testing set with the ratio of 7:1:2.

Figure 5. The overlap of different phrase sampling schemes.
Table 5. Performance comparison of baselines and our model on the e-commerce downstream tasks (%).

5.4. Implementation Details

Pre-training details. All the models are initialized with the same pre-trained BERT (the bert-base-uncased

by Huggingface, with 12 layers, 768 hidden dimensions, 12 heads, 110M parameters). We post-train all the models (except BERT) on the domain corpus for 20 epochs, with batch size 32 and learning rate 1e-5. For our framework, we adopt Continual Multi-task Learning

(sun2020ernie) to combine AHM and CEA. Specifically, we first train AHM alone on the entire corpus for epochs with the same batch size and learning rate. Then, we train AHM and CEA jointly on the product corpus (with instances reformatted as text pairs by entity associations) for another epochs. In AHM, to initialize and ensure a stable training, we fix for t=11,000 (word learning mode is easier and provides preliminary knowledge, hence we weigh it more for initial iterations). For training OT-based CEA, we set in the IPOT algorithm. All the pre-training is performed on a computational cluster with 8 NVIDIA GTX-1080-Ti GPUs with 20 days duration.

Fine-tuning Details. In each task, we adopt the same task-specific architecture (task head) as aforementioned for all the models. We choose the learning rate and epochs from {5e-6, 1e-5, 2e-5, 5e-5} and {2,3,4,5} respectively. For each task and each model, we pick the best learning rate and number of epochs on the development set and report the corresponding test results. We found the setting that works best across most tasks and models is 2 or 4 epochs and a learning rate of 2e-5. Results are reported as averages of 10 runs.

5.5. Experimental Results

5.5.1. Main Results Analysis

Table 5 presents the performance comparison of all the baselines and our framework on the four tasks. The key observations and conclusions are: (1) Our framework (AHM, AHM+CEA) easily outperforms all the external baselines by a large margin (4.1% in average), indicating the effectiveness of our general idea, i.e., leveraging auxiliary domain knowledge to enhance domain-oriented language modeling. (2) BERT-PT outperforms BERT, proving that for domain-oriented tasks, capturing domain semantics by pre-training on a domain corpus is necessary. (3) The effects of different masking schemes: BERT-NP and SpanBERT can perform better consistently than BERT-PT, indicating the advantage of phrase/span based masking strategy over word based masking strategy. (4) The effects of different phrase selection schemes: DPM achieves more improvements over BERT-NT and SpanBERT, certificating that the domain phrase pool based sampling outperforms general chunking based phrase sampling. We attribute this to that: the domain phrase pool, serving as a “supervisor”, enables the language model to “focus” more on domain-oriented phrases, and these phrases have more effects over the downstream tasks.

5.5.2. Ablation Studies

The bottom of Table 5 shows the performance comparison of the internal baselines. As can be seen, (1) DPM-R outperforms DPM, validating the effectiveness of the proposed phrase regularization term. Compared with reconstructing phrases by tokens, it encourages complete phrase reconstruction, leads to a more accurate phrase perception learning. (2) HM-R uses hybrid masking in a straightforward way, achieves slightly better performances than DPM-R. Besides, AHM achieves more improvements on DPM-R than HM-R. This indicates that both word learning and phrase learning are essential for language models, and the adaptive hybrid learning method is a more solid way to combine them. (3) AHM+CEA further improves the performances by 0.5%1.2% over AHM on the four tasks, certificating the effectiveness of our idea of leveraging entity association knowledge to augment semantic learning. Moreover, the proposed OT-based alignment pre-train task can successfully exploit the hidden co-occurrence signals in entity association based text pairs.

5.5.3. Case Studies and Visualizations

Table 6 shows a case study of the review aspect extraction task. We compare our model with BERT-PT, both are pre-trained on the same domain corpus, and employ the same fine-tuning architecture and task-specific dataset. As can be seen, for “aspects” that span multiple words, our model offers better predictions than BERT-PT in terms of the phrase completeness (size of the screen vs screen) . This indicates that our model indeed possesses fine phrase perception ability needed for phrase-intensive tasks.

Table 6. Case studies of Aspect Extraction (AE). Given a review, it aims to extract specific product “aspects” that are discussed. Ground-truth answers are marked in color. For answers consisting of multi-word phrases, our model make more comprehensive predictions than BERT.

Figure 6 presents the visualization of the optimal transport alignment for two product pairs, where darker color indicates stronger correlations. Example (a) is about Mandoline Slicer and Steel Chopper, example (b) is about a Docking station and Dell Monitor. As can be seen, in both examples, OT alignments are sparser and offers better interpretability, with meaningful word alignment pairs being discovered automatically (Slicer vs Chopper, Vegetables vs Veggies, Monitor vs Display). This certificates that the OT-based alignment task can indeed benefit semantic learning by automatically correlating similar words/phrases across entity pairs.

(a) Two kitchen products. (b) Two electronic products.
Figure 6. Visualizing the optimal transport plan in two real examples.

6. Conclusion

In this paper, we introduced how to improve domain-oriented language modeling by leveraging auxiliary domain knowledge. Specifically, we proposed a generalized pre-training framework enhancing existing works from two perspectives. First, we developed Adaptive Hybrid Masked Model (AHM) to incorporate auxiliary domain phrase knowledge. Second, we designed Cross Entity Alignment (CEA) to leverage entity association as weak supervision for augmenting the semantic learning of pre-trained models. Without the loss of generalization, we performed the experimental validation on four downstream e-commerce tasks. The results showed that incorporating phrase knowledge via AHM can improve the performance on all the tasks, especially the phrase-intensive ones. Also, utilizing the entity association knowledge via CEA can further improve the performances and the learned alignments revealed meaningful semantic correlation across word pairs.