Text classification is one of the most important tasks of natural language processing (NLP). However, many text data sets are unlabelled. Moreover, text data is always domain-dependent and it is difficult to obtain annotated data for all the domains of interest. To handle this, researchers apply the domain adaptation techniques for text classification. The study[blitzer2007biographies] first applied the structural correspondence learning (SCL) [blitzer2006domain] algorithm to the cross-domain sentiment classification. In [pan2010cross] the spectral feature alignment (SFA) is proposed algorithm to reduce the gap between domain-specific words of the domains. The study [bollegala2015cross] modeled the cross-domain classification task as an embedding learning.
Recently deep adversarial networks [goodfellow2014generative] have achieved success across many tasks. The domain-adversarial neural networks (DANN) structure proposed by [ganin2016domain]
outperforms the traditional approaches in domain adaptation tasks of sentiment analysis. It implements a domain classifier to learn domain-invariant features. Some studies extended DANN for different multi-source scenarios[liu2017adversarial, zhao2018multiple, chen2018multinomial]
. However, they all assume that the label proportions across the domains remain unchanged, an assumption that often is not met in real world tasks. The changes in the label distribution are known as prior probability shift or label shift and they prohibit the DANN from learning domain-invariant features. In the study of[zhang2013domain] the author applies the kernel mean matching (KMM) methods to estimate the label shift. Another attempt to quantify the shift is the Black Box Shift Estimation (BBSE) [lipton2018detecting, azizzadenesheli2019regularized] which obtains accurate estimates on high-dimensional datasets. Recently, [li2019target] proposed to address the label shift problem by using distribution matching to estimate label proportions.
In this paper, we implement a domain adversarial network with label proportions estimation (DAN-LPE) framework which learns domain-invariant features and estimates the target label proportions. The proportion estimation only uses the validation confusion and label predictions as inputs. We reduce the label shift by re-weighting the sample in the domain classifier based on the estimate. In the experiments we compare the DAN-LPE with other algorithms and show it leads in most of the tasks.
2 Problem Setup
Let and be the source and target domains defined on and be a classifier. We use and to denote the feature and label variables. The output of a classifier is denoted by . We use and
to indicate the probability density function ofand . The source and target datasets are represented by and . We split the into training set and validation set . The prior distributions of and are given by and .
The red box in Fig. 1 shows the DANN structure consisting of a feature extractor , a text classifier and a domain classifier . We expect the feature extractor to capture the features satisfying with the help of which makes the feature distributions between source and target domains indistinguishable by back-propagation with gradient reversal. However, the performance of would be declined if the prior distributions between and differ a lot. To handle this, we implement a prior distribution estimator to estimate the target label proportions and correct the label shift by weighting the samples feeding into the based on this updating proportions.
3 Domain Adversarial Network With Label Proportion Estimation
3.1 Moments and Matrices Definition
We first define the validation set
The moments, matrices ofand and their plug-in estimates are denoted as follows
The distributions of the label predictions of and their plug-in estimates are expressed by
3.2 Label Proportions Estimation
The crucial component of the DAN-LPE is the a way of updating label proportion estimate. We define a random vectorto estimate . And we assume the perfect domain-invariant features are learnt that , which implies . When , the equality holds for every
, thus we proposed the loss function
After replacing with the plug-in estimate we get
Now we relate only to the observable data. We set , which implies , and we get that
Then we conclude the following implication
Assume and is an invertible matrix,
is an invertible matrix,is achieved if and only if .
By computing the gradient we can derive
The proposed prior is updated by gradient descent using Equation (3). However, since is constrained by , we apply the projected gradient descent
Where is the learning rate of updating . To avoid the existence of the negative proportion estimate, we also set a lower bound that . Once , we have
We define and as the loss functions of and . To eliminate the prior shift, we re-weight the samples from in based on their labels. Let , where is the prior distribution of For a mini-batch of size , the instances from and are and , the sample weight vector of is . So we compute by
Where presents the cross-entropy loss.
The complete pseudo-code of this learning procedure is given in Algorithm 1
. In the first step it trains a domain adversarial net and processes label proportion estimation alternately to get an estimate of the target prior distribution. During this procedure, the label shift effect is being reduced due to the improving estimate of label proportions, which helps the model derive better domain-invariant features. Since the label shift still matters in early epochs, we need a second step to perform general DANN with the fixedachieved in the first step and the modified loss function in (6).
The hyper-parameters of step 1 in algorithm 1 are quite flexible. The number of iteration is suitable when the validation loss does not increase evidently. The role of is to guarantee that we update when a decent model is trained. We update every iterations so it reduces the times to predict and and accelerates the process. Parameter controls how fast changes. The DAN-LPE is not very sensitive to these hyper-parameters. When is fixed as the prior distribution of , step 1 of Algorithm 1 is equivalent to DANN.
4.1 Experiments on Yelp Data
The Yelp Open Dataset [yelpdata2019] includes 192,609 businesses and 6,685,900 reviews of more than 20 categories. In each review a user expresses opinions about a business and gives a rating ranging from 1 to 5. We compute the average review ratings of each business and label the business with if and if . The business with are filtered out to make the gap. We select the data of Financial Services(F), HotelTravel(H), BeautySpas(B) and Pets(P) for the tasks. Their label distributions vary are shown in Fig. 2a. We sample 2800 businesses for each domain preserving the label proportions and predict the class using their reviews. Among the samples of each domain, 10% of them are split into the validation set.
In feature extraction, we find 500 words by the intersection of exact 837 most common words of each domain, and form the bag of words representation for each review by the occurrence of these tokens. The feature vector of a business is achieved by averaging the vectors of its reviews. In the DAN-LPE setting we implement a standard neural network with 2 hidden layers of 32 dimensions.takes the output of the first layer as the input and another hidden layer of the same size. Dropout of p = 0.6 is set for all the hidden layers.
We compare DAN-LPE with SVM, DNN and DANN. DNN is constructed by and DANN by . For DNN, DANN and DAN-LPE, the learning rate is fixed as and the size of mini-batch is 64. For optimization, the Adam [kingma2014adam] optimizer was applied following an early stopping strategy. In the first step of DAN-LPE, we set , , , and .
To evaluate the label proportions estimation, we define to be the estimate by DAN-LPE and and be the label proportions of samples of and .
The results are shown in Table 1. In the first eight tasks and differ a lot and the DANN does not show much improvement over DNN. In some tasks it even degrades the classification performance. In these experiments, the DAN-LPE shows a significant gain because the label proportions estimate reduces the label shift. In the last four tasks the label proportions between the and are close and DANN gets the best accuracy in three of them. The DAN-LPE performs comparatively with DANN in these tasks since does not increase much shift. It is worth mentioning that given accurate label shift estimate, we can also improve the classification accuracy by the prior probability adjustment [saerens2002adjusting] re-weighting the class importance in . Here we focus on the behavior in the domain adapter and will not further discuss on this.
4.2 Experiments on Behavioral Coding Data
One of the text classification application is the behavioral coding in the Motivational Interviewing (MI) [miller2012motivational] counseling. The utterances of therapist are coded to evaluate a therapist based on the Motivational Interviewing Skill Code (MISC) [miller2003manual] manual. Some of these MISC codes confuse with each other and training classifier for subset of these confused codes help improve the behavioral coding [chen2019improving]. In this experiment we classify the utterances of Giving Information (GI), simple reflection (RES) and complex reflection(REC) collected from MI sessions of alcohol addiction (A), drug-abuse (D) [atkins2014scaling] and general psychology conversations (G) from at a US university counseling center with each category containing around 10000 samples. The label proportions are shown in Fig. 2b. The modules and in the DAN-LPE structure is similar to the one in the Yelp experiment with dimensions of 128 in hidden layers and dropout rate p = 0.4. We replace of the yelp experiment with a word embedding layer, followed by a bidirectional LSTM layer and an attention mechanism implemented as in [yang2016hierarchical] above the LSTM. As shown in Fig. 2b, the data is highly imbalanced so we evaluate the performance by the average F1 score. In module of both step 1 and step 2 of the algorithm 1
, we assign weights for each class inversely proportional to their class frequencies to make the algorithm more robust as well as improving the F-score.
The results in Table 2 show that the DAN-LPE achieves the overall best performance and the highest F-score in most tasks except in the last one when the estimated proportions does not reduce much label shift. The DANN only has a comparable F-score compared with DNN and even degrades for some tasks. The DNN results suggest that we find the behavioral coding task is harder than the yelp experiment because the number of classes is larger and the behavior codes are human defined and not orthogonal. However, the DAN-LPE shows its robustness and still gives reasonable proportion estimate of the data in unlabelled domain.
5 Conclusion and Future Work
In this paper, we proposed the DAN-LPE framework to handle the label shift in the DANN for unsupervised domain adaptation of text classification. In DAN-LPE we estimate the target label distribution and learn the domain-invariant features simultaneously. We derived the formula to update the label proportions estimate using the validation confusion and target label predictions. Experiments shows that the DAN-LPE is robust to estimate the label distribution and improves the text classification. In the future, we plan to apply the DAN-LPE to other tasks such as image classification.