Interestingly, we find that many of the aforementioned AI tasks are emerged in dual forms, i.e., the input and output of one task are exactly the output and input of the other task respectively. Examples include translation from language A to language B vs. translation from language B to A, image classification vs. image generation, and speech recognition vs. speech synthesis. Even more interestingly (and somehow surprisingly), this natural duality is largely ignored in the current practice of machine learning. That is, despite the fact that two tasks are dual to each other, people usually train them independently and separately. Then a question arises: Can we exploit the duality between two tasks, so as to achieve better performance for both of them? In this work, we give a positive answer to the question.
To exploit the duality, we formulate a new learning scheme, which involves two tasks: a primal task and its dual task. The primal task takes a sample from space as input and maps to space , and the dual task takes a sample from space as input and maps to space
. Using the language of probability, the primal task learns a conditional distributionparameterized by , and the dual task learns a conditional distribution parameterized by , where and . In the new scheme, the two dual tasks are jointly learned and their structural relationship is exploited to improve the learning effectiveness. We name this new scheme as dual supervised learning (briefly, DSL).
There could be many different ways of exploiting the duality in DSL. In this paper, we use it as a regularization term to govern the training process. Since the joint probability can be computed in two equivalent ways: , for any , ideally the conditional distributions of the primal and dual tasks should satisfy the following equality:
However, if the two models (conditional distributions) are learned separately by minimizing their own loss functions (as in the current practice of machine learning), there is no guarantee that the above equation will hold. The basic idea of DSL is to jointly learn the two modelsand by minimizing their loss functions subject to the constraint of Eqn.(1). By doing so, the intrinsic probabilistic connection between and are explicitly strengthened, which is supposed to push the learning process towards the right direction. To solve the constrained optimization problem of DSL, we convert the constraint Eqn.(1) to a penalty term by using the method of Lagrange multipliers (Boyd & Vandenberghe, 2004). Note that the penalty term could also be seen as a data-dependent regularization term.
To demonstrate the effectiveness of DSL, we apply it to three artificial intelligence applications 111In our experiments, we chose the most cited models with either open-source codes or enough implementation details, to ensure that we can reproduce the results reported in previous papers. All of our experiments are done on a single Telsa K40m GPU.:
(1) Neural Machine Translation (NMT) We first apply DSL to NMT, which formulates machine translation as a sequence-to-sequence learning problem, with the sentences in the source language as inputs and those in the target language as outputs. The input space and output space of NMT are symmetric, and there is almost no information loss while mapping from to or from to . Thus, symmetric tasks in NMT fits well into the scope of DSL. Experimental studies illustrate significant accuracy improvements by applying DSL to NMT: points measured by BLEU scores for EnglishFrench translation, points for EnglishGermen translation and points on EnglishChinese.
(2) Image Processing We then apply DSL to image processing, in which the primal task is image classification and the dual task is image generation conditioned on category labels. Both tasks are hot research topics in the deep learning community. We choose ResNet (He et al., 2016b) as our baseline for image classification, and PixelCNN++(Salimans et al., 2017) as our baseline for image generation. Experimental results show that on CIFAR-10, DSL could reduce the error rate of ResNet-110 from to and obtain a better image generation model with both clearer images and smaller bits per dimension. Note that these primal and dual tasks do not yield a pair of completely symmetric input and output spaces since there is information loss while mapping from an image to its class label. Therefore, our experimental studies reveal that DSL can also work well for dual tasks with information loss.
(3) Sentiment Analysis Finally, we apply DSL to sentiment analysis, in which the primal task is sentiment classification (i.e., to predict the sentiment of a given sentence) and the dual one is sentence generation with given sentiment polarity. Experiments on the IMDB dataset show that DSL can improve the error rate of a widely-used sentiment classification model by point, and can generate sentences with clearer/richer styles of sentiment expression.
All of above experiments on real artificial intelligence applications have demonstrated that DSL can improve practical performance of both tasks, simultaneously.
In this section, we formulate the problem of dual supervised learning (DSL), describe an algorithm for DSL, and discuss its connections with existing learning schemes and its application scope.
2.1 Problem Formulation
To exploit the duality, we formulate a new learning scheme, which involves two tasks: a primal task that takes a sample from space as input and maps to space , and a dual task takes a sample from space as input and maps to space .
Assume we have training pairs i.i.d. sampled from the space according to some unknown distribution . Our goal is to reveal the bi-directional relationship between the two inputs and . To be specific, we perform the following two tasks: (1) the primal learning task aims at finding a function such that the prediction of for is similar to its real counterpart ; (2) the dual learning task aims at finding a function such that the prediction of for is similar to its real counterpart . The dissimilarity is penalized by a loss function. Given any , let and denote the loss functions for and respectively, both of which are mappings from to .
A common practice to design
is the maximum likelihood estimation based on the parameterized conditional distributionsand :
where and are the parameters to be learned.
By standard supervised learning, the primal model is learned by minimizing the empirical risk in space :
and dual model is learned by minimizing the empirical risk in space :
Given the duality of the primal and dual tasks, if the learned primal and dual models are perfect, we should have
We call this property probabilistic duality, which serves as a necessary condition for the optimality of the learned two dual models.
By the standard supervised learning scheme, probabilistic duality is not considered during the training, and the primal and the dual models are trained independently and separately. Thus, there is no guarantee that the learned dual models can satisfy probabilistic duality. To tackle this problem, we propose explicitly reinforcing the empirical probabilistic duality of the dual modes by solving the following multi-objective optimization problem instead:
where and are the marginal distributions. We call this new learning scheme dual supervised learning (abbreviated as DSL).
We provide a simple theoretical analysis which shows that DSL has theoretical guarantees in terms of generalization bound. Since the analysis is straightforward, we put it in Appendix A.
2.2 Algorithm Description
In practical artificial intelligence applications, the ground-truth marginal distributions are usually not available. As an alternative, we use the empirical marginal distributions and to fulfill the constraint in Eqn.(2).
To solve the DSL problem, following the common practice in constraint optimization, we introduce Lagrange multipliers and add the equality constraint of probabilistic duality into the objective functions. First, we convert the probabilistic duality constraint into the following regularization term (with the empirical marginal distributions included):
Then, we learn the models of the two tasks by minimizing the weighted combination between the original loss functions and the above regularization term. The algorithm is shown in Algorithm 1.
The duality between tasks has been used to enable learning from unlabeled data in (He et al., 2016a)
. As an early attempt to exploit the duality, this work actually uses the exterior connection between dual tasks, which helps to form a closed feedback loop and enables unsupervised learning. For example, in the application of machine translation, the primal task/model first translates an unlabeled English sentenceto a French sentence ; then, the dual task/model translates back to an English sentence ; finally, both the primal and the dual models get optimized by minimizing the difference between with . In contrast, by making use of the intrinsic probabilistic connection between the primal and dual models, DSL takes an innovative attempt to extend the benefit of duality to supervised learning.
While can be regarded as a regularization term, it is data dependent, which makes DSL different from Lasso (Tibshirani, 1996) or SVM (Hearst et al., 1998), where the regularization term is data-independent. More accurately speaking, in DSL, every training sample contributes to the regularization term, and each model contributes to the regularization of the other model.
DSL is different from the following three learning schemes: (1) Co-training focuses on single-task learning and assumes that different subsets of features can provide enough and complementary information about data, while DSL targets at learning two tasks with structural duality simultaneously and does not yield any prerequisite or assumptions on features. (2) Multi-task learning requires that different tasks share the same input space and coherent feature representation while DSL does not. (3) Transfer Learning uses auxiliary tasks to boost the main task, while there is no difference between the roles of two tasks in DSL, and DSL enables them to boost the performance of each other simultaneously.
We would like to point that there are several requirements to apply DSL to a certain scenario: (1) Duality should exist for the two tasks. (2) Both the primal and dual models should be trainable. (3) and in Eqn. (3) should be available. If these conditions are not satisfied, DSL might not work very well. Fortunately, as we have discussed in the paper, many machine learning tasks related to image, speech, and text satisfy these conditions.
3 Application to Machine Translation
We first apply our dual supervised learning algorithm to machine translation and study whether it can improve the translation qualities by utilizing the probabilistic duality of dual translation tasks. In the following of the section, we perform experiments on three pairs of dual tasks 222Since both tasks in each pair are symmetric, they play the same role in the dual supervised learning framework. Consequently, any one of the dual tasks can be viewed as the primal task while the other as the dual task.: EnglishFrench (EnFr), EnglishGermany (EnDe), and EnglishChinese (EnZh).
Datasets We employ the same datasets as used in (Jean et al., 2015) to conduct experiments on EnFr and EnDe. As a part of WMT’14, the training data consists of M sentences pairs for EnFr and M for EnDe, respectively (WMT, 2014). We combine newstest2012 and newstest2013 together as the validation sets and use newstest2014 as the test sets. For the dual tasks of EnZh, we use M sentence pairs obtained from a commercial company as training data. We leverage NIST2006 as the validation set and NIST2008 as well as NIST2012 as the test sets333The three NIST datasets correspond to ZhEn translation task, in which each Chinese sentence has four English references. To build the test set for EnZh, we use the Chinese sentence with one randomly picked English sentence to form up a EnZh validation/test pair. . Note that, during the training of all three pairs of dual tasks, we drop all sentences with more than words.
Marginal Distributions and We use the LSTM-based language modeling approach (Sundermeyer et al., 2012; Mikolov et al., 2010) to characterize the marginal distribution of a sentence , defined as , where is the th word in , denotes the number of words in , and the index indicates . More details about such language modeling approach can be referred to Appendix B.
Model We apply the GRU as the recurrent module to implement the sequence-to-sequence model, which is the same as (Bahdanau et al., 2015; Jean et al., 2015). The word embedding dimension is and the number of hidden node is . Regarding the vocabulary size of the source and target language, we set it as k, k, and k for EnFr, EnDe, and EnZh, respectively. The out-of-vocabulary words are replaced by a special token UNK. Following the common practice, we denote the baseline algorithm proposed in (Bahdanau et al., 2015; Jean et al., 2015) as RNNSearch. We implement the whole NMT learning system based on an open source code444https://github.com/nyu-dl/dl4mt-tutorial.
Evaluation Metrics The translation qualities are measured by tokenized case-sensitive BLEU (Papineni et al., 2002) scores, which is implemented by (multi bleu, 2015). The larger the BLEU score is, the better the translation quality is. During the evaluation process, we use beam search with beam width 12 to generate sentences. Note that, following the common practice, the ZhEn is evaluated by case-insensitive BLEU score.
Training Procedure We initialize the two models in DSL (i.e., the and ) by using two warm-start models, which is generated by following the same process as (Jean et al., 2015). Then, we use SGD with the minibatch size of 80 as the optimization method for dual training. During the training process, we first set the initial learning rate to
and then halve it if the BLEU score on the validation set cannot grow for a certain number of mini batches. In order to stabilize parameters, we will freeze the embedding matrix once halving learning rates can no long improve the BLEU score on the validation set. The gradient clip is set as, and during the training for EnFr, EnDe, and EnZh, respectively (Pascanu et al., 2013). The value of both and in Algorithm 1 are set as according to empirical performance on the validation set. Note that, during the optimization process, the LSTM-based language models will not be updated.
Table 1 shows the BLEU scores on the dual tasks by the DSL method with that by the baseline RNNSearch method. Note that, in this table, we use (MT08) and (MT12) to denote results carried out on NIST2008 and NIST2012, respectively. From this table, we can find that, on all these three pairs of symmetric tasks, DSL can improve the performance of both dual tasks, simultaneously.
To better understand the effects of applying the probabilistic duality constraint as the regularization, we compute the on the test set by DSL compared with RNNSearch. In particular, after applying DSL to EnFr, the decreases from to , which also indicates that the two models become more coherent in terms of probabilistic duality.
(Jean et al., 2015) proposed an effective post-process technique, which can achieve better translation performance by replacing the “UNK” with the corresponding word-level translations. After applying this technique into DSL, we report its results on EnFr in Table 2, compared with several baselines with the same model structures as ours that also integrate the “UNK” post-processing technique. From this table, it is clear to see that DSL can achieve better performance than all baseline methods.
|MRT||Direct optimizing BLEU|
|DSL||Refer to Algorithm 1|
| (Jean et al., 2015);  (Shen et al., 2016)|
In the previous experiments, we use a warm-start approach in DSL using the models trained by RNNSearch. Actually, we can use stronger models for initialization to achieve even better accuracy. We conduct a light experiment to verify this. We use the models trained by (He et al., 2016a) as the initializations in DSL on EnFr translation. We find that BLEU score can be improved from to for EnFr translation, and from to for FrEn translation.
There are two hyperparametersand in our DSL algorithm. We conduct some experiments to investigate their effects. Since the input and output space are symmetric, we set and plot the validation accuracy of different ’s in Figure 1(a). From this figure, we can see that both EnFr and FrEn reach the best performance when , and thus the results of DSL reported in Table 1 are obtained with . Moreover, we find that, within a relatively large interval of , DSL outperforms standard supervised learning, i.e., the point with . We also plot the BLEU scores for on the validation and test sets in Figure 1
(b) with respect to training iterations. We can see that, in the first couple of rounds, the test BLEU curves fluctuate with large variance. The reason is that two separately initialized models of dual tasks yield are not consistent with each other, i.e., Eqn. (1) does not hold, which causes the declination of the performance of both models as they play as the regularizer for each other. As the training goes on, two models become more consistent and finally boost the performance of each other.
|[Source (En)] A board member at a German blue-chip|
|company concurred that when it comes to economic espionage,|
|"the French are the worst."|
|[Source (Fr)] Un membre du conseil d’administration d’une|
|société allemande renommée estimait que lorsqu’il s’agit|
|d’espionnage économique , « les Français sont les pires » .|
|[RNNSearch (FrEn)] A member of the board of directors|
|of a renowned German society felt that when it was economic|
|espionage, “the French are the worst. ”|
|[RNNSearch (EnFr)] Un membre du conseil d’une|
|compagnie allemande UNK a reconnu que quand il s’agissait|
|d’espionnage économique, "le français est le pire".|
|[DSL (FrEn)] A board member of a renowned German|
|company felt that when it comes to economic espionage,|
|"the French are the worst. "|
|[DSL (EnFr)] Un membre du conseil d’une compagnie|
|allemande UNK a reconnu que , lorsqu’il s’agit d’espionnage|
|économique, "les Français sont les pires".|
Table 3 shows a couple of translation examples produced by RNNSearch compared with DSL. From this table, we find that DSL demonstrates three major advantages over RNNSearch. First, by leveraging the structural duality of sentences, DSL can result in the improvement of mutual translation, e.g. “when it comes to” and “lorsqu qu’il s’agit de”, which better fit the semantics expressed in the sentences. Second, DSL can consider more contextual information in translation. For example, in FrEn, une société is translated to company, however, in the baseline, it is translated to society. Although the word level translation is not bad, it should definitely be translated as “company” given the contextual semantics. Furthermore, DSL can better handle the plural form. For example, DSL can correctly translate “the French are the worst”, which are of plural form, while the baseline deals with it by singular form.
4 Application to Images Processing
In the domain of image processing, image classification (imagelabel) and image generation (labelimage) are in the dual form. In this section, we apply our dual supervised learning framework to these two tasks and conduct experimental studies based on a public dataset, CIFAR-10 (Krizhevsky & Hinton, 2009), with 10 classes of images. In our experiments, we employ a popular method, ResNet555https://github.com/tensorflow/models/tree/master/resnet, for image classification and a most recent method, PixelCNN++666https://github.com/openai/pixel-cnn, for image generation. Let denote the image space and denote the category space related to CIFAR-10.
In our experiments, we simply use the uniform distribution to set the marginal distributionof 10-class labels, which means the marginal distribution of each class equals . The image distribution is usually defined as , where all pixels of the image is serialized and is the value of the -th pixel of an -pixel image. Note that the model can predict only based on the previous pixels with index . We use the PixelCNN++, which is so far the best algorithm, to model the image distribution.
Models For the task of image classification, we choose 32-layer ResNet (denoted as ResNet-32) and 110-layer ResNet (denoted as ResNet-110) as two baselines, respectively, in order to examine the power of DSL on both relatively simple and complex models. For the task of image generation, we use PixelCNN++ again. Compared to the PixelCNN++ used for modeling distribution, the difference lies in the training process: When used for image generation given a certain class, PixelCNN++ takes the class label as an additional input, i.e., it tries to characterize , where
is the 1-hot label vector.
Evaluation Metrics We use the classification error rates to measure the performance of image classification. We use bits per dimension (briefly, bpd) (Salimans et al., 2017), to assess the performance of image generation. In particular, for an image with label , the bpd is defined as:
where is the number of pixels in image . By using the dataset CIFAR-10, is for any image , and we will report the average bpd on the test set.
Training Procedure We first initialize both the primal and the dual models with the ResNet model and PixelCNN++ model pre-trained independently and separately. We obtain a 32-layer ResNet with error rate of and a 110-layer ResNet with error rate of as the pre-trained models for image classification. The error rates of these two pre-trained models are comparable to results reported in (He et al., 2016b). We generate a pre-trained conditional image generation model with the test bpd of , which is the same as reported in (Salimans et al., 2017). For DSL training, we set the initial learning rate of image classification model as and that of image generation model as . The learning rates follow the same decay rules as those in (He et al., 2016b) and (Salimans et al., 2017). The whole training process takes about two weeks before convergence. Note that experimental results below are based on the training with and .
4.2 Results on Image Classification
Table 4 compares the error rates of two image classification models, i.e., DSL vs. Baseline, on the test set. From this table, we find that, with using either ResNet-32 or ResNet-110, DSL achieves better accuracy than the baseline method.
Interestingly, we observe from Table 4 that, DSL leads to higher relative performance improvement on the ResNet-110 over the ResNet-32. We hypothesize one possible reason is that, due to the limited training data, an appropriate regularization can benefit more to the 110-layer ResNet with higher model complexity, and the duality-oriented regularization indeed plays this role and consequently gives rise to higher relative improvement.
4.3 Results on Image Generation
Our further experimental results show that, based on ResNet-110, DSL can decrease the test bpd from (baseline) to (DSL), which is a new state-of-the-art result on CIFAR-10. Indeed, it is quite difficult to improve bpd by which though seems like a minor change. We also find that, there is no significant improvement on test bpd based on ResNet-32. An intuitive explanation is that, since ResNet-110 is stronger than ResNet-32 in modeling the conditional probability , it can better help the task of image generation through the constraint/regularization of the probabilistic duality.
As pointed out in (Theis et al., 2015), bpd is not the only evaluation rule of image generation. Therefore, we further conduct a qualitative analysis by comparing images generated by dual supervised learning with those by the baseline model for each of image categories, some examples of which are shown in Figure 2.
Each row in Figure 2 corresponds to one category in CIFAR-10, the five images in the left side are generated by the baseline model, and the five ones in the right side are generated by the model trained by DSL. From this figure, we find that DSL generally generates images with clearer and more distinguishable characteristics regarding the corresponding category. Specifically, those right five images in Row 3, 4, and 6 can illustrate more distinguishable characteristics of birds, cats and dogs respectively, which is mainly due to benefits of introducing the probabilistic duality into DSL. But, there are still some cases that neither the baseline model nor DSL can perform well, like deers it Row 5 and frogs in Row 7. One reason is that the bpd of images in the category of deer and frogs are and , which are significant larger than the average . This shows that the images of these two categories are harder to generate.
5 Application to Sentiment Analysis
Finally, we apply the dual supervised learning framework to the domain of sentiment analysis. In this domain, the primal task, sentiment classification (Maas et al., 2011; Dai & Le, 2015), is to predict the sentiment polarity label of a given sentence; and the dual task, though not quite apparent but really existed, is sentence generation based on a sentiment polarity. In this section, let denote the sentences and denote the sentiment related to our task.
5.1 Experimental Setup
Dataset Our experiments are performed based on the IMDB movie review dataset (IMDB, 2011), which consists of k training and k test sentences. Each sentence in this dataset is associated with either a positive or a negative sentiment label. We randomly sample a subset of sentences from the training data as the validation set for hyperparameter tuning and use the remaining training data for model training.
Marginal Distributions We simply use the uniform distribution to set the marginal distribution of polarity labels, which means the marginal distribution of positive or negative class equals . On the other side, we take advantage of the LSTM-based language modeling to model the marginal distribution of a sentence . The test perplexities (Bengio et al., 2003) of the obtained language model is .
Model Implementation We leverage the widely used LSTM (Dai & Le, 2015) modeling approach for sentiment classification777Both supervised and semi-supervised sentiment classification are studied in (Dai & Le, 2015). We focus on supervised learning here. Therefore, we do not compare with the models trained with semi-supervised (labeled + unlabeled) data. model. We set the embedding dimension as and the hidden layer size as . For sentence generation, we use another LSTM model with as input, where denotes the ’th word, and represent the embedding matrices for word and sentiment label respectively, and ’s represent the connections between embedding matrix and LSTM cells. A sentence is generated word by word sequentially, and the probability that word is generated is proportional to , where is the hidden state outputted by LSTM. Note the ’s and the ’s are the parameters to learn in training. In the following, we call the model for sentiment based sentence generation as contextual language model (briefly, CLM).
Evaluation Metrics We measure the performance of sentiment classification by the error rate, and that of sentence generation, i.e., CLM, by test perplexity.
Training Procedure To obtain baseline models, we use Adadelta as the optimization method to train both the sentiment classification and sentence generation model. Then, we use them to initialization the two models for DSL. At the beginning of DSL training, we use plain SGD with an initial learning rate of and then decrease it to for both models once there is no further improvement on the validation set. For each pair, we set and , where is the length of . The whole training process of DSL takes less than two days.
Table 5 compares the performance of DSL with the baseline method in terms of both the error rates of sentiment classification and the perplexity of sentence generation. Note that the test error of the baseline classification model, which is as shown in the table, is comparable to the recent results as reported in (Dai & Le, 2015). We have two observations from the table. First, DSL can reduce the classification error by without modifying the LSTM-based model structure. Second, DSL slightly improves the perplexity for sentence generation, but the improvement is not very significant. We hypothesize the reason is that the sentiment label can merely supply at most 1 bit information such that the perplexity difference between the language model (i.e., the marginal distribution ) and CLM (i.e., the conditional distribution ) are not large, which limits the improvement brought by DSL.
|Test Error (%)||Perplexity|
Qualitative analysis on sentence generation
In addition to quantitative studies as shown above, we further conduct qualitative analysis on the performance of sentence generation. Table 6 demonstrates some examples of generated sentences based on sentiment labels. From this table, we can find that both the baseline model and DSL succeed in generating sentences expressing the certain sentiment. The baseline model prefers to produce the sentence with those words yielding high-frequency in the training data, such as the “the plot is simple/predictable, the acting is great/bad
”, etc. This is because the sentence generation model itself is essentially a language model based generator, which aims at catching the high-frequency words in the training data. Meanwhile, since the training of CLM in DSL can leverage the signals provided by the classifier, DSL makes it more possible to select those words, phrases, or textual patterns that can present more specific and more intense sentiment, such as “nothing but good, 10/10, don’t waste your time”, etc. As a result, the CLM in DSL can generate sentences with richer expressions for sentiments.
|i’ve seen this movie a few times. it’s still one of my|
|Base||favorites. the plot is simple, the acting is great.|
|(Pos)||It’s a very good movie, and i think it’s one of the|
|best movies i’ve seen in a long time.|
|I have nothing but good things to say about this|
|movie. I saw this movie when it first came out,|
|DSL||and I had to watch it again and again. I really|
|(Pos)||enjoyed this movie. I thought it was a very good|
|movie. The acting was great, the story was great.|
|I would recommend this movie to anyone.|
|I give it 10 / 10.|
|after seeing this film, i thought it was going to be|
|Base||one of the worst movies i’ve ever seen; the acting|
|(Neg)||was bad, the script was bad. the only thing i can|
|say about this movie is that it’s so bad.|
|this is a difficult movie to watch, and would, not|
|DSL||recommend it to anyone. The plot is predictable,|
|Neg||the acting is bad, and the script is awful.|
|Don’t waste your time on this one.|
In previous experiments, we start DSL training with well-trained primal and dual models. We conduct some further experiments to verify whether warm start is a must for DSL. (1) We train DSL from a warm-start sentence generator and a cold-start (randomly initialized) sentence classifier. In this case, DSL achieves a classification error of , which is better than the baseline classifier in Table 5. (2) We train DSL from a warm-start classifier and a cold-start sentence generator. The perplexity of the generator after DSL training reach , which is better than the baseline generator. (3) We train DSL from both cold-start models. The final classification error is % and the perplexity of the generator is 58.82, which are both better than the baselines. These results show that the success of DSL does not necessarily require warm-start models, although they can speed up the training of DSL.
6 Conclusions and Future Work
Observing the existence of structure duality among many AI tasks, we have proposed a new learning framework, dual supervised learning, which can greatly improve the performance for both the primal and the dual tasks, simultaneously. We have introduced a probabilistic duality term to serve as a data-dependent regularizer to better guide the training. Empirical studies have validated the effectiveness of dual supervised learning.
There are multiple directions to explore in the future. First, we will test dual supervised learning on more dual tasks, such as speech recognition and speech synthesis. Second, we will enrich theoretical study to better understand dual supervised learning. Third, it is interesting to combine dual supervised learning with unsupervised dual learning (He et al., 2016a) to leverage unlabeled data so as to further improve the two dual tasks. Fourth, we will combine dual supervised learning with dual inference (Xia et al., 2017) so as to leverage structural duality to enhance both the training and inference procedures.
Appendix A Theoretical Analysis
As we know, the final goal of the dual learning is to give correct predictions for the unseen test data. That is to say, we want to minimize the (expected) risk of the dual models, which is defined as follows888The parameters and in the dual models will be omitted when the context is clear.:
where , , and are parameter spaces, and the is taken over the underlying distribution . Besides, let denote the product space of the two models satisfying probabilistic duality, i.e., the constraint in Eqn.(4). For ease of reference, define as .
Define the empirical risk on the sample as follows: for any ,
Following (Bartlett & Mendelson, 2002), we introduce Rademacher complexity for dual supervised learning, a measure for the complexity of the hypothesis.
Define the Rademacher complexity of DSL, , as follows:
where , in which and , are i.i.d sampled with .
Based on , we have the following theorem for dual supervised learning:
Theorem 1 ((Mohri et al., 2012)).
Let be a mapping from to . Then, for any , with probability at least , the following inequality holds for any ,
Similarly, we define the Rademacher complexity for the standard supervised learning under our framework by replacing the in Definition 1 by . With probability at least , the generation error bound of supervised learning is smaller than .
Since , by the definition of Rademacher complexity, we have . Therefore, DSL enjoys a smaller generation error bound than supervised learning.
The approximation of dual supervised learning is defined as
The approximation error for supervised learning is similarly defined.
. Let and denote the two conditional probabilities derived from . We have the following theorem:
If and , then supervised learning and DSL has the same approximation error.
By definition, we can verify both of the two approximation errors are zero. ∎
Appendix B Details about the Language Models for Marginal Distributions
We use the LSTM language models (Sundermeyer et al., 2012; Mikolov et al., 2010) to characterize the marginal distribution of a sentence , defined as , where is the -th word in , denotes the number of words in , and the index indicates . The embedding dimension and hidden node are both . We apply 0.5 dropout to the input embedding and the last hidden layer before softmax. The validation perplexities of the language models are shown in Table 7, where the validation sets are the same.
For the marginal distributions for sentences of sentiment classification, we choose the LSTM language model again like those for machine translation applications. The two differences are: (i) the vocabulary size is ; (ii) the word embedding dimension is . The perplexity of this language model is .
- Amodei et al. (2016) Amodei, Dario, Anubhai, Rishita, Battenberg, Eric, Case, Carl, Casper, Jared, Catanzaro, Bryan, Chen, Jingdong, Chrzanowski, Mike, Coates, Adam, Diamos, Greg, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In 33rd International Conference on Machine Learning, 2016.
- Bahdanau et al. (2015) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
- Bartlett & Mendelson (2002) Bartlett, Peter L and Mendelson, Shahar. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Bengio et al. (2003) Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Jauvin, Christian. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
- Boyd & Vandenberghe (2004) Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press, 2004.
- Dai & Le (2015) Dai, Andrew M and Le, Quoc V. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pp. 3079–3087, 2015.
Graves et al. (2013)
Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey.
Speech recognition with deep recurrent neural networks.In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. IEEE, 2013.
- He et al. (2016a) He, Di, Xia, Yingce, Qin, Tao, Wang, Liwei, Yu, Nenghai, Liu, Tie-Yan, and Ma, Wei-Ying. Dual learning for machine translation. In Advances In Neural Information Processing Systems, pp. 820–828, 2016a.
- He et al. (2016b) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In , 2016b.
- He et al. (2016c) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Springer, 2016c.
- Hearst et al. (1998) Hearst, Marti A., Dumais, Susan T, Osuna, Edgar, Platt, John, and Scholkopf, Bernhard. Support vector machines. IEEE Intelligent Systems and their Applications, 13(4):18–28, 1998.
- IMDB (2011) IMDB. Imdb dataset. http://ai.stanford.edu/ amaas/data/sentiment/, 2011.
- Jean et al. (2015) Jean, Sébastien, Cho, Kyunghyun, Memisevic, Roland, and Bengio, Yoshua. On using very large target vocabulary for neural machine translation. In ACL, 2015.
- Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
- Maas et al. (2011) Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 142–150. Association for Computational Linguistics, 2011.
- Mikolov et al. (2010) Mikolov, Tomas, Karafiát, Martin, Burget, Lukas, Cernockỳ, Jan, and Khudanpur, Sanjeev. Recurrent neural network based language model. In Interspeech, volume 2, pp. 3, 2010.
- Mohri et al. (2012) Mohri, Mehryar, Rostamizadeh, Afshin, and Talwalkar, Ameet. Foundations of machine learning. MIT press, 2012.
multi bleu (2015)
- Oord et al. (2016) Oord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Papineni et al. (2002) Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu, Wei-Jing. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
- Pascanu et al. (2013) Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. ICML (3), 28:1310–1318, 2013.
- Salimans et al. (2017) Salimans, Tim, Karpathy, Andrej, Chen, Xi, P. Kingma, Diederik, and Bulatov, Yaroslav. Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations, 2017.
- Shen et al. (2016) Shen, Shiqi, Cheng, Yong, He, Zhongjun, He, Wei, Wu, Hua, Sun, Maosong, and Liu, Yang. Minimum risk training for neural machine translation. ACL, 2016.
- Sundermeyer et al. (2012) Sundermeyer, Martin, Schlüter, Ralf, and Ney, Hermann. Lstm neural networks for language modeling. In Interspeech, pp. 194–197, 2012.
- Theis et al. (2015) Theis, Lucas, Oord, Aäron van den, and Bethge, Matthias. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015.
- Tibshirani (1996) Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
- van den Oord et al. (2016a) van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016a.
- van den Oord et al. (2016b) van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In 33rd International Conference on Machine Learning, 2016b.
- WMT (2014) WMT. Wmt dataset for machine translation. http://www.statmt.org/wmt14/translation-task.html, 2014.
- Wu et al. (2016) Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V, Norouzi, Mohammad, Macherey, Wolfgang, Krikun, Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Xia et al. (2017) Xia, Yingce, Bian, Jiang, Qin, Tao, Yu, Nenghai, and Liu, Tie-Yan. Dual inference for machine learning. In The 26th International Joint Conference on Artificial Intelligence, 2017.
- Zeiler (2012) Zeiler, Matthew D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.