Off-Policy Self-Critical Training for Transformer in Visual Paragraph Generation

06/21/2020 ∙ by Shiyang Yan, et al. ∙ 0

Recently, several approaches have been proposed to solve language generation problems. Transformer is currently state-of-the-art seq-to-seq model in language generation. Reinforcement Learning (RL) is useful in solving exposure bias and the optimisation on non-differentiable metrics in seq-to-seq language learning. However, Transformer is hard to combine with RL as the costly computing resource is required for sampling. We tackle this problem by proposing an off-policy RL learning algorithm where a behaviour policy represented by GRUs performs the sampling. We reduce the high variance of importance sampling (IS) by applying the truncated relative importance sampling (TRIS) technique and Kullback-Leibler (KL)-control concept. TRIS is a simple yet effective technique, and there is a theoretical proof that KL-control helps to reduce the variance of IS. We formulate this off-policy RL based on self-critical sequence training. Specifically, we use a Transformer-based captioning model as the target policy and use an image-guided language auto-encoder as the behaviour policy to explore the environment. The proposed algorithm achieves state-of-the-art performance on the visual paragraph generation and improved results on image captioning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transformer (self-attention) is a kind of seq-to-seq models, which shows breakthrough successes in natural language processing (NLP), such as machine translation and image captioning 

vaswani2017attention cornia2019m

. Seq-to-seq models are usually trained using either Maximum Likelihood Estimation (MLE) or Reinforcement Learning (RL) 

yu2017seqgan rennie2017self

. Especially, RL for seq-to-seq models can tackle two problems in language generation: (1). The exposure bias, referring to the train-test discrepancy in seq-to-seq models. The training uses the ground-truths while the testing generates a new token based on the previously generated ones. (2). The gradient estimation towards optimisation for non-differentiable evaluation metrics such as BLEU or CIDEr 

rennie2017self. Indeed, RL has brought significant performance gain in image captioning and language generation rennie2017self yu2017seqgan. However, there is less literature on Transformer performing RL parisotto2019stabilizing. On-policy RL is known to be sample inefficient, and this is especially serious for Transformer in visual paragraph generation where the generated paragraph usually contains about 200 words or more krause2017hierarchical. Expensive computing resource is required for the gradient graph of the decoder, which is established in each time step for on-policy training, making the training even in-feasible.

Off-policy RL, on the contrary, is to use another independent behaviour policy to explore the environment and transfer the experience to the target policy. Off-policy is sample efficient gu2016q and also can largely reduce the computing resources required. In RL, the concept of off-policy is usually rooted in value-based RL mnih2013playing munos2016safe. However, the RL in NLP is usually policy-based RL learning methods, e.g., REINFORCE-like williams1992simple algorithms yu2017seqgan rennie2017self and actor-critic bahdanau2016actor as the action space (vocabulary) is large. The value-based RL is not advantageous in dealing with large action space keneshloo2019deep.

Also, off-policy RL is sometimes inaccurate as there exists a discrepancy between the target and the behaviour policy. A true off-policy where the target and behaviour policy is non-correlated is extremely hard fujimoto2018off. The well-known off-policy RL learning algorithms such as DQN mnih2013playing and DDPG lillicrap2015continuous, are only capable of learning with data correlated to their current policy fujimoto2018off. A common way of approximation in off-policy is using Importance Sampling (IS) estimators farajtabar2018more jiang2015doubly

, which tries to correct the mismatch in the distributions under the behaviour policy and target policy. IS, however, has high variance when the two policy distributions are very different. The ratio of the two probabilities sampled becomes either small or large (sometimes infinite), which leads to huge variance. This phenomenon is noticeable when the episode of RL is long, like when dealing with long sentence generation in visual paragraph generation.

Hence, we propose an off-policy self-critical sequence training based on its on-policy version rennie2017self, a REINFORCE-like Policy Gradient algorithm and apply it for visual paragraph generation. We employ the smooth version of IS, i.e. truncated relative importance sampling (TRIS) humayoo2018relative to reduce the variance of the conventional IS. TRIS is proved to be effective in reducing the variance of IS as it introduces a relative distribution ratio, which is bounded. Also, there is evidence that the Kullback-Leibler (KL) divergence between the target and behaviour policy influence the variance of IS wexler2012importance. KL-control studies an RL problem in which the agent tries to maximise the task-related reward while minimising deviation from a prior policy (behaviour policy). Consequently, when training the target policy with off-policy RL, we penalise its divergence from the behaviour policy with KL-control rawlik2013stochastic. We add a term of the KL divergence between the target policy and the behaviour policy in the value function of our RL and incorporate it into the self-critical sequence training.

To be specific, we train Meshed Transformer cornia2019m optimised under the proposed off-policy self-critical sequence training for visual paragraph generation. We design a GRU-based image-guided language auto-encoder as the behaviour policy and treat our Transformer as the target policy. The target policy will learn self-critical rewards while minimising the divergence from the behaviour policy, reducing the variance of the TRIS. To summarise, our contributions are threefold: (1) We propose a novel off-policy self-critical sequence training framework, making the RL learning feasible for Transformer. (2) We reduce the variance of the IS ratio, which is in off-policy RL approximation, by applying TRIS and the concept and techniques of KL-control. (3) We achieve state-of-the-art results on visual paragraph generation and improved results on image captioning. Empirical evidence also shows that the IS variance can be significantly reduced.

2 Related Works

2.1 Off-Policy RL Learning

RL with replay buffer lin1992self can be considered as a standard tool for off-policy learning mnih2015human. In these schemes, the behaviour policy in the replay buffer is somehow related to the target policy mnih2013playing, which is not a ‘true’ off-policy RL learning fujimoto2018off. For example, Isele et al. isele2018selective see that the performance of an agent is most reliable when the distribution of data in the replay buffer matched the test distribution.

Many approaches dudik2011doubly farajtabar2018more

use IS to re-weight the probability distribution when the target policy is different from behaviour policy. However, IS is with high variance, preventing the model from achieving stable performance. Hanna et al. 

hanna2018importance use function approximation to estimate the behaviour policy to reduce the variance. Liu et al. liu2018breaking models the stationary state visiting distribution of the behaviour policy for infinite-horizon off-policy RL tasks. Humayoo et al. humayoo2018relative apply a simple technique, TRIS, to solve this problem.

KL-control is a branch of stochastic optimal control, where the KL divergence from other distributions is applied in regularisation abdolmaleki2018maximum jaques2019way. An example in the on-policy policy gradient is Trust Region Policy Optimisation (TRPO) schulman2015trust

, where a KL penalty term is incorporated in the value function of the Policy Gradient algorithm. KL-control has also been used to improve transfer learning between MLE training on data and training with RL 

jaques2017sequence.

2.2 Image Paragraph Generation

Regions-Hierarchical krause2017hierarchical introduces the first large-scale paragraph captioning dataset, which utilises the images from Visual Genome dataset and adds new annotations. The dataset shows more pronouns, verbs and more diversities than the single sentence captioning dataset, which is more challenging.

Approaches krause2017hierarchical liang2017recurrent chatterjee2018diverse propose different types of hierarchical model structures to generate the visual paragraphs, with an effective coupling mechanism between sentences within one paragraph. These hierarchical architectures model each sentence and couple the sentences into one paragraph, often with more superior performance than the flat models krause2017hierarchical. Advanced methods like VAE chatterjee2018diverse, GAN liang2017recurrent are applied to boost the performance further. However, we see less literature on Transformer-like models under RL for visual paragraph generation as the sampling in on-line RL is computing-expensive.

3 Methods

In this section, we first formulate the RL setting of Transformer in visual paragraph generation, then introduce our off-policy self-critical framework for optimisation.

3.1 Formulation of Visual Paragraph Generation in On-Policy Self-Critical Training

We consider the visual paragraph generation process as a finite Markov Decision Process (MDP). Transformer can be viewed as an agent, which interacts with the environment (words and image features). In the MDP

, is the state space, is an action space. is the state transition probability, is the reward function and

is the discount factor. The agent selects an action, from a conditional probability distribution, which is called the policy

, parametrised by . In visual paragraph generation, the state space composed of image features () and actions generated so far, described as . Value functions are the expectation of accumulative discounted future reward, measuring how good each state is. There are two kinds of value functions: the state value function and the state-action value function , which are defined as follows:

(1)

The agent tries to maximise the accumulative reward and update the parameters, the loss function is expressed as follows:

(2)

For Policy Gradient methods sutton2000policy, which are widely applied in sequence generation problems, the optimisation can be formulated as:

(3)

The Policy Gradient is unbiased, but with high variance. A common way to address this issue is using an arbitrary baseline , which is described as follows:

(4)

The baseline is an arbitrary function, which should be independent from the action . The Q function appears in the above equations in self-critical sequence learning is set as the expectation of the accumulated rewards. As there is no intermediate reward in language generation task, the self-critical training uses a single sample from Monte Carlo sampling to approximate the Q function, which, in reality, is the CIDEr score of the sampled sentence . The self-critical uses the baseline CIDEr score CIDEr from greedy sampling to reduce the variance in Policy Gradient,

(5)

3.2 The Proposed Off-Policy Self-Critical Training

One reason for the instability of off-policy learning is the discrepancy between distributions of the target and behaviour policies as we wish to gather data from the distributions of target policy but sample data from the distribution of the behaviour policy.

Importance Sampling (IS).

IS huber1981wiley precup2000eligibility is a classical approach in handling the discrepancy between the target and behaviour policies. If the behaviour policy is , if , then IS in off-policy self-critical learning can be written as:

(6)

where is the importance ratio. The IS is with high variance, especially when the discrepancy between the distributions of the target policy and the behaviour policy is large, as the ratio of the probability becomes unstable.

Truncated Relative Importance Sampling (TRIS).

Relative Importance Sampling (RIS) sugiyama2015introduction yamada2013relative humayoo2018relative can be applied to smooth the IS so as to reduce the variance, which is described as follows:

(7)

where the RIS is bounded as it is no greater than , as proved in humayoo2018relative. Accordingly, RIS has bounded variance and is also with low bias sugiyama2015introduction. The probability ratio does not involve a product of a sequence of unbounded value.

The Truncated Relative Importance Sampling (TRIS), expressed as , can stabilise the training as it truncates the min value of the ratio to , which introduces a lower bound to RIS.

Penalty in Policy Gradient.

A method to further reduce the variance of TRIS and stabilize the training is encouraging the learnt policy to be close to the behaviour policy wu2019behavior. We can penalise the KL divergence in the value function. It is also related to the KL-control problem in which a KL value penalty is introduced in the value function. Formally, if , we define the penalised loss objective as:

(8)

where is the RIS ratio, and is a divergence function between distributions over actions such as KL divergence. This formulation can penalise the behaviour of the target policy being divergent from the behaviour policy. As . Then the loss objective is equivalent to the following expression at the action level:

(9)

where the term rewards the model for choosing the action that have a high probability of the prior (behaviour) policy. is the entropy regularisation ahmed2018understanding, which is very important in RL for efficient exploring. is the coefficient weight to control the contribution of the penalty term.

The Rationale of Combining TRIS with KL-control.

A fundamental issue of IS is the choice of the importance function, which, in our case, is the behaviour policy . where is the instantiation of variables in the samples, is the observed variable. The optimal importance function is when , which is proportional to and lead to zero variance of the IS. In practice, the optimal is not easy to sample. Hence, many researchers are seeking methods to reduce the variance of IS.

While the optimal is hard to find, the KL-divergence between the two distributions can significantly affect the variance of IS, which is proved in wexler2012importance: Let and be two importance functions, and the where is the KL-divergence, then . indicates the variance. Accordingly, even a small change in KL-divergence could exponentially alter the variance of IS and RIS. Consequently, we can penalise the target policy when it is divergent from the behaviour policy, to further reduce the variance of TRIS.

3.3 The Model Structure

The agent we utilise is the meshed Transformer cornia2019m, which shows state-of-the-art performance in image captioning. We use a GRU-based image-guided language auto-encoder as the behaviour policy, as shown in Figure 1

. The input paragraph (ground-truth paragraph) is encoded via a GRU-based language encoder to a hidden vector

, with a size of

. Then we feed the region image features extracted from a pre-trained Faster R-CNN model, denoted as

, each item with a size of to a visual attention module xu2015show in every time step of the language decoder. Hence the input to the language decoder (a GRU model) at each time step is expressed as:

(10)

where and . The hidden vector of the language decoder is initialised with . The language auto-encoder can be considered as image-guided. is then decoded to paragraph. We use the language auto-encoder as the behaviour policy to explore in the environment. To approximate the off-policy learning, TRIS and a KL-divergence penalty are utilised in training.

Figure 1: The off-policy self-critical for visual paragraph generation: The image is first input to a Faster R-CNN model ren2015faster to extracting region features, each is with a dimension of 4096. The features are forwarded to Transformer to perform training, after Fully-connected (FC) transforming. Meanwhile, the input paragraph is encoded via a GRU encoder to a hidden vector. The hidden vector, along with the visual features, are subsequently input to a GRU decoder to perform Multinomial Sampling. The sampled words are then forwarded to Transformer to obtain the action probabilities. The self-critical reward obtained from the GRU decoder is formulated with a KL penalty term, which is to reduce the variance of TRIS used for re-weighting the probabilities. Best Viewed in Colour.

3.4 Training Algorithm

We first train an image-guided language auto-encoder using the image-paragraph pairs provided with the dataset, which is then used as a behaviour policy. Transformer cornia2019m is pre-trained using the standard MLE learning scheme on the dataset. Then we treat Transformer model as the target policy, and start the off-policy Policy Gradient training described previously. When training the model under RL, the total loss objective is a combination of MLE loss and RL loss, expressed as:

(11)

where the MLE loss is to minimise the negative log probabilities of the generated word token given previous generated work tokens.

4 Experiments

We conduct the experiments on two use cases of the off-policy self-critical for image-based language generation: visual paragraph generation and image captioning. The merits of our algorithm are mainly in tackling long sequence generation for Transformer models in, e.g., visual paragraph generation. Image captioning is to generate a caption for a given image, which can be combined with on-policy self-critical cornia2019m. Nevertheless, we apply the proposed method on image captioning as well.

4.1 Visual Paragraph Generation

Implementation Details.

We experiment on the Stanford Visual Paragraph dataset krause2017hierarchical

. In this dataset, each image contains one paragraph. The training, validation and testing sets contain 14,575, 2487 and 2489 images, respectively. We evaluate the BLEU, METEOR, ROUGE-L and CIDEr scores for the generated paragraphs. For MLE baseline, we train the model for 40 epochs. For our off-policy self-critical algorithm, we further train the model for 8 epochs using a combination of off-policy RL and MLE. We use early stopping on CIDEr score to choose the best model for every one epoch. The learning rate is set as 4e-4 for MLE training, and 4e-5 for our off-policy self-critical training. We use Adam optimiser 

kingma2014adam

with stochastic back-propagation. The batch size is set as 20. Our experiments are conducted using Pytorch 1.2.0 and with a server equipped with an NVIDIA 2080-Ti GPU.

Ablation Studies.

Firstly, we set two kinds of behaviour policies, the visual attention-based captioning model xu2015show and our image-guided language auto-encoder, which are shown in Table 1

. The attention model yields poorer performance than our auto-encoder as we include the language information in our auto-encoder.

The impact of such different behaviour policy on the target policy is not that obvious, as revealed in Table 2. This phenomenon shows that: (1). the behaviour policy is only applied in the exploration of RL, which, in theory, does not affect the target policy. (2). Behaviour policy that selects better action, can have a better impact on the target policy as the reward tends to be more positive.

RIS can reduce the variance of IS via a simple technique of linear transformation. As the reduced variance leads to more stable training, the performance can be raised, as shown in Table 

3. TRIS can further boost the performance as it additionally introduces a lower bound of the RIS ratio. This lower bound guarantees that the is bigger than zero mostly, leading to more effective training.

The KL-control technique described previously can penalise the target policy when it is divergent from the behaviour policy, thus can reduce the variance of IS. The results are shown in Table 3 and Table 4. The TRIS with KL-control can increase the final performance of the target policy.

We study the value of in TRIS, as presented in Table 6. A suitable is critical in maintaining the performance as it directly affects the TRIS ratio. yields the best results.

The coefficient also has an impact on the performance,

can make a right balance between supervised learning and off-policy RL learning, as shown in Table 

5.

(a) IS ratio.
(b) RIS ratio.
(c) RIS ratio with KL penalty.
(d) TRIS ratio with KL penalty.
Figure 2: The IS ratio of different schemes. The X-axis is the iterations while the Y-axis is the ratio. We see an obvious impact of TRIS and KL penalty on the ratio value.

We plot the IS ratio curves of training versus the iteration. We run the RL training for 2000 iterations, with a batch size of 20, which can be seen in Figure 2. The IS ratio leads to a very high value (more than 3000) in around 1600 and 2000 iterations, which is not bounded. RIS with a relative ratio of 0.5 can significantly reduce the variance, making the value of IS ratio below 0.07, which shows critical contrast with the IS ratio. The KL-control can further reduce the variance of the RIS ratio, limiting the RIS ratio below 0.05. The TRIS introduces a lower bound of 0.95 to the ratio, leading to stable training.

Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
behaviour Policy xu2015show 22.4 10.2 4.3 1.7 9.5 24.4 7.0
Our behaviour Policy 46.3 30.8 20.9 14.5 18.0 41.5 66.7
Table 1: The performance of behaviour policies.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
off-policy with behaviour Policy xu2015show 37.8 22.1 13.0 7.6 14.9 29.2 14.1
off-policy with our behaviour Policy 41.9 24.8 14.8 8.9 16.6 29.8 19.0
Table 2: The impact of behaviour policies on the performance of the target policy.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
IS, 41.9 24.8 14.8 8.9 16.6 29.8 19.0
IS + KL, 42.1 24.2 14.1 9.2 16.5 28.2 16.9
RIS + KL, 43.1 25.5 15.2 9.0 16.9 29.5 20.0
TRIS + KL, 42.7 25.7 15.5 9.4 16.9 30.2 20.9
Table 3: The impact of TRIS on the performance of the target policy.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
RIS+KL, 42.0 24.4 14.4 8.6 16.5 29.0 19.9
RIS+KL, 42.9 25.4 15.1 8.9 16.7 29.7 20.2
RIS+KL, 42.5 25.4 15.3 9.2 16.7 30.1 19.5
Table 4: The impact of coefficient weight of KL-control on the performance on the target policy.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
RIS+KL, 42.3 24.8 14.6 8.5 16.5 29.1 18.8
RIS+KL, 43.1 25.5 15.2 9.0 16.9 29.5 20.0
RIS+KL, 42.3 25.1 15.0 8.9 16.6 29.6 20.2
RIS+KL, 42.5 25.4 15.3 9.2 16.7 30.1 19.5
Table 5: The impact of the coefficient of the off-policy policy gradient on the performance.
Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
TRIS+KL, 44.2 26.8 16.2 9.8 17.2 30.8 20.6
TRIS+KL, 42.7 25.7 15.5 9.4 16.9 30.2 20.9
TRIS+KL, 41.6 24.7 14.7 8.7 16.5 29.5 19.9
Table 6: The impact of the truncated value on the performance of TRIS.

Comparison with the State-of-the-art.

The comparison of our scheme and the current leading methods are shown in Table 7. We achieve state-of-the-art results by using our algorithms with a Transformer optimised on CIDEr. The achieved results even significantly outperform the human’s annotations on BLEU scores. The CIDEr score is also state-of-the-art.

Category Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr
Flat Models Sentence-Concat krause2017hierarchical 31.1 15.1 7.6 4.0 12.1 6.8
Template krause2017hierarchical 37.5 21.0 12.0 7.4 14.3 12.2
Image-Flat krause2017hierarchical 34.0 19.1 12.2 7.7 12.8 11.1
Top-down Attention anderson2018bottom 32.8 19.0 11.4 6.9 12.9 13.7
self-critical rennie2017self 29.7 16.5 9.7 5.9 13.6 13.8
DAM-Att wang2018look 35.0 20.2 11.7 6.6 13.9 17.3
Meshed Transformer + MLE cornia2019m 37.5 22.3 13.7 8.4 15.4 16.1
Hierarchical Models Regions-Hierarchical krause2017hierarchical 41.9 24.1 14.2 8.7 16.0 13.5
RTT-GAN liang2017recurrent 42.0 24.9 14.9 9.0 17.1 16.9
Diverse (VAE) chatterjee2018diverse 42.4 25.6 15.2 9.4 18.6 20.9
ParaCNN yan2020paracnn 42.0 25.0 14.9 8.8 17.0 20.4
Ours Meshed Transformer cornia2019m + off-policy (c = 0.95) 42.7 25.7 15.5 9.4 16.9 20.9
Meshed Transformer cornia2019m + off-policy (c = 0.96) 44.2 26.8 16.2 9.8 17.2 20.6
Human krause2017hierarchical Annotations 42.9 25.7 15.6 9.7 19.2 28.6
Table 7: The Performance Comparison with the State-of-the-art Methods on the Stanford Visual Paragraph Dataset.

4.2 Extending to the Convolutional Model and Image Captioning

Convolutional captioning aneja2018convolutional has a similar parallel computing feature to Transformer. The sentence needs to be generated is shorter, requiring relatively less GPU computing resources. Nevertheless, we test our off-policy learning on the convolutional image captioning task. Following the practice of aneja2018convolutional, We experiment on MS-COCO dataset lin2014microsoft under the ‘Karpathy’ split and report results, which are presented in Table 8. We follow the training protocol of the paper aneja2018convolutional for the baseline. We further train the model for 5 epochs using our off-policy self-critical algorithm. Our method improves the convolutional captioning in almost every metric of language evaluation. Notably, the CIDEr is significantly enhanced as our off-policy RL is optimised towards the CIDEr score.

Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
Conv Captioning aneja2018convolutional 71.0 53.6 38.9 27.9 24.0 51.9 88.1
Our off-policy learning 70.9 53.8 39.3 28.3 24.2 52.0 90.0
Table 8: The impact of off-policy RL on convolutional image captioning.

5 Conclusions

Transformer and Convolution-based seq-to-seq model is hard to perform on-line RL optimisation in visual paragraph generation as the computing resources required for a such models are beyond current equipments. Hence, we propose an off-policy self-critical algorithm in which a GRU-based model is set as the behaviour policy to perform sampling in RL, whose sampling is much more efficient. To better approximate this off-policy RL, both TRIS and a KL-divergence penalty term are applied in the off-policy RL to reduce the high variance of the IS approximation. As Transformer is empowered with RL learning capability, we achieve state-of-the-art results on visual paragraph generation and also improved results on image captioning.

Broader Impact

In this paper, we introduce an off-policy self-critical sequence training algorithm, especially targeting on Transformer-like models in visual paragraph generation, enabling the feasibility of the combination of these advanced models and reinforcement learning (RL).

Usually, we can consider the language generation task as a sequential decision-making process. At each time step, the agent selects a word from a pre-defined vocabulary until the whole sentence or paragraph is generated. Previous studies usually make the agent perform on-policy. However, the off-policy can prevent the agent from real exploration and is sample efficient. Transformer is especially computing-expensive in on-policy exploration, leading to the in-feasibility of the on-policy RL for Transformer. One of the impact is that the proposed algorithm can not only directly save computing resource by preventing Transformer from real exploration but also show the possibilities of the off-policy RL in language generation tasks.

This research is also a test on how the off-policy RL performs in large action space problems. The off-policy is usually rooted in a value-based RL algorithm where the action space is small. Instead, we propose a policy gradient method, without Temporal Difference (TD) bootstrapping, to directly transfer the Monte-Carlo experience of the behaviour policy to the target policy. Mostly, the policy gradient is better behaved when combined with function approximations, while the TD method is more readily applied in off-policy learning. The main obstacle is the high variance in the off-policy estimation of the policy gradient. We show that it is feasible to formulate and apply the off-policy policy gradient if we handle the variance properly.

A drawback of the proposed algorithm is that we might need to introduce another RNN-based model as the behaviour policy, which increases the number of the parameters of the models. Hence, further research can make efforts on how to reduce the training models’ complexity while achieving the same effects of the off-policy RL.

In summary, this research can help existing natural language processing (NLP) models like Transformer to perform off-policy RL learning, which is a novel way of the training of Transformer. This research will also provide insights for other RL learning scheme, for instance, actor-critic learning, in various NLP tasks.

References