Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

05/09/2020
by   Xinting Huang, et al.
6

Dialogue policy optimization often obtains feedback until task completion in task-oriented dialogue systems. This is insufficient for training intermediate dialogue turns since supervision signals (or rewards) are only provided at the end of dialogues. To address this issue, reward learning has been introduced to learn from state-action pairs of an optimal policy to provide turn-by-turn rewards. This approach requires complete state-action annotations of human-to-human dialogues (i.e., expert demonstrations), which is labor intensive. To overcome this limitation, we propose a novel reward learning approach for semi-supervised policy learning. The proposed approach learns a dynamics model as the reward function which models dialogue progress (i.e., state-action sequences) based on expert demonstrations, either with or without annotations. The dynamics model computes rewards by predicting whether the dialogue progress is consistent with expert demonstrations. We further propose to learn action embeddings for a better generalization of the reward function. The proposed approach outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2023

PAGAR: Imitation Learning with Protagonist Antagonist Guided Adversarial Reward

Imitation learning (IL) algorithms often rely on inverse reinforcement l...
research
09/21/2020

Learn to Exceed: Stereo Inverse Reinforcement Learning with Concurrent Policy Optimization

In this paper, we study the problem of obtaining a control policy that c...
research
06/02/2018

Internal Model from Observations for Reward Shaping

Reinforcement learning methods require careful design involving a reward...
research
09/16/2019

Domain Transfer in Dialogue Systems without Turn-Level Supervision

Task oriented dialogue systems rely heavily on specialized dialogue stat...
research
02/19/2019

Learning to Generalize from Sparse and Underspecified Rewards

We consider the problem of learning from sparse and underspecified rewar...
research
01/28/2021

Attention Guided Dialogue State Tracking with Sparse Supervision

Existing approaches to Dialogue State Tracking (DST) rely on turn level ...
research
04/24/2018

No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling

Though impressive results have been achieved in visual captioning, the t...

Please sign up or login with your details

Forgot password? Click here to reset