Hierarchical Reinforcement Learning for Open-Domain Dialog

by   Abdelrhman Saleh, et al.

Open-domain dialog generation is a challenging problem; maximum likelihood training can lead to repetitive outputs, models have difficulty tracking long-term conversational goals, and training on standard movie or online datasets may lead to the generation of inappropriate, biased, or offensive text. Reinforcement Learning (RL) is a powerful framework that could potentially address these issues, for example by allowing a dialog model to optimize for reducing toxicity and repetitiveness. However, previous approaches which apply RL to open-domain dialog generation do so at the word level, making it difficult for the model to learn proper credit assignment for long-term conversational rewards. In this paper, we propose a novel approach to hierarchical reinforcement learning, VHRL, which uses policy gradients to tune the utterance-level embedding of a variational sequence model. This hierarchical approach provides greater flexibility for learning long-term, conversational rewards. We use self-play and RL to optimize for a set of human-centered conversation metrics, and show that our approach provides significant improvements -- in terms of both human evaluation and automatic metrics -- over state-of-the-art dialog models, including Transformers.


Deep Reinforcement Learning For Modeling Chit-Chat Dialog With Discrete Attributes

Open domain dialog systems face the challenge of being repetitive and pr...

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Building an open-domain conversational agent is a challenging problem. C...

Natural Language Generation Using Reinforcement Learning with External Rewards

We propose an approach towards natural language generation using a bidir...

Human-centric Dialog Training via Offline Reinforcement Learning

How can we train a dialog model to produce better conversations by learn...

Challenges in Building Intelligent Open-domain Dialog Systems

There is a resurgent interest in developing intelligent open-domain dial...

Discovering Dialog Structure Graph for Open-Domain Dialog Generation

Learning interpretable dialog structure from human-human dialogs yields ...

Reinforcement Learning-based Product Delivery Frequency Control

Frequency control is an important problem in modern recommender systems....