Detecting egregious responses in neural sequence-to-sequence models

09/11/2018
by   Tianxing He, et al.
0

In this work, we attempt to answer a critical question: whether there exists some input sequence that will cause a well-trained discrete-space neural network sequence-to-sequence (seq2seq) model to generate egregious outputs (aggressive, malicious, attacking, etc.). And if such inputs exist, how to find them efficiently. We adopt an empirical methodology, in which we first create lists of egregious outputs, and then design a discrete optimization algorithm to find input sequences that will generate them. Moreover, the optimization algorithm is enhanced for large vocabulary search and constrained to search for input sequences that are likely to appear in real-world settings. In our experiments, we apply this approach to a dialogue response generation model for two real-world dialogue datasets: Ubuntu and Switchboard, testing whether the model can generate malicious responses. We demonstrate that given the trigger inputs our algorithm finds, a significant number of malicious sentences are assigned a large probability by the model.

READ FULL TEXT
research
03/07/2022

Towards Robust Online Dialogue Response Generation

Although pre-trained sequence-to-sequence models have achieved great suc...
research
03/08/2023

Automatically Auditing Large Language Models via Discrete Optimization

Auditing large language models for unexpected behaviors is critical to p...
research
05/27/2020

Chat as Expected: Learning to Manipulate Black-box Neural Dialogue Models

Recently, neural network based dialogue systems have become ubiquitous i...
research
11/18/2016

Generative Deep Neural Networks for Dialogue: A Short Review

Researchers have recently started investigating deep neural networks for...
research
09/13/2019

Say What I Want: Towards the Dark Side of Neural Dialogue Models

Neural dialogue models have been widely adopted in various chatbot appli...
research
10/04/2020

Meta Sequence Learning and Its Applications

We present a meta-sequence representation of sentences and demonstrate h...
research
07/04/2016

Sequence to Backward and Forward Sequences: A Content-Introducing Approach to Generative Short-Text Conversation

Using neural networks to generate replies in human-computer dialogue sys...

Please sign up or login with your details

Forgot password? Click here to reset