Down the Toxicity Rabbit Hole: Investigating PaLM 2 Guardrails

09/08/2023
by   Adel Khorramrouz, et al.
0

This paper conducts a robustness audit of the safety feedback of PaLM 2 through a novel toxicity rabbit hole framework introduced here. Starting with a stereotype, the framework instructs PaLM 2 to generate more toxic content than the stereotype. Every subsequent iteration it continues instructing PaLM 2 to generate more toxic content than the previous iteration until PaLM 2 safety guardrails throw a safety violation. Our experiments uncover highly disturbing antisemitic, Islamophobic, racist, homophobic, and misogynistic (to list a few) generated content that PaLM 2 safety guardrails do not evaluate as highly unsafe.

READ FULL TEXT

page 5

page 7

page 8

page 17

research
03/27/2023

Beyond Toxic: Toxicity Detection Datasets are Not Enough for Brand Safety

The rapid growth in user generated content on social media has resulted ...
research
06/05/2023

The Chai Platform's AI Safety Framework

Chai empowers users to create and interact with customized chatbots, off...
research
10/03/2022

Red-Teaming the Stable Diffusion Safety Filter

Stable Diffusion is a recent open-source image generation model comparab...
research
01/14/2022

A causal model of safety assurance for machine learning

This paper proposes a framework based on a causal model of safety upon w...
research
05/14/2019

WatchOut: A Road Safety Extension for Pedestrians on a Public Windshield Display

We conducted a field study to investigate whether public windshield disp...
research
06/09/2023

Safety and Fairness for Content Moderation in Generative Models

With significant advances in generative AI, new technologies are rapidly...
research
05/03/2018

AGI Safety Literature Review

The development of Artificial General Intelligence (AGI) promises to be ...

Please sign up or login with your details

Forgot password? Click here to reset