Baselines for Identifying Watermarked Large Language Models

05/29/2023
by   Leonard Tang, et al.
0

We consider the emerging problem of identifying the presence and use of watermarking schemes in widely used, publicly hosted, closed source large language models (LLMs). We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario. Along the way, we formalize the specific problem of identifying watermarks in LLMs, as well as LLM watermarks and watermark detection in general, providing a framework and foundations for studying them.

READ FULL TEXT
research
01/24/2023

A Watermark for Large Language Models

Potential harms of large language models can be mitigated by watermarkin...
research
05/24/2023

Deriving Language Models from Masked Language Models

Masked language models (MLM) do not explicitly define a distribution ove...
research
02/27/2023

LLaMA: Open and Efficient Foundation Language Models

We introduce LLaMA, a collection of foundation language models ranging f...
research
07/20/2023

Dynamic Large Language Models on Blockchains

Training and deploying the large language models requires a large mount ...
research
03/12/2022

Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice

Classifiers in natural language processing (NLP) often have a large numb...
research
08/02/2023

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Without proper safeguards, large language models will readily follow mal...
research
04/24/2019

A Comparison of Methods for Identifying Location Effects in Unreplicated Fractional Factorials in the Presence of Dispersion Effects

Most methods for identifying location effects in unreplicated fractional...

Please sign up or login with your details

Forgot password? Click here to reset