Eliciting Latent Predictions from Transformers with the Tuned Lens

03/14/2023
by   Nora Belrose, et al.
0

We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the tuned lens, is a refinement of the earlier "logit lens" technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.

READ FULL TEXT

page 1

page 5

page 10

page 15

page 16

page 17

page 20

research
06/28/2023

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

We propose LENS, a modular approach for tackling computer vision problem...
research
10/14/2017

GHCLNet: A Generalized Hierarchically tuned Contact Lens detection Network

Iris serves as one of the best biometric modality owing to its complex, ...
research
03/23/2022

Can Prompt Probe Pretrained Language Models? Understanding the Invisible Risks from a Causal View

Prompt-based probing has been widely used in evaluating the abilities of...
research
08/20/2023

ViT-Lens: Towards Omni-modal Representations

Though the success of CLIP-based training recipes in vision-language mod...
research
12/19/2022

LENS: A Learnable Evaluation Metric for Text Simplification

Training learnable metrics using modern language models has recently eme...
research
09/28/2022

Transfer Learning with Pretrained Remote Sensing Transformers

Although the remote sensing (RS) community has begun to pretrain transfo...
research
12/15/2021

Evaluating Pretrained Transformer Models for Entity Linking in Task-Oriented Dialog

The wide applicability of pretrained transformer models (PTMs) for natur...

Please sign up or login with your details

Forgot password? Click here to reset