Log In Sign Up

How is BERT surprised? Layerwise detection of linguistic anomalies

by   Bai Li, et al.

Transformer language models have shown remarkable ability in detecting when a word is anomalous in context, but likelihood scores offer no information about the cause of the anomaly. In this work, we use Gaussian models for density estimation at intermediate layers of three language models (BERT, RoBERTa, and XLNet), and evaluate our method on BLiMP, a grammaticality judgement benchmark. In lower layers, surprisal is highly correlated to low token frequency, but this correlation diminishes in upper layers. Next, we gather datasets of morphosyntactic, semantic, and commonsense anomalies from psycholinguistic studies; we find that the best performing model RoBERTa exhibits surprisal in earlier layers when the anomaly is morphosyntactic than when it is semantic, while commonsense anomalies do not exhibit surprisal at any intermediate layer. These results suggest that language models employ separate mechanisms to detect different types of linguistic anomalies.


page 1

page 13


Integrating Linguistic Theory and Neural Language Models

Transformer-based language models have recently achieved remarkable resu...

Does BERT Solve Commonsense Task via Commonsense Knowledge?

The success of pre-trained contextualized language models such as BERT m...

Variation and generality in encoding of syntactic anomaly information in sentence embeddings

While sentence anomalies have been applied periodically for testing in N...

Predicting metrical patterns in Spanish poetry with language models

In this paper, we compare automated metrical pattern identification syst...

BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?

Analogies play a central role in human commonsense reasoning. The abilit...

Evaluating Contextualized Language Models for Hungarian

We present an extended comparison of contextualized language models for ...

Types of Out-of-Distribution Texts and How to Detect Them

Despite agreement on the importance of detecting out-of-distribution (OO...