How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

03/07/2023
by   Yuchen Li, et al.
0

While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking. Their capabilities have been probed on benchmarks which include a variety of structured and reasoning tasks – but mathematical understanding is lagging substantially behind. Recent lines of work have begun studying representational aspects of this question: that is, the size/depth/complexity of attention-based networks to perform certain tasks. However, there is no guarantee the learning dynamics will converge to the constructions proposed. In our paper, we provide fine-grained mechanistic understanding of how transformers learn "semantic structure", understood as capturing co-occurrence structure of words. Precisely, we show, through a combination of experiments on synthetic data modeled by Latent Dirichlet Allocation (LDA), Wikipedia data, and mathematical analysis that the embedding layer and the self-attention layer encode the topical structure. In the former case, this manifests as higher average inner product of embeddings between same-topic words. In the latter, it manifests as higher average pairwise attention between same-topic words. The mathematical results involve several assumptions to make the analysis tractable, which we verify on data, and might be of independent interest as well.

READ FULL TEXT

page 2

page 3

page 4

page 39

research
04/14/2023

Optimal inference of a generalised Potts model by single-layer transformers with factored attention

Transformers are the type of neural networks that has revolutionised nat...
research
05/17/2021

Pay Attention to MLPs

Transformers have become one of the most important architectural innovat...
research
01/20/2023

Holistically Explainable Vision Transformers

Transformers increasingly dominate the machine learning landscape across...
research
09/22/2022

Improving Attention-Based Interpretability of Text Classification Transformers

Transformers are widely used in NLP, where they consistently achieve sta...
research
12/31/2020

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

We generalize deep self-attention distillation in MiniLM (Wang et al., 2...
research
08/13/2016

Analysis of Morphology in Topic Modeling

Topic models make strong assumptions about their data. In particular, di...
research
05/23/2023

Physics of Language Models: Part 1, Context-Free Grammar

We design experiments to study how generative language models, like GPT,...

Please sign up or login with your details

Forgot password? Click here to reset