Transformers are Universal Predictors

07/15/2023
by   Sourya Basu, et al.
0

We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset