N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using Neural Ordinary Differential Equations

10/22/2020
by   Aaron Baier-Reinio, et al.
0

We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver. Our goal in proposing the N-ODE Transformer is to investigate whether its depth-adaptivity may aid in overcoming some specific known theoretical limitations of the Transformer in handling nonlocal effects. Specifically, we consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations that can only be overcome by using a sufficiently large number of layers or attention heads. We find, however, that the depth-adaptivity of the N-ODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem, and provide explanations for why this is so. Next, we pursue regularization of the N-ODE Transformer by penalizing the arclength of the ODE trajectories, but find that this fails to improve the accuracy or efficiency of the N-ODE Transformer on the challenging parity problem. We suggest future avenues of research for modifications and extensions of the N-ODE Transformer that may lead to improved accuracy and efficiency for sequence modelling tasks such as neural machine translation.

READ FULL TEXT
research
04/06/2021

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation

It has been found that residual networks are an Euler discretization of ...
research
11/05/2022

Discovering ordinary differential equations that govern time-series

Natural laws are often described through differential equations yet find...
research
09/27/2021

Abstraction, Reasoning and Deep Learning: A Study of the "Look and Say" Sequence

The ability to abstract, count, and use System 2 reasoning are well-know...
research
07/24/2023

Predicting Ordinary Differential Equations with Transformers

We develop a transformer-based sequence-to-sequence model that recovers ...
research
10/22/2019

Depth-Adaptive Transformer

State of the art sequence-to-sequence models perform a fixed number of c...
research
04/27/2020

Explicitly Modeling Adaptive Depths for Transformer

The vanilla Transformer conducts a fixed number of computations over all...

Please sign up or login with your details

Forgot password? Click here to reset