Transformers with Learnable Activation Functions

08/30/2022
by   Haishuo Fang, et al.
0

Activation functions can have a significant impact on reducing the topological complexity of input data and therefore improve the performance of the model. Selecting a suitable activation function is an essential step in neural model design. However, the choice of activation function is seldom discussed or explored in Transformer-based language models. Their activation functions are chosen beforehand and then remain fixed from pre-training to fine-tuning. As a result, the inductive biases they imposed on models cannot be adjusted during this long life cycle. Moreover, subsequently developed models (e.g., RoBERTa, BART, and GPT-3) often follow up prior work (e.g., BERT) to use the same activation function without justification. In this paper, we investigate the effectiveness of using Rational Activation Function (RAF), a learnable activation function, in the Transformer architecture. In contrast to conventional, predefined activation functions, RAFs can adaptively learn optimal activation functions during training according to input data. Our experiments show the RAF-based Transformer (RAFT) achieves a lower validation perplexity than a vanilla BERT with the GELU function. We further evaluate RAFT on downstream tasks in low- and full-data settings. Our results show that RAFT outperforms the counterpart model across the majority of tasks and settings. For instance, RAFT outperforms vanilla BERT on the GLUE benchmark by 5.71 points on average in low-data scenario (where 100 training examples are available) and by 2.05 points on SQuAD in full-data setting. Analysis of the shapes of learned RAFs further unveils that they substantially vary between different layers of the pre-trained model and mostly look very different from conventional activation functions. RAFT opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/02/2018

The Quest for the Golden Activation Function

Deep Neural Networks have been shown to be beneficial for a variety of t...
research
05/02/2020

A survey on modern trainable activation functions

In the literature, there is a strong interest to identify and define act...
research
03/04/2023

Lon-eå at SemEval-2023 Task 11: A Comparison of Activation Functions for Soft and Hard Label Prediction

We study the influence of different activation functions in the output l...
research
05/03/2022

Adaptable Adapters

State-of-the-art pretrained NLP models contain a hundred million to tril...
research
09/15/2023

Attention-Only Transformers and Implementing MLPs with Attention Heads

The transformer architecture is widely used in machine learning models a...
research
03/29/2021

Restricted Boltzmann Machines as Models of Interacting Variables

We study the type of distributions that Restricted Boltzmann Machines (R...
research
05/13/2022

Uninorm-like parametric activation functions for human-understandable neural models

We present a deep learning model for finding human-understandable connec...

Please sign up or login with your details

Forgot password? Click here to reset