Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference

06/21/2023
by   Alessandro Sordoni, et al.
0

We view large language models (LLMs) as stochastic language layers in a network, where the learnable parameters are the natural language prompts at each layer. We stack two such layers, feeding the output of one layer to the next. We call the stacked architecture a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). We then show how to train 2-layer DLNs (DLN-2), where two prompts must be learnt. We consider the output of the first layer as a latent variable to marginalize, and devise a variational inference algorithm for joint prompt training. A DLN-2 reaches higher performance than a single layer, sometimes comparable to few-shot GPT-4 even when each LLM in the network is smaller and less powerful. The DLN code is open source: https://github.com/microsoft/deep-language-networks .

READ FULL TEXT

page 8

page 19

page 20

page 21

page 22

research
06/12/2019

Neural Variational Inference For Estimating Uncertainty in Knowledge Graph Embeddings

Recent advances in Neural Variational Inference allowed for a renaissanc...
research
03/21/2021

ProgressiveSpinalNet architecture for FC layers

In deeplearning models the FC (fully connected) layer has biggest import...
research
02/27/2019

Training Variational Autoencoders with Buffered Stochastic Variational Inference

The recognition network in deep latent variable models such as variation...
research
07/09/2022

QKVA grid: Attention in Image Perspective and Stacked DETR

We present a new model named Stacked-DETR(SDETR), which inherits the mai...
research
10/02/2019

Reconsidering Analytical Variational Bounds for Output Layers of Deep Networks

The combination of the re-parameterization trick with the use of variati...
research
10/18/2021

NormFormer: Improved Transformer Pretraining with Extra Normalization

During pretraining, the Pre-LayerNorm transformer suffers from a gradien...
research
05/06/2014

Is Joint Training Better for Deep Auto-Encoders?

Traditionally, when generative models of data are developed via deep arc...

Please sign up or login with your details

Forgot password? Click here to reset