DeepAI
Log In Sign Up

Exploiting Inductive Bias in Transformers for Unsupervised Disentanglement of Syntax and Semantics with VAEs

We propose a generative model for text generation, which exhibits disentangled latent representations of syntax and semantics. Contrary to previous work, this model does not need syntactic information such as constituency parses, or semantic information such as paraphrase pairs. Our model relies solely on the inductive bias found in attention-based architectures such as Transformers. In the attention of Transformers, keys handle information selection while values specify what information is conveyed. Our model, dubbed QKVAE, uses Attention in its decoder to read latent variables where one latent variable infers keys while another infers values. We run experiments on latent representations and experiments on syntax/semantics transfer which show that QKVAE displays clear signs of disentangled syntax and semantics. We also show that our model displays competitive syntax transfer capabilities when compared to supervised models and that comparable supervised models need a fairly large amount of data (more than 50K samples) to outperform it on both syntactic and semantic transfer. The code for our experiments is publicly available.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/02/2019

A Multi-Task Approach for Disentangling Syntax and Semantics in Sentence Representations

We propose a generative model for a sentence that uses two latent variab...
07/06/2019

Generating Sentences from Disentangled Syntactic and Semantic Spaces

Variational auto-encoders (VAEs) are widely used in natural language gen...
10/13/2021

Semantics-aware Attention Improves Neural Machine Translation

The integration of syntactic structures into Transformer machine transla...
08/27/2019

Text Modeling with Syntax-Aware Variational Autoencoders

Syntactic information contains structures and rules about how text sente...
12/24/2020

Disentangling semantics in language through VAEs and a certain architectural choice

We present an unsupervised method to obtain disentangled representations...
06/22/2022

Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles

Linking neural representations to linguistic factors is crucial in order...
09/09/2020

Unsupervised Part Discovery by Unsupervised Disentanglement

We address the problem of discovering part segmentations of articulated ...

1 Introduction

Disentanglement, a process aimed at obtaining neural representations with identified meaning, is a crucial component of research on interpretability Rudin2022InterpretableChallenges. A form of disentanglement that received a lot of interest from the NLP community is the separation between syntax and semantics in neural representations Chen2019ARepresentations; Bao2020; Zhang2020Syntax-infusedGeneration; Chen2020ControllableExemplar; Huang2021GeneratingPairs; Huang2021DisentanglingModelsb. Previous works perform disentanglement using paraphrase pairs as information for semantics, and/or constituency parses as information for syntax. The dependence of models on labeled data is known to entail high cost (see Seddah2020BuildingHell on syntactic annotation), and to often require new labels to handle problems such as concept drift Lu2019LearningReview and domain adaptation Farahani2021AAdaptation.

In light of the above, we propose an unsupervised model which directs syntax and semantics into different neural representations without semantic or syntactic information. In the Transformer architecture Vaswani2017, the attention mechanism is built upon a query from a set , which pools values through keys . For each query, values are selected according to their matching score computed by the similarity between their corresponding keys and the query. Building on an analogy between the couple and syntactic roles with their lexical realizations (explicited in §4.2) we present QKVAE222A contraction of the triplet with the VAE acronym.

, a Transformer-based Variational Autoencoder (VAE;

Kingma2014Auto-encodingBayes).

To build our model, we modify a previous Transformer-based VAE, called the Attention-Driven VAE (ADVAE; Felhi2021TowardsRoles). Using Cross-Attention, our model encodes sentences into two latent variables: to infer values for , and to assign keys in for values in . These keys and values are then used in the Attention mechanism of a Transformer Decoder to generate sentences. We show that tends to contain syntactic information, while tends to represent semantic information. Additionally, comparisons with a supervised model show that it needs a considerable amount of data to outperform our model on syntactic and semantic transfer metrics.

Our contributions can be summarized as follows:

  • We describe QKVAE, a model designed to disentangle syntactic information from semantic information by using separate latent variables for keys and values in Transformers Attention.

  • We run experiments on a dataset for English which empirically show that the two types of latent variables have strong preferences respectively for syntax and semantic.

  • We also show that our model is capable of transferring syntactic and semantic information between sentences by using their respective latent variables. Moreover, we show that our model’s syntax transfer capabilities are competitive with supervised models when they use their full training set (more than 400k sentences), and that a supervised model needs a fairly large amount of labeled data (more than 50k samples) to outperform it on both semantic and syntactic transfer.

2 Related Work

We broadly divide works on explainability in NLP into two research directions. The first seeks post hoc explanations for black-box models, and led to a rich literature of observations on the behavior of Neural Models in NLP tenney-etal-2019-bert; jawahar-etal-2019-bert; Hu2020AModels; Kodner2020OverestimationModels; Marvin2020TargetedModels; Kulmizev2020DoFormalisms; rogers-etal-2020-primer. Along with these observations, this line of works also led to numerous advances in methodology concerning, for instance, the use of attention as an explanation Jain2019AttentionExplanation; Wiegreffe2020AttentionExplanation, the validity of probing Pimentel2020Information-TheoreticStructure, or contrastive evaluation with minimal pairs Vamvas2021OnEvaluation. The second research direction on explainability in NLP seeks to build models that are explainable by design. This led to models with explicit linguistically informed mechanisms such as the induction of grammars (RNNG; Dyer2016RecurrentGrammars, URNNG; Kim2019UnsupervisedGrammars) or constituency trees (ON-LSTM; Shen2019OrderedNetworks, ONLSTM-SYD; Du2020ExploitingApproach).

Disentangled representation learning is a sub-field of this second research direction which aims at separating neural representations into neurons with known associated meanings. This separation was performed on various characteristics in text such as style John2020DisentangledTransfer; Cheng2020ImprovingGuidance, sentiment and topic Xu2020OnSupervision, or word morphology Behjati2021InducingAttention. In works on disentanglement, consequent efforts have been put in the separation between syntax and semantics, whether merely to obtain an interpretable specialization in the embedding space Chen2019ARepresentations; Bao2020; ravfogel-etal-2020-unsupervised; Huang2021DisentanglingModelsb, or for controllable generation Zhang2020Syntax-infusedGeneration; Chen2020ControllableExemplar; Huang2021GeneratingPairs; hosking-lapata-2021-factorising; li-etal-2021-deep-decomposable; hosking-etal-2022-hierarchical. However, all these works rely on syntactic information (constituency parses and PoS tags) or semantic information (paraphrase pairs). To the best of our knowledge, our work is the first to present a method that directs syntactic and semantic information into assigned embeddings in the challenging unsupervised setup.

From a broader machine learning perspective, using knowledge of the underlying phenomena in our data, we design our model QKVAE with an inductive bias that induces understandable behavior in an unsupervised fashion. Among the existing line of applications of this principle

Rezende2016UnsupervisedImages; Hudson2018; Locatello2020Object-centricAttention; Tjandra2021UnsupervisedRepresentation, ADVAE Felhi2021TowardsRoles, the model on which QKVAE is based, is designed to separate information from the realizations of different syntactic roles without supervision on a dataset of regularly structured sentences.

3 Background

In this section, we go over the components of our model, namely VAEs, attention in Transformers, and ADVAE, the model on which QKVAE is based.

3.1 VAEs as Language Models

Given a set of observations

, VAEs are a class of deep learning models that train a generative model

, where is a prior distribution on latent variables that serve as a seed for generation, and is called the decoder and generates an observation from each latent variable value . Since directly maximizing the likelihood to train a generative model is intractable, an approximate inference distribution , called the encoder, is used to formulate a lower-bound to the exact log-likelihood of the model, called the Evidence Lower-Bound (ELBo):

(1)

Early works on VAEs as language models have shown that, contrary to non-generative sequence-to-sequence (Sutskever2014b) models, they learn a smooth latent space (Bowman2016GeneratingSpace)

. In fact, this smoothness enables decoding an interpolation of latent codes (

i.e. a homotopy) coming from two sentences to yield a well-formed third sentence that clearly shares characteristics (syntactic, semantic…) with both source sentences. This interpolation will be used as a control baseline in our experiments.

3.2 Attention in Transformers.

The inductive bias responsible for the disentanglement capabilities of our model is based on the design of Attention in Transformers Vaswani2017

. In attention mechanisms, each element of a series of query vectors

performs a soft selection of values whose compatibility with the query is given by their corresponding key vector in via dot product. For each , the series of dot products is normalized and used as weights for a convex interpolation of the values. Formally, the result is compactly written as:

(2)

Here, we stress that is only capable of controlling what information is selected from , while is responsible for the value of this information. Using the above operators and the embedding level concatenation operator , Multi-Head Attention () in Transformers is defined as follows:

Where , , , and are trainable parameter matrices. In turn, Self-Attention () and Cross-Attention () are defined, for sets of elements called source and target , as follows:

The above mechanism is used to exchange information between elements of target , while in , targets pull (or query for) information from each element of the source . Transformer Encoders () are defined as the composition of layers each consisting of an attention followed by a Feed-Forward Network :333

We omit residual connections and layer normalizations after each

or for simplicity.

Transformer Decoders () are defined with instances of , and :

where and above are respectively the number of layers of and . For autoregressive decoding, Vaswani2017 define a version of we will call . In this version, the result of each (Eq. 2) in Self-Attention is masked so that each in only queries for information from with . Even though yields a sequence of length equal to that of target , in the following sections we will consider its output to be only the last element of in order to express auto-regressive generation in a clear manner.

3.3 Advae

ADVAE is a Variational Autoencoder for unsupervised disentanglement of sentence representations. It mainly differs from previous LSTM-based (Bowman2016GeneratingSpace) and Transformer-based (Li2020Optimus:Space) VAEs in that it uses Cross-Attention to encode and decode latent variables, which is the cornerstone of our model. In ADVAE, Cross-Attention is used to: i) encode information from sentences into a fixed number of vectorial latent variables; ii) decode these vectorial latent variables by using them as sources for the target sentences generated by a Transformer Decoder.

Formally, let us define , , and

to be linear layers that will respectively be used to obtain the latent variables’ means and standard deviations, and the generated words’ probabilities,

the number of vectorial latent variables , and finally and two sets of trainable embeddings. Embeddings and serve as fixed identifiers for the latent variable respectively in the encoder and in the decoder.

Given input token sequence , the encoder first yields parameters and

to be used by the diagonal Gaussian distribution of each of the latent variables

as follows444To simplify equations, we omit word embedding look-up tables and positional embeddings.:

(3)

Cross-Attention is also used by the ADVAE decoder to dispatch information from the source latent variable samples to the target generated sequence. Accordingly, using a beginning-of-sentence token , yields probabilities for the categorical distribution of the generated tokens by decoding latent variables concatenated with their embeddings :

4 QKVAE: Using separate latent variables for Keys and Values

In this section, we describe the architecture of our model, the behavior it entails, and how we deal with the optimization challenges it poses.

child to wear cloak winter
nsubj root dobj decoded (, ): A child wears a cloak.
agent root nsubjpass pobj decoded (, ): A cloak is worn, in winter, by a child
Table 1: Example of interpretable values for the and in our model with . We display a sentence transiting from the active form to the passive form, to illustrate how different keys arranging the same values can lead to the same minimal semantic units being rearranged according to a different syntactic structure. We also stress that a different set of keys may omit or bring forth an element from the values vector (e.g. "winter" here above).

4.1 QKVAE architecture

The modification we bring to ADVAE is aimed at controlling how information is selected from the latent space with the value of a newly introduced latent variable. We call this latent variable , and refer to the latent variables already formulated in ADVAE as . is obtained with the same process as each (Eq. 3), i.e. by adding an additional identifier embedding , and matrices and to obtain its mean and standard-deviation parameters.

For the QKVAE Decoder, we modify the Transformer Decoder into so as to use Multi-Head Attention with separate inputs for keys and values instead of Cross-Attention :

where is the number of layers. Similar to , we define to be the auto-regressive version of . The QKVAE decoder yields probabilities for the generated tokens by using this operator on values given by concatenated with embeddings

, and keys given by a linear transformation on

:

where is a linear layer.555The output of is reshaped to obtain a matrix of keys. While ADVAE already uses Cross-Attention to encode and decode latent variables, our model uses separate variables to obtain keys and values for Multi-Head Attention in its decoder.

4.2 QKVAE Behavior

In the Multi-Head Attention of our decoder, controls keys, and controls values. In other words, the value of each is called to be passed to the target sequence according to its key which is given by the variable . Therefore, given a query, decides which content vector participates most to the value of the generated token at each generation step. To better get a gist of the kind of behavior intended by this construction, we assume in Table 1 for explanatory purposes, that our decoder has one layer and one attention head, that the value of each in key matrices and corresponds to syntactic roles, and that each informs on the realization of the corresponding syntactic role. Table 1 displays the resulting sentence when each of and are coupled with .

In the examples in Table 1, the generator uses a query at each generation step to pick a word in a manner that would comply with English syntax. Therefore, the key of each value should inform on its role in the target structure, which justifies syntactic roles as an adequate meaning for keys.

Although our model may stray from this possibility and formulate non-interpretable values and keys, keys will still inform on the roles of values in the target structure, and therefore influence the way values are injected into the target sequence. And given the fact that our model uses multiple layers and attention heads and the continuous nature of keys in Attention (as opposed to discrete syntactic role labels), our model performs a multi-step and continuous version of the behavior described in Table 1.

Injecting values into the structure of a sentence requires the decoder to model this structure. Previous works have shown that this is well within the capabilities of Transformers. Specifically, Hewitt2019ARepresentations showed that Transformers embed syntactic trees in their inner representations, Clark2019WhatAttentionb showed that numerous attention heads attend to specific syntactic roles, and we Felhi2021TowardsRoles showed that Transformer-based VAEs can capture the realizations of syntactic roles in latent variables obtained with Cross-Attention.

4.3 Balancing the Learning of and

Similar to ADVAE, we use a standard Normal distribution as a prior

and train QKVAE with the -VAE objective Higgins2019-VAE:Framework which is simply (Eq. 1) with a weight on its Kullback-Leibler () term. Higgins2019-VAE:Framework show that a higher leads to better unsupervised disentanglement. However, the term is responsible for a phenomenon called posterior collapse where the latent variables become uninformative and are not used by the decoder Bowman2016GeneratingSpace. Therefore, higher values for cause poorer reconstruction performance Chen2018c. To avoid posterior collapse, we follow Li2020AText: i) We pretrain our model as an autoencoder by setting to 0; ii) We linearly increase to its final value ( annealing; Bowman2016GeneratingSpace) and we threshold each dimension of the term with a factor (Free-Bits strategy; Kingma2016ImprovedFlow).

In preliminary experiments with our model, we observed that it tends to encode sentences using only . As we use conditionally independent posteriors666These posteriors are ADVAE encoders (Eq. 3). and for our latent variables, their terms (Eq. 1) can be written seperately, and they can therefore be weighted separately with different values of . Using a lower for as was done by Chen2020ControllableExemplar 777Although not explicitly mentioned in the paper, this is performed in their companion source code. did not prove effective in making it informative for the model. Alternatively, linearly annealing for before did solve the issue. This intervention on the learning process was inspired by the work of Li2020ProgressiveRepresentations which shows that latent variables used at different parts of a generative model should be learned at different paces.

5 Experiments

5.1 Setup

Data

To compare our model to its supervised counterparts, we train it with data from the English machine-generated paraphrase pairs dataset ParaNMT Wieting2018ParanMT-50M:Translations. More specifically, we use the 493K samples used by Chen2020ControllableExemplar888https://drive.google.com/open?id=1HHDlUT_-WpedL6zNYpcN94cLwed_yyrP to train their model VGVAE. Since our model is unsupervised, we only use the reference sentences (half the training set) to train our model. Using the development and test sets of ParaNMT, Chen2020ControllableExemplar also provide a curated set of triplets formed by a target sentence (target), a semantic source (sem_src),and a syntactic source (syn_src). The semantic source is a paraphrase of the target sentence, while the syntactic source is selected by finding a sentence that is syntactically close to the target (i.e. edit distance between the sequence of PoS Tags of both sentences is low999We follow Chen2020ControllableExemplar by using this evaluation data, although edit distance between PoS tags might not be a good proxy for syntactic similarity.) and semantically different from the paraphrase (has low BLEU score with it). Contrary to paraphrases in the training set of ParaNMT, paraphrases from this set were manually curated. These triplets are divided into a development set of 500 samples and a test set of 800 samples. We display results on the test set in the main body of the paper. The results on the development set, which lead to the same conclusions, are reported in Appendix A.

Training details & hyper-parameters

Encoders and Decoders in QKVAE are initialized with parameters from BART Lewis2020BART:Comprehension. After manual trial and error on the development set, we set the sizes of and to 768, and to 4. Further Hyper-parameters are in Appendix B. We train 5 instances of our model and report the average scores throughout all experiments.

Baselines

We compare our system to 4 previously published models, where 2 are supervised and 2 are unsupervised: i) VGVAE Chen2020ControllableExemplar: a VAE-based paraphrase generation model with an LSTM architecture. This model is trained using paraphrase pairs and PoS Tags to separate syntax and semantics into two latent variables. This separation is used to separately specify semantics and syntax to the decoder in order to produce paraphrases; ii) SynPG Huang2021GeneratingPairs: A paraphrase generation Seq2Seq model based on a Transformer architecture which also separately encodes syntax and semantics for the same purpose as VGVAE. This model is, however, trained using only source sentences with their syntactic parses, without paraphrases; iii) Optimus Li2020Optimus:Space: A large-scale VAE based on a fusion between BERT Devlin2018b and GPT-2 Radford2018LanguageLearners with competitive performance on various NLP benchmarks; iv) ADVAE: This model is QKVAE without its syntactic variable. The size of its latent variable is set to 1536 to equal the total size of latent variables in QKVAE.

Official open-source instances

101010 VGVAE: github.com/mingdachen/syntactic-template-generation/; SynPG: github.com/uclanlp/synpg; Optimus: github.com/ChunyuanLI/Optimus; ADVAE: github.com/ghazi-f/ADVAE of the 4 models above are available, which ensures accurate comparisons. The off-the-shelf instances of VGVAE and SynPG are trained on ParaNMT with GloVe111111Gains could be observed with better embeddings for supervised models, but we stick to the original implementations.Pennington2014 embeddings. We fine-tune a pre-trained Optimus on our training set following instructions from the authors. Similar to our model, we initialize ADVAE with parameters from BARTLewis2020BART:Comprehension and train 5 instances of it on ParaNMT with .

5.2 Syntax and Semantics Separation in the Embedding Space

We first test whether and respectively specialize in syntax and semantics. A syntactic (resp. semantic) embedding should place syntactically (resp. semantically) similar sentences close to each other in the embedding space.

Using the (target, sem_src, syn_src) triplets, we calculate for each embedding the probability that target is closer to sem_src than it is to syn_src in the embedding space. For simplicity, we refer to the syntactic and semantic embeddings of all models as and . For Gaussian latent variables, we use the mean parameter as a representation (respectively the mean direction parameter from the von Mises-Fisher distribution of the semantic variable of VGVAE). We use an L2 distance for Gaussian variables and a cosine distance for the others. Since Optimus and ADVAE do not have separate embeddings for syntax and semantics i) We take the whole embedding for Optimus; ii)For ADVAE, we measure the above probability on the development set for each latent variable (Eq. 3). Then, we choose the latent variable that places target sentences closest to their sem_src (resp. syn_src) as a semantic (resp. syntactic) variable. The results are presented in Table 2.

Supervised Models
VGVAE 99.9 14.8
SynPG 93.4 26.5
Unsupervised Models
Optimus 91.8 -
ADVAE 39.5 40.0
QKVAE 89.2 26.4
Table 2: The probability*100 that an embedding places a target sentence closer to its semantic source than it is to its syntactic source in the embedding space. Arrows (/) indicate whether higher or lower scores are better.
sem_src syn_src target
STED TMA2 TMA3 STED TMA2 TMA3 STED TMA2 TMA3
Control and Reference baselines
sem_src 0.0 100 100 13.0 40.3 4.8 12.0 39.6 7.0
syn_src 13.0 40.3 4.8 0.0 100 100 5.9 84.3 45.8
Optimus 11.6 50.0 15.9 9.2 61.6 23.6 10.2 58.9 21.8
Supervised Models
VGVAE 13.1 39.9 5.4 3.3 86.4 64.1 6.7 80.4 44.6
SynPG 11.7 41.9 18.0 13.5 74.1 10.5 13.1 69.1 13.3
Unsupervised Models
ADVAE 11.9 47.3 14.0 10.3 54.3 19.2 11.1 52.3 17.0
QKVAE 12.7 40.2 7.8 7.2 68.2 39.5 8.9 63.9 28.1
Table 3: Syntactic transfer results. STED is the Syntactic Tree Edit Distance, and TMA2/3 is the exact matching between constituency trees truncated at the / level.

Table 2 clearly shows for QKVAE, SynPG, and VGVAE that the syntactic (resp. semantic) variables lean towards positioning sentences in the embedding space according to their syntax (resp. semantics). Surprisingly, the syntactic variable of our model specializes in syntax (i.e. has low score) as much as that of SynPG. The generalist latent variable of Optimus seems to position sentences in the latent space according to their semantics. Accordingly, we place its score in the column. Interestingly, the variables in ADVAE have very close scores and score well below 50, which shows that the entire ADVAE embedding leans more towards syntax. This means that, without the key/value distinction in the Attention-based decoder, the variables specialize more in structure than in content.

5.3 Syntactic and Semantic Transfer

Similar to (Chen2020ControllableExemplar), we aim to produce sentences that take semantic content from sem_src sentences and syntax from syn_src sentences. For each of SynPG, VGVAE, and QKVAE we simply use the syntactic embedding of syn_src, and the semantic embedding of sem_src as inputs to the decoder to produce new sentences. Using the results of the specialization test in the previous experiment, we do the same for ADVAE by taking the 2 latent variables that lean most to semantics (resp. syntax) as semantic (resp. syntactic) variables. The output sentences are then scored in terms of syntactic and semantic similarity with sem_src, syn_src and target.

sem_src syn_src target
M PB M PB M PB
Control and Reference baselines
sem_src 100 1.0 6.9 0.14 28.8 0.84
syn_src 6.9 0.14 100 1.0 12.1 0.16
Optimus 12.4 0.34 15.9 0.39 10.8 0.32
Supervised Models
VGVAE 17.6 0.58 15.3 0.18 24.9 0.58
SynPG 45.9 0.87 8.0 0.13 25.2 0.75
Unsupervised Models
ADVAE 8.0 0.19 8.3 0.17 7.4 0.19
QKVAE 12.8 0.35 11.0 0.19 12.6 0.34
Table 4: Semantic transfer results. M is the Meteor score, and PB

is the ParaBart cosine similarity.

Control and reference baselines

Beside model outputs, we also use our syntactic and semantic comparison metrics, explicited below, to compare syn_src and sem_src sentences to one another and to target sentences. Additionally, using Optimus, we embed sem_src and syn_src, take the dimension-wise average of both embeddings, and decode it. As VAEs are known to produce quality sentence interpolations Bowman2016GeneratingSpace; Li2020Optimus:Space, the scores for this sentence help contrast a naïve fusion of features in the embedding space with a composition of well identified disentangled features.

Transfer metrics

We measure the syntactic and semantic transfer from source sentences to output sentences. i) Semantics: For semantics, previous works Chen2020ControllableExemplar; Huang2021GeneratingPairs rely on lexical overlap measures such as BLEU Papineni2001BLEU:Translation, ROUGE lin2004rouge, and Meteor denkowski-lavie-2014-meteor. As will be shown in our results, the lexical overlap signal does not capture semantic transfer between sentences when this transfer is too weak to produce paraphrases. Therefore, we use Meteor (M) in conjunction with ParaBART Huang2021DisentanglingModelsb a model where BART Lewis2020BART:Comprehension is fine-tuned using syntactic information to produce neural representations that represent maximally semantics and minimally syntax. We measure the cosine similarity between sentences according to ParaBART embeddings (PB). ii) Syntax: We use the script of Chen2020ControllableExemplar to produce a syntactic tree edit distance (STED) between the constituency trees of sentences, as was done to assess VGVAE. Additionally, following the evaluation procedure designed by Huang2021GeneratingPairs for SynPG, we measure the Template Matching Accuracy between sentences, where the template is the constituency tree truncated at the second level (TMA2). TMA2 is the percentage of sentence pairs where such templates match exactly. We extend this measure by also providing it at the third level (TMA3). Results are presented in Tables 3 and 4. In both Tables, the comparison scores between sentences and syn_src that are not significantly121212We consider differences to be significant if their associated -test yields a -value<0.01. different from the same scores produced with regard to sem_src are marked with .

Sanity checks with metrics and baselines

We notice in Table 4 that using Meteor as a semantic similarity measure results in various inconsistencies. For instance, paraphrases target have a higher Meteor score with the syntactic sources than with interpolations from Optimus. It can also be seen that the Meteor score between outputs from VGVAE and both syntactic and semantic sources are rather close 131313This was not observed by Chen2020ControllableExemplar, as they only compared outputs from VGVAE to the target paraphrases.. In contrast, ParaBART score behaves as expected across comparisons in Table 4. Consequently, we retain ParaBART score as a semantic similarity measure. In the following, we use the scores between sem_src, syn_src, and target (first two rows in Tables 4 and 3) as reference scores for unrelated sentences, paraphrase pairs, and syntactically similar sentences.

Comparing the supervised baselines

VGVAE and SynPG greatly differ in scores. It can be seen that SynPG copies a lot of lexical items from its semantic input (high Meteor score) which allows for higher semantic similarity scores. However, Table 3 shows that SynPG transfers syntax from syn_src at a high level (high TMA2, but low TMA3). In contrast, VGVAE transfers syntax and semantics in a balanced way and achieves the best syntax transfer scores overall (lowest STED with syn_src and target).

Analysing the scores of QKVAE

The semantic similarity scores PB of QKVAE outputs with target and sem_src are close to those of Optimus outputs. Although these scores are low compared to supervised models, they are notably higher than semantic similarity scores between unrelated sentences (e.g. syn_src and sem_src). However, in contrast to Optimus, QKVAE outputs display low PB scores with syn_src, which show that they draw very little semantic information from the syntactic sources. Concerning syntactic transfer in Table 3, QKVAE outputs share syntactic information with syn_src on all levels (low STED, and high TMA2 and TMA3). Our model is even competitive with SynPG on TMA2, and better on TMA3 and STED. As expected, the scores comparing QKVAE outputs to sem_src show that they share very little syntactic information. On the other hand, ADVAE shows poor transfer performance on syntax and semantics, with only slight differences between scores w.r.t syn_src and scores w.r.t sem_src.

sem_src syn_src SynPG VGVAE QKVAE target
we have destroyed the 49th armored division. concomitant usage is not recommended. we have destroyed the 49th armored division. armored division hasn’t destroyed. this military force will be destroyed. 49th armored division has been destroyed .
let the fire burn and put a piece of hot iron in it. sing a song. sing a song for boys. don’t put the fire in it burn a hot piece of iron and fire. burn the fire. put the iron on burns. come on fire. get a fire on it. keep this fire going. keep a piece of hot iron on it.
they took the lunch boxes ? have you given me your hands ? do they boxes took the lunch ? have they taken them your snacks ? have you heard of some lunch ? have they taken the lunch boxes ?
does it have a coach ? that’s a phone switcher, right ? how does it have a coach ? that’s a coach coach, right ? that’s a warden, huh? it has a coach, no ?
an old lady in a cemetery. that is a bad time for a war. there’s a lady in an old cemetery. that’s an old lady in the cemetery. this is a strange place for a woman. there is an old lady in the cemetery.
don’t be afraid. there are still many places to go. you don’t be afraid. there aren’t be afraid to be. there will be no need to worry. there is no need to be afraid .
isn’t there a door open ? the machines are still good, right ? a isn’t open door there ? the doors aren’t open, right ? the door will be open, okay? there is a door open, right ?
Table 5: Syntactic sources (syn_src), semantic sources (sem_src), the sentences produced when using them with different models, and the corresponding correct paraphrases (target).

5.4 Comparing our Model to a Supervised Model with Less Data

Since VGVAE displays balanced syntactic and semantic transfer capabilities, we use it for this experiment where we train it on subsets of sizes in from its original training data. Our goal is to find out how much labeled data is needed for VGVAE to outperform our unsupervised model on both transfer metrics.

minipage=scale=0.59          
Figure 1: Plotting STED w.r.t syn_ref and the PB cosine similarity w.r.t sem_ref for VGVAE with different amounts of labeled data and for QKVAE. Points are scaled proportionally to the amount of training data. The vertical and horizontal diameters of each ellipse are equal to the standard deviation of the associated data points and axes.

In Figure 1, we plot for QKVAE and instances of VGVAE the STED of their outputs w.r.t syn_src and the PB of these outputs w.r.t sem_src. All values are averages over 5 runs, with standard deviations plotted as ellipses. Figure 1 shows that to outperform QKVAE on syntactic and semantic transfer, VGVAE needs more than 50K labeled samples.

6 Discussion and conclusion

In Table 5, we display example outputs of SynPG, VGVAE, and QKVAE along with their syntactic sources, semantic sources, and targets. We generally observed that the outputs of QKVAE range from paraphrases (line 6) to broadly related sentences (line 3). As was shown by our quantitative results, outputs from VAE-based models (VGVAE and QKVAE) share relatively few lexical items with the semantic input. This can be seen in the qualitative examples where they often swap words in the semantic source with closely related words (e.g. "armored division" to "military force" in line 1, or "lunch boxes" to "snacks" in line 2). We attribute this quality to the smoothness of the latent space of VAEs which places coherent alternative lexical choices in the same vicinity. The examples above also show that our model is capable of capturing and transferring various syntactic characteristics such as the passive form (line 1), the presence of subject-verb inversion (lines 3, 4, and 7), or interjections (lines 4 and 6).

We presented QKVAE, an unsupervised model which disentangles syntax from semantics without syntactic or semantic information. Our experiments show that its latent variables effectively position sentences in the latent space according to these attributes. Additionally, we show that QKVAE displays clear signs of disentanglement in transfer experiments. Although the semantic transfer is moderate, syntactic transfer with QKVAE is competitive with SynPG, one of its supervised counterparts. Finally, we show that VGVAE, a supervised model, needs more than 50K samples to outperform QKVAE on both syntactic and semantic transfer.

We plan to extend this work in three directions: i) Finding ways to bias representations of each towards understandable concepts; ii) Applying QKVAE to non-textual data since it is data agnostic (e.g. to rearrange elements of a visual landscape.); iii) Investigating the behavior of QKVAE on other languages.

Acknowledgments

This work is supported by the PARSITI project grant (ANR-16-CE33-0021) given by the French National Research Agency (ANR), the Laboratoire d’excellence “Empirical Foundations of Linguistics” (ANR-10-LABX-0083), as well as the ONTORULE project. It was also granted access to the HPC resources of IDRIS under the allocation 20XX-AD011012112 made by GENCI.

References

Appendix A Results on the development set

We hereby display the scores on the development set. The encoder scores concerning the specialization of latent variables are in Table 6, while the transfer scores are in Table 7 for semantics, and Table 8 for syntax. The values on the development set concerning the comparison of QKVAE with VGVAE trained on various amounts of data is in Figure 2.

minipage=scale=0.59          
Figure 2: Plotting STED w.r.t syn_ref and the PB cosine similarity w.r.t sem_ref for VGVAE with different amounts of labeled data and for QKVAE. Points are scaled proportionally to the amount of training data. The vertical and horizontal diameters of each ellipse are equal to the standard deviation of the associated data points and axes.

Appendix B Hyper-parameters

Hyper-parameter values

The weight on the divergence is set to 0.6 for and to 0.3 for , and the threshold for the Free-Bits strategy is set to 0.05. annealing is performed between steps 3K and 6K for , and between steps 7K and 20K for . The model is trained using Adafactor Shazeer2018Adafactor:Cost, a memory-efficient version of Adam Kingma2015

. Using a batch size of 64, we train for 40 epochs, which takes about 30 hours on a single Nvidia GEForce RTX 2080 GPU. We use 4 layers for both Transformer encoders and decoders. The encoders (resp. decoders) are initialized with parameters from the 4 first layers (resp. 4 last layers) of BART encoders (resp. decoders). In total, our model uses 236M parameters.

Manual Hyper-parameter search

Given that the architecture for Transformer layers is fixed by BART, we mainly explored 3 parameters: number of latent variables , number of Transformer layers, values for . Our first experiments have shown that setting to 8 or 16 does not yield good results, which is probably due to the fact that a high

raises the search space for possible arrangements of values with keys, and consequently makes convergence harder. Concerning the number of layers, we observed that results with the full BART model (6 layers) have high variance over different runs. Reducing the number of layers to 4 solved this issue. In regards to

, we observed that it must be or less for the model to produce adequate reconstructions and that it is beneficial to set it slightly lower for than for so as to absorb more syntactic information with .