It is standard for neural networks to map an input to a point in a-dimensional real space (hochreiter1997long, ; vaswani2017attention, ; lecun1989backpropagation, ). However, that makes it difficult to find a specific point when the real space is being sampled randomly. That can limit the applicability of pre-trained models to their initial scope. Some approaches do map an input into volumes in the latent space. The family of approaches that stem out of the idea of Variational Autoencoders (Kingma2014, ; Bowman2016, ; rezende2015variational, ; chen2018neural, )
are trained to encourage such type of representations. By encoding an input into a probability distribution that is sampled before decoding, several neighbouring points incan end up representing the same input.
However, it often implies having two summands in the loss, a Kullback–Leibler divergence term and a log-likelihood termKingma2014 ; Bowman2016 , that fight for two different causes. In fact, if we want a smooth and volumetric representation, encouraged by the KL loss, it might come at the cost of having worse reconstruction or classification, encouraged by the log-likelihood. Therefore, each diminishes the strength and influence of the other.
By giving partially up on the smoothness of the representation, we propose instead a method to explicitly construct volumes, without a loss that is implicitly encouraging such behavior. We propose AriEL, a method to map sentences to volumes in for efficient retrieval with either random sampling, or a network that operates in its continuous space. It draws inspiration from arithmetic coding (AC) and k-d trees (KdT), and we name it after them Arithmetic coding and k-d trEes for Language (AriEL). For simplicity we choose to focus on language, even though the technique is applicable for the coding of any variable length sequence of discrete symbols. We prove how such a volume representation eases the retrieval of stored learned patterns.
AC is one of the most efficient lossless data compression techniques (witten1987arithmetic, ; arithmetic1963, ). As illustrated in figure 1, AC assigns a sequence to a segment in [0,1] whose length is proportional to the frequency of that sequence in the dataset. KdT (bentley1975multidimensional, ) is a data structure for storage that can handle different types of queries efficiently. It is typically used as a fast approximation to k-nearest neighbours in low dimensions (friedman1977algorithm, ). It organizes data in space according to which half of the space it belongs to with respect to the median. Then it moves to the following of the k axis and repeats the process of splitting with respect to the median and turning to a new axis.
Our contributions are therefore:
the use of a context-free grammar and a random bias in the dataset (Section 3.3), that allow us to automatically quantify the quality of the generated language;
2 Related Work
Volume codes: We define a volume code as a pair of functions, an encoding and a decoding functions, where the encoding function maps an input into a set that contains compact and connected sets of (munkres2018elements, ), and the decoding function maps every point within that set back to
. It is a form of distributed representationshinton1984distributed in the sense that the latter only assumes that the input will be represented as a point in . For simplicity, we define point codes as its complementary, the distributed representations that are not volume codes, and therefore map to isolated points in . Volume codes differ from coarse coding hinton1984distributed in the sense that in this case the code is represented by a list of zeros and ones that identifies in which overlapping sets the falls into. Both generative and discriminative models ng2002discriminative ; Kingma2014 ; jebara2012machine can end up learning volume codes by encouraging smoothness of representation via the loss function (bengio2013representation, ). We call implicit volume code, the latter, when the volume code is encouraged through the loss function. We call explicit volume code
, when the volumes are constructed instead through the arrangement of neurons in the network.
Sentence generation through random sampling: Generative Adversarial Networks (GAN) Goodfellow2014
are generative models conceived to map random samples to a learned generation through a 2-players game training procedure. They have had in the past trouble for text generation, due to the non differentiability of theperformed at the end of the generator, and partially generated sequences are non trivial to score yu2017seqgan
. Several advances have significantly improved the performance of this technique for text generation, such as using the generator as a reinforcement learning agent trained on next symbol generation through Policy Gradientyu2017seqgan , avoiding a binary classification typical in GAN in favor of a cross-entropy for the discriminator that evaluates each word generated xu2018diversity , or with the Gumbel-Softmax distribution kusner2016gans . Random sampling the latent space is used as well by Variational Autoencoders (VAE) Kingma2014 , to smooth the representation of the learned patterns. Training VAE for text has been shown to be possible with KL annealing and word dropout Bowman2016 , and made easier with convolutional decoders 46890 ; yang2017improved . An important line of research has focused on generalizing VAE to more flexible priors, through techniques such as Normalizing Flows rezende2015variational or Inverse Autoregressive Flows kingma2016improved . Several works explore how VAE and GAN can be combined makhzani2015adversarial ; tolstikhin2017wasserstein ; mescheder2017adversarial where GAN provides VAE with a learnable prior distribution, and VAE provides GAN with a more stable training procedure. We define AriEL to be used in combination with the previously mentioned methods, since it can be used as a generator or a discriminator in a GAN, or as an encoder or a decoder in an autoencoder. However it differs from them in the explicit procedure to construct volumes in the latent space that correspond to different inputs. The intention is to fill the entire latent space with the learned patterns, to make them easy to retrieve by uniform random sampling.
Arithmetic coding and neural networks: AC has been used for neural network compression in wiedemann2019deepcabac but typically, neural networks are used in AC as the model of the data distribution, to perform prediction based compression, for real time speech compressed transmission pasero2003neural , image compression triantafyllidis2002neural ; jiang1993lossless , high efficiency video coding ma2019neural and for general purpose compression tatwawadi2018deepzip . We turn AC into a compression algorithm in real numbers, to combine its properties with the properties of high-dimensional spaces, which is the domain of neural networks.
K-d trees and neural networks: Neural networks are typically used in conjuction with KdT to reduce the dimensionality of the search space, for KdT to be able to perform queries efficiently woodbridge2018detecting ; yin2017efficient ; vasudevan2009gaussian . KdT have been substituted completely by neural networks in cheng2018deep
to make better use of the limited memory resources in low-performance hardware. KdT has been used as well in combination with Delaunay triangulation for function learning, as an alternative to NN with Backpropagationgross1995kd . A KdT inspired algorithm is used in maillard1994neural to guide the creation of neurons to grow a neural network. We use KdT to make sure that when we turn AC into a multidimensional version of itself, it splits in a systematic way all dimensions of , so it can make use of all the space available.
3.1 AriEL: volume coding of language in continuous spaces
AriEL maps the sentence to a -dimensional volume of
. The sentence is encoded as the center of that volume for simplicity, and any point within it is decoded to the same sentence. Decoding iteratively computes the bounds of the volumes for all possible next symbols and checks inside which bounds the vector is, to find the next symbol at each step. The exact algorithm is described in algorithm1 and 2.
To adapt KdT to more splits than binary, we decide to split the chosen dimension giving a space from 0 to 1 to each possible next symbol, proportional to the probability of that next symbol: the first symbol in the sentence will be assigned a segment of length in the first axis chosen , and next symbols will be assigned a segment proportional to their probability conditional to the symbols previously seen e.g. on the axis , where , and are the first three symbols in the sentence. Then turn to the following axis and continue the process of splitting and turning.
In figure 1 for example, the possible symbols in the dataset are , where EOS stands for End-Of-Sentence. The initial token is given a portion on the axis of length , larger than the portion given to or , since there are less sentences that start with than with , and there is none that starts with : . Then, we split the axis according to the probability of the next symbol being . In this case the second most likely symbol after symbol is , and that is why ends with a larger volume than , and (which is an abbreviation for ‘’). As the sentences become longer than the dimensionality chosen, next symbols will be assigned an axis that has been split already, but only the section of interest will be further split. So, in the figure, the sentence ABC will take a portion of AB equal to , while ‘’ will take a portion equal to .
We select the next axis in to split to be , where and is the length of the sequence. If is larger than the dimension , then the segment in previously selected by the splitting process, will be split again. Since we do not have access to the true statistics of the data , we applied a neural network to approximate that distribution, the Language Model (LM) of AriEL, . This will approximate the frequency information that makes AC entropically efficient (figure 1) since after a successful training, will converge to . AriEL conserves then the arithmetic coding property of assigning larger volume to frequent sentences.
AriEL only uses a bounded region of , the interval , so encoder and decoder map each input to a compact set and from a compact set respectively. Moreover, AriEL encoder assigns sequence to a hyper-rectangle (hyperrectangle, ) and AriEL decoder assigns values inside that hyper-rectangle to the sequence . Since hyper-rectangles cannot be divided into two disjoint non-empty closed sets, they are connected (munkres2018elements, ). Therefore AriEL is a volume code. AriEL is an explicit volume code since its LM is trained only on a next word prediction log-likelihood loss, without a regularization term, and the volumes are constructed by arranging the outputs of the softmax neurons into a dimensional grid. However, even if the fact that sentences that start with the same symbols will remain close to each other, making it a smooth representation (bengio2013representation, ), it is imposing a prior that might not be suited for every task.
In this work, AriEL’s language model neural network consists of a word embedding, followed by a LSTM unit and a feedforward layer that outputs a softmax distribution over next possible symbols. Then the argmax is not applied directly to the softmax, but the probabilities defined by the latter are used as a deterministic russian roulette with the latent space point as the deterministic pointer that chooses the position in the roulette.
The most notable features of AriEL are that (1) if the language model learned the grammar, almost all the continuous space encodes almost only grammatically correct sentences, and (2) sequences are mapped to a volume, not to a point, therefore small noise in the continuous space will produce the same sentence. Theoretically this is so because a sequence of symbols that are likely together, with very small, will be assigned an exponentially small volume of . We support experimentally this claim with the generation studies below. Another feature inherited from KdT is (3) that any sequence is assigned a point in the continuous space, no matter how small: in theory this allows AriEL to be able to perfectly encode and decode any sequence to itself. This is studied experimentally with the generalization studies below.
AriEL with a RNN-based language model has a computational complexity of for both encoding and decoding, as it can be seen in algorithms 1 and 2, where is the length of the sequence and is the dimensionality of the RNN hidden state. We use capital to refer to the length of the longest hidden layer among the recurrent layers in the encoder or in the decoder, while refers to the length of the last hidden layer, the one that defines the size of the latent space. Since the language model only performs short-term (i.e. next word) modelling, AriEL allows the use of a smaller and thus has significantly less trainable parameters than other seq2seq models. AriEL has a time complexity of for both encoding and decoding, which is on par with conventional recurrent networks for seq2seq learning.
3.2 Neural Networks: models and experimental conditions
We compare AriEL to some of the classical approaches to map variable length discrete spaces to fixed length continuous spaces. These are the sequence to sequence recurrent autoencoders (AE) (Sutskever2014, ), their variational version (VAE) (Bowman2016, ) and Tranformer (vaswani2017attention, ). All of them are trained for next word prediction of word , when all the previous words are given as input, and they are trained over the biased train set, defined in section 3.3. We applied teacher forcing (6795228, ) to the decoder during training. All of them can be split into an encoder that maps the sentences at the input into , and a decoder that maps back from into a sentence. For all methods, all the input sentence is fed to the encoder, and subsequently the output of the encoder is used to produce the complete decoded output. We call word vector representation, the models that pass vectors that represent words from the encoder to the decoder, such as Transformer, while we call sentence vector representation the models that pass a vector that represents the whole sentence from the encoder to the decoder, such as AE, VAE and AriEL (kiros2015skip, ).
For both AE and VAE, we stack two GRU layers (Cho2014, ) with 128 units at both, the encoder and the decoder, to increase their representational capabilities (Pascanu2014, ). Other recurrent networks gave similar results (hochreiter1997long, ; Li2018, ). The last encoder layer has either units or for all methods. The output of the decoder was a softmax distribution over the entire vocabulary. AriEL’s Language Model was implemented as an embedding layer of dimension 64, followed by an LSTM of 140 units and a Fully Connected layer with a softmax output to predict the next symbol in the sentence.
. Since it is a fixed-length representation at the word level but it is variable-length at the sentence level, we padded all sentences to the maximum length in the dataset to be able to compare its latent space capacity to the other models. We split the Transformer into encoder and decoder, to study how they make use of the real latent space, and we use its decoder as a generator sampling randomly the input. The literature tends to use only the decoder of the original model(dai2019transformer, ; radford2018improving, ) and there is an active interest in widening the applications and in the clarification of the inner mechanisms kovaleva-etal-2019-revealing ; jawahar-etal-2019-bert . We treat as the latent dimension of the Transformer its , that will take a value of 16 or 512. We choose most of the other parameters as in the original work (vaswani2017attention, ): the number of parallel attention heads as , the key and value dimension as and , and a dropout regularization of 0.1, and we only change the stack of identical decoders and encoders to , and the inner dimension of the position-wise feed-forward network to to have a number of parameters similar to the other methods.
We choose the hyper-parameters of AE, VAE, Transformer and AriEL’s LM to keep a number of trainable parameters comparable to each other. For the toy dataset to about 270K parameters. For we used the same hyper-parameters as before, and only changed the latent dimension. This implied that each model scaled differently: 120M parameters for AE and VAE, 9M for Transformer and AriEL keeps the same small network, since the size of its latent space can be defined at any time without trainable parameters depending on it. During the training on the GuessWhat?! dataset, the scaling with respect to the other methods was different: we trained the Transformer 16 on the GuessWhat?! dataset with , to have a number of trainable parameters on the same order of magnitude than the other methods, 2,666K rather than the 588K that it had when , but the performance was much worse, so we decided to present the results for the Transformer. For again, all of them scaled differently, as it can be seen in table IV.
We go through the training data 10 times, in mini-batches of 256 sentences. We use the Adam (Kingma2015, )
optimizer with a learning rate of 1e-3 and gradient clipping at 0.5 magnitude. During training, the learning was reduced by a factor of 0.2 if the loss function didn’t decrease in the last 5 epochs, but with a minimum learning rate of 1e-5. For all RNN-based embeddings, kernel weights used the Xavier uniform initialization(Glorot2010, )
, while recurrent weights used random orthogonal matrix initialization(Saxe2014, )
. All biases are initialized to zero. All embedding layers are initialized with a uniform distribution between [-1, 1]. For Transformer all the matrices in the multihead attention and in the position-wise feedforward module, used the Xavier uniform initialization(Glorot2010, ), the beta of the layer normalization uses zeros, and its gamma uses ones for initialization. AE and VAE are trained with a word dropout of 0.25 at the input, and VAE is trained with KL loss annealing that moves the weight of the KL loss from zero to one during the 7th epoch, similarly to the original work (Bowman2016, ).
3.3 Datasets: toy and human sentences
We perform our analysis on two datasets. A toy dataset of sentences generated from a context-free grammar and a realistic dataset of sentences written by humans while playing a cooperative game.
The toy dataset: we generate questions about objects with a context-free grammar (CFG), fully specified in the Supplementary Materials, section 7. To stress the learning methods and understand their limits we choose a CFG with a large vocabulary and numerous grammar rules, rather than smaller but more classic alternatives (e.g. REBER). The intention is as well to focus on dialogue agents and that’s the reason why all sentences are framed as questions about objects.
In this work we distinguish between unbiased sentences, those that have been simply sampled from the CFG, and biased sentences, those that after being sampled from the CFG have been selected according to an additional structural constraint. To do so we generate an adjacency matrix of words that can occur together in the same sentence, and we use that as the filter to bias the sentences. For simplicity the adjacency matrix has been generated randomly. The intention is to emulate the setting were a CFG is constrained by realistic scenes, in which case not all the grammatically correct sentences can be semantically correct: e.g. ”Is it the wooden toilet in the kitchen ?” could be grammatically correct in a given CFG, but semantically incorrect given that it does not usually happen in a realistic scene. We use it to detect to which degree each learning method is able to extract the grammar and extract the roles of each word, despite a bias that could make this task harder.
The vocabulary consists of 840 words. The maximal length of the sentences is of 19 symbols and the mean length is of 9.9 symbols. We split the biased dataset into 1M train sentences, 10k test sentences and 512 validation sentences, where no sentence is shared between sets. The validation set is small to speed up training. We created another set of 10k unbiased test sentences with the same CFG, where we only gather sentences that don’t follow the adjacency matrix, to make sure that the overlap of this test set is zero with previous ones. We train the learning models on the biased sentences and we use the unbiased to test if they were able to grasp the grammar behind.
The real dataset: we choose the GuessWhat?! dataset (deVries2016, ), a dataset of sentences asked by humans to humans to solve a cooperative game. This dataset features a vocabulary of 10,469 words, an order of magnitude larger than the toy CFG. The maximal length of the sentences is of 57 symbols, and the mean length is of 5.9 symbols.
3.4 Evaluation Metrics
We perform a qualitative and a quantitative assessment of the models.
3.4.1 Qualitative evaluations
The four qualitative studies are: (1) we list a few samples of reconstruction via next word prediction of unbiased sentences, to understand the generalization capabilities of the different models (table I), (2) we list a few samples of generated sentences when the latent space is sampled randomly, to understand the generation capabilities (table II), (3) we visualize all the points randomly sampled in the latent space for generation, and we color code them according to the number of adjectives present in the sentence produced if it was grammatically correct, and in another color if it was not grammatically correct, (first row, figure 2), and (4) we visualize where biased test sentences belonging to different grammar rules and sentence length were mapped by the encoder (second and third row, figure 2). All qualitative studies are performed for .
3.4.2 Quantitative evaluations on the toy grammar, CFG
We propose measures that cover 3 properties of an autoencoder: the quality of generation, prediction and generalization. We perform our studies for networks with a latent dimension of 16 units, to understand their compression limits, and for a latent dimension of 512 units, which is often taken as the default size in the literature (Kingma2014, ; vaswani2017attention, ).
Generation/Decoding Quality is evaluated with sentences produced by the decoder when the latent space of each model is sampled randomly. The sampling is done uniformly in the continuous latent space, within the maximal hyper-cube defined by the encoded test sentences. We sample 10k sentences and apply four measures: i) grammar coverage (GC) as the number of grammar rules (e.g. single adjective, multiple adjectives) that could be parsed in the sampled sentences, over four, the maximal number of adjectives plus one for sentences without adjectives; ii) vocabulary coverage (VC) as the ratio between the number of words in the vocabulary that appeared in the sampled sentences, over 840, the size of the complete vocabulary; iii) uniqueness (U) as a ratio of unique sampled sentences; and iv) validity (V) as a ratio of valid sampled sentences, sentences that were unique and grammatically correct.
Prediction Quality is evaluated by encoding the 10k biased test sentences and looking at the reconstructions produced by the decoder, using the following objective criteria: i) prediction accuracy biased (PAB) as a ratio of correctly reconstructed sentences (i.e. all words must match); ii) grammar accuracy (GA) as a ratio of grammatically correct reconstructions (i.e. can be parsed by the CFG, even if the reconstruction is not accurate). and iii) bias Accuracy (BA) as the ratio of inaccurate reconstructions that are still grammatical and keep the bias of the training set.
Generalization Quality is evaluated using the 10k unbiased test sentences while the embeddings were trained on the biased training set. The prediction accuracy unbiased (PAU) is computed in the same way as PAB, as the ratio of correctly reconstructed ubiased sentences. It allows us to measure how well the latent space can generalize to grammatically correct sentences outside the language bias.
3.4.3 Quantitative evaluations on the real dataset, GuessWhat?!
In a real dataset we don’t have a notion of what is grammatically correct, since humans can use spontaneously ungrammatical constructions. We quantified the quality of the language learned with two measures uniqueness is the percentage of the sentences generated with random sampling that was unique over the 10K generations and validity was the percentage of the unique sentences that could be found in the training data, indicating how easy it was to retrieve the learned information.
3.4.4 Quantitative evaluations: random interpolations within AriEL
In figure 3
we show what we call the interpolation diversity given the dimension of the latent space. It measures how many of the sentences generated through a straight line between two random points inwere unique and grammatically correct for AriEL for different values of d. The Language Model is the one trained on the toy grammar.
4.1 Qualitative Evaluations
We present the qualitative studies performed for . Table I shows the output of the generalization study. To avoid cherry picking, we display the first 4 reconstructed sentences. AE and VAE fail to generalize to the unbiased language, however both manage to keep the structure at the output of the input sentence, to a large extent. Their behavior improved significantly when the latent space dimension is increased to , figure 4, with the corresponding increase of parameters. In theory, AriEL is able to reconstruct any sequence by design, by keeping a volume for each of them. However in practice, it failed as well even if slightly less often than the Transformer. Both produce reconstructions of the unbiased input at a similar rate, as it can be see in table I and in the metric PAU in table III and figure 4. This means that to a reasonable degree, the areas that represent unseen data during training, are available and relatively easy to track for AriEL and Transformer. Instead, all the latent space seems to be taken almost exclusively by the content of the training set for AE and VAE, since sentences that are not seen during training (in this case the unbiased sentences) cannot be reconstructed at all.
The generation study is shown in Table II (first 4 samples for each model). AriEL excels at this task, and almost all generations are unique and grammatically correct (valid in our definition). AE and VAE perform remarkably well given the small latent space. As it will be shown in the quantitative study, VAE almost triples AE performance in terms of generation of valid sentences when , validity in table III. Transformer performs poorly at this task, and it is very hard to get grammatical sentences when the latent space is sampled randomly. The quantitative analysis reveals however that with the increase of the latent space, Transformer, AE and VAE achieve all improved validity, remaining at a performance of one third the performance of AriEL.
In figure 2, each dot represents a sentence in the latent space. In the first row the dot in the latent space is passed as input to the decoder, while in the second and third row the dot is the output of the encoder when the biased test sentence is fed at its input. Two random axis in are chosen for the generator, first row, while two axis were chosen subjectively among the first components of a PCA for the encoder, second and third row. In every case, the values in the latent space where normalized between zero and one to ease the visualization. Lines are used to ease the visualization of the clusters and shifts of data with their label, since the point clouds overlap and are hard to see. The curves are constructed as concave hulls of the dots based on their Delaunay triangulation, a method called alpha shapes 1056714 .
We can see in figure 2 (first row) how easy it is to find grammatical sentences when randomly sampling the latent space for each model. AriEL practically only generates grammatical sentences and AE and VAE perform reasonably well too, while Transformer fails. AriEL failures are plot on top, to remark how few they are, while AE and VAE failures are plot at the bottom, otherwise they would hide the rest given how numerous they are. In the same figure (rows two and three) we can observe how different methods structure the input in the latent space, each with prototypical clusters and shifts. The Transformer presents an interesting structure of clusters whose purpose remains unclear. Interestingly, the encoding maps seem to be more organized than the decoding ones. All the models seem to cluster or shift data belonging to different classes at the encoding, that could be taken advantage of by a learning agent placed in the latent space. However it seems hard to use the Transformer as a generator module for an agent. The good performance of AriEL is a consequence of the fact that all the latent space is utilized, and in no directions large gaps can be observed. This can be seen in the two encoding rows, where the white spaces around the cloud of dots are consequence of the rotation performed by the PCA, otherwise all the space between 0 and 1 would be utilized by AriEL.
|is the thing this linen carpet made of tile ?|
|is it huge and teal ?|
|is the thing transparent , huge and slightly heavy ?|
|is the object antique white , tiny and closed ?|
|is the thing this lime carpet made of tile ?|
|is it huge and teachable ?|
|is the thing transparent , huge and slightly heavy ?|
|is the object antique white , tiny and closed ?|
|is the thing this stretchable carpet made of tile ?|
|is it huge and magenta ?|
|is the thing transparent , huge and slightly heavy ?|
|is the object antique white , tiny and closed ?|
|is the thing this small toilet made of laminate ?|
|is it this average-sized and average-sized laminate ?|
|is the thing very heavy , heavy and very heavy ?|
|is the object light pink , small and textured ?|
|is the thing a small and textured deep stone ?|
|is it the light deep bedroom ?|
|is the thing textured , textured and moderately heavy ?|
|is the thing light , moderately heavy and light green ?|
|is the object that tiny very light set ?|
|is the thing a tiny destroyable abstraction ?|
|is the thing this mint cream textured organic structure ?|
|is the object this small large wearable textile ?|
|is the thing slightly heavy heavy stone squeezable|
|closed sea heavy ?|
|is it pale lime executable executable shallow decoration|
|drab turquoise , heavy and potang ?|
|is it an tomato slot box made of decoration facing stone ?|
|is the thing short and spring heavy slightly heavy potang ?|
|is the object that light light laminate ?|
|is the thing a light , small and small laminate ?|
|is the thing that tiny small decoration stone ?|
|is the object the average-sized , textured and|
|average-sized laminate ?|
|is the thing a light and deep office ?|
|is it light , light and light and pink ?|
|is the object dark , light and pink ?|
|is the object a light deep living room ?|
4.2 Quantitative Evaluations
The results of the quantitative study are shown graphically in figure 4 and in table III. AriEL outperforms or closely matches every other method for all the 8 measures, remarkably outperforming by a large margin every other alternative for validity, which stands for unique and grammatical sentences generated, and is the most important of the metrics for the toy dataset. Transformer performs remarkably well at not overfitting and it is able to reconstruct biased and unbiased sentences better than the other non-AriEL methods. It does so even in the under-parameterized version (small latent space, 16-dimensions). It manages to cover all grammar rules in generation but it performs very poorly at generating a diverse set of valid sentences from uniform random sampling. Remarkably, it only needed one iteration through the data to achieve almost perfect validation accuracy, without losing performance when we trained for the complete set of 10 epochs. The 16-dimensional VAE despite the poor generalization to the biased test set and the unbiased test set, figure 4, results in the best non-AriEL generator, measured by validity. In this case, it could be said that the conflict between cross-entropy and KL divergence, encouraged the VAE to look for sentences that were outside the bias of the training set, since it was able to produce more grammatically correct sentences, albeit unbiased, than AE.
Increasing the learned parameters by moving from to , had no effect on Transformer, that was already excellent in several of the metrics, apart from a significant improvement in validity. However, a larger latent space and the increase in number of parameters that followed, was necessary to have an AE and VAE that did not overfit (better PAU and PAB).
It is not completely fair to claim that AriEL is doing some sort of generalization however. It should rather be understood that AriEL keeps the volume for every possible sequence of symbols available, even if with a negligible size. Those volumes can easily be found and tracked having access to AriEL’s Language Model.
When trained on human sentences, on the GuessWhat?! dataset, the patterns that arise are similar and AriEL achieves again the best validity. Every approach seems to generate more unique sentences than AriEL, but the fraction of them that is a good generation is very small. Less than of the unique sentences generated by AE, VAE and Transformer are in the training set, while AriEL achieves and more.
As we can see in the interpolation diversity study in figure 3, for low , we have to pass through many sentences in between two random points in the latent space, while as we augment the dimensionality, we distribute the sentences in different directions. Therefore we find less sentences when we move on the direction defined by the straight line between two random points. The specific curve, lower threshold and speed of decay, will vary for different vocabulary sizes and given the complexity of the language learned, but the shape would be expected to remain similar.
|param||grammar coverage||vocabulary coverage||validity||uniqueness||bias accuracy||grammar accuracy||prediction accuracy biased||prediction accuracy unbiased|
|AriEL||237K||100.0 0.0%||70.4 0.2%||97.6 0.2%||99.7 0.1%||100.0 0.0%||100.0 0.0%||100.0 0.0%||53.1 0.4%|
|Transformer||258K||100.0 0.0%||70.1 0.8%||4.7 2.7%||99.1 0.5%||99.98 0.01%||99.95 0.02%||99.92 0.02%||49.0 0.1%|
|AE||258K||100.0 0.0%||6.89 0.7%||11.5 4.2%||13.9 5.1%||89.5 2.3%||98.0 1.7%||0.0 0.1%||0.0 0.1%|
|VAE||258K||100.0 0.0%||11.5 2.6%||16.0 9.2%||24.3 14.8%||85.4 5.2%||85.1 8.8%||0.0 0.1%||0.0 0.1%|
|AriEL||237K||100.0 0.0%||70.2 0.3%||97.9 0.2%||99.8 0.1%||100.0 0.0%||100.0 0.0%||100.0 0.0%||53.2 0.3%|
|Transformer||9M||100.0 0.0%||67.3 0.9%||17.2 6.3%||87.2 7.5%||99.99 0.01%||99.91 0.03%||99.86 0.05%||49.0 0.1%|
|AE||120M||100.0 0.0%||39.3 6.0%||21.0 11.8%||71.8 5.6%||82.2 3.5%||86.8 1.3%||34.7 11.4%||24.4 6.0%|
|VAE||120M||85.0 12.6%||28.9 2.4%||26.5 2.4%||95.2 3.8%||73.8 2.2%||89.5 2.8%||4.3 3.7%||4.9 3.6%|
. Each experiment is run 5 times, we report the mean and the variance for each configuration. We comment on the content of this table in section4.2 and they are plot in figure 4. Our proposed method, AriEL, achieves almost perfect performance in almost all the metrics that we defined, especially in Generation validity, which quantifies how many random samples gave a unique and grammatical sentence at the output of the decoder. Transformer performed exceptionally well even in the under-parameterized case, with a 16-dimensional latent space, except for validity. All methods improved their performance with a larger latent space, and when over-parameterized (), particularly in validity, but still achieved less than one third the performance of AriEL. VAE is consistently the second best performer in validity, supporting our hypothesis, that volume coding facilitates retrieval of information by random sampling.
Brief summary of the results. Transformer has proven to be exceptional at not to overfit during training, with a very quick learning. It did so requiring significantly less parameters than classical approaches (AE and VAE). Embeddings learned by the Transformer encoding revealed interesting structures that remain unexplained (figure 2). Those structures are projected as key and value to the multihead attention (vaswani2017attention, ). However it is hard to find the language that it learned by randomly sampling the latent space. We hypothesize that this is due to its word vector embedding nature: even if its latent dimension was on the word level, at the sentence level it was given that an artificial padding to the maximum sentence length () was introduced to be able to perform several of the quantitative studies. A consequence of very high dimensional spaces is that points tend to be all equally distant to each other. The explanation might therefore be that ungrammatical sentences outnumber grammatical ones, and therefore were easier to find. AE and VAE needed a high dimensional latent space and therefore a larger number of parameters to be able to generalize to the biased and unbiased test sets (PAU and PAB). However they proved to be decent generators for a small latent space.
AriEL latent space is a free parameter. A detail that is worth to stress, is that the size of the latent space of AriEL can be defined at any time, for a fixed Language Model. It could therefore be conceivable to control it with a learnable parameter, with the activity of another neuron, or as another function of the input. In fact, as we augment the dimension, the volumes will tend to have more neighbouring volumes that represent different sentences as confirmed by the interpolation study, figure 3. It could have implications as well during training to have a gradient that can rely more on its angle than on its magnitude.
What to choose for a learning agent with a language module? In the case of a learning agent that needs a language model to interact with other agents, our study suggests that it will benefit from AriEL to generate a diverse language (generation). Interestingly it outperformed the other methods in the toy dataset but as well on the real data. In order to encode language, it might benefit from any of the methods, or an ensemble of them. As well, in this work we trained AriEL’s Language Model (LM) before utilizing it inside AriEL. If AriEL was used as a module of a larger architecture, it would be necessary to design a proper pseudo-gradient to train end to end its LM. This can be the case for the training of a GAN for text goodfellow2014generative ; fedus2018maskgan , that typically fails due to (1) mode collapse, (2) given the non differentiability of argmax. AriEL would make sure the availability of a wide and complex language model, decreasing the chances of mode collapse, and given that the next token selection is not necessarily done through argmax.
Partial evidence for volume codes. As a consequence of the experiments performed, it is our impression that the volume aspect of AriEL is to be held responsible of its success, and that’s why we provide it as a definition that could reveal itself more useful than the algorithm we present. We have provided some evidence on how volume coding can be beneficial for retrieval of stored information that is composed of discrete symbols, and variable length, by means of random sampling, in contrast with simply distributed representations. It is in fact AriEL to be the method that generates more valid sentences, a method that performs explicit volume coding. VAE is the second when trained over the toy dataset, a method that performs implicit volume coding, encouraged by the loss. However it performed rather poorly on the real dataset. Point encodings can still provide an advantage for example in classification. In our case, a point encoding would be represented by Transformer, and the classification task the PAU and PAB metrics: it trained fast, without overfitting, and underparametrized, but the absence of dense volumes made it hard to retrieve the stored information using uniform random samples.
6 Conclusion and Future Work
We proposed AriEL, a volume mapping of language into a continuous hypercube. It provides a latent organization of language that excels at several important metrics related to the use of language, giving special emphasis to being able to generate many unique and grammatically correct sentences sampling uniformly the latent space, what we call valid sentences. AriEL fuses together arithmetic coding and k-d trees to construct volumes that preserve the statistics of a dataset. In this way we construct a latent representation that explicitly assigns a data sample to a volume, instead of a point. When compared to standard techniques it performs favorably in generation, prediction and generalization.
Moreover, we used a manually designed context-free grammar (CFG) to generate our own large-scale dataset of sentences, and we assigned a random bias using a randomly generated adjacency matrix of words that can appear together. This allows us to (1) automatically check if the language generated by the models belongs to the grammar used and (2) understand if different deep learning methods can grasp notions of grammar separately from other notions of bias. We used a dataset of real human interactions to make sure the findings would hold for a larger vocabulary, and a less strict grammar.
Recurrent-based continuous sentence embeddings largely overfit the training data and only cover a small subset of the possible language space, particularly when the size of the latent space is small. They also fail to learn the underlying CFG and generalize to unbiased sentences from that CFG. However they manage to generate quite a few diverse valid sentences. Transformer managed to avoid overfitting even after being overtrained, proving its robustness. It performed a remarkable generalization to the unbiased data. However it proves hard to use as a generator from the continuous latent space using random sampling.
We stress that volume based codes can provide an advantage over point codes in generation tasks, or sampling tasks to call it differently. Moreover, our method allows us to sample/generate in theory the same probability distribution as the training set and in practice a much more diverse set of sentences, as demonstrated on the toy dataset and on the human dataset.
Our planned next step is to use AriEL as a module in a learning agent. It would be interesting to apply AriEL to k-Nearest Neighbour, and optimize it for very high dimensional latent spaces. This study has been performed for dialogue based language generation, which implies short sentences. It would be useful for the NLP community to understand if this method generalizes to the compression of longer texts.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- (2) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
- (3) Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
- (4) D. P. Kingma and M. Welling, “Auto-encoding variational bayes.” in ICLR, 2014.
- (5) S. R. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio, “Generating sentences from a continuous space,” in 20th SIGNLL CoNLL. Association for Computational Linguistics, 2016, pp. 10–21.
- (6) D. J. Rezende and S. Mohamed, “Variational inference with normalizing flows,” arXiv preprint arXiv:1505.05770, 2015.
T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” inNeurIPS, 2018, pp. 6571–6583.
- (8) I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Communications of the ACM, vol. 30, no. 6, pp. 520–540, 1987.
- (9) P. Elias and N. Abramson, Information Theory and Coding, 1st ed., ser. Electronic Science. McGraw-Hill Inc.,US, 1963, pp. 72–89.
- (10) J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
- (11) J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” ACM TOMS, vol. 3, no. 3, pp. 209–226, 1977.
- (12) J. R. Munkres, Elements of algebraic topology. CRC Press, 2018.
- (13) G. E. Hinton, J. L. McClelland, D. E. Rumelhart et al., Distributed representations. Carnegie-Mellon University Pittsburgh, PA, 1984.
- (15) T. Jebara, Machine learning: discriminative and generative. Springer Science & Business Media, 2012, vol. 755.
- (16) Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
- (17) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680.
L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial
nets with policy gradient,” in
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- (19) J. Xu, X. Ren, J. Lin, and X. Sun, “Diversity-promoting gan: A cross-entropy based generative adversarial network for diversified text generation,” in EMNLP, 2018, pp. 3940–3949.
- (20) M. J. Kusner and J. M. Hernández-Lobato, “Gans for sequences of discrete elements with the gumbel-softmax distribution,” arXiv preprint arXiv:1611.04051, 2016.
- (21) A. Severyn, E. Barth, and S. Semeniuta, “A hybrid convolutional variational autoencoder for text generation,” in EMNLP, 2017.
- (22) Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick, “Improved variational autoencoders for text modeling using dilated convolutions,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 3881–3890.
- (23) D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, “Improved variational inference with inverse autoregressive flow,” in NeurIPS, 2016, pp. 4743–4751.
- (24) A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
- (25) I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein auto-encoders,” arXiv preprint arXiv:1711.01558, 2017.
- (26) L. Mescheder, S. Nowozin, and A. Geiger, “Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2391–2400.
- (27) S. Wiedemann, H. Kirchhoffer, S. Matlage, P. Haase, A. Marban, T. Marinc, D. Neumann, A. Osman, D. Marpe, H. Schwarz et al., “Deepcabac: Context-adaptive binary arithmetic coding for deep neural network compression,” arXiv preprint arXiv:1905.08318, 2019.
- (28) E. Pasero and A. Montuori, “Neural network based arithmetic coding for real-time audio transmission on the tms320c6000 dsp platform,” in ICASSP, vol. 2. IEEE, 2003, pp. II–761.
- (29) G. Triantafyllidis and M. Strintzis, “A neural network for context-based arithmetic coding in lossless image compression,” in WSES ICNNA, 2002.
- (30) W. Jiang, S.-Z. Kiang, N. Hakim, and H. Meadows, “Lossless compression for medical imaging systems using linear/nonlinear prediction and arithmetic coding,” in ISCAS. IEEE, 1993, pp. 283–286.
- (31) C. Ma, D. Liu, X. Peng, Z.-J. Zha, and F. Wu, “Neural network-based arithmetic coding for inter prediction information in hevc,” in ISCAS. IEEE, 2019, pp. 1–5.
- (32) K. Tatwawadi, “Deepzip: Lossless compression using recurrent networks,” URL https://web. stanford. edu/class/cs224n/reports/2761006. pdf, 2018.
- (33) J. Woodbridge, H. S. Anderson, A. Ahuja, and D. Grant, “Detecting homoglyph attacks with a siamese neural network,” in SPW. IEEE, 2018, pp. 22–28.
- (34) H. Yin, X. Ding, L. Tang, Y. Wang, and R. Xiong, “Efficient 3d lidar based loop closing using deep neural network,” in ROBIO. IEEE, 2017, pp. 481–486.
- (35) S. Vasudevan, F. Ramos, E. Nettleton, and H. Durrant-Whyte, “Gaussian process modeling of large-scale terrain,” Journal of Field Robotics, vol. 26, no. 10, pp. 812–840, 2009.
- (36) Y. Cheng, L. Zou, Z. Zhuang, Z. Sun, and W. Zhang, “Deep reinforcement learning combustion optimization system using synchronous neural episodic control,” in 2018 37th Chinese Control Conference (CCC). IEEE, 2018, pp. 8770–8775.
- (37) E. M. Gross, “Kd trees and delaunay based linear interpolation for kinematic control: a comparison to neural networks with error backpropagation,” in Proceedings of 1995 IEEE International Conference on Robotics and Automation, vol. 2. IEEE, 1995, pp. 1485–1490.
- (38) E. Maillard and B. Solaiman, “A neural network based on lvq2 with dynamic building of the map,” in ICNN, vol. 2. IEEE, 1994, pp. 766–770.
- (39) N. W. Johnson, Geometries and transformations. Cambridge University Press, 2018.
- (40) I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NeurIPS. Cambridge, MA, USA: MIT Press, 2014, pp. 3104–3112.
R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,”Neural Computation, vol. 1, no. 2, pp. 270–280, June 1989.
- (42) R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in NeurIPS, 2015, pp. 3294–3302.
- (43) K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in EMNLP. Association for Computational Linguistics, 2014, pp. 1724–1734.
- (44) R. Pascanu, Ç. Gülçehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” in ICLR, 2014.
- (45) S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao, “Independently recurrent neural network (indrnn): Building A longer and deeper RNN,” in IEEE CPVR, 2018.
- (46) Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019.
- (47) A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” Technical report, OpenAI, 2018.
- (48) O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, “Revealing the dark secrets of BERT,” in EMNLP-IJCNLP. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 4364–4373.
- (49) G. Jawahar, B. Sagot, and D. Seddah, “What does BERT learn about the structure of language?” in 57th ACL, Jul. 2019, pp. 3651–3657.
- (50) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
- (51) X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in JMLR W&CP: 13th AISTATS, vol. 9, May 2010, pp. 249–256.
- (52) A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” in ICLR, 2014.
- (53) H. De Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville, “Guesswhat?! visual object discovery through multi-modal dialogue.” in IEEE CPVR, vol. 1, no. 2, 2017.
- (54) H. Edelsbrunner, D. Kirkpatrick, and R. Seidel, “On the shape of a set of points in the plane,” IEEE Transactions on Information Theory, vol. 29, no. 4, pp. 551–559, July 1983.
- (55) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, 2014, pp. 2672–2680.
- (56) W. Fedus, I. Goodfellow, and A. M. Dai, “MaskGAN: Better text generation via filling in the _,” in ICLR, 2018.
- (57) G. A. Miller, “Wordnet: A lexical database for english,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.
7 Context-free grammar (CFG) used in the experiments
The context free grammar used to generate the biased and unbiased sentences is composed by the following rules:
8 Size of the language space
From the CFG used in the experiment, it is possible to extract a total of 15,396 distinct grammar rules, some are shown below. However, for simplicity, we defined only 4, related to the number of adjectives in it. In the case of the unbiased dataset, those rules can produce a total of 9.81e+18 unique sentences. The total number of unique sentences for the biased dataset is expected to be an order of magnitude smaller.
9 Example of sentences generated from the CFG
9.1 Biased sample sentences
is it large , light yellow and light ?
is it white , deep pink and average-sized ?
is it a light , huge and shallow laminate ?
is the object average-sized and light ?
is the object fashionable , ghost white and pale turquoise ?
is the thing huge , huge and khaki ?
is the thing small , ignitable and very light ?
is the object a notable very light orange carpet ?
is the object this small wood made of facing stone ?
is the object a textured and combinable floor cover made of laminate ?
9.2 Unbiased sample sentences
is the object the huge tiny lovable guest room ?
is the object the closed closed transparent textile ?
is the thing a transparent , narrow and slightly heavy textile ?
is it steerable , dark orange and light ?
is it gray , very heavy and textured ?
is it closed , heavy and moderately light ?
is it transparent , transformable and moderately light ?
is the thing average-sized and dark red ?
is the thing large and deep garage ?
is it that slightly heavy stucco made of grass ?
|Annotation||Nb. of classes||Example of classes|
|Noun||86||air conditioner, mirror, window, door, piano|
|WordNet category (Miller1995, )||580||instrument, living thing, furniture, decoration|
|Location||24||kitchen, bedroom, bathroom, office, hallway, garage|
|Color||139||red, royal blue, dark gray, sea shell|
|Color property||2||transparent, textured|
|Material||15||wood, textile, leather, carpet, decoration stone|
|Overall mass||7||light, moderately light, heavy, very heavy|
|Overall size||4||tiny, small, large, huge|
|Category-relative size||10||tiny, small, large, huge, short, shallow, narrow, wide|
|Acoustical capability||3||sound, speech, music|
|Affordance||100||attach, bend, divide, play, shake, stretch, wear|