Understanding spoken language is one of the key capabilities of intelligent systems which need to interact with humans. Applications include personal assistants, search engines, vehicle navigation systems and many others. The standard approach to understanding spoken language both in industry and in research has been to decompose the problem into two components arranged in a pipeline: Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). The audio signal representing a spoken utterance is first transcribed into written text, which is subsequently processed to extract some semantic representation of the utterance. Recent works have proposed to learn semantic embeddings of spoken language by using photographic images of everyday situations matched with their spoken captions, without an intermediate transcription step (harwath2016unsupervised; chrupala2017representations). The weak and noisy supervision in these approaches is closer to how humans learn to understand speech in by grounding it in perception and thus more useful as a cognitive model. It can also have some practical advantages: in certain circumstances it may be easier to find or collect speech associated with images rather than transcribed speech – for example when dealing with language whose speakers are illiterate, or for languages with no standard writing system (note that even some languages with many millions of speakers, like Cantonese, may not have a standardized writing system). On the other hand, the learning problem in this type of framework is less constrained, and harder, than standard ASR.
In order to alleviate this shortcoming, we propose to use multitask learning (MTL) and exploit transcribed speech within the end-to-end visually-grounded setting, and thus combine some features of both the pipeline and end-to-end approaches. We describe a three-task architecture which combines the main objective of matching speech with images with two auxiliary objectives: matching speech with text, and matching text with images.
The plain end-to-end speech/image
matching task, modeled via standard architectures such as recurrent neural networks, lacks a language-specific learning bias. This type of model may discover in the course of learning that speech can be represented as a sequence of symbols (such as for example phonemes or graphemes), but it is in no way predisposed to make this discovery. Human learners may be more efficient at least in part thanks to their innate inductive bias whereby they assume that language is symbolic. They presumably acquired such bias via the process of evolution by natural selection, or possibly by cultural evolution(Kirby5241)
. In the context of machine learning, inductive bias can instead be injected via multi-task learning, where supervision from the secondary task guides the model towards appropriately biased representations.
Specifically, our motivation for the speech/text task is to encourage the model to learn speech representations which are correlated with the encoding of spoken language as a sequence of characters. Additionally, and for completeness, we also consider a second auxiliary task matching text to images.
Our contribution consists in formulating and answering the following questions:
Do the auxiliary tasks improve the main speech/image task? The speech/text task helps but we have no evidence of the text/image task improving performance.
If so, is this mainly because MTL allows us to exploit extra data, or because the additional task injects an appropriate inductive bias into the model? The inductive bias is key to the performance gains of MTL, while extra data makes no impact.
Which parameters should be shared between tasks and which should be task specific? Best performance is achieved by sharing only the lower layers of the speech encoder.
What are the specific effects of the symbolic inductive bias on the learned representations? speech/text contributes to make the encoded speech more speaker invariant, and more strongly correlated to the written form of the utterances.
2 Related work
2.1 Visually grounded semantic embeddings of spoken language
The most relevant strand of related work is on visually-grounded learning of (spoken) language. It dates back at least to Roy2002113, but has recently attracted further interest due to better-performing modeling tools based on neural networks.
harwath2015deep collect spoken descriptions for the Flick8K captioned image dataset and present a model which is able to map pre-segmented spoken words to aspects of visual context. harwath2016unsupervised describe a larger dataset of images paired with spoken captions (Places Audio Caption Corpus) and present an architecture that learns to project images and unsegmented spoken captions to the same embedding space. The sentence representation is obtained by feeding the spectrogram to a convolutional network. Further elaborations on this setting include P17-1047, which shows a clustering-based method to identify grounded words in the speech-image pairs, and harwath2018jointly
which constructs a three-dimensional tensor encoding affinities between image regions and speech segments.
The work of chrupala2017representations is similar in that it exploits datasets of images with spoken captions, but their grounded speech model is based around multi-layer Recurrent Highway Networks, and focuses on quantitative analyses of the learned representations. They show that the encoding of meaning tends to become richer in higher layers, whereas encoding of form tends to initially increase and then stay constant or decrease. K17-1037 further analyze the representations of the same model and show that phonological form is reliably encoded in the lower recurrent layers of the network but becomes substantially attenuated in the higher layers.
drexler2017analysis also analyze the representations of a visually grounded speech model with view of using such representations for unsupervised speech recognition, and show that they contain more linguistic and less speaker information than filterbank features.
use images as a pivot to learn to associate textual labels with spoken utterances, by mapping utterances and images into joint semantic space. After labeling the images with an object classifier, these labels can be further associated with utterances, providing bag-of-words representation of spoken language which can be useful in speech retrieval.
2.2 Multi-task learning for speech and language
The concept of multi-task learning (MTL) was introduced by caruana1997multitask. Neural architectures widely used in the fields of speech and language processing make it easy to define parameter-sharing architectures and exploit MTL, and thus there has been a recent spurt of reports on its impact.
Within Natural Language Processing (NLP),44928 explore sharing encoders and decoders in a sequence-to-sequence architecture for translation, syntactic parsing, and image captioning, and show gains on some configurations. E17-2026 investigate which particular pairs of NLP tasks lead to gains, concluding that learning curves and label entropy of the tasks may be used as predictors. mccann2018natural propose a 10-task NLP challenge, and a single MTL model which performs reasonably well on all tasks.
P16-2038 show that which parameters are shared in a multi-task architecture matters a lot: they find that when sharing parameters between syntactic chunking or supertagging and POS tagging as an auxiliary task, it was consistently better to only share the lower-layers of the model. Relatedly, D17-1206 propose a method of training NLP tasks at multiple levels of complexity by growing the depth of the model to solve increasingly more difficult tasks. Swayamdipta2018SyntacticSF use similar ideas and show that syntactic information can be incorporated in a semantic task with MTL, using auxiliary syntactic tasks without building full-fledged syntactic structure at prediction time.
MTL can lead to a bewildering number of choices regarding which tasks to combine, which parameters to share and how to schedule and weight the tasks. Some recent works have suggested specific approaches to deal with this complexity: ruder2017learning propose to learn from data which parameters to share in MTL with sluice networks and show some gains on NLP tasks. kiperwasser2018scheduled investigate how to interleave learning syntax and translation and how to schedule these tasks.
Several works show that exploiting MTL via the use of multiple language versions of the same or comparable data leads to performance gains (e.g. lee2017fully; Q17-1024; de2018parameter). gella2017image and kadar2018lessons learn visual semantic embeddings from textual-visual datasets and show gains from additional languages which reuse the same encoder. kadar2018lessons additionally show that an extra objective linking the languages directly rather than only via the visual modality provides additional performance gains. In the context of audio-visual data, harwath2018vision applies a type of MTL in the setting where there are images paired with descriptions in English and Hindi. They project the images, English speech and Hindi speech into a joint semantic space, and show that training on multiple tasks matching both languages to images works better compared to only using a single monolingual task.
MTL has also recently seen some success in speech processing. Similar to what we see in machine translation, in ASR parameter sharing between different languages is also beneficial (6639348). More recently, DBLP:journals/corr/abs-1802-07420 show that exploiting this effect is especially useful for low-resource languages.
6639012 apply MTL for phone recognition with three lower-level auxiliary tasks and show noticeable reductions in error rates. DBLP:conf/interspeech/ToshniwalTLL17 use MTL for conversational speech recognition with lower-level tasks (e.g. phoneme recognition) in an encoder-decoder model for direct character transcription. 7953071 learn to align utterances with phonetic transcriptions in a lower layer and graphemic transcriptions in the final layer, exploiting again the relation between task level of complexity and levels of neural architecture in a MTL setting. They also show a benefit of sharing model parameters between different varieties of the same language, specifically US, British, Indian and Australian English. DBLP:journals/corr/abs-1710-08377 demonstrate the effectiveness of transfer from generic audio classification to speech command recognition, which can also be considered a particular instance of MTL.
How our work fits in.
The current paper uses an intuition also present in several of the works mentioned above: namely that an end-to-end model which needs to induce several levels of intermediate latent representations should be guided to find useful ones by including auxiliary prediction tasks at the intermediate layers. These auxiliary prediction tasks typically use lower-level linguistically-motivated structures such as phonemes for end-to-end ARS, or syntactic trees for semantic parsing.
The present study extends this setting to a full speech-to-semantics setup: the main task is to take spoken language as input and learn a semantic representation based on feedback from the visual modality, while an ASR-like task (speech/text matching) is merely auxiliary. The lower-level linguistic structures in our case are the sequences of phoneme-like units approximated by the written form of the language.
The modeling framework uses a multi-task setup. The core model is a three-task architecture depicted in Figure 1: there are three encoders, one for each modality: speech, image, and text. Each modality has a shared encoder which works directly on the input modality, and two specialized encoders which take as input the encoded data from the shared encoder. The three tasks correspond to three losses (depicted with circles in the figure): each loss works with a pair of modalities and attempts to minimize the distance between matching encoded items, while maximizing the distance between mismatching ones. For a pair of modalities with encoded objects and , the loss if defined as follows
∑_u,i (∑_u’ max[0, α+ d(u,i) - d(u’,i)] + ∑_i’ max[0, α+ d(u,i) - d(u,i’)] ) where are matching objects (for example an utterance and a matching image), and and are mismatched objects within a batch, while is the cosine distance between encoded objects.
The speech/image part of the architecture is based on the grounded speech model from chrupala2017representations, with the main difference being that these authors used Recurrent Highway Networks (pmlr-v70-zilly17a)
for the recurrent layers, while we chose the simpler Gated Recurrent Unit networks(chung2014empirical), because they have optimized low-level CUDA support which makes them much faster to run and enables us to carry out an at least somewhat comprehensive set of experiments.
3.2 Image Encoders
3.3 Speech Encoders
The shared encoder s consists of a 1-dimensional convolutional layer which subsamples the input, followed by a stack of recurrent layers. The modality specific encoders s2t and s2i consist of a stack of recurrent layers, followed by an attention operator. The encoder is defined as follows:
where is a convolutional layer with kernel size ,
channels, and stride, and is a stack of GRU layers. An encoder of modality x is defined as
where is the attention operator and is L2-normalization. Note that for the case is simply the identity function. The attention operator computes a weighted sum of the RNN activations at all timesteps:
where the weights are determined by an MLP with learned parameters and , and passed through the timewise softmax function:
3.4 Text Encoders
The text encoders are defined in the same way as the speech encoders, with the only difference being that the convolutional layer is replaced by an embedding layer, i.e. a lookup table mapping characters to embedding vectors.
The model is trained by alternating between the tasks, and updating the parameters of each task in turn. Note that the input data for the three tasks can be the same, but can also be partly or completely disjoint. We report two conditions
Joint: all tasks use the same data;
Disjoint: the data for the Speech/Text task is disjoint from the data for the other two tasks.
We consider the disjoint condition somewhat more realistic, in that it is easier to find separate datasets for each pair of modalities than it is to to find a single dataset with all three modalities. However the main reason to including both conditions is that it allows us to disentangle via which mechanism MTL contributes: by enabling the use of extra data, or by enforcing an inductive bias.
3.6 Architecture variants
There is a multitude of ways in which the details of the core architecture can be varied. in order to reduce them to a manageable number we made the following choices:
Keep the image encoder simple and fixed.
Keep the architecture of the encoders fixed, and only vary encoder depth and the degree of sharing.
In addition to variants of the full three-task model, we also have single-task and two-task baselines which are the three-task model with the speech/text and text/image tasks completely ablated, or with only the text/image task ablated. Note that we do not include a condition with only the speech/text task ablated, as the two remaining tasks do not share any learnable parameters (since I is fixed).
3.7 Evaluation metrics
Below we introduce metrics evaluating performance on the image retrieval task, as well as additional analytical metrics which quantify some aspects of the internal representation learned by the encoders.
Evaluating image retrieval
In order to evaluate how well the main speech/image task performs we report the recall at 10 (R@10) and median rank (Medr) for the speech/image task: utterances in the development set are encoded via s2i and images via i2s. For each utterance the images are ranked in order of cosine distance; R@10 counts the mean proportion of correct images among top 10 ranked images, while Medr gives the median of the ranks of the correct image (where correct image counts as image originally paired with the utterance).
Invariance to speaker
We measure how invariant the utterance encoding is to the identity of the speaker; in principle it is expected and desirable that the utterance encoding captures the meaning of the spoken language rather than other aspects of it such as who spoke it or other features irrelevant to the task. To quantify this invariance we report the accuracy of a L2-penalized logistic regression model on the task of decoding the identity of the speaker from the output of thes2i encoder. The logistic model is trained on of the development data and the tested on the remaining .
Representational Similarity Analysis (kriegeskorte2008representational)
gauges the correlation between two sets of pairwise similarity measurements. Here we use it to quantify the correlation of the learned representation space with the written text space and with the image space. For the encoder representations, the pairwise similarities between utterances are given by the cosine similarities. For the written form, the similarities are the inverse of the normalized Levenshtein distance between the character sequences encoding each pair of utterances:
where is the Levenshtein distance and is string length.
We compute the Pearson correlation coefficient between two similarity matrices on the upper triangulars of the each matrix, excluding the diagonal.
3.8 Experimental settings
The speech/image and text/image tasks are always trained on the Flickr8K Audio Caption Corpus (harwath2016unsupervised), which is based on the original Flickr8K dataset (hodosh2013framing). Flickr8K consists of 8,000 photographic images depicting everyday situations. Each image is accompanied by five brief English descriptions produced by crowd workers. Flickr8K Audio Caption Corpus enriches this data with spoken descriptions, read aloud and recorded by crowd workers. One thousand images are held out for validation, and another one thousand for the test set, using the splits provided by karpathy2015deep. In the joint condition the speech/text task is also trained on this data.
In the disjoint condition, we train the speech/text task on the Libri dataset (7178964) which consists of approximately 1,000 hours of read English speech, derived from read audiobooks. There are 291,630 sentences in the corpus, of which 1,000 are held out for validation.
We preprocess the audio by extracting 12-dimensional mel-frequency cepstral coefficients (MFCC) plus log of the total energy. We use 25 millisecond windows, sampled every 10 milliseconds. The shared image encoder is fixed and consists of 4096 dimensional activations the pre-classification layer of VGG-16 (simonyan2014very)
pre-trained on Imagenet(Russakovsky2015).
We use a simple round-robin training scheme: we alternate between tasks, and for each task update the parameters of that task as well as the shared parameters based on supervision from one batch of data. The data ordering for each task is independent, both in the joint and disjoint condition: for each epoch we reshuffle the dataset associated to each task and iterate through the batches until the smallest dataset runs out. This procedure makes sure that the only difference between the joint and disjoint conditions is the actual data and not other aspects of training.
We will release the code to reproduce our results and analyses as open source.
Table 1 shows the evaluation results on validation data, on the image retrieval task of 13 configurations of the model, including three versions with one or two tasks ablated.
Table 2 shows the results on the test set with the 1-task baseline model and the best performing configuration compared to previously reported results on this dataset.As can be seen the baseline model is a bit worse than the best reported result on this data, while the 3-task model is much better.
Below we discuss and interpret the patterns in performance on image retrieval as measured by Recall@10 and median rank.
Impact of tasks
The most striking result is the large gap in performance between the 1-task condition (row 1) and most of the other rows. Comparing row 1 versus rows 2 and 3 we see that adding the speech/text task leads to a substantial improvement. However, comparing rows 2 and 3 versus rows 6 and 11, it seems that the addition of text/image task does not seem to have a major impact on performance, at least to the extent that can be gleaned from the experiments we carried out. It is possible that with more effort put into engineering this component of the model we would see a better result.
Role of data vs inductive bias
The other major finding is that whether we use the same or different data to train the main and auxiliary task has overall little impact: this is indicated by relatively small differences between configurations in the joint vs disjoint condition. The differences that are there tend to favor the joint setting. This lends supports to the conclusion that the speech/text auxiliary task contributes to improved performance on the main task via a strong inductive bias rather than merely via enabling the use of extra data. This is in contrast to many other applications of MTL.
Impact of parameter sharing design
The third important effect is about how parameters between the tasks are shared, specifically how the shared and task-specific parts of the speech encoder are apportioned. The configuration with maximum sharing of parameters among the tasks (rows 8 and 13) performs poorly compared to sharing only the lower layers of the encoders for speech and text (i.e. rows 6 and 11). Additionally, we see that the inclusion of a text-specific speech encoder s2t degrades performance: compare for example row 6 to 7, and row 11 to 12. Thus it is best to have a shared speech encoder whose output is directly used by the speech/text task, while the speech/image task carries out further transformations of the input via an image-specific speech encoder s2i. We can interpret this as the MTL emulating a pipeline architecture to some extent: direct connection of the speech/text task to the shared encoder forces it to come up with a representation closely correlated with a written transcription, and then the image-specific speech encoder takes this as input and maps it to a more meaning-related representation, useful for the speech/image task.
In addition to the above patterns of performance on image retrieval we now address our further research questions by investigating selected aspects of the encoder representations.
Figure 2 shows the relation between Recall@10 and the accuracy of speaker identification from the activation patterns of the output of encoder s2i. Configurations with more speaker invariance (i.e. lower accuracy) tend to perform better on the speech/image
retrieval task, but this a noisy relation: speaker identification accuracy accounts for 10% of the total variance in recall (i.e. the correlation is). Specifically, low accuracy on speaker ID implies high recall, but some configurations which have high speaker ID accuracy can nevertheless have relatively high recall.
As seen in Figure 2, the depth of the s2i encoder has an impact on speaker identification accuracy, while the type of the s2t encoder is moderately associated with speaker ID accuracy and also moderately related to recall.
Quantitatively we can measure the association between variable and , while controling for other variables by using the coefficient of partial determination, or the relative reduction in error caused by including variable in a linear model for :
where is the sum squared error of the model with all variables, and is the sum squared error of the model with removed. Table 3 shows the strength of the relations between the two encoders and accuracy of speaker identification accuracy and recall.
In summary, using encoder s2i of depth 2 leads to more speaker invariance but not necessarily to better recall; including the speech/text task with a zero-depth s2t encoder tends to improve both speaker invariance and recall. A likely interpretation of this pattern is that the speech/text task improves performance at least in part by encouraging speaker-invariant representations.
|Speaker||s2i||s t t2i s2t||0.76|
|R@10||s2i||s t t2i s2t||0.00|
|Speaker||s2t||s t t2i s2i||0.45|
|R@10||s2t||s t t2i s2i||0.34|
RSA with regard to textual and visual spaces
Table 4 shows the RSA scores between the encoder representations of utterances and their representations in the spoken, written and visual modalities.
|Model 1, s2i||0.043||0.194||0.187|
|Model 6, s2i||0.030||0.212||0.222|
|Model 6, s2t||0.099||0.243||0.105|
Comparing the RSA scores between the s2i encoder of model 1 (single task) and model 6 (3 task) we see that the correlations with the textual modality and the visual modality are enhanced while the correlation with the input audio modality drops. This can be interpreted as the speech/text task nudging the model to align more closely with the text, which also ends up contributing to the correlation with the image space. Looking at the scores for model 6 but using the output of the s2t encoder, we see the correlation with the text space is even higher while the correlation with the image space is low. These patterns are what we would expect if speech/text does indeed inject a symbolic inductive bias to the model. Finally an interesting fact is that while the RSA score between the textual and visual modalities is low (0.083), nevertheless model 6’s encoder s2i is moderately correlated with both of these (0.212 and 0.222 respectively).
We show that the addition of the speech/text task leads to substantial performance improvements when compared to training the speech/image task in isolation. Via controlled experiments and analyses we show evidence that this is due to the role of inductive bias on the learned encoder representations.
Limitations and future work
Our current model does not include an explicit speech-to-text decoder, which limits the types of analyses we can perform. For one, it also makes it infeasible to carry out an apples-to-apples comparison with a pipeline architecture. Going forward we would like to go beyond matching tasks and evaluate the impact of an explicit speech-to-text decoder as an auxiliary task. We are also planning to investigate how sensitive our approach is to amount of data for the auxiliary task. This would be especially interesting given that one motivation for a visually-supervised end-to-end approach is the un-availability of large amounts of transcribed speech in certain circumstances.
I would like to thanks Afra Alishahi, Lieke Gelderloos and Ákos Kádár for helpful comments and discussion about this work.