Sound-Word2Vec: Learning Word Representations Grounded in Sounds

03/06/2017 ∙ by Ashwin K Vijayakumar, et al. ∙ Georgia Institute of Technology Virginia Polytechnic Institute and State University 0

To be able to interact better with humans, it is crucial for machines to understand sound - a primary modality of human perception. Previous works have used sound to learn embeddings for improved generic textual similarity assessment. In this work, we treat sound as a first-class citizen, studying downstream textual tasks which require aural grounding. To this end, we propose sound-word2vec - a new embedding scheme that learns specialized word embeddings grounded in sounds. For example, we learn that two seemingly (semantically) unrelated concepts, like leaves and paper are similar due to the similar rustling sounds they make. Our embeddings prove useful in textual tasks requiring aural reasoning like text-based sound retrieval and discovering foley sound effects (used in movies). Moreover, our embedding space captures interesting dependencies between words and onomatopoeia and outperforms prior work on aurally-relevant word relatedness datasets such as AMEN and ASLex.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sound and vision are the dominant perceptual signals, while language helps us communicate complex experiences via rich abstractions. For example, a novel can stimulate us to mentally construct the image of the scene despite having never physically perceived it. Indeed, language has evolved to contain numerous constructs that help depict visual concepts. For example, we can easily form the picture of a white, furry cat with blue eyes via. a description of the cat in terms of its visual attributes (Lampert et al., 2009; Parikh and Grauman, 2011).

Need for Onomatopoeia. However, how would one describe the auditory instantiation of cats? While a first thought might be to use audio descriptors like loud, shrill, husky as mid-level constructs or “attributes”, arguably, it is difficult to precisely convey and comprehend sound through such language. Indeed, Wake and Asahi (1998) find that humans first communicate sounds using “onomatopoeia” – words that are suggestive of the phonetics of sounds while having no explicit meaning meow, tic-toc. When asked for further explanation of sounds, humans provide descriptions of potential sound sources or impressions created by the sound (pleasant, annoying, .)

Need for Grounding in Sound. While onomatopoeic words exist for commonly found concepts, a vast majority of concepts are not as perceptually striking or sufficiently frequent for us to come up with dedicated words describing their sounds. Even worse, some sounds, say, musical instruments, might be difficult to mimic using speech. Thus, for a large number of concepts there seems to be a gap between sound and its counterpart in language (Sundaram and Narayanan, 2006). This becomes problematic in specific situations where we want to talk about the heavy tail of concepts and their sounds, or while describing a particular sound we want to create as an effect (say in movies). To alleviate this, a common literary strategy is to provide metaphors to more relatable exemplars. For example, when we say, “He thundered angrily”, we compare the person’s angry speech to the sound of thunder to convey the seriousness of the situation. However, without this grounding in sound, thunder and anger both appear to be seemingly unrelated concepts in terms of semantics.

Contributions. In this work, we learn embeddings to bridge the gap between sound and its counterpart in language. We follow a retrofitting strategy, capturing similarity in sounds associated with words, while using distributional semantics (from word2vec) to provide smoothness to the embeddings. Note that we are not interested in capturing phonetic similarity, but the grounding in sound of the concept associated with the word (say “rustling” of leaves and paper.) We demonstrate the effectiveness of our embeddings on three downstream tasks that require reasoning about related aural cues:
1. Text-based sound retrieval – Given a textual query describing the sound and a database containing sounds and associated textual tags, we retrieve sound samples by matching text (Sec. 5.1)
2. Foley Sound Discovery – Given a short phrase that outlines the technique of producing Foley sounds111Foley sounds are sound effects (typically ambient sounds) that are added to movies in the post-production stage to make actions or situations appear more realistic. These sounds are generally created using easily available proxy objects that mimic the sound of the true situation being depicted. For example, sound of breaking celery sticks is used to create the effect of breaking bones., we discover other relevant words (objects or actions) which can produce similar sound effects (Sec. 5.2)
3. Aurally-relevant word relatedness assessment on AMEN and ASLex (Kiela and Clark, 2015) (Sec. 5.3)

We also qualitatively compare with word2vec to highlight the unique notions of word relatedness captured by imposing auditory grounding.

2 Related Work

Audio and Word Embeddings. Multiple works in the recent past (Bruni et al., 2014; Lazaridou et al., 2015; Lopopolo and van Miltenburg, 2015; Kiela and Clark, 2015; Kottur et al., 2016) have explored using perceptual modalities like vision and sound to learn language embeddings. While Lopopolo and van Miltenburg (2015)

show preliminary results on using sound to learn distributional representations,

Kiela and Clark (2015) build on ideas from Bruni et al. (2014) to learn word embeddings that respect both linguistic and auditory relationships by optimizing a joint objective. Further, they propose various fusion strategies to combine knowledge from both the modalities. Instead, we “specialize” embeddings to exclusively respect relationships defined by sounds, while initializing with word2vec embeddings for smoothness. Similar to previous findings (Melamud et al., 2016), we observe that our specialized embeddings outperform language-only as well as other multi-modal embeddings in the downstream tasks of interest.
In an orthogonal and interesting direction, other recent works (Chung et al., 2016; He et al., 2016; Settle and Livescu, 2016) learn word representations based on similarity in their pronunciation and not the sounds associated with them. In other words, phonetically similar words that have near identical pronunciations are brought closer in the embedding space (, flower and flour).
Sundaram and Narayanan (2006) 

study the applicability of onomatopoeia to obtain semantically meaningful representations of audio. Using a novel word-similarity metric and principal component analysis,

they find representations for sounds and cluster them in this derived space to reason about similarities. In contrast, we are interested in learning word representations that respect aural-similarity. More importantly, our approach learns word representations for in a data-driven manner without having to first map the sound or its tags to corresponding onomatopoeic words.

Multimodal Learning with Surrogate Supervision.  Kottur et al. (2016) and Owens et al. (2016) use a surrogate modality to induce supervision to learn representations for a desired modality. While the former learns word embeddings grounded in cartoon images, the latter learns visual features grounded in sound. In contrast, we use sound as the surrogate modality to supervise representation learning for words.

3 Datasets

Freesound. We use the freesound database (Font et al., 2013), also used in prior work (Kiela and Clark, 2015; Lopopolo and van Miltenburg, 2015) to learn the proposed sound-word2vec embeddings. Freesound is a freely available, collaborative dataset consisting of user uploaded sounds permitting reuse. All uploaded sounds have human descriptions in the form of tags and captions in natural language. The tags contain a broad set of relevant topics for a sound (, ambience, electronic, birds, city, reverb) and captions describing the content of the sound, in addition to details pertaining to audio quality. For the text-based sound retrieval task, we use a subset of 234,120 sounds from this database and divide it into training (80%), validation (10%) and testing splits (10%). Further, for foley sound discovery, we aggregate descriptions of foley sound production provided by sound engineers (epicsound, accessed 23-Jan-2017; Singer, accessed 23-Jan-2017) to create a list of 30 foley sound pairs, which forms our ground truth for the task. For example, the description to produce a foley “driving on gravel” sound is to record the “crunching sound of plastic or polyethene bags”.

AMEN and ASLex. AMEN and ASLex (Kiela and Clark, 2015) are subsets of the standard MEN (Bruni et al., 2014) and SimLex (Hill et al., 2015) word similarity datasets consisting of word-pairs that “can be associated with a distinctive associated sound”. We evaluate on this dataset for completeness to benchmark our approach against previous work. However, we are primarily interested in the slightly different problem of relating words with similar auditory instantions that may or may not be semantically related as opposed to relating semantically similar words that can be associated with some common auditory signal.

4 Approach

We use the Freesound database to construct a dataset of tuples , where is a sound and is the set of associated user-provided tags. We then aim to learn an embedding space for the tags that respects auditory grounding using sound information as cross-modal context – similar to word2vec (Mikolov et al., 2013) that uses neighboring words as context / supervision. We now explain our approach in detail.

Audio Features and Clustering. We represent each sound

by a feature vector consisting of the mean and variance of the following audio descriptors that are readily available as part of Freesound database:

  • Mel-Frequency Cepstral Co-efficients: This feature represents the short-term power spectrum of an audio and closely approximates the response of the human auditory system – computed as given in (Ganchev et al., 2005).

  • Spectral Contrast: It is the magnitude difference in the peaks and valleys of the spectrum – computed according to (Akkermans et al., 2009).

  • Dissonance: It measures the perceptual roughness of the sound (Plomp and Levelt, 1965).

  • Zero-crossing Rate: It is the percentage of sign changes between consecutive signal values and is indicative of noise content.

  • Spectral Spread: This feature is the concatenation of the

    -order moments of the spectrum, where

    .

  • Pitch Salience: This feature helps discriminate between musical and non-musical tones. While, pure tones and unpitched sounds have values near 0, musical sounds containing harmonics have higher values (Ricard, 2004).

We then use -Means algorithm to cluster the sounds in this feature space to assign each sound to a cluster . We set to 30 by evaluating the performance of the embeddings on text-based audio-retrieval on the held out validation set. Note that the clustering is only performed once, prior to representation learning described below.

Representation Learning. We represent each tag using a

dimensional one-hot encoding denoted by

, where is the set of all unique tags in the training set (the size of our dictionary). This one-hot vector is projected into a -dimensional vector space via , the projection matrix. This projection matrix computes the representation for each word in . The idea of our approach is to use to accurately predict cluster assignments (for sounds associated with words), which enforces grounding in sound. For each data-point, we obtain the summary of the tags , by averaging the projections of all tags in the set as .

Figure 1: The model used to learn the proposed sound-word2vec embeddings. The projection matrix containing that is used as the sound-word2vec embedding is learned by training the model to accurately predict the cluster assignment of the sound.

We then transform the so obtained summary representation via a linear layer (with parameters ) and pass the output through the softmax function to obtain a distribution, over the K sound clusters. We perform maximum-likelihood training for the correct cluster assignment 222We also tried to regress directly to sound features instead of clustering, but found that it had poor performance., optimizing for parameters and :

(1)

We use SGD with momentum to optimize this objective which essentially is the cross-entropy between cluster assignments and . We set to to be consistent with the publicly available word2vec embeddings.

Initialization. We initialize with word2vec embeddings (Mikolov et al., 2013) trained on the Google news corpus dataset with 3M words. We fine-tune on a subset of 9578 tags which are present in both Freesound as well as Google news corpus datasets, which is 55.68% of the original tags in the Freesound dataset. This helps us remove noisy tags unrelated to the content of the sound.

In addition to enlarging the vocabulary, the pre-training helps induce smoothness in the sound-word2vec embeddings – allowing us to transfer semantics learnt from sounds to words that were not present as tags in the Freesound database. Indeed, we find that word2vec pre-training helps improve performance (Sec. 2). Our use of language embeddings as an initialization to fine-tune (specialize) from, as opposed to formulating a joint objective with language and audio context (Kiela and Clark, 2015) is driven by the fact that we are interested in embeddings for words grounded in sounds, and not better generic word similarity.

5 Results

Ablations. In addition to the language-only baseline word2vec (Mikolov et al., 2013), we compare against tag-word2vec – that predicts a tag using other tags of the sound as context, inspired by (Font et al., 2014). We also report results with a randomly initialized projection matrix (sound-word2vec(r) to evaluate the effectiveness of pre-training with word2vec.

Prior work. We compare against previous works Lopopolo and van Miltenburg (2015) and Kiela and Clark (2015). While the former uses a standard bag of words and SVD pipeline to arrive at distributional representations for words, the latter trains under a joint objective that respects both linguistic and auditory similarity. We use the openly available implementation for Lopopolo and van Miltenburg (2015) and re-implement Kiela and Clark (2015) and train them on our dataset for a fair comparison of the methods. In addition, we show a comparison to word-vectors released by (Kiela and Clark, 2015) in the supplementary material. All approaches use an embedding size of 300 for consistency.

5.1 Text-based Sound Retrieval

Given a textual description of a sound as query, we compare it with tags associated with sounds in the database to retrieve the sound with the closest matching tags. Note that this is a purely textual task, albeit one that needs awareness of sound. In a sense, this task exactly captures what we want our model to be able to do – bridge the semantic gap between language and sound. We use the training split (Sec. 3) to learn the sound-word2vec vectors, validation to pick the number of clusters (K), and report results on the test split.

For retrieval, we represent sounds by averaging the learnt embeddings for the associated tags. We embed the caption provided for the sound (in the Freesound database) in a similar manner, and use it as the query. We then rank sounds based on the cosine similarity between the tag and query representations for retrieval.

We evaluate using standard retrieval metrics – Recall@{1,10,50,100}. Note that the entire testing set (10k sounds) is present in the retrieval pool. So, recall@100 corresponds to obtaining the correct result in the top of the search results, which is a relatively stringent evaluation criterion.

Results. Table. 1 shows that our sound-word2vec embeddings outperform the baselines. We see that specializing the embeddings for sound using our two-stage training outperforms prior work(Kiela and Clark (2015) and Lopopolo and van Miltenburg (2015)), which did not do specialization. Among our approaches, tag-word2vec performs second best – this is intuitive since the tag distributions implicitly capture auditory relatedness (a sound may have tags cat and meow), while word2vec and sound-word2vec(r) have the lowest performance.

Embedding Recall
@1 @10 @50 @100
word2vec 6.470.00 14.250.05 21.720.12 26.030.22
tag-word2vec 6.950.02 15.100.03 22.430.09 27.210.24
sound-word2vec(r) 6.490.00 14.980.03 21.960.11 26.430.20
(Lopopolo and van Miltenburg, 2015) 6.480.02 15.090.05 21.820.13 26.890.23
(Kiela and Clark, 2015) 6.520.01 15.210.03 21.920.08 27.740.21
sound-word2vec 7.110.02 15.880.04 23.140.09 28.670.17
Table 1: Text-based sound retrieval (higher is better). We find that our sound-word2vec model outperforms all baselines.

5.2 Foley Sound Discovery

In this task, we evaluate how well embeddings identify matching pairs of target sounds (flapping bird wings) and descriptions of Foley sound production techniques (rubbing a pair of gloves). Intuitively, one expects sound-aware word embeddings to do better at this task than sound-agnostic ones. We setup a ranking task by constructing a set of original Foley sound pairs and decoy pairs formed by pairing the target description with every word from the vocabulary. We rank using cosine similarity between the average word-vectors in each member of the pair. A good embedding is one in which the original Foley sound pair has the lowest rank. We use the mean rank of the Foley sound in the dataset for evaluation. We transfer the embeddings from Sec. 5.1 to this task, without additional training.

Results.  We find that Sound-word2vec performs the best with a mean rank of 34.6 compared to other baselines tag-word2vec (38.9), sound-word2vec(r) (114.3) and word2vec (189.45). As observed previously, the second best performing approach is tag-word2vec. Lopopolo and van Miltenburg (2015) and Kiela and Clark (2015) perform worse than tag-word2vec with a mean rank of 48.4 and 42.1 respectively. Note that random chance gets a rank of .

5.3 Evaluation on AMEN and ASLex

Embedding Spearman Correlation
AMEN ASLex
(Lopopolo and van Miltenburg, 2015) 0.4100.09 0.2370.04
(Kiela and Clark, 2015) 0.6480.08 0.3660.11
sound-word2vec 0.6740.05 0.3910.06
Table 2: Comparison to state of the art AMEN and ASLex datasets (Kiela and Clark, 2015) (higher is better). Our approach performs better than Kiela and Clark (2015).

AMEN and ASLex (Kiela and Clark, 2015) are subsets of the MEN and SimLex-999 datasets for word relatedness grounded in sound. From tab:amen, we can see that our embeddings outperform (Kiela and Clark, 2015) on both AMEN and ASLex. These datasets were curated by annotating concepts related by sound; however we observe that relatedness is often confounded. For example, (river, water), (automobile, car) are marked as aurally related — however they do not stand out as aurally-related examples as they are already semantically related. In contrast, we are interested in how onomatopoeic words relate to regular words (tab:2), which we study by explicit grounding in sound. Thus while we show competitive performance on this dataset, it might not be best suited for studying the benefits of our approach.

6 Discussion and Conclusion

word word2vec sound-word2vec
apple apples, pear, fruit bite, snack, chips
berry, pears, strawberry chew, munch, carton
wood lumber, timber, softwoods, wooden, snap, knock,
hardwoods, cedar, birch smack, whack, snapping
bones skull, femur, skeletons, eggshell, carrot, arm
thighbone, pelvis, molar blood, polystyrene, crunch
glass hand-blown, glassware, tumbler, shattered, ceramic, smash
Plexiglass, wine-glass, bottle clink, beer, spoon
Onomatopoeic query words
boom booms, booming, bubble, bomb, bang, explosion
craze, downturn, upswing bombing, exploding, ecstatic
jingle song, commercial, catchy-tune, magic, tinkle, nails
ditty, slogan, anthem bells, key, doorbell
slam slams, piledriver, uranage shut, lock, opening
spinkick, hiptoss, hit closing, latch, door
quack charlatan, quackery, crackpot duck, snort, calling
homeopaths, concoctions, snake-oil chirp, tweet, oink
Table 3: We show nearest neighbors in both word2vec and sound-word2vec spaces for eight words (‘regular’ words, top half and onomatopoeic words, bottom half).

We show nearest neighbors in both sound-word2vec and word2vec space (tab:2) to qualitatively demonstrate the unique dependencies captured due to auditory grounding. While word2vec maps a word (say, apple) to other semantically similar words (other fruits), similar ‘sounding’ words (chips) or onomatopoeia (munch) are closer in our embedding space. Moreover, onomatopoeic words (say, boom and slam) are mapped to relevant objects (explosion and door). Interestingly, parts (, lock, latch) and actions (closing) are also closer to the onomatopoeic query – exhibiting an understanding of the auditory scene.
Conclusion. In this work we introduce a novel word embedding scheme that respects auditory grounding. We show that our embeddings provide strong performance on text-based sound retrieval, Foley sound discovery along with intuitive nearest neighbors for onomatopoeia that are tasks in text requiting auditory reasoning. We hope our work motivates further efforts on understanding and relating onomatopoeia words to “regular” words.

Acknowledgements. We thank the Freesound team and Frederic Font in particular for helping us with the Freesound API. We also thank Khushi Gupta and Stefan Lee for fruitful discussions. This work was funded in part by an NSF CAREER, ONR Grant N00014-16-1-2713, ONR YIP, Sloan Fellowship, Allen Distinguished Investigator, Google Faculty Research Award, Amazon Academic Research Award to DP.

References

  • Akkermans et al. (2009) Vincent Akkermans, Joan Serrà, and Perfecto Herrera. 2009. Shape-based spectral contrast descriptor. In Proc. of the Sound and Music Computing Conf.(SMC).
  • Bruni et al. (2014) Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics.

    Journal of Artificial Intelligence Research (JAIR)

    .
  • Chung et al. (2016) Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Hung-yi Lee, and Lin-Shan Lee. 2016.

    Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder.

    CoRR.
  • epicsound (accessed 23-Jan-2017) epicsound. accessed 23-Jan-2017. The guide to sound effects.
  • Font et al. (2014) Frederic Font, Sergio Oramas, György Fazekas, and Xavier Serra. 2014. Extending tagging ontologies with domain specific knowledge. In Proceedings of the International Semantic Web Conference.
  • Font et al. (2013) Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia.
  • Ganchev et al. (2005) Todor Ganchev, Nikos Fakotakis, and George Kokkinakis. 2005. Comparative evaluation of various mfcc implementations on the speaker verification task. In Proceedings of the SPECOM.
  • He et al. (2016) Wanjia He, Weiran Wang, and Karen Livescu. 2016. Multi-view recurrent neural acoustic word embeddings. arXiv preprint arXiv:1611.04496.
  • Hill et al. (2015) Felix Hill, Roi Reichart, and Anna Korhonen. 2015.

    Simlex-999: Evaluating semantic models with (genuine) similarity estimation.

    Computational Linguistics.
  • Kiela and Clark (2015) Douwe Kiela and Stephen Clark. 2015. Multi-and cross-modal semantics beyond vision: Grounding in auditory perception. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

    .
  • Kottur et al. (2016) Satwik Kottur, Ramakrishna Vedantam, Jose M. F. Moura, and Devi Parikh. 2016. Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .
  • Lampert et al. (2009) Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by betweenclass attribute transfer. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Lazaridou et al. (2015) Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015. Combining language and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501.02598.
  • Lopopolo and van Miltenburg (2015) Alessandro Lopopolo and Emiel van Miltenburg. 2015. Sound-based distributional models. IWCS 2015.
  • Melamud et al. (2016) Oren Melamud, David McClosky, Siddharth Patwardhan, and Mohit Bansal. 2016. The role of context types and dimensionality in learning word embeddings. arXiv preprint arXiv:1601.00893.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS).
  • Owens et al. (2016) A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. 2016. Ambient Sound Provides Supervision for Visual Learning. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Parikh and Grauman (2011) Devi Parikh and Kristen Grauman. 2011. Relative Attributes. In Proceedings of IEEE International Conference on Computer Vision (ICCV).
  • Plomp and Levelt (1965) Reinier Plomp and Willem Johannes Maria Levelt. 1965. Tonal consonance and critical bandwidth. The journal of the Acoustical Society of America.
  • Ricard (2004) Julien Ricard. 2004. Towards computational morphological description of sound. DEA pre-thesis research work, Universitat Pompeu Fabra, Barcelona.
  • Settle and Livescu (2016) Shane Settle and Karen Livescu. 2016. Discriminative acoustic word embeddings: Recurrent neural network-based approaches. arXiv preprint arXiv:1611.02550.
  • Singer (accessed 23-Jan-2017) Philip R. Singer. accessed 23-Jan-2017. Art of foley.
  • Sundaram and Narayanan (2006) Shiva Sundaram and Shrikanth Narayanan. 2006. Vector-based representation and clustering of audio using onomatopoeia words. In In Proceedings of AAAI 2006 Fall Symposia.
  • Wake and Asahi (1998) Sanae Wake and Toshiyuki Asahi. 1998. Sound retrieval with intuitive verbal expressions. In Proceedings of the 5th International Conference on Auditory Display (ICAD).