Semantic keyword spotting by learning from images and speech

10/05/2017
by   Herman Kamper, et al.
0

We consider the problem of representing semantic concepts in speech by learning from untranscribed speech paired with images of scenes. This setting is relevant in low-resource speech processing, robotics, and human language acquisition research. We use an external image tagger to generate soft labels, which serve as targets for training a neural model that maps speech to keyword labels. We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic keyword spotting, where the goal is to search for spoken utterances that are semantically relevant to a given text query. Without seeing any text, the model trained on parallel speech and images achieves a precision of almost 60 to a model trained on transcriptions, our model matches human judgements better by some measures, especially in retrieving non-verbatim semantic matches.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset