Björn Schuller

is this you? claim profile


Chair of Complex and Intelligent Systems at University of Passau, Full Professor & Head of the Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany, Reader (Associate Professor) in Machine Learning, Group on Language, Audio & Music, Imperial College London, London/U.K, Chief Executive Officer (CEO) and Co-Founder, audEERING GmbH, Gilching/Germany, Visiting Professor, School of Computer Science and Technology, Harbin Institute of Technology, Harbin/P.R. China.

  • Synthesising 3D Facial Motion from "In-the-Wild" Speech

    Synthesising 3D facial motion from speech is a crucial problem manifesting in a multitude of applications such as computer games and movies. Recently proposed methods tackle this problem in controlled conditions of speech. In this paper, we introduce the first methodology for 3D facial motion synthesis from speech captured in arbitrary recording conditions ("in-the-wild") and independent of the speaker. For our purposes, we captured 4D sequences of people uttering 500 words, contained in the Lip Reading Words (LRW) a publicly available large-scale in-the-wild dataset, and built a set of 3D blendshapes appropriate for speech. We correlate the 3D shape parameters of the speech blendshapes to the LRW audio samples by means of a novel time-warping technique, named Deep Canonical Attentional Warping (DCAW), that can simultaneously learn hierarchical non-linear representations and a warping path in an end-to-end manner. We thoroughly evaluate our proposed methods, and show the ability of a deep learning model to synthesise 3D facial motion in handling different speakers and continuous speech signals in uncontrolled conditions.

    04/15/2019 ∙ by Panagiotis Tzirakis, et al. ∙ 12 share

    read it

  • Multi-modal Active Learning From Human Data: A Deep Reinforcement Learning Approach

    Human behavior expression and experience are inherently multi-modal, and characterized by vast individual and contextual heterogeneity. To achieve meaningful human-computer and human-robot interactions, multi-modal models of the users states (e.g., engagement) are therefore needed. Most of the existing works that try to build classifiers for the users states assume that the data to train the models are fully labeled. Nevertheless, data labeling is costly and tedious, and also prone to subjective interpretations by the human coders. This is even more pronounced when the data are multi-modal (e.g., some users are more expressive with their facial expressions, some with their voice). Thus, building models that can accurately estimate the users states during an interaction is challenging. To tackle this, we propose a novel multi-modal active learning (AL) approach that uses the notion of deep reinforcement learning (RL) to find an optimal policy for active selection of the users data, needed to train the target (modality-specific) models. We investigate different strategies for multi-modal data fusion, and show that the proposed model-level fusion coupled with RL outperforms the feature-level and modality-specific models, and the naive AL strategies such as random sampling, and the standard heuristics such as uncertainty sampling. We show the benefits of this approach on the task of engagement estimation from real-world child-robot interactions during an autism therapy. Importantly, we show that the proposed multi-modal AL approach can be used to efficiently personalize the engagement classifiers to the target user using a small amount of actively selected users data.

    06/07/2019 ∙ by Ognjen Rudovic, et al. ∙ 9 share

    read it

  • SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild

    Natural human-computer interaction and audio-visual human behaviour sensing systems, which would achieve robust performance in-the-wild are more needed than ever as digital devices are becoming indispensable part of our life more and more. Accurately annotated real-world data are the crux in devising such systems. However, existing databases usually consider controlled settings, low demographic variability, and a single task. In this paper, we introduce the SEWA database of more than 2000 minutes of audio-visual data of 398 people coming from six cultures, 50 18 to 65 years old. Subjects were recorded in two different contexts: while watching adverts and while discussing adverts in a video chat. The database includes rich annotations of the recordings in terms of facial landmarks, facial action units (FAU), various vocalisations, mirroring, and continuously valued valence, arousal, liking, agreement, and prototypic examples of (dis)liking. This database aims to be an extremely valuable resource for researchers in affective computing and automatic human sensing and is expected to push forward the research in human behaviour analysis, including cultural studies. Along with the database, we provide extensive baseline experiments for automatic FAU detection and automatic valence, arousal and (dis)liking intensity estimation.

    01/09/2019 ∙ by Jean Kossaifi, et al. ∙ 8 share

    read it

  • Voice command generation using Progressive Wavegans

    Generative Adversarial Networks (GANs) have become exceedingly popular in a wide range of data-driven research fields, due in part to their success in image generation. Their ability to generate new samples, often from only a small amount of input data, makes them an exciting research tool in areas with limited data resources. One less-explored application of GANs is the synthesis of speech and audio samples. Herein, we propose a set of extensions to the WaveGAN paradigm, a recently proposed approach for sound generation using GANs. The aim of these extensions - preprocessing, Audio-to-Audio generation, skip connections and progressive structures - is to improve the human likeness of synthetic speech samples. Scores from listening tests with 30 volunteers demonstrated a moderate improvement (Cohen's d coefficient of 0.65) in human likeness using the proposed extensions compared to the original WaveGAN approach.

    03/13/2019 ∙ by Thomas Wiest, et al. ∙ 6 share

    read it

  • On Many-to-Many Mapping Between Concordance Correlation Coefficient and Mean Square Error

    The concordance correlation coefficient (CCC) is one of the most widely used reproducibility indices, introduced by Lin in 1989. In addition to its extensive use in assay validation, CCC serves various different purposes in other multivariate population-related tasks. For example, it is often used as a metric to quantify an inter-rater agreement. It is also often used as a performance metric for prediction problems. In terms of the cost function, however, there has been hardly any attempt to design one to train the predictive deep learning models. In this paper, we present a family of lightweight cost functions that aim to also maximise CCC, when minimising the prediction errors. To this end, we first reformulate CCC in terms of the errors in the prediction; and then as a logical next step, in terms of the sequence of the fixed set of errors. To elucidate our motivation and the results we obtain through these error rearrangements, the data we use is the set of gold standard annotations from a well-known database called `Automatic Sentiment Analysis in the Wild' (SEWA), popular thanks to its use in the latest Audio/Visual Emotion Challenges (AVEC'17 and AVEC'18). We also present some new and interesting mathematical paradoxes we have discovered through this CCC reformulation endeavour.

    02/14/2019 ∙ by Vedhas Pandit, et al. ∙ 4 share

    read it

  • Tunable Sensitivity to Large Errors in Neural Network Training

    When humans learn a new concept, they might ignore examples that they cannot make sense of at first, and only later focus on such examples, when they are more useful for learning. We propose incorporating this idea of tunable sensitivity for hard examples in neural network learning, using a new generalization of the cross-entropy gradient step, which can be used in place of the gradient in any gradient-based training method. The generalized gradient is parameterized by a value that controls the sensitivity of the training process to harder training examples. We tested our method on several benchmark datasets. We propose, and corroborate in our experiments, that the optimal level of sensitivity to hard example is positively correlated with the depth of the network. Moreover, the test prediction error obtained by our method is generally lower than that of the vanilla cross-entropy gradient learner. We therefore conclude that tunable sensitivity can be helpful for neural network learning.

    11/23/2016 ∙ by Gil Keren, et al. ∙ 0 share

    read it

  • Detecting Road Surface Wetness from Audio: A Deep Learning Approach

    We introduce a recurrent neural network architecture for automated road surface wetness detection from audio of tire-surface interaction. The robustness of our approach is evaluated on 785,826 bins of audio that span an extensive range of vehicle speeds, noises from the environment, road surface types, and pavement conditions including international roughness index (IRI) values from 25 in/mi to 1400 in/mi. The training and evaluation of the model are performed on different roads to minimize the impact of environmental and other external factors on the accuracy of the classification. We achieve an unweighted average recall (UAR) of 93.2 mph. The classifier still works at 0 mph because the discriminating signal is present in the sound of other vehicles driving by.

    11/22/2015 ∙ by Irman Abdić, et al. ∙ 0 share

    read it

  • The Principle of Logit Separation

    We consider neural network training, in applications in which there are many possible classes, but at test-time, the task is to identify only whether the given example belongs to a specific class, which can be different in different applications of the classifier. For instance, this is the case in an image search engine. We consider the Single Logit Classification (SLC) task: training the network so that at test-time, it would be possible to accurately identify if the example belongs to a given class, based only on the output logit for this class. We propose a natural principle, the Principle of Logit Separation, as a guideline for choosing and designing losses suitable for the SLC. We show that the cross-entropy loss function is not aligned with the Principle of Logit Separation. In contrast, there are known loss functions, as well as novel batch loss functions that we propose, which are aligned with this principle. In total, we study seven loss functions. Our experiments show that indeed in almost all cases, losses that are aligned with Principle of Logit Separation obtain a 20 losses that are not aligned with it. We therefore conclude that the Principle of Logit Separation sheds light on an important property of the most common loss functions used by neural network classifiers. Tensorflow code for optimizing the new batch losses is publicly available in

    05/29/2017 ∙ by Gil Keren, et al. ∙ 0 share

    read it

  • DeepCoder: Semi-parametric Variational Autoencoders for Automatic Facial Action Coding

    Human face exhibits an inherent hierarchy in its representations (i.e., holistic facial expressions can be encoded via a set of facial action units (AUs) and their intensity). Variational (deep) auto-encoders (VAE) have shown great results in unsupervised extraction of hierarchical latent representations from large amounts of image data, while being robust to noise and other undesired artifacts. Potentially, this makes VAEs a suitable approach for learning facial features for AU intensity estimation. Yet, most existing VAE-based methods apply classifiers learned separately from the encoded features. By contrast, the non-parametric (probabilistic) approaches, such as Gaussian Processes (GPs), typically outperform their parametric counterparts, but cannot deal easily with large amounts of data. To this end, we propose a novel VAE semi-parametric modeling framework, named DeepCoder, which combines the modeling power of parametric (convolutional) and nonparametric (ordinal GPs) VAEs, for joint learning of (1) latent representations at multiple levels in a task hierarchy1, and (2) classification of multiple ordinal outputs. We show on benchmark datasets for AU intensity estimation that the proposed DeepCoder outperforms the state-of-the-art approaches, and related VAEs and deep learning models.

    04/07/2017 ∙ by Dieu Linh Tran, et al. ∙ 0 share

    read it

  • Acoustic Gait-based Person Identification using Hidden Markov Models

    We present a system for identifying humans by their walking sounds. This problem is also known as acoustic gait recognition. The goal of the system is to analyse sounds emitted by walking persons (mostly the step sounds) and identify those persons. These sounds are characterised by the gait pattern and are influenced by the movements of the arms and legs, but also depend on the type of shoe. We extract cepstral features from the recorded audio signals and use hidden Markov models for dynamic classification. A cyclic model topology is employed to represent individual gait cycles. This topology allows to model and detect individual steps, leading to very promising identification rates. For experimental validation, we use the publicly available TUM GAID database, which is a large gait recognition database containing 3050 recordings of 305 subjects in three variations. In the best setup, an identification rate of 65.5 achieved out of 155 subjects. This is a relative improvement of almost 30 compared to our previous work, which used various audio features and support vector machines.

    06/11/2014 ∙ by Jürgen T. Geiger, et al. ∙ 0 share

    read it

  • 6th International Symposium on Attention in Cognitive Systems 2013

    This volume contains the papers accepted at the 6th International Symposium on Attention in Cognitive Systems (ISACS 2013), held in Beijing, August 5, 2013. The aim of this symposium is to highlight the central role of attention on various kinds of performance in cognitive systems processing. It brings together researchers and developers from both academia and industry, from computer vision, robotics, perception psychology, psychophysics and neuroscience, in order to provide an interdisciplinary forum to present and communicate on computational models of attention, with the focus on interdependencies with visual cognition. Furthermore, it intends to investigate relevant objectives for performance comparison, to document and to investigate promising application domains, and to discuss visual attention with reference to other aspects of AI enabled systems.

    07/22/2013 ∙ by Lucas Paletta, et al. ∙ 0 share

    read it