Neural Multisensory Scene Inference

10/06/2019 ∙ by Jae Hyun Lim, et al. ∙ Université de Montréal Rutgers University 7

For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. We also introduce a novel method, called the Amortized Product-of-Experts, to improve the computational efficiency and the robustness to unseen combinations of modalities at test time. Experimental results demonstrate that the proposed model can efficiently infer robust modality-invariant 3D-scene representations from arbitrary combinations of modalities and perform accurate cross-modal generation. To perform this exploration, we also develop the Multisensory Embodied 3D-Scene Environment (MESE).



There are no comments yet.


page 3

page 12

page 16

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning a world model and its representation is an effective way of solving many challenging problems in machine learning and robotics,


, via model-based reinforcement learning 

(Silver et al., 2016). One characteristic aspect in learning the physical world is that it is inherently multifaceted and that we can perceive its complete characteristics only through our multisensory modalities. Thus, incorporating different physical aspects of the world via different modalities should help build a richer model and representation. One approach to learn such multisensory representations is to learn a modality-invariant representation as an abstract concept representation of the world. This is an idea well supported in both psychology and neuroscience. According to the grounded cognition perspective (Barsalou, 2008), such abstract concepts like objects and events can only be obtained through perceptual signals. For example, what represents a cup in our brain is its visual appearance, the sound it could make, the tactile sensation, etc. In neurosciences, the existence of concept cells (Quiroga, 2012) that responds only to a specific concept regardless of the modality sourcing the concept (e.g., by showing a picture of Jennifer Aniston or listening her name) can be considered as a biological evidence of the metamodal brain perspective (Pascual-Leone & Hamilton, 2001; Yildirim, 2014) and the modality-invariant representation.

An unanswered question from the computational perspective (our particular interest in this paper) is how to learn such modality-invariant representation of the complex physical world (e.g., 3D scenes placed with objects).  We argue that it is a particularly challenging problem because the following requirements need to be satisfied for the learned world model. First, the learned representation should reflect the 3D nature of the world. Although there have been some efforts in learning multimodal representations (see Section 3), those works do not consider this fundamental 3D aspect of the physical world. Second, the learned representation should also be able to model the intrinsic stochasticity of the world. Third, for the learned representation to generalize, be robust, and to be practical in many applications, the representation should be able to be inferred from experiences of any partial combinations of modalities. It should also facilitate the generative modelling of other arbitrary combinations of modalities (Yildirim, 2014), supporting the metamodal brain hypothesis – for which human evidence can be found from the phantom limb phenomenon (Ramachandran & Hirstein, 1998). Fourth, even if it is evidenced that there exists metamodal representation, there still exist modality-dependent brain regions, revealing the modal-to-metamodal hierarchical structure (Rohe & Noppeney, 2016). A learning model can also benefit from such hierarchical representation as shown by Hsu & Glass (2018). Lastly, the learning should be computationally efficient and scalable, e.g., with respect to the number of possible modalities.

Motivated by the above desiderata, we propose the Generative Multisensory Network (GMN) for neural multisensory scene inference and rendering. In GMN, from an arbitrary set of source modalities we infer a 3D representation of a scene that can be queried for generation via an arbitrary target modality set, a property we call generalized cross-modal generation. To this end, we formalize the problem as a probabilistic latent variable model based on the Generative Query Network (Eslami et al., 2018) framework and introduce the Amortized Product-of-Experts (APoE). The prior and the posterior approximation using APoE makes the model trainable only with a small combinations of modalities, instead of the entire combination set. The APoE also resolves the inherent space complexity problem of the traditional Product-of-Experts model and also improves computation efficiency. As a result, the APoE allows the model to learn from a large number of modalities without tight coupling among the modalities, a desired property in many applications such as Cloud Robotics (Saha & Dasgupta, 2018) and Federated Learning (Konečný et al., 2016). In addition, with the APoE the modal-to-metamodal hierarchical structure is easily obtained. In experiments, we show the above properties of the proposed model on 3D scenes with blocks of various shapes and colors along with a human-like hand.

The contributions of the paper are as follows: (i) We introduce a formalization of modality-invariant multisensory 3D representation learning using a generative query network model and propose the Generative Multisensory Network (GMN). (ii) We introduce the Amortized Product-of-Experts network that allows for generalized cross-modal generation while resolving the problems in the GQN and traditional Product-of-Experts. (iii) Our model is the first to extend multisensory representation learning to 3D scene understanding with human-like sensory modalities (such as haptic information) and cross-modal generation. (iv) We also develop the Multisensory Embodied 3D-Scene Environment (MESE) used to develop and test the model.

2 Neural Multisensory Scene Inference

2.1 Problem Description

Our goal is to understand 3D scenes by learning a metamodal representation of the scene through the interaction of multiple sensory modalities such as vision, haptics, and auditory inputs.  In particular, motivated by human multisensory processing (Deneve & Pouget, 2004; Shams & Seitz, 2008; Murray & Wallace, 2011), we consider a setting where the model infers a scene from experiences of a set of modalities and then to generate another set of modalities given a query for the generation. For example, we can experience a 3D scene where a cup is on a table only by touching or grabbing it from some hand poses and ask if we can visually imagine the appearance of the cup from an arbitrary query viewpoint (see Fig. 1). We begin this section with a formal definition of this problem.

Figure 1: Cross-modal inference using scene representation. (a) A single image context. (b) Haptic contexts. (c) Generated images for some viewpoints (image queries) in the scene, given the contexts. (d) Ground truth images for the same queries. Conditioning on an image context and multiple haptic contexts, modality-agnostic latent scene representation, , is inferred. Given sampled s, images are generated using various queries; in (c), each row corresponds to the same latent sample. Note that the shapes of predicted objects are consistent given different samples , while color pattern of the object changes except the parts seen by the image context (a).

A multisensory scene, simply a scene, consists of context and observation . Given the set of all available modalities , the context and observation in a scene are obtained through the context modalities and the observation modalities , respectively. In the following, we omit the scene index when the meaning is clear without it. Note that and are arbitrary subsets of including the cases , , and . We also use to denote all modalities available in a scene, .

The context and observation consist of sets of experience trials represented as query()-sense() pairs, i.e., and . For convenience, we denote the set of queries and senses in observation by and , respectively, i.e., . Each query and sense in a context consists of modality-wise queries and senses corresponding to each modality in the context modalities, i.e., (See Fig. S1). Similarly, the query and the sense in observation is constrained to have only the observation modalities . For example, for modality , an unimodal query can be the viewpoint and the sense is the observation image obtained from the query viewpoint. Similarly, for , an unimodal query can be the hand position, and the sense is the tactile and pressure senses obtained by a grab from the query hand pose. For a scene, we may have and . For convenience, we also introduce the following notations. We denote the context corresponding only to a particular modality by such that and . Similarly, , and are used to denote modality part of , , and , respectively.

Given the above definitions, we formalize the problem as learning a generative model of a scene that can generate senses corresponding to queries of a set of modalities, provided a context from other arbitrary modalities. Given scenes from the scene distribution , our training objective is to maximize , where is the model parameters to be learned.

2.2 Generative Process

We formulate this problem as a probabilistic latent variable model where we introduce the latent metamodal scene representation from a conditional prior

. The joint distribution of the generative process becomes: P_(X,|V,C) &= P_(X|V,)P_(|C)

&= ∏_n=1^N_oP_(_n|_n,)P_(|C) = ∏_n=1^N_o∏_m∈_oP__m(_n^m|_n^m,)P_(|C).

2.3 Prior for Multisensory Context

As the prior is conditioned on the context, we need an encoding mechanism of the context to obtain . A simple way to do this is to follow the Generative Query Network (GQN) (Eslami et al., 2018) approach: each context query-sense pair is encoded to and summed (or averaged) to obtain permutation-invariant context representation . A ConvDRAW module (Gregor et al., 2016) is then used to sample from .

In the multisensory setting, however, this approach cannot be directly adopted due to a few challenges. First, unlike GQN the sense and query of each sensor modality has different structure, and thus we cannot have a single and shared context encoder that deals with all the modalities. In our model, we therefore introduce a modality encoder for each .

The second challenge stems from the fact that we want our model capable of generating from any context modality set to any observation modality set – a property we call generalized cross-modal generation (GCG). However, at test time we do not know which sensory modal combinations will be given as a context and a target to generate. This hence requires collecting a training data that contains all possible combinations of context-observation modalities . This equals the Cartesian product of ’s powersets, i.e., . This is a very expensive requirement as increases exponentially with respect to the number of modalities111The number of modalities or sensory input sources can be very large depending on the application. Even in the case of ‘human-like’ embodied learning, it is not only, vision, haptics, auditory, etc. For example, given a robotic hand, the context input sources can be only a part of the hand, e.g., some parts of some fingers, from which we humans can imagine the senses of other parts. .

Although one might consider dropping-out random modalities during training to achieve the generalized cross-modal generation, this still assumes the availability of the full modalities from which to drop off some modalities. Also, it is unrealistic to assume that we always have access to the full modalities; to learn, we humans do not need to touch everything we see. Therefore, it is important to make the model learnable only with a small subset of all possible modality combinations while still achieving the GCG property. We call this the missing-modality problem.

To this end, we can model the conditional prior as a Product-of-Experts (PoE) network (Hinton, 2002) with one expert per sensory modality parameterized by . That is, While this could achieve our goal at the functional level, it comes at a computational cost of increased space and time complexity w.r.t. the number of modalities. This is particularly problematic when we want to employ diverse sensory modalities (as in, e.g., robotics) or if each expert has to be a powerful (hence expensive both in computation and storage) model like the 3D scene inference task (Eslami et al., 2018), where it is necessary to use the powerful ConvDraw network to represent the complex 3D scene.

2.4 Amortized Product-of-Experts as Metamodal Representation

To deal with the limitations of PoE, we introduce the Amortized Product-of-Experts (APoE). For each modality , we first obtain modality-level representation using the modal-encoder. Note that this modal-encoder is a much lighter module than the full ConvDraw network. Then, each modal-encoding along with its modality-id is fed into the expert-amortizer that is shared across all modal experts through shared parameter . In our case, this is implemented as a ConvDraw module (see Appendix B for the implementation details). We can write the APoE prior as follows: P(|C) = ∏_m∈_cP_ψ(|^m,m) . We can extend this further to obtain a hierarchical representation model by treating as a latent variable:

where is modality-level representation and is metamodal representation. Although we can train this hierarchical model with reparameterization trick and Monte Carlo sampling, for simplicity in our experiments we use deterministic function for where is a dirac delta function. In this hierarchical version, the generative process becomes: &P(X,, {^m}|V,C) = P_(X|V,)∏_m∈_c P_ψ(|^m,m)P__m(^m|C_m) . An illustration of the generative process is provided in Fig.S2 (b), on the Appendix. From the perspective of cognitive psychology, the APoE model can be considered as a computational model of the metamodal brain hypothesis (Pascual-Leone & Hamilton, 2001), which states the existence of metamodal brain area (the expert-of-experts in our case) which perform a specific function not specific to input sensory modalities.

2.5 Inference

Since the optimization of the aforementioned objective is intractable, we perform variational inference by maximizing the following evidence lower bound (ELBO), , with the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014): logP_(X|V,C) ≥_Q_ϕ(| C,O)[logP_(X|V,) ] -[Q_ϕ(|C,O)|| P_(|C)] , where . This can be considered a cognitively-plausible objective as, according to the “grounded cognition” perspective (Barsalou, 2008), the modality-invariant representation of an abstract concept, , can never be fully modality-independent.

APoE Approximate Posterior. The approximate posterior is implemented as follows. Following Wu & Goodman (2018), we first represent the true posterior as &P(O,C|)P()P(O,C) = P()P(C,O)∏_m∈_SP(C_m,O_m|) =P()P(C,O)∏_m∈_SP(|C_m,O_m)P(C_m,O_m)P(). After ignoring the terms that are not a function of , we obtain Replacing the numerator terms with an approximation , we can remove the priors in the denominator and obtain the following APoE approximate posterior: P(|C,O) ≈∏_m∈_SQ_ϕ(|C_m,O_m) . Although the above product is intractable in general, a closed form solution exists if each expert is a Gaussian (Wu & Goodman, 2018). The mean and covariance of the APoE are, respectively, and , where and are the mean and the inverse of the covariance of each expert. The posterior APoE is implemented first by encoding and then putting and modality-id into the amortized expert , which is a ConvDraw module in our implementation. The amortized expert outputs and for while sharing the variational parameter across the modality-experts. Fig. 2 compares the inference network architectures of CGQN, PoE, and APoE.

Figure 2: Baseline model, PoE and APoE. In the baseline model (left), a single inference network (denoted as Encoder) get an input as sum of all modality encoders’s outputs. In PoE (middle), each of the experts contains an integrated network combining the modality encoder and a complex inference network like ConvDraw, resulting in space cost of inference networks. In APoE (right), the modality encoding and the inference network are detached, and the inference networks are integrated into a single amortized expert inference network serving for all experts. Thus, the space cost of inference networks reduces to .

3 Related Works

Multimodal Generative Models. Multimodal data are associated with many interesting learning problems, e.g.

cross-modal inference, zero-shot learning or weakly-supervised learning. Regarding these, latent variable models have provided effective solutions: from a model with global latent variable shared among all modalities

(Suzuki et al., 2016) to hierarchical latent structures (Hsu & Glass, 2018) and scalable inference networks with Product-of-Experts (PoE) (Hinton, 2002; Wu & Goodman, 2018; Kurle et al., 2018). In contrast to these works, the current study addresses two additional challenges. First, this work aims at achieving the any-modal to any-modal conditional inference regardless of modality configurations during training: it targets on generalization under distribution shifts at test time. On the other hand, the previous studies assumed to have full modality configurations in both training and test data. Second, the proposed model considers each source of information to be rather partially observable, while each modality-specific data has been treated as fully observable. As a result, the modality-agnostic metamodal representation is inferred from modality-specific representations, each of which is integrated from a set of partially observable inputs.

3D Representations and Rendering. Learning representations of 3D scenes or environments with partially observable inputs has been addressed by supervised learning (Choy et al., 2016; Wu et al., 2017; Shin et al., 2018; Mescheder et al., 2018), latent variable models (Eslami et al., 2018; Rosenbaum et al., 2018; Kumar et al., 2018), and generative adversarial networks (Wu et al., 2016; Rajeswar et al., 2019; Nguyen-Phuoc et al., 2019). The GAN-based approaches exploited domain-specific functions, e.g. 3D representations, 3D-to-2D projection, and 3D rotations. Thus, it is hard to apply to non-visual modalities whose underlying transformations are unknown. On the other hand, neural latent variable models for random processes (Eslami et al., 2018; Rosenbaum et al., 2018; Kumar et al., 2018; Garnelo et al., 2018a, b; Le et al., 2018; Kim et al., 2019) has dealt with more generalized settings and studied on order-invariant inference. However, these studies focus on single modality cases, so they are contrasted from our method, addressing a new problem setting where qualitatively different information sources are available for learning the scene representations.

4 Experiment

Figure 3:

Results on cross-modal density estimation. (a) log-likelihood of target images (gray) vs. the number of haptic observation. (b) log-likelihood of target images (rgb) vs. the number of haptic observation. (c) log-likelihood of target haptic values vs. the number of image observations. The dotted lines show fully cross-modal inference where the context does not include any target modality. For the inference with additional context from the target modality, the results are denoted as dashed, dashdot, and solid lines.

The proposed model is evaluated with respect to the following criteria: (i) cross-modal density estimation in terms of log-likelihood, (ii) ability to perform cross-modal sample generation, (iii) quality of learned representation by applying it to a downstream classification task, (iv) robustness to the missing-modality problem, and (v) space and computational cost.

To evaluate our model we have developed an environment, the Multisensory Embodied 3D-Scene Environment (MESE). MESE integrates MuJoCo (Todorov et al., 2012), MuJoCo HAPTIX (Kumar & Todorov, 2015), and the OpenAI gym (Brockman et al., 2016) for 3D scene understanding through multisensory interactions. In particular, from MuJoCo HAPTIX the Johns Hopkins Modular Prosthetic Limb (MPL) (Johannes et al., 2011) is used. The resulting MESE, equipped with vision and proprioceptive sensors, makes itself particularly suitable for tasks related to human-like embodied multisensory learning. In our experiments, the visual input is RGB image and the haptic input is 132-dimension consisting of the hand pose and touch senses. Our main task is similar to the Shepard-Metzler object experiments used in Eslami et al. (2018) but extends it with the MPL hand.

As a baseline model, we use a GQN variant (Kumar et al., 2018) (discussed in Section 2.3). In this model, following GQN, the representations from different modalities are summed and then given to a ConvDraw network. We also provide a comparison to PoE version of the model in terms of computation speed and memory footprint. For more details on the experimental environments, implementations, and settings, refer to Appendix A.

Cross-Modal Density Estimation. Our first evaluation is the cross-modal conditional density estimation. For this, we estimate the conditional log-likelihood for , i.e. . During training, we use both modalities for each sampled scene and use 0 to 15 randomly sampled context query-sense pairs for each modality. At test time, we provide uni-modal context from one modality and generate the other.

Fig. 3 shows results on 3 different experiments: (a) HAPTICGRAY, (b) HAPTICRGB and (c) RGBHAPTIC. Note that we include HAPTICGRAY – although GRAY images are not used during training – to analyze the effect of color in haptic-to-image generation. The APoE and the baseline are plotted in blue and orange, respectively. In all cases our model (blue) outperforms the baseline (orange). This gap is even larger when the model is provided limited amount of context information, suggesting that the baseline requires more context to improve the representation. Specifically, in the fully cross modal setting where the context does not include any target modality (the dotted lines), the gap is largest. We believe that our model can better leverage modal-invariant representations from one modality to another. Also, when we provide additional context from the target modality (dashed, dashdot, solid lines), we still see that our model outperforms the baseline. This implies that our models can successfully incorporate information from different modalities without interfering each other. Furthermore, from Fig. 3(a) and (b), we observe that haptic information captures only shapes: the prediction in RGB has lower likelihood without any image in the context. However, for the GRAY image in (a), the likelihood approaches near the upper bound.

Cross-Modal Generation. We now qualitatively evaluate the ability for cross-generation. Fig. 1 shows samples of our cross-modal generation for various query viewpoints. Here, we condition the model on 15 haptic context signal but provide only a single image. We note that the single image provides limited color information about the object, namely, red and cyan are part of the object and almost no information about the shape. We can see that the model is able to almost perfectly infer the shape of the object. However, it fails to predict the correct colors (Fig. 1(c)) which is expected due to the limited visual information provided. Interestingly, the object part for which the context image provides color information has correct colors, while other parts have random colors for different samples, showing that the model captures the uncertainty in . Additional results provided in the Appendix D suggest further that: (i) our model gradually aggregates numerous evidences to improve predictions (Fig. S5) and (ii) our model successfully integrates distinctive multisensory information in their inference (Fig. S6).

Figure 4: Classification result.


 To further evaluate the quality of the modality-invariant scene representations, we test on a downstream classification task. We randomly sample 10 scenes and from each scene we prepare a held-out query-sense pairs to use as the input to the classifier. The models are then asked to classify which scene (1 out of 10) a given query-sense pair belongs to. We use Eq. (

C) for this classification. To see how the provided multi-modal context contributes to obtaining useful representation for this task, we test the following three context configurations: (i) image-query pairs only (), (ii) haptic-query pairs only (), and (iii) all sensory contexts ().

In Fig. 4, both models use contexts to classify scenes and their performance improves as the number of contexts increases. APoE outperforms the baseline in the classification accuracy, while both methods have similar ELBO (see Fig. S4). This suggests that the representation of our model tends to be more discriminative than that of the baseline. In APoE, the results with individual modality ( or ) are close to the one with all modalities (). The drop in performance with only haptic-query pairs () is due to the fact that certain samples might have same shape, but different colors. On the other hand, the baseline shows worse performance when inferring modality-invariant representation with single sensory modality, especially for images. This demonstrates that the APoE model helps learning better representations for both modality-specific ( and ) and modality-invariant tasks ().

Missing-modality Problem. In practical scenarios, since it is difficult to assume that we always have access to all modalities, it is important to make the model learn when some modalities are missing. Here, we evaluate this robustness by providing unseen combinations of modalities at test time. This is done by limiting the set of modality combinations observed during training. That is, we provide only a subset of modality combinations for each scene , i.e, . At test time, the model is evaluated on every combinations of all modalities thus including the settings not observed during training. As an example, for total 8 modalities 222left and right half of an image}, we use to indicate that each scene in training data contains only one or two modalities. Fig. 5(a) and (b) show results with while (c) and (d) with .

Figure 5: Results of missing-modality experiments for (a,b) , and (c,d) environments. During training (), limited combinations of all possible modalities are presented to the model. The size of exposed multimodal senses per scene is denoted as . For validation dataset, the models are evaluated with the same limited combinations as done in training (), as well as all combinations ().

Fig. 5 (a) and (c) are results when a much more restricted number of modalities are available during training: 2 out of 8 and 4 out of 14, respectively. At test time, however, all combinations of modalities are used. We denote the performance on the full configurations by and on the limited modality configurations used during training by . Fig. 5 (b) and (d) show the opposite setting where, during training, a large number of modalities (e.g., 78 modalities) are always provided together for each scene. Thus, the scenes have not trained on small modalities such as only one or two modalities but we tested on this configurations at test time to see its ability to learn to perform the generalized cross-modal generation. For more results, see Appendix E.

Overall, for all cases our model shows good test time performance on the unseen context modality configurations whereas the baseline model mostly overfits (except (c)) severely or converges slowly. This is because, in the baseline model, the sum representation on the unseen context configuration is likely to be also unseen at test time and thus overfit. In contrast, our model as a PoE is robust to this problem as all experts agree to make a similar representation. The baseline results for case (c) seem less prone to this problem but converged much slowly. As it converges slowly, we believe that it might still overfit in the end with a longer run.

Space and Time Complexity. The expert amortizer of APoE significantly reduces the inherent space problem of PoE while it still requires separate modality encoders. Specifically, in our experiments, for the case, PoE requires 53M parameters while APoE uses 29M. For , PoE uses 131M parameters while APoE used only 51M. We also observed a reduction in computation time by using APoE. For model, one iteration of PoE takes, in average, 790 ms while APoE takes 679 ms. This gap becomes more significant for where PoE takes 2059 ms while APoE takes 1189 ms. This is partly due to the number of parameters. Moreover, unlike PoE, APoE can parallelize its encoder computation via convolution. For more results, see Table 1 in Appendix.

5 Conclusion

We propose the Generative Multisensory Network (GMN) for understanding 3D scenes via modality-invariant representation learning. In GMN, we introduce the Amortized Product-of-Experts (APoE) in order to deal with the problem of missing-modalities while resolving the space complexity problem of standard Product-of-Experts. In experiments on 3D scenes with blocks of different shapes and a human-like hand, we show that GMN can generate any modality from any context configurations. We also show that the model with APoE learns better modality-agnostic representations, as well as modality-specific ones. To the best of our knowledge this is the first exploration of multisensory representation learning with vision and haptics for generating 3D objects. Furthermore, we have developed a novel multisensory simulation environment, called the Multisensory Embodied 3D-Scene Environment (MESE), that is critical to performing these experiments.


JL would like to thank Chin-Wei Huang, Shawn Tan, Tatjana Chavdarova, Arantxa Casanova, Ankesh Anand, and Evan Racah for helpful comments and advice. SA thanks Kakao Brain, the Center for Super Intelligence (CSI), and Element AI for their support. CP also thanks NSERC and PROMPT.


Appendix A Experiments

We start from describing the Multisensory Embodied 3D-Scene Environment (MESE) environment and simulated datasets used in our experiments. We continue by explaining training settings.

Figure S1: Example multisensory scene (with single object) in MESE. The scene includes a set of visual and haptic observations, each of which is partially observable.

a.1 Multisensory Embodied 3D-Scene Environment (MESE)

Targeting on a development environment for 3D scene understanding through interaction, we build a multisensory 3D scene environment, equipped with visual and proprioceptive (haptic) sensors, called Multisensory Embodied 3D-Scene Environment (MESE). MESE is similar to Shepard-Metzler object experiments used in Eslami et al. (2018), but extends it with a MPL hand model of MuJoCo HAPTIX (Kumar & Todorov, 2015). The environment uses MuJoCo (Todorov et al., 2012) and the OpenAI gym (Brockman et al., 2016).

Scene. Adopted from Eslami et al. (2018), MESE generates single Shepard-Metzler object with an arbitrary number of blocks per episode. Each block of the object is randomly colored in HSV scheme. More precisely, hue and saturation are randomly selected within fixed ranges: hue is sampled from (0, 1) and saturation is sampled from (0, 0.75). Value (in HSV) is fixed to 1. The sampled HSVs are converted to RGBs.

Image. An RGB camera is defined in the environment for visual input. The position of the camera and its facing direction, i.e. are defined as actions for agents. We refer to a viewpoint as the position and facing direction combined. From a given viewpoint, a generated RGB image has dimension.

Haptic. For proprioceptive (haptic) sense, the Johns Hopkins Modular Prosthetic Limb (MPL) (Johannes et al., 2011) is used, which is a part of MuJoCo HAPTIX. The hand model generates 132-dimensional observation, consisting of the its actuator positions, velocities, accelerations, and touch senses. For more details about the MPL hand, please finds Appendix C. in Amos et al. (2018)

. The MPL hand model has 13 degrees of freedom to control. MESE adds 5 degrees of freedom,

i.e. , to control the position and facing direction of the hand’s wrist, similar to camera control.

a.2 Datasets

Given that each scene has a single object at the origin, images and haptics are randomly generated. For an image, a camera viewpoint is sampled on a spherical surface with a fixed radius while the camera faces to the object. We refer to image query as camera viewpoint.

For an haptic data in each scene, we first sample a wrist pose of the hand similar to generating camera viewpoints. Given the sampled wrist, a fixed deterministic policy is performed.The policy starts from a stretched hand pose to gradually go to grabbing posture without any stochasticity. Note that a haptic datapoint is a function of the wrist pose and the object, given the aforementioned fixed policy; thus, the wrist’s position and facing direction is set to the hand’s query. Each dimension of haptic data is re-scaled to .

For the environment , also denoted as , 1M scenes are collected for training data. For each scene, a Shepard-Metzler object with 5 parts is randomly sampled as described in Section A.1. The number of unique shapes is 728 for the 5-parts object dataset. In each scene, 15 queries and corresponding sensory outputs are randomly sampled for each sensory modularity. For validation and test data, 20k and 100k scenes are sampled, respectively.

For the environment whose is larger than 2, we slice the dimensions of image and haptic data. For example, in order to build an environment , image is split to four quadrants of it so that each image is one of {, , , }. In addition to these four visual modalities, haptic input is provided. Note that while we split each image into four, the corresponding experiment defers from image in-paining or de-noising tasks. In those image tasks, statistical regularities of image are heavily taken into account, i.e.

statistics of local receptive fields are almost identical regardless of position. Many recent solutions on the problems resort to convolutional architectures, as a practical solution for sharing parameters of models across arbitrary locations. As long as the inductive bias hasn’t made use of in any model, it is valid that they are distinct random variables, each of which has different statistical characteristic; thus, they can be treated as multiple modalities.

For , image is cropped to and re-sized to due to the memory overhead. The image is split to left-right for each RGB channel; thus, we have {, , , , , } as different senses. Haptic dimension is also divided into to two, i.e. and . corresponds to thumb, index, and middle fingers. corresponds to ring and little fingers, as well as palm.

For , image is converted to as in , but is is sliced as to . With haptic data divded to and , we get an environment with .

a.3 Training

For training, Adam optimizer is used (Kingma & Ba, 2014). -annealing333In here, -annealing refers to anneal a weight at KL term of ELBO as done in Higgins et al. (2017) is employed; is set to 0.1

1 for the first epoch and maintained as 1 for the rest of training. Learning rate is set to 0.0001. In order for stable training, gradient is clipped to

. Training is ran for 10 epochs. Mini-batch size is set to 14 scenes for the environment and 24 scenes for the 5, 8, and 14.

Appendix B Network Architectures

(a) Baseline model.
(b) APoE.
Figure S2: Computation graphs of generation processes for the proposed models. (a) Baseline model: Each instance of -th modality query-sense pairs feeds to , i.e. representation network. All instances of representation s will summed up to get representation . Metamodal scene representation is inferred using the C-GQN decoder (or encoder in inference). Conditioning on the and a query , an instance of sensory data will be generated using , i.e. the renderer. (b) APoE: Each instance of -th modality query-sense pairs feeds to , and they are summed up to get modality-specific representation . Metamodal scene representation is inferred via product-of-experts using the expert-amortizer network. Rendering follows the same process as in the baseline model.
Figure S3: Implementation details for the modified ConvDraw network architecture; a) baseline encoder and b) decoder. c) the proposed model’s encoder and d) decoder. Sampled latent will be passed to renderers. Note that in the PoE and APoE, the distribution of is estimated as the product of experts for each -th step.

Overall. We adopt C-GQN network architecture from Kumar et al. (2018) for the proposed model, as well as the baseline. This architecture can be thought of as a modified version of ConvDraw encoder-decoder, in which the posteriors don’t have feedback routes of the predicted inputs and the residuals between the target and the predictions, unlike the original one (Gregor et al., 2016). As a result, for every step of ConvLSTM iterations the same input is repeatedly provided instead (see Fig. S3 (a)).

For a baseline, we use the C-GQN network, and the baseline’s generation process is depicted in Fig. S2 (a). Each instance of -th modality query-sense pairs feeds to , i.e. -th representation network. All instances of representation s will summed up to get representation . Metamodal scene representation is inferred using the C-GQN decoder (or encoder in inference). Conditioning on the and a query , a sensory datapoint will be generated using , i.e. the renderer for the -th modality.

For APoE, multiple experts are modeled as a single network, called expert-amortizer, in which a binary mask to identify modularity is used while inferring , e.g. in Eq. (2.4) where a binary mask. The expert-amortizer is build upon further modifications from the modified ConvDraw, as shown in Fig. S3 (b). Especially for efficient computation, the expert-amortizers are implemented such that they perform convolution over .

See Fig. S2 (b) for APoE’s generation. Identical to the baseline, each instance of -th modal query-sense pairs feeds to , and they are summed up to get modality-specific representation . However, metamodal scene representation is inferred via product-of-experts using the expert-amortizer network.

For PoE, each expert is modeled a single ConvDraw encoder-decoder with corresponding modularity encoder, and the rest of its implementations are identical to APoE.

Representation Network. To estimate modality-specitic representation for each instance of a query-sense pair, tower representation networks proposed in Eslami et al. (2018) is used. For camera position-image pair, convolution layer is used in the tower representation network. Similar to image, MLP is applied for a haptic observation and its corresponding query. The same representation network architectures are used to baseline, PoE, and APoE.

Renderer. Renderer network is a part of a decoder for predicting each sensory modality. Those renderers get a query and modality-agnostic latent representation as its inputs, output the sensory data conditioning on them. For rendering image, ConvLSTM with convolutional layer is used as done in C-GQN. Similar to image rendering, ConvLSTMs is used for proprioceptive, but MLP is employed instead of convolution layer.

Appendix C Classification

For classification, we adopt a method from Lake et al. (2015). Let we have -number of context sets, for , each of which is a set that contains multisensory data-query pairs. Given we have an observation set obtained from one of the -scenes (more precisely objects), we can predict from which scene the new observation set comes. The predicted label is obtained by following; ^k = _k logP_(X|V,C^(k)) , where each is estimated by using the log-likelihood estimators from Burda et al. (2016). This method doesn’t require additional training process.To approximate the log-likelihood each , 50 latent samples are used.

For held-out dataset, 1000 additional Shepard-Metzler objects with 4 or 6 parts are generated: any of these objects hasn’t presented in training dataset. is set to 10. For all models, three different inference scenarios are considered; classification is performed by using (i) only image-query pairs from each scene (), (ii) haptic-only contexts (), and (iii) use both sensory contexts ().

The results are shown in S4. In order to claim that both models are well trained and converged to training dataset, the learning curves for baseline and APoE models used in classification are also attached.

Figure S4: Classification result for environment. (Left) learning curves for baseline and APoE models used in classification. (Right) classification results of the models from the left. Classification is performed in three different conditioning scenarios; each model is conditioned on image-only context (), haptic-only context () and all information ().

Appendix D Cross-modal Generation

d.1 Reducing Uncertainty with Aggregation of Evidences

In this task, we examine the uncertainty of modality-agnostic representations with respect to the number of contexts. Similar to Fig. 1, we provide a single image context but we condition a trained model on different numbers of haptic contexts. More precisely, the image context is given such that the model cannot recover the entire scene from the image.

The generated image samples are shown in Fig. S5 (a). As the number of the haptic contexts increases, more accurate visual observation is predicted. We can also observe that generated images at each column continue to develop in comparison to the previous column, corresponding to where additional haptic information is provided. Again, we observe that the part of the object for which the context image provides color information has similar colors while other part of the block has random colors.

Fig. S5

(b) describes the generated haptic samples from the same query. In this figure, 95%-confidence interval from 20 samples is also illustrated. Similar to visual prediction, the haptic prediction improves according to the number of the haptic contexts. In addition, it is demonstrated that the uncertainty of the prediction reduces as the contexts aggregate.

Figure S5: Multi-sensory inference. (a) Upper row illustrates single visual context and various haptic cues, from empty observation to multiple observations. Note that the given visual context is insufficient to infer correct object shape. The model predicts visual observations for different viewpoints, i.e. , , and , using the visual and haptic contexts. -axis label indicates the indices of haptic contexts used when the model predicts the corresponding column. (b) The ground truth images for the same viewpoints. (c) The model predicts haptic observations for a haptic query. The ground truth values are denoted as red. -axis label indicates index of 132-dimensional output of the hand model. means the number of haptic contexts used for prediction.

d.2 Any-to-any Cross-modal Generation

Additional cross-modal generation experiments are performed for in order to explore multisensory integration of arbitrary context conditions. Given any context condition, a trained model is asked to generate all modality outputs (for a given set of queries) and these are combined to be displayed. For instance, a model trained in generates outputs in all modalities, i.e. . The visual outputs are combined and displayed as shown in Fig S6 (d). The haptic outputs are omitted to conserve space.

Three different context conditions are applied for each environment.For , a model is conditioned on (i) -only, (ii) , and (iii) contexts. For , a model is conditioned on (i) , (ii) , and (iii) . For , a model is conditioned on (i) , (ii) , and (iii) . Each context modality is provided with 5 query-sense pairs.

The results are shown in Fig. S6 (b)-(d), and the ground truth images are given in Fig. S6 (a). The provided context senses are illustrated in the first and second rows in each experiment. In general, haptic-related contexts are sufficient for the learned models to infer the shapes. With additional visual cues, the models start to correctly predict colors. For example, in middle column of Fig. S6 (c) and (d), red-mixed colors are successfully inferred with -channel context; however, it still fails to predict all color patterns as other color information is deficient. As more color information is given, our models results in successful predictions of all color patterns as shown in the right column of Fig. S6 (c) and (d).

Appendix E Missing-modality Problem

In addition to the experiments described in Fig. 5, more results are added in S7. Note that loss is evaluated as moving average of mini-batches, while is estimated on whole batch at the end of each epoch. This explains validation loss sometimes lower than training’s in the figures.

In general, all models tend to under-fit when they have never seen entire modalities during training. On the other hand, the models exposed to many modalities are prone to give us tighter negative ELBO.

We can observe notable difference between the baseline and APoE on the settings secluded from individual modality during training. Combined with the classification in Fig. S4, we can interpret the results as that PoE helps training individual expert. The inference of PoE has been understood as an agreement of all experts (Hinton, 2002)

; therefore, this lead each expert is capable of performing inference independently as well as expressing its own uncertainty. On the other hand, the simple sum operation of the C-GQN (baseline) probably end up with relying on dominating signals and ignore rests, which drove to overfit to training distributions.

Appendix F Computational Time

Model # of parameters timer per iter (ms)
2 5 8 14 2 5 8 14
baseline 53M 28M 48M 51M 346 397 481 866
APoE 53M 29M 48M 51M 587 679 992 1189
PoE 58M 53M 92M 131M 486 790 1459 2059
Table 1: The number of parameters and computation time for each model for all experiments. Mini batch size is set to 1.

Table 1 shows the number of parameters and computational time cost for all experiments. Each experiment is ran with single NVIDIA Tesla P100 GPU and four cores of an Intel Xeon E5-2650 2.20GHz CPU. PyTorch (Paszke et al., 2017), CUDA-9.0 (Nickolls et al., 2008), and cuDNN7 (Chetlur et al., 2014) are used for the implementations. All models share the same representation and renderer network architectures, and the same number of steps and hidden sizes are applied to the encoder and decoder architectures. For fair comparison, the mini-batch size is set to 1 for measuring the costs.

In PoE, each of the expert contains a large network like ConvDraw, resulting in space cost for inference networks. In APoE, the inference networks are integrated into single expert-amortizer, serving for all modalities. Thus, the space cost of inference networks reduces to . As a result, the APoE model’s parameter size in the experiments is almost the same as the baseline’s, while it can provide probabilistic information integration that the PoE has.

(a) Ground-truth images
(b) environment, i.e. .
(c) environment, i.e. .
(d) environment, i.e. .
Figure S6: Any-to-any cross-modal generation examples. Given context conditions, trained models are asked to generate all modality outputs (for a given set of queries) and these are combined to be displayed. The first row in each sub-figure displays five haptic context senses. The second row illustrates combined senses from different visual modalities. The third illustrates the predictions for given queries. For example, the second row in (b) depicts the five senses. The same row in (c) displays additional five senses combined with the ones in (b).

(a) environment, i.e. .
(b) environment, i.e. .
(c) environment, i.e. .
Figure S7: Results of missing-modality experiments for various multimodal scenarios. During training, limited combinations of modalities are presented. At test time, all combinations of the entire modalities are randomly selected.