I Introduction & Related Work
Active sensing
(AS) is one of the most fundamental problems and challenges in mobile robotics which seeks to maximize the efficiency of an estimation task by actively controlling the sensing parameters. It can roughly be divided into two sub task: the identification of a PoI (e.g. object or place) for exploration or to answer the question
where to go next? and the ability of a robot to navigate through the environment to reach a certain goal location without colliding with any obstacles en route. Of particular interest for heterogeneous robot teams is the case where AS is used for determining the locations at which the robots should move to in order to acquire the most informative measurements.The recent literature has seen a growing number of methods being proposed to tackle the task of autonomous navigation with
(DRL) algorithms. The works comprising [1] formulate the navigation problem as Markov decision processes (MDPs) or partially observable MDPs (POMDPs) that first take in the sensor readings (color/depth images, laser scans, etc.) as observations and stack or augment them into states, and then search for the optimal policy by means of adeep neuronal network
(DNN) that is capable of guiding the agent to navigate to goal locations. The first approach that performed well on such a generic task definition was DQN [2]. Multiagent reinforcement learning (MARL) is the integration of multiagent systems with reinforcement learning(RL), thus it is hence at the intersection of game theory and RL communities
[3]. However, while the automated discovery of the structure in raw data by a DNN is a timeconsuming and errorprone task, an additional preprocessing step that performs feature extraction on the data helps to make DQNs more robust, and to downsize and even transfer them more easily.
Auto Encoder (AE), Variational Auto Encoder (VAE), and more recently Disentangled Variational Auto Encoder (VAE) have a considerable impact on the field of DRL as they encode the data into latent space features that are (ideally) linearly separable [4].The objective of this work is to investigate a reinforcement approach for distributed sensing based on the latent space derived from multimodal deep generative models, as depicted in Fig. 1. The main objective is to train multimodal VAE that integrates all the information on different sensor modalities into a joint latent representation and then to generate one sensor information from the corresponding other one via this joint representation. Therefore, this model can exchange multiple sensor modalities bidirectionally, for example, features from laser scanner data to images and vice versa, and can learn a shared latent space distribution between uni and multimodal cases. Furthermore, we train a deep QNetwork that controls robots equipped with unimodal sensors directly on the latent space with the objective of reducing uncertainty regarding detected objects. Our approach performs better than naive multirobot exploration.
Our contribution makes use of the fields of deep neuronal networks for feature extraction, deep generative models for latent representations, and deep QNetworks for optimal control in heterogeneous multiagent systems in order to archive sufficient classification of objects in the environment. Since this contribution concentrates on the generative models as the central feature enabling our approach, we stress this part in Section II. Section III provides an overview of our application within RL, which is then evaluated in Section IV.
Ii Generative Models
The VAE proposed by [5] is used in settings where only a single modality is present, in order to find a latent encoding (c.f. Fig. 2 right). On the other hand, when multiple modalities and are available as shown in Fig. 2 (mid.), it is less clear how to use the model as one would need to train a VAE for each modality. Therefore, Suzuki et al. [6] propose a joint multimodal VAE (JMVAE) that is trained on two unimodal encoder and bimodal en/decoders which share an objective function derived from the variation of information (VI) between and . Therefore, the unimodal encoders are trained, so that their distribution and are close to an encoder in order to build a coherent latent space posterior distribution.
With the JMVAE, we can extract joint latent features by sampling from the joint encoder at testing time. While the objective of [6] is to exchange modalities bidirectionally ( and ), our primary concern is to find a meaningful posterior distribution and hence, we analyze their approach with a view of using all statistics provided by the encoder network.
Iii Application to Rl
One trend in current RL approaches is to first learn interpretable and factorized representations of the sensor data, rather than directly learn internal representations from raw observations [4]. Thus, (V)AEs are used to project high dimensional input data into a representation in which either a single or a group of latent units are sensitive to changes in single ground truth factors. We follow this approach with the intention of using all the statistics advertised by an encoder network to derive reasonable navigation goals in a multiagent setup.
The goal of an agent is to select navigation goals in the environment by selecting actions in a manner that lowers the variance of observations or vice versa, increasing the amount of information about a PoI
. We define the properties of an MDP as follows: is a global map of objects with feature . Each object can be observed by taking one action , or the game can be terminated before observing all PoIs by the selection of no operation (NOP): , while dependent on the modality, an action samples from the posterior or . However, if there is a former observation by each other’s modality, the encoder is applied to perform sensorfusion in place by generating the missing modality using or (c.f. Fig. 3). Furthermore, the reward is defined by the increase in information after observation of object by .We apply a deep QNetwork to derive the next action given the current map . For each agent, or more precisely each modality, an own QNetwork is trained. It is worth mentioning that there is no interaction between the agents and thus, the learning process is decoupled since all robots act independently of one another.
Iv Experimental Setup & Evaluation
We performed experiments using two AMiRo minirobots equipped with a camera and LiDAR [7]
to classify PoIs in the environment. The overall mapping and decision architecture is depicted in
Fig. 1. Every robot explores the environment for PoIs and encodes the detection to a common map that is shared among all entities. Based on this environmental model, a modality specific DQN decides which robot has to pursue which PoI to increase the information in the map. It is worth mentioning, that the DQN only selects the target PoI and has no spatial information nor direct control at the motor level. We use the DQN algorithm by [8] for training the deep QNetwork and sample the test and training data from the Gazebo simulator.Training the JMVAE on raw data resulted in nonsufficient results. We conjecture that this effect is caused by the necessarily larger network architecture which causes the weights in deep layers to collapse and by the imbalance in the VAE’s reconstruction loss term for varying modality dimensionality. To derive a proper latent distribution, we therefore apply neuronal networks based scanline detectors and
to derive feature vectors for each modality with dimensionality
(c.f. Fig. 3).The experiment was performed on a comprehensive scenario with three classes (c.f. Table I). Modality is derived from a camera based feature extractor that distinguishes between red and green PoIs. Modality is derived from a LiDAR based feature extractor that distinguishes between round and edgy PoIs. The ability to unambiguously classify per modality is shown in Table I. We designed the JMVAE as depicted in Fig. 3 with a Gaussian prior with unit variance, Gaussian variational distribution that is parametrized by the encoder network, and latent dimensionality of incorporating the sampled mean and variance
for each dimension. All remaining layers are dense with 64 neurones and ReLU activation.
Fig. 4 shows, that the JMVAE is able to detect ambiguous classifications, as shown in Table I. Feeding in shows a clear separation of the three classes, while in the unimodal cases, the distributions of ambiguous classes collapse to their mean. Furthermore, the JMVAE drives up the variance due to the fact that the now collapsed distribution incorporates the old ones. This property is absolutely reasonable as the KLdivergence attempts to find the best representative (i.e. the mean) and the reconstruction loss enforces the variance to extend to the noncollapsed classes during training.
The number of possible states of is , which is 384 for this relatively comprehensible experiment assuming only binary PoIencodings. Having continues encodings, as offered by the JMVAE, the task of controlling the robots by the means of handcrafted architectures based on the encodings becomes unfeasible. Therefore, we train a DQN for which we select and . The reward is shaped as follows: if the observation led to an increase of information; if there is no increase in information; quitting the exploration by NOP always results in . The other crucial parameters (c.f. [8]) are , , , and . The Qnetwork has two dense layers with 24 neurones, ReLU activation, and a linear output layer trained by the Adam optimization algorithm ().
We evaluated the training by the total reward the agent collects in an episode averaged over a 512 randomly sampled environments. The average total reward metric is shown in Fig. 5; it demonstrates the successful adaptation of each modality’s network to our task (c.f. application video^{1}^{1}1https://goo.gl/Edi92T).
class vs. modality  &  

green cylinder (1)  ✓  ✓  
red cylinder (2)  ✓  
red cube (3)  ✓  ✓ 
V Conclusions and Future Work
A complete framework implementation has been proposed that is able to learn and control multiple robots with various sensor modalities in a comprehensive threeclass example. Further work will adapt VAEs to multiple classes and introduce a generic mapping approach.
References
 [1] L. Tai, J. Zhang, M. Liu, J. Boedecker, and W. Burgard, “A Survey of Deep Network Solutions for Learning Control in Robotics: From Reinforcement to Imitation,” vol. 14, no. 8, pp. 1–19, 2016.
 [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” pp. 1–9, 2013.
 [3] A. Nowe, P. Vrancx, and Y.M. D. Hauwere, “Game Theory and Multiagent Reinforcement Learning,” 2012, vol. 12, no. January.
 [4] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “DARLA: Improving ZeroShot Transfer in Reinforcement Learning,” 2017.

[5]
D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “SemiSupervised Learning with Deep Generative Models,” 2014.
 [6] M. Suzuki, K. Nakayama, and Y. Matsuo, “JOINT MULTIMODAL LEARNING WITH DEEP GENERATIVE MODELS,” pp. 1–12, 2017.
 [7] S. Herbrechtsmeier, T. Korthals, T. Schöpping, and U. Rückert, “AMiRo: A modular & customizable opensource mini robot platform,” 2016.
 [8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” vol. 518, no. 7540, pp. 529–533, 2015.