I Introduction & Related Work
(AS) is one of the most fundamental problems and challenges in mobile robotics which seeks to maximize the efficiency of an estimation task by actively controlling the sensing parameters. It can roughly be divided into two sub task: the identification of a PoI (e.g. object or place) for exploration or to answer the questionwhere to go next? and the ability of a robot to navigate through the environment to reach a certain goal location without colliding with any obstacles en route. Of particular interest for heterogeneous robot teams is the case where AS is used for determining the locations at which the robots should move to in order to acquire the most informative measurements.
The recent literature has seen a growing number of methods being proposed to tackle the task of autonomous navigation with (DRL) algorithms.
The works comprising  formulate the navigation problem as Markov decision processes (MDPs) or partially observable MDPs (POMDPs) that first take in the sensor readings (color/depth images, laser scans, etc.) as observations and stack or augment them into states, and then search for the optimal policy by means of a deep neuronal network
deep neuronal network(DNN) that is capable of guiding the agent to navigate to goal locations. The first approach that performed well on such a generic task definition was DQN . Multi-agent reinforcement learning (MARL) is the integration of multi-agent systems with reinforcement learning
(RL), thus it is hence at the intersection of game theory and RL communities
. However, while the automated discovery of the structure in raw data by a DNN is a time-consuming and error-prone task, an additional pre-processing step that performs feature extraction on the data helps to make DQNs more robust, and to downsize and even transfer them more easily.Auto Encoder (AE), Variational Auto Encoder (VAE), and more recently Disentangled Variational Auto Encoder (VAE) have a considerable impact on the field of DRL as they encode the data into latent space features that are (ideally) linearly separable .
The objective of this work is to investigate a reinforcement approach for distributed sensing based on the latent space derived from multi-modal deep generative models, as depicted in Fig. 1. The main objective is to train multi-modal VAE that integrates all the information on different sensor modalities into a joint latent representation and then to generate one sensor information from the corresponding other one via this joint representation. Therefore, this model can exchange multiple sensor modalities bi-directionally, for example, features from laser scanner data to images and vice versa, and can learn a shared latent space distribution between uni- and multi-modal cases. Furthermore, we train a deep Q-Network that controls robots equipped with uni-modal sensors directly on the latent space with the objective of reducing uncertainty regarding detected objects. Our approach performs better than naive multi-robot exploration.
Our contribution makes use of the fields of deep neuronal networks for feature extraction, deep generative models for latent representations, and deep Q-Networks for optimal control in heterogeneous multi-agent systems in order to archive sufficient classification of objects in the environment. Since this contribution concentrates on the generative models as the central feature enabling our approach, we stress this part in Section II. Section III provides an overview of our application within RL, which is then evaluated in Section IV.
Ii Generative Models
The VAE proposed by  is used in settings where only a single modality is present, in order to find a latent encoding (c.f. Fig. 2 right). On the other hand, when multiple modalities and are available as shown in Fig. 2 (mid.), it is less clear how to use the model as one would need to train a VAE for each modality. Therefore, Suzuki et al.  propose a joint multi-modal VAE (JMVAE) that is trained on two uni-modal encoder and bi-modal en-/decoders which share an objective function derived from the variation of information (VI) between and . Therefore, the uni-modal encoders are trained, so that their distribution and are close to an encoder in order to build a coherent latent space posterior distribution.
With the JMVAE, we can extract joint latent features by sampling from the joint encoder at testing time. While the objective of  is to exchange modalities bi-directionally ( and ), our primary concern is to find a meaningful posterior distribution and hence, we analyze their approach with a view of using all statistics provided by the encoder network.
Iii Application to Rl
One trend in current RL approaches is to first learn interpretable and factorized representations of the sensor data, rather than directly learn internal representations from raw observations . Thus, (V)AEs are used to project high dimensional input data into a representation in which either a single or a group of latent units are sensitive to changes in single ground truth factors. We follow this approach with the intention of using all the statistics advertised by an encoder network to derive reasonable navigation goals in a multi-agent setup.
The goal of an agent is to select navigation goals in the environment by selecting actions in a manner that lowers the variance of observations or vice versa, increasing the amount of information about a PoI. We define the properties of an MDP as follows: is a global map of objects with feature . Each object can be observed by taking one action , or the game can be terminated before observing all PoIs by the selection of no operation (NOP): , while dependent on the modality, an action samples from the posterior or . However, if there is a former observation by each other’s modality, the encoder is applied to perform sensor-fusion in place by generating the missing modality using or (c.f. Fig. 3). Furthermore, the reward is defined by the increase in information after observation of object by .
We apply a deep Q-Network to derive the next action given the current map . For each agent, or more precisely each modality, an own Q-Network is trained. It is worth mentioning that there is no interaction between the agents and thus, the learning process is decoupled since all robots act independently of one another.
Iv Experimental Setup & Evaluation
We performed experiments using two AMiRo mini-robots equipped with a camera and LiDAR 
to classify PoIs in the environment. The overall mapping and decision architecture is depicted inFig. 1. Every robot explores the environment for PoIs and encodes the detection to a common map that is shared among all entities. Based on this environmental model, a modality specific DQN decides which robot has to pursue which PoI to increase the information in the map. It is worth mentioning, that the DQN only selects the target PoI and has no spatial information nor direct control at the motor level. We use the DQN algorithm by  for training the deep Q-Network and sample the test and training data from the Gazebo simulator.
Training the JMVAE on raw data resulted in non-sufficient results. We conjecture that this effect is caused by the necessarily larger network architecture which causes the weights in deep layers to collapse and by the imbalance in the VAE’s reconstruction loss term for varying modality dimensionality. To derive a proper latent distribution, we therefore apply neuronal networks based scanline detectors and
to derive feature vectors for each modality with dimensionality(c.f. Fig. 3).
The experiment was performed on a comprehensive scenario with three classes (c.f. Table I). Modality is derived from a camera based feature extractor that distinguishes between red and green PoIs. Modality is derived from a LiDAR based feature extractor that distinguishes between round and edgy PoIs. The ability to unambiguously classify per modality is shown in Table I. We designed the JMVAE as depicted in Fig. 3 with a Gaussian prior with unit variance, Gaussian variational distribution that is parametrized by the encoder network, and latent dimensionality of incorporating the sampled mean and variance
for each dimension. All remaining layers are dense with 64 neurones and ReLU activation.
Fig. 4 shows, that the JMVAE is able to detect ambiguous classifications, as shown in Table I. Feeding in shows a clear separation of the three classes, while in the uni-modal cases, the distributions of ambiguous classes collapse to their mean. Furthermore, the JMVAE drives up the variance due to the fact that the now collapsed distribution incorporates the old ones. This property is absolutely reasonable as the KL-divergence attempts to find the best representative (i.e. the mean) and the reconstruction loss enforces the variance to extend to the non-collapsed classes during training.
The number of possible states of is , which is 384 for this relatively comprehensible experiment assuming only binary PoI-encodings. Having continues encodings, as offered by the JMVAE, the task of controlling the robots by the means of handcrafted architectures based on the encodings becomes unfeasible. Therefore, we train a DQN for which we select and . The reward is shaped as follows: if the observation led to an increase of information; if there is no increase in information; quitting the exploration by NOP always results in . The other crucial parameters (c.f. ) are , , , and . The Q-network has two dense layers with 24 neurones, ReLU activation, and a linear output layer trained by the Adam optimization algorithm ().
We evaluated the training by the total reward the agent collects in an episode averaged over a 512 randomly sampled environments. The average total reward metric is shown in Fig. 5; it demonstrates the successful adaptation of each modality’s network to our task (c.f. application video111https://goo.gl/Edi92T).
|class vs. modality||&|
|green cylinder (1)||✓||✓|
|red cylinder (2)||✓|
|red cube (3)||✓||✓|
V Conclusions and Future Work
A complete framework implementation has been proposed that is able to learn and control multiple robots with various sensor modalities in a comprehensive three-class example. Further work will adapt VAEs to multiple classes and introduce a generic mapping approach.
-  L. Tai, J. Zhang, M. Liu, J. Boedecker, and W. Burgard, “A Survey of Deep Network Solutions for Learning Control in Robotics: From Reinforcement to Imitation,” vol. 14, no. 8, pp. 1–19, 2016.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” pp. 1–9, 2013.
-  A. Nowe, P. Vrancx, and Y.-M. D. Hauwere, “Game Theory and Multi-agent Reinforcement Learning,” 2012, vol. 12, no. January.
-  I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “DARLA: Improving Zero-Shot Transfer in Reinforcement Learning,” 2017.
D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi-Supervised Learning with Deep Generative Models,” 2014.
-  M. Suzuki, K. Nakayama, and Y. Matsuo, “JOINT MULTIMODAL LEARNING WITH DEEP GENERATIVE MODELS,” pp. 1–12, 2017.
-  S. Herbrechtsmeier, T. Korthals, T. Schöpping, and U. Rückert, “AMiRo: A modular & customizable open-source mini robot platform,” 2016.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” vol. 518, no. 7540, pp. 529–533, 2015.