As robots move more and more into dynamic real-world environments, learning mechanisms are getting increasingly important. However, learning robots are held back by multiple issues, including: potential unreliability of the learning process, long learning times, and the requirement of intensive human supervision. In light of these issues, an interesting learning mechanism is Self-Supervised Learning (SSL). It focuses on the augmentation of a robot’s perception capabilities. Typically, in SSL the robot uses a trusted, primary sensor cue to train a secondary sensor cue with supervised learning. If learning is successful, the secondary sensor will give similar outputs to the primary sensor cue. For example, the car that won the grand DARPA challenge, Stanley 
used a laser scanner as the primary sensor to classify areas ahead as being part of the road or not. It used a color camera as the secondary sensor, and learned a mapping from colors to the class labels “road” or “not road”. Since the camera could evaluate the terrain much further into the distance than the (range-limited) laser scanner, Stanley could drive faster. This was important for winning the race.
SSL has the following beneficial properties: (i) the robot always keeps access to the trusted primary cue, which can be used to ensure the safety of the system during and after learning, (ii) learning is supervised, which means that it is relatively fast and can build on an enormous amount of research in machine learning, (iii) since the supervised targets are provided by a robotic sensor, no human supervision is required and an ample amount of training data is available for the machine learning algorithms such as deep neural networks (e.g.,[9, 7]).
The main reason to perform SSL is that the two sensor cues have complementary properties. The example of Stanley showed that the secondary cue may have a longer range than the primary cue (see also [10, 11, 4, 12, 8]). In the literature, other types of complementarities have been studied as well. For instance, in  a robot first judges terrain traversability by means of haptic interaction, and uses SSL to learn this same capability with the camera. Another interesting example is given in , in which a flying robot selects a landing site by making use of optical flow from an onboard camera as the primary cue. The secondary sensor cue consists of image appearance features, which after learning allows the robot to select a landing site without moving.
Until now, studies on SSL have kept the two sensor cues separated. For example, Stanley did not fuse close-by vision-based road classifications with the laser-based classifications. Fusion in SSL raises several questions. For example, given that the secondary cue is learned on the primary cue’s outputs, will its estimates not always be (much) worse? Will the estimates of the secondary cue not be statistically (too) dependent on the primary cue? Given that a ground truth is not available, can the robot determine the uncertainty of the secondary cue reliably enough for successful fusion? The answers to these questions cannot only come from empirical studies on SSL fusion. To answer them in a more generic way, a theoretical investigation is required.
The main contribution of this article lies in a theoretical analysis of the fusion of the primary and secondary cue in SSL. Employing a minimal model of SSL, a theoretical proof is provided that (1) shows that fusion in SSL can indeed lead to better results, and (2) for the given model determines the conditions on the estimation accuracy of the two cues under which fusion is indeed beneficial (Section 2). An additional contribution is the verification of the proposed SSL fusion scheme on robotic data (Section 3). In particular, SSL is applied to a scenario in which a drone has to estimate its height based on a barometer and a sonar sensor (see Figure 1).
2 Fusion in self-supervised learning
2.1 A minimal model for fusion in self-supervised learning
Figure 2-(a) shows the graphical probabilistic model used in the proof. The robot has two observations, and . From these observations, it will have to infer , which is not observed and therefore shaded in gray. The graphical model shows that and are independent from each other given . Not shown in the figure is the type of distributions from which the variables are drawn. For our minimal model, we will have: , , and . Figure 2-(a) is a standard graphical model as can be found in the machine learning literature (e.g., ). Typically, it represents the assumptions that the designer and hence the robot has on the structure of the observation task. For the given ground-truth model, the optimal fusion estimate would be , but this supposes that the robot knows all parameters.
In self-supervised learning, the robot does not have any prior idea of the distribution of the complementary sensory cue. And often, it may not have any idea on the distribution of variable to be estimated either. The reason is that both distributions will likely depend on the unknown environment in which the robot will operate. We represent variables for which the robot does not know the distribution by means of dashed lines in the graphical model. Figure 2-(b) shows that in our minimal model, the robot does not know the distributions of and . It does know that .
Self-supervised learning has the robot learn a mapping from the complementary cue to the trusted cue . This leads to a new variable . For this proof, we will have the robot fuse this variable with by making an assumption on the distribution of . Hence, in Figure 2-(c), is shown with a solid line. The dependency of on passes through the function , which in our minimal model is linear with a single parameter : . In our case, the robot will assume that . The assumption that is centered on
is, as we will see further below, incorrect, given the ground-truth model. For fusion, the robot needs to know the variance. Since it does not know the distributions of and , it will estimate on the basis of the data encountered. The main difficulty here is that the robot evidently does not know what is for each sample, so the robot will have to use a proxy for the real . In our minimal model, the robot will use as a proxy for . Finally, please note that the fact that is learned with the help of does not mean that is conditionally dependent on . Obviously, given or , is independent of .
2.2 Proof under which conditions fusion of and leads to better estimates than alone
Here we will give a closed form solution to the conditions under which a fused estimate leads to a lower expected squared error than the estimate relying only on , denoted by . We first determine what the estimates are for the two different cases. Please note that in both cases, the robot does not know the distribution of . So, when only using , is estimated by optimizing the likelihood:
where we made use of the fact that the robot knows that . When fusing both cues, is estimated by optimizing the likelihood of both and :
In the next subsections, we will use our knowledge of the ground-truth model to determine the associated expected estimation errors. The crux is that this knowledge allows us to predict what function and what estimate of the robot will converge to given sufficient data.
2.2.1 Expected squared error when using
Here, we determine the expected error when the robot only uses :
which we can split in the following three parts. First:
where is a shorthand for . Second:
where we made use of being Gaussian, . Third:
The three parts together lead to the expected squared error:
2.2.2 Expected squared error when fusing and
Here we determine the expected error if the robot fuses with . The procedure is as follows. First we express as a function of . Then we can retrieve the expression of the fused estimate and finally calculate the corresponding expected error .
The function maps to . In the case of our ground-truth model, Figure 2-(a), what would the parameter converge to in if the robot has enough data? Well, we know that , so . So after many samples, we would expect the function to try and map to . The answer to the question then lies in the calculation of . Following  (p. 93), given the distributions and :
implying that and hence:
We have discussed before that for fusion, the robot will assume that
is normally distributed and centered on. This is actually incorrect, as , with . This mapping leads to , which is not centered on . Please remark that this actually makes a better estimate of than itself, as implicitly takes into account the prior distribution of .
Next we express the variable in terms of , , and . From  (p. 89) follows:
The variance of is:
because . Hence, (Eq. 5). The variance of is:
with from Eq. 9. Therefore, the variance of simplifies to:
Finally, the covariance between and is:
Putting all of this together, we have:
The formula for the robot’s expected square error is:
And third: , since .
Putting these formulas together into Eq 20 and simplifying, gives:
2.2.3 When fusion is better than just using :
In order to prove that a robot employing self-supervised learning can obtain better estimates of than when using only , we only need to show that there are conditions in which the expected error of Eq. 24 is smaller than that of Eq 7. Given , the expected fused error is smaller if:
or else () if:
Intuitively, these conditions correspond to (i) having a strong prior (Eq. 25) or (ii) being sufficiently informative on (Eq. 26). The first case of the strong prior may not be easy to understand. It helps to think of the fact that the learned secondary cue takes the prior into account, while does not (as we assume that the robot does not know anything about the prior distribution of ). Therefore, fusion with is more advantageous if the prior is stronger.
2.2.4 Computational Verification
The theoretical findings above were verified with computational experiments, in which a data set was generated according to the ground truth model from Figure 2-(a). Then, the program first learned the parameter of function with least-squares regression. Subsequently, it estimated by determining the variance of when is in the interval of . Finally, it fused the observations and according to Eq. 2.
The error is compared to that of using alone. Given a large enough the results converge to the values predicted in the theoretical analysis. For instance, with , , , and , we get a fused squared error of (theoretical prediction ). With alone the error is (theoretical prediction ). The theoretical threshold on fusion not being useful anymore is . Please note that this is a rather benign condition, as can be more than 11 times as large as in this case. Table 1 shows results for four different instances. The bottom case illustrates that fusion helps if the prior is strong enough, even if is high. The MATLAB code is part of the supplementary material.
|Error primary (theory)||Fusion error (theory)|
3 Case study: Height estimation with a barometer and sonar.
In this section, we apply SSL fusion to a case study, in which a flying robot uses a barometer and sonar to estimate the height. As human designers we know how these sensors relate to the height, but in the case study we will assume that the robot only knows how to relate one of the sensors to the height (assumed to be ), and will regard the other sensor as the “unknown” . The main goal of the case study is to see if fusion in an SSL setup can be beneficial in a real-world case, which may not comply with the assumptions of the theoretical analysis. A scenario with two scalar measurements was chosen in order to allow a direct comparison with the theoretical model.
3.1 Experimental setup
A Parrot AR drone is used for gathering the experimental data. The drone is flown inside of a motion tracking arena. It uses the open source autopilot Paparazzi () to log the relevant sensor data, consisting of the pressure, the sonar readings, and the height provided by the motion tracking system. The height from the Optitrack motion tracking system is considered the most reliable of the three sensors and hence is used in this case as the ‘ground-truth’ value ().
The sonar measurements can be directly used as primary cue. If the pressure measurements are used as primary cue, they are mapped to a height estimate in meters with the following formula:
where is the gas constant, is the sea level temperature, is the molar mass of the Earth’s air, the gravity, the sea level pressure, and the measured pressure. After this conversion to height, there is still an offset and scaling factor due to the fact that the drone has not been flying in the exact circumstances represented by the constants (at sea level for instance). Typically, this offset is taken into account by calibrating the pressure measurement at take-off. Here, is mapped with a linear function to the Optitrack height (on the training set). The resulting heights are then used as the target values in the self-supervised learning, i.e., as the in the theoretical analysis.
Figure 3 gives insight into the data. The left plot shows the Optitrack ground truth height (thick black line), the sonar (purple line), and the corrected barometer measurements when used as primary cue, (dark yellow line). The right plot shows the untransformed pressure measurements . These ‘raw’ measurement values are used when the barometer represents the secondary cue. The magnitude of these measurements already shows that the distribution of pressure measurements is not centered at , as assumed in the theoretical analysis.
The secondary cue is mapped to the primary cue with a machine learning method, which performs regression on the training set. Here we use a nearest neighbor approach, with
, so that possible non-linear relations can be captured (for instance from the raw pressure measurements to the sonar height). Furthermore, in the experiment the standard deviation ofis assumed to be known - and is here determined with the help of the ground truth from Optitrack on the training set. The conditional standard deviation is determined on the validation set, only using the variables observed by the robot, (the secondary cue obtained with regression function) and (the primary cue). For each condition, we perform experiments on the data. In each experiment, of the data set is used for training, of the set is used for validation (determining the conditional standard deviation ), and is used for testing. For each experiment, we determine the mean absolute error of the primary cue , that of , and their fusion.
3.2 Experimental results
There are two different experimental conditions: (i) the sonar is the primary cue and the barometer the secondary cue, and (ii) vice versa. Table 2 shows the main results from the experiments. In both conditions, the SSL fusion consistently gives (slightly) better results than just using the primary cue.
|Mean Absolute Error (m)|
|Primary||Primary ()||Secondary ()||Fusion||Successful fusion|
Let us analyze the case where the sonar is the primary cue in order to see how well the distributions of the involved variables correspond to the assumptions in our minimal model. Figure 4 (top row) shows the distributions of the Optitrack height (), the error of the sonar height (), and the error of the pressure-based height estimate learned with SSL (). The corresponding means and standard deviations are: , , , , , and . These numbers show that both the primary and secondary cue are centered on , and that the accuracy of the secondary cue is actually better than that of the primary cue (). We compared each distribution against a normal distribution that has the same mean and standard deviation. The Chi-square values are , , and for , and , respectively, confirming that the secondary cue indeed resembles its corresponding normal distribution most. However, a randomized statistical test () shows that even the histogram of is unlikely to come from the corresponding normal distribution (with a -value of ).
An analysis of pressure as the primary cue paints a similar picture. Two things are interesting to observe though. The first observation is that in this condition, the secondary cue is less accurate than the primary cue; and . The second observation follows from the bottom row of Figure 4, which shows the distributions when pressure is the primary cue. The right plot shows the distribution of sonar as a secondary cue. The distribution, in this condition, seems much more normally distributed than when sonar is the primary cue (top row). Indeed, the Chi-square value is for both the primary and secondary cue in this condition (still with a low -value of ).
To summarize the findings of the analysis, the variables in the real-world experiment deviate from the model’s assumptions of how they are distributed. Despite this, fusion still leads to better results. It may be though that the threshold value differs from the theoretical one. This is akin to using a Kalman filter when the involved distributions are not normal; The filter will most of the time still give a reasonable result, but estimation optimality is no longer guaranteed. Interestingly, this case study shows that the threshold expressed in Eq.26 often cannot be validated. It would for instance not be very useful to look at when pressure is the secondary cue, as it has wildly different values from . It may be better to express the threshold in Eq. 26 in terms of . This can be done by using the relation , with defined in Eq. 9. If the terms on the right-hand side of Eq. 26 are represented by the variable , this leads to the threshold: .
In this article, a theoretical analysis was performed under which conditions it is favorable to fuse the primary and secondary cue in self-supervised learning. This analysis shows that fusion of the cues with the robot’s knowledge is favorable when (i) the prior on the target value is strong, or (ii) the secondary cue is sufficiently accurate. When the assumptions of the analysis are valid, the conditions for the usefulness of fusion are rather benign. In the studied model, the standard deviation of the secondary cue can be more than ten times that of the primary cue, while still giving better fusion results. Although the employed model is rather minimal, the result that fusion can lead to better estimates extends to more complex cases, as is confirmed by the real-world case study. However, violations of the assumptions will likely change the threshold on the secondary cue’s accuracy.
Given that normal distributions approximate quite well various real-world phenomena, the theoretical analysis may be applicable to a wide range of cases. Still, the generalization of the main finding - that SSL fusion can give better results than the primary cue alone - to more complex cases should be further investigated. To this end, future work could employ the current proof as a template. Moreover, it would be interesting to apply SSL fusion to a more complex, relevant case study than the one studied here. For instance, it would be highly interesting if SSL fusion could improve the performance of complex senses such as robotic vision.
-  José Baleia, Pedro Santana, and José Barata. On exploiting haptic cues for self-supervised learning of depth-based robot navigation affordances. Journal of Intelligent & Robotic Systems, 80(3-4):455–474, 2015.
-  C.M. Bishop. Pattern recognition and machine learning. Springer Science and Business Media, LLC, New York, NY, 2006.
Empirical methods for artificial intelligence. MIT Press, Cambridge, MA, 1995.
-  Raia Hadsell, Pierre Sermanet, Jan Ben, Ayse Erkan, Marco Scoffier, Koray Kavukcuoglu, Urs Muller, and Yann LeCun. Learning long-range vision for autonomous off-road driving. Journal of Field Robotics, 26(2):120–144, 2009.
-  G. Hattenberger, M. Bronz, and M. Gorraz. Using the paparazzi uav system for scientific research. In IMAV 2014, International Micro Air Vehicle Conference and Competition 2014, 2014.
-  HW Ho, C De Wagter, BDW Remes, and GCHE de Croon. Optical flow for self-supervised learning of obstacle appearance. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 3098–3104. IEEE, 2015.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Kevin Lamers, Sjoerd Tijmons, Christophe De Wagter, and Guido de Croon. Self-supervised monocular distance learning on a lightweight micro air vehicle. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pages 1779–1784. IEEE, 2016.
-  Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
-  David Lieb, Andrew Lookingbill, and Sebastian Thrun. Adaptive road following using self-supervised learning and reverse optical flow. In Robotics: Science and Systems, pages 273–280, 2005.
Andrew Lookingbill, John Rogers, David Lieb, J Curry, and Sebastian Thrun.
Reverse optical flow for self-supervised adaptive autonomous robot
International Journal of Computer Vision, 74(3):287–302, 2007.
-  Urs A Muller, Lawrence D Jackel, Yann LeCun, and Beat Flepp. Real-time adaptive off-road vehicle navigation and terrain classification. In SPIE Defense, Security, and Sensing, pages 87410A–87410A. International Society for Optics and Photonics, 2013.
-  S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann, K. Lau, C. Oakley, M. Palatucci, V. Pratt, P. Stang, S. Strohband, C. Dupont, L.-E. Jendrossek, C. Koelen, C. Markey, C. Rummel, J. van Niekerk, E. Jensen, P. Alessandrini, G. Bradski, B. Davies, S. Ettinger, A. Kaehler, A. Nefian, and P. Mahoney. Stanley: The robot that won the darpa grand challenge. Journal of Field Robotics, 23(9):661–692, 2006.