Self-awareness models make it possible for an agent to evaluate whether faced situations at a given time correspond to previous experiences. Self-aware computational models have been studied and several architectures have been introduced [1, 2, 3, 4]. Such models have to provide a framework where autonomous decisions and/or teleoperation by a human can be integrated as a capability of the device itself to dynamically evaluate the contextual situation 
. Recent progress of signal processing and machine learning allow an agent to obtain a self-awareness model from stored multi-sensorial data coming from previously successfully completed experiences. Self-awareness layers modeling situations perceived through different sensorial modalities and can be integrated in order to build a uniform structure of cross-modal self-awareness for an agent. Using such models, the agent gains the ability to either predict the future evolution of a situation (e.g. for internal resources modulation) or to detect situations potentially unmanageable. This “sense of the limit” allows an agent predicting potential abnormalities with respect to the previous experiences to involve a human operator for support in due time. In this sense, the capability of detecting abnormal situations is an important feature included in self-awareness models as it can allow autonomous systems to anticipate in time their situation/contextual awareness about the effectiveness of the decision-making sub-modules[5, 6].
In  a two-layers self-awareness model has been proposed: Shared Layer (SL) and Private Layer (PL). The analysis of observed moving agents for understanding the normal/abnormal dynamics in a given scene from an external viewpoint is a very hot topic and emerging research field [7, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. One common approach is to detect abnormalities as deviations from externally observed Environment Centered (EC) models. EC models can be considered shared as they concern observation variables externally accessible to multiple agents. However, when a mobile observer agent is placed in the same EC reference system [7, 5, 8, 18], the observer can also use its own private variables to detect the normal/abnormal flow of a situation under its viewpoint. The agent, however, relates its perceived private variables to actions that it performs in a given location of the environment. An abnormality can be perceived by EC models as a not-predicted EC location varying behaviour.
However, an agent can also detect an abnormality with respect to what it is used to observe from a first-person viewpoint while performing the same task. This knowledge can be only estimated by the agent itself by means of its private perception variables. Detecting abnormalities by using a self-awareness model learned through multi-sensorial data acquired in the first-person by the agent can be possible while doing the same task for which a self-awareness SL model has been obtained. Such a model can be described as the PL of self-awareness. An external observer can have no access to such information (unless an explicit communication link is established with the agent) and will not be able to detect PL abnormalities, while he can still do this by using the SL model, if available. Thus, a well-trained model for PL self-awareness can allow an agent to be able to evaluate abnormalities with a more complete information set, based on the joint availability of PL and SL models, as it was shown in. However, previous works mostly rely on a high level of supervision to learn PL self-awareness models [19, 7, 6, 12, 20, 4], while in this work, we propose a weakly-supervised method based on a hierarchy of Cross-modal Generative Adversarial Networks (GANs) for establishing self-awareness on PL. This model not only can be trained in a self-supervised manner but also can provide a level of information to boost the SL model. It can provide further normality representation for enriching the one obtained in an unsupervised way from SL normality representation and vice-versa. Both SL and PL learned models, can be used to predict the dynamics of a vehicle performing a task. They can be used also to describe normal behavioural conditions of the agent with respect to the two type of observations. Furthermore, a cross-correlation of the private and shared perspectives over the same phenomenon would provide a more complete capability for detecting the incoming anomalies [19, 7].
This paper proposes a novel method to learn the PL model using an incremental hierarchy of GANs. GANs are deep networks commonly used to generate data and they have shown good performance for learning the distribution of data [22, 23, 24, 25]
. Despite the great power of GANs for modeling data distribution, it has been observed that GANs fail to learn the highly complex distributions where the data is in a high order of diversity. In other words, the GAN fails to learn the diversity, in which the generator fails to create diverse samples and the discriminator is not able to classify them as fake. This is a very known problem in the context of training GANs and several works try to tackle that [27, 28]. A natural solution to estimate data distribution with huge diversity is to break it down into smaller sets and estimate it as the mixture of multiple small distributions [29, 30, 31]. The nature of an autonomous agent self-awareness model demands to learn a highly diverse distribution. The challenge in such tasks is not only the problem of learning a complex data distribution, but also the source of supervision is limited. The concept of learning a mixture of multiple distributions starting from simple to complex has been explored before [32, 7, 33, 34, 28], but they mostly rely on a high level of supervision. Inspired by [28, 33] we propose a self-supervised method using a set of cross-modal GANs to solve small sub-problems. This set of GANs stack in a hierarchical fashion and learns the small distributions in different levels of hierarchy. The first level is trained using the provided subset of data including most simple situations; more complex situations that can emerge in time will be abnormalities with respect to such models and their description can generate additional models. The next levels of the hierarchy are trained with the supervision provided by the first level. Namely, the scores of the first level discriminator are used to approximate the complexity of data. This can be seen as a supervision to detect whether a new model needs to be learned, and when this happens, to estimate the new model that has to be incrementally added to the hierarchy.
The main novelty of this proposed approach is a weakly-supervised strategy to divide and solve a complex problem using GANs. Furthermore, at best of our knowledge, it is the first time that the discriminator scores are used to approximate the complexity of a distribution. The discriminator scores correspond to the error (i.e., the innovations with respect to the other models already presented in the GANs hierarchy). As result, the method is able to model highly diverse distributions, which can be seen as a tool to find a sort of set of orthogonal basis functions by subdividing into simpler sets.
2 Agent Embodied Self-Awareness Model
The PL of self-awareness model consists of a hierarchical structure of cross-modal GANs. These  are trained to learn the normality using a sequence of images synchronously collected from a first-person viewpoint paired with their corresponding optical-flow maps.In order to understand the relation between this two modalities, the hierarchy of cross-modal GANs is adopted and trained in a weakly-supervised manner. The only supervision here is the provided subset of normal data to train the first level of the hierarchy that we called Base GAN. The Base GAN provides the reference for the next levels of the hierarchy and all the further levels are trained in a self-supervised manner. The rest of this section is dedicated to explaining the procedure of learning a single cross-modal GAN, constructing the hierarchy of GANs, and finally the criteria for anomaly detection using this hierarchical structure.
Cross-modal GANs for learning the normality: GANs are deep networks commonly used to generate data (e.g., images) and are trained using only unsupervised data. The supervisory information in a GAN is indirectly provided by an adversarial game between two independent networks: a generator () and a discriminator (). This competition between and is helpful in boosting the ability of both and . To learn the normal pattern of the observed scene two channels are used: appearance (i.e., raw-pixels) and motion (optical-flow images) for two cross-channel tasks. In the first task, optical-flow images are generated from the original frames, while in the second task appearance information is estimated from an optical-flow image. Specifically, let be the -th frame of a training video and the optical-flow obtained using and . is computed using . Two networks are trained: , which is trained to generate optical-flow from frames and , which generates frames from optical-flow. In both cases, inspired by [23, 24], our networks are composed by the conditional generator and the conditional discriminator . takes as input an image and outputs an image of the same dimensions of but represented in a different modality. In the proposed architecture both and are fully-convolutional networks. The network is the U-Net architecture , which is an encoder-decoder following with skip connections which help preserving important local information. For the PatchGAN discriminator  is proposed, which is based on a “small” fully-convolutional discriminator . The output of is a score map which can be seen as the encoded representation of the discriminator. Additional details about the training can be found in [23, 24].
Hierarchy of cross-modal GANs: As reviewed in Sec.1 the assumption is that the distribution of the normality patterns is under a high degree of diversity. In order to learn such distribution, we suggest a hierarchical strategy by splitting the different distributions among the different hierarchical levels, in which, each subset of train data is used to train a GAN for a different level. To construct the proposed hierarchy of GANs, a recursive procedure is adopted.
As shown in Alg. 1 the inputs of the procedure are represented by two sets: is the entire normal sequence of training data, which includes a set of coupled Frame-Motion maps, where , and is the number of total train samples. The input is a subset of , provided to train GANs for each individual level of the hierarchy. For instance, in case of the first level GANs, the initial set is used to train two cross-modal networks , and (denoted by ). Note that, the only supervision here is the initial to train the first level of the hierarchy, and the next levels are built accordingly using the supervision provided by the first level of GANs.
After training , we input and (denoted by ) using each frame of the entire set and its corresponding optical-flow image , respectively. generates the predicted Frame-Motion couples . In case of , , where and are -th predicted optical-flow and predicted frame, respectively.
During the training phase of GANs, the output of (the encoded representation of discriminator) over all the grid positions is averaged and this provides the final score of with respect to the input. At this time for constructing the next level of the hierarchy, we directly use the averaged scores of as a “detector” which is run over the grid to detect the abnormality from the input frame. Each discriminator models the decision boundary on the learned feature space which separates the densest area of this distribution from the rest of the space. Outside this area lie both non-realistic generated images and real, unseen events. Our hypothesis is that the latter lies outside the discriminator’s decision boundaries because they represent situations never observed during training and hence treated by
as outliers. In other words, the distance between the discriminator encoded representation of the predictionand the encoded layer associated with the output of observing the new image/optical-flow couple is higher when the input sample represents a situation never observed during training. This fact makes it possible to use the scores as a information to detect different distributions of data to train the next levels of the GANs.
In light of the above, the distance maps between the observations () and the predictions (
) are computed, then clustered by a self-organizing map (SOM). Clusters with high average scores are considered as new distributions. The newly detected distributions build the new subsets to train new GANs. All the networks are stacked into the next levels for constructing the entire hierarchical structure . Such incremental nature of the proposed method makes it a powerful model to learn a very complex distribution of data in a self-supervised manner.
In our experiments, we use a Hierarchy of GANs based on the identified distributions provided by scores, which consists of two levels: or base GAN, which train with a provided initialization subset for representing the empty straight path, and which is trained over the high-scored cluster (in our case this cluster semantically representing the curves). Note that, the choice of selecting as reference the images and the optical-flow dynamics as related to situations when an agent is moving straight in an environment with no obstacle on his path is quite straightforward. In fact, different variations in the motion of the agent with respect to such a situation can be semantically associated with different interactions with the environment. By assuming that
scores deviations present a self-similarity when the same interaction happens, it can be guessed that the capability of clustering such scores into different classes can carry to an unsupervised segmentation of different interaction situations. In our formulation, the SOM’s output consists of a set of neurons encoding the main information from prediction errors and cluster them into a set of prototypes. Each of such a cluster can tell us how images and optical-flow data can be clustered depending on their similarity and on the similarity of the prediction error that a GAN produces trained on a situation when an object moves in an empty space. So the decision boundaries of the detector to describe configurations where the corrections to the generative model should be high can generate multiple regions, each characterized by a different type of self-similarities. For example, curving in an empty space can generate different prediction errors with respect to curving for avoiding an obstacle. Similarly, the discriminator’s learned decision boundaries can be also used to better predict and semantically tag events at the testing time that is explained in the next section.
Anomaly detection: At the testing time, the discriminators are used to detect the abnormality. More specifically, input the test sample into the first level in the hierarchy of GANs, let and be the patch-based discriminators trained using the two channel-transformation tasks. Given a test couple , where is a test frame and its corresponding optical-flow image is , we first produce the prediction couple , where is reconstructed and using first level and , respectively. Then, the pairs of patch-based discriminators and , are applied for the first and the second task, respectively. This operation results in a pair of discriminator representation for the ground truth observation: and, the prediction: . Note that, a possible abnormality in the observation (e.g., an unusual object or an unusual movement) corresponds to an outlier with respect to the data distribution learned by and during training. The presence of the anomaly results in a low value of and (the discriminator encoded representation of prediction), but a high value of and (the discriminator encoded representation of observation). Hence, in order to decide whether an observation is normal or abnormal with respect to the scores from the current hierarchy level of GANs, we simply measure the distance between prediction and observation. The distance from the normality in the observation defines as: .
The computed also determines the cluster of the test sample, by computing the distance between and the encoded prototypes inside the SOM’s neurons. Each trained neuron can be described by a centroid that encodes a known (normal) situation if belongs to a normal cluster will be tagged as a normal sample, otherwise, it inputs to the next level of the hierarchy. The similar procedure is applied for the input sample into the next levels and eventually in the last level an error threshold is defined to detect abnormal events: when all the levels in the hierarchy of GANs tag the sample as abnormal and the measurement is higher than this threshold, the current measurement is considered an abnormality.
3 Experimental Results
Dataset: The proposed dataset captured from an onboard camera in a real vehicle ’iCab’ , during a perimeter monitoring task. Two different scenarios are defined: vehicle performs a standard perimeter monitoring under the normal situation, and the presence of anomalies while performing the perimeter monitoring task, the vehicle performs a manoeuvre to avoid a static pedestrian and continue standard patrolling.
Training the hierarchy of GANs: The first level of GANs () is trained on a selected subset of normal samples from . This subset represents the captured sequences while the vehicle moves on a straight path when the road is empty and the expected behaviour is the vehicle moving straight (normal situation). Once the pair of detects an abnormality in the corresponding set on which is trained, therefore, it is expected that the corresponding observations can be considered as outliers. This is confirmed by testing the over the entire sequences of , by observing the discriminators scores distances between the prediction and the observation, where . Fig. 1 shows the results of training . In Fig. 1 (a), where the test is performed using only the set, it can detect the straight path (white background area) perfectly, while when the vehicle curves (green bars) it failed and recognized curving as an abnormal event. The discriminator scores distances between the prediction and the observation are higher over the curving areas, which was expected. However, after training the second level GANs using this subset of data (red cluster) and applying entire hierarchy, the model can recognize entire training sequence as normal. This happened because the different distribution of samples tagged as abnormal by the input to the second level of the hierarchy, where they are recognized as normal samples.
Evaluations over the testing scenario: In order to evaluate our proposed model we apply the trained hierarchy of GANs over an unseen testing sequence (). The scenario is the moving vehicle performing the perimeter monitoring task in presence of abnormal events. In the vehicle performs an avoidance manoeuvre over a static pedestrian and continue standard monitoring afterwards. The goal is to detect the abnormality, which is the presence of the pedestrian. In Fig. 3 a normalized abnormality signal is reported over the test sequence. The abnormality signal is calculated from distances between the prediction and the observation score maps. The red bars show the presence of an abnormal situation, which in this case is the static pedestrian. The abnormality areas start from the first sight of observing the pedestrian and continue until the avoidance manoeuvre finishes. Note that, the abnormality signal in Fig. 3 is computed by averaging over the distance maps: when an abnormality begins this value does not undergo large changes since the observing a compressed local abnormality (see Fig. 2(c)) can not change the average value significantly. However, as soon as observing a full sight of pedestrian and starting the avoidance action by the vehicle, the abnormality signal becomes higher since both observed appearance and action are presenting an unseen situation. This situation is shown in Fig. 2(d,e).
Hierarchy of GANs vs. single GAN: It is interesting to compare our hierarchy of GANs with the Single GAN, in other to evaluate the performance of the proposed hierarchy. In this experiments we kept all the training parameters similar to the original setup, the only difference is using entire sequence to train the cross-modal GANs. In other words, the assumption is that a single GAN should learn entire normal pattern distribution. The result of abnormality signal is shown in Fig. 4 (a). However, the single GAN can detect the peak of abnormality, but the number of misdetections is higher than for the hierarchy of GANs, specifically over the curves. This could be due to the mode collapse effect which is a common issue in training GANs, the single GAN collapsed on moving straight samples and failed to generate/discriminate curving action.
Furthermore, a frame-level anomaly detection is performed for the test scenario. An abnormality label is predicted for a given test frame if at least one abnormal pixel is predicted in that frame. This evaluation procedure is iterated using a range of confidence thresholds in order to build a corresponding ROC curve. In our case, these confidence thresholds are directly applied to the output of the abnormality signal defined in Sec. 2. Fig. 4 (b) shows the ROC curves. The results using the Equal Error Rate (EER) and the Area Under Curve (AUC) for the Single GAN are: and EER and AUC, respectively, which are significantly worse than our baseline based on hierarchy of GANs: where and EER and AUC, respectively.
In this paper, a method is proposed to learning a complex data distribution based on a hierarchy of GANs in a weakly-supervised manner. This model is used to represent the PL self-awareness model of autonomous embodied agents. Scores of Discriminator network are used to approximate the complexity of data, which is one of our novelties. Namely, a set of distance maps between prediction and the observation scores is used as a criterion for creating another level in the hierarchical structure. Such technique facilitates breaking and solving a complex data distribution in an incremental fashion. The experimental results on semi-autonomous ground vehicles show a good performance of our method.
-  M. Baydoun, D. Campo, V. Sanguineti, L. Marcenaro, A. Cavallaro, and C. Regazzoni, “Learning switching models for abnormality detection for autonomous driving,” in FUSION, 2018.
-  M. Ravanbakhsh, M. Baydoun, D. Campo, L. Marcenaro, and C. Regazzoni, “Learning multi-modal self-awareness models for autonomous vehicles from human driving,” in FUSION, 2018.
-  P. R. Lewis, M. Platzner, B. Rinner, J. Tørresen, and X. Yao, Self-aware Computing Systems, 2016.
-  M. Baydoun, M. Ravanbaksh, D. Campo, P. Marín, D. Martín, L. Marcenaro, A. Cavallaro, and C. S. Regazzoni, “A multi-perspective approach to anomaly detection for self-aware embodied agents,” in ICASSP, 2018.
-  D. Campo, A. Betancourt, L. Marcenaro, and C. Regazzoni, “Static force field representation of environments based on agents’ nonlinear motions,” EURASIP, 2017.
-  J. S. Olier, P. Marín, D. Martín, L. Marcenaro, E. Barakova, M. Rauterberg, and C. Regazzoni, “Dynamic representations for autonomous driving,” in AVSS, 2017.
-  V. Bastani, L. Marcenaro, and C. S. Regazzoni, “Online nonparametric bayesian activity mining and analysis from surveillance video,” TIP, 2016.
-  B. T. Morris and M. M. Trivedi, “A survey of vision-based trajectory learning and analysis for surveillance,” TCSVT, 2008.
-  G. Pons-Moll, A. Baak, T. Helten, M. Müller, H. P. Seidel, and B. Rosenhahn, “Multisensor-fusion for 3d full-body human motion capture,” in CVPR, 2010.
-  E.B. Ermis, V. Saligrama, P.-M. Jodoin, and J. Konrad, “Abnormal behavior detection and behavior matching for networked cameras,” in ICDSC, 2008.
-  R. Emonet, J. Varadarajan, and J.-M. Odobez, “Multi-camera open space human activity discovery for anomaly detection,” in AVSS, 2011.
-  David Sirkin, Nikolas Martelaro, Mishel Johns, and Wendy Ju, “Toward measurement of situation awareness in autonomous vehicles,” in CHI, 2017.
-  D. Campo, M. Baydoun, L. Marcenaro, A. Cavallaro, and C. Regazzoni, “Unsupervised trajectory modeling based on discrete descriptors for classifying moving entities in video sequences,” in ICIP, 2018.
-  M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe, “Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection,” WACV, 2018.
M. Sabokrou, M. Fayyaz, M. Fathy, et al.,
“Fully convolutional neural network for fast anomaly detection in crowded scenes,”CVIU, 2018.
-  H. Rabiee, J. Haddadnia, H. Mousavi, M. Nabi, V Murino, and N. Sebe, “Crowd behavior representation: an attribute-based approach,” SpringerPlus, 2016.
-  M. Nabi, A. Bue, and V. Murino, “Temporal poselets for collective activity detection and recognition,” in ICCV Workshops, 2013.
-  D. Campo, M. Baydoun, L. Marcenaro, A. Cavallaro, and C. Regazzoni, “Modeling and classification of trajectories based on a gaussian process decomposition into discrete components,” in AVSS, 2017.
-  K. Kim, D. Lee, and I. Essa, “Gaussian process regression flow for analysis of motion trajectories,” in ICCV, 2011.
-  D. M. Ramík, C. Sabourin, R. Moreno, and K. Madani, “A machine learning based intelligent vision system for autonomous object detection and recognition,” Applied Intelligence, 2014.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS. 2014.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, in CVPR, 2016.
-  M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. S. Regazzoni, and N. Sebe, “Abnormal event detection in videos using generative adversarial nets,” in ICIP, 2017.
-  M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe, “Training adversarial discriminators for cross-channel abnormal event detection in crowds,” arXiv preprint arXiv:1706.07680, 2017.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in NIPS, 2016.
-  A. Srivastava, L. Valkoz, C. Russell, M. U. Gutmann, and C. Sutton, “Veegan: Reducing mode collapse in gans using implicit variational learning,” in NIPS, 2017.
-  I. O. Tolstikhin, S. Gelly, O. Bousquet, C.n Simon-Gabriel, and B. Schölkopf, “Adagan: Boosting generative models,” in NIPS, 2017.
-  Douglas Reynolds, Encyclopedia of biometrics, 2015.
-  Patrenahalli M. Narendra and Keinosuke Fukunaga, “A branch and bound algorithm for feature subset selection,” IEEE Transactions on computers, 1977.
Shabnam N Kadir, Dan FM Goodman, and Kenneth D Harris,
“High-dimensional cluster analysis with the masked em algorithm,”Neural computation, 2014.
-  A. Abad, M. Nabi, and A. Moschitti, “Autonomous crowdsourcing through human-machine collaborative learning,” in SIGIR, 2017.
E. Sangineto, M. Nabi, D. Culibrk, and N. Sebe,
“Self paced deep learning for weakly supervised object detection,”TPAMI, 2018.
-  A. Abad, M. Nabi, and A. Moschitti, “Self-crowdsourcing training for relation extraction,” in ACL, 2017.
-  T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in ECCV, 2004.
-  T. Kohonen, Self-Organizing Maps, Physics and astronomy online library. 2001.
-  P. Marın-Plaza, J. Beltrán, A. Hussein, B. Musleh, D. Martın, A. de la Escalera, and J. M. Armingol, “Stereo vision-based local occupancy grid map for autonomous navigation in ros,” in VISIGRAPP, 2016.