Learning Multi-Modal Self-Awareness Models for Autonomous Vehicles from Human Driving

This paper presents a novel approach for learning self-awareness models for autonomous vehicles. The proposed technique is based on the availability of synchronized multi-sensor dynamic data related to different maneuvering tasks performed by a human operator. It is shown that different machine learning approaches can be used to first learn single modality models using coupled Dynamic Bayesian Networks; such models are then correlated at event level to discover contextual multi-modal concepts. In the presented case, visual perception and localization are used as modalities. Cross-correlations among modalities in time is discovered from data and are described as probabilistic links connecting shared and private multi-modal DBNs at the event (discrete) level. Results are presented on experiments performed on an autonomous vehicle, highlighting potentiality of the proposed approach to allow anomaly detection and autonomous decision making based on learned self-awareness models.



There are no comments yet.


page 5

page 6

page 7

page 8


Collective Awareness for Abnormality Detection in Connected Autonomous Vehicles

The advancements in connected and autonomous vehicles in these times dem...

Towards Autonomous Driving: a Multi-Modal 360^∘ Perception Proposal

In this paper, a multi-modal 360^∘ framework for 3D object detection and...

Hierarchy of GANs for learning embodied self-awareness model

In recent years several architectures have been proposed to learn embodi...

On Assessing Driver Awareness of Situational Criticalities: Multi-modal Bio-sensing and Vision-based Analysis, Evaluations and Insights

Automobiles for our roadways are increasingly utilizing advanced driver ...

Field trial on Ocean Estimation for Multi-Vessel Multi-Float-based Active perception

Marine vehicles have been used for various scientific missions where inf...

Factorized Multi-Modal Topic Model

Multi-modal data collections, such as corpora of paired images and text ...

The Autonomous Siemens Tram

This paper presents the Autonomous Siemens Tram that was publicly demons...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Self-awareness refers to system capability to recognize and predict its own state, possible actions and the result of these actions on the system itself and on its environment [1]. Recent developments in signal processing and machine learning techniques can be useful to design autonomous systems equipped with a self-awareness module to make it possible to compare how much the current realization is similar to a previous experience of the same type while a given task is executed. The capability of predicting task evolution in normal conditions (i.e., when the task follows the rules learned in the previous experience) and jointly detecting abnormal situations that can rise based on such self-awareness is an important task that allows autonomous systems to increase their situational awareness and the effectiveness of the decision making sub-modules [2, 3]. Models of different self-awareness layers can be integrated in order to buildup a structured and multi-modal self-aware behavior for an agent.

In [3] a self-awareness model was introduced that consists of two layers: Shared Level (SL) and Private Layer (PL). The analysis of observed moving agents for learning the models of normal/abnormal dynamics in a given scene from an external viewpoint, represents an emerging research field [4, 5, 6, 7, 8, 9, 10, 11, 12]. The planned activity of an entity is one of such models that can be defined as the sequence of organized state changes (actions) that an entity has to perform in a specific context to achieve a task. This set of actions can be learned from examples and clustered into sequential discrete patterns of motions. The availability of a plan that associates the current state with an action class makes it possible to detect normal/abnormal situations in future repetitions of the same task. In general, computational models for abnormality detection are trained on a set of observations corresponding to standard behaviors. Accordingly, abnormalities can be defined as observations that do not match with the patterns previously learned as regular, i.e., behaviors that have not been observed before[13].

An active self-aware action plan can be considered as a filter that makes it possible to predict and estimate state behaviors using linear and non linear dynamic and observation models. Switching Dynamical Systems (SDSs) are well-known Probabilistic Graphical Models (PGMs) that are capable to manage discrete and continuous dynamic variables in a jointly dynamical filter. Such systems have been used successfully to improve decision making and tracking capabilities


. In SDSs, each dynamical model for continuous state variable in successive time instants is associated with one of a discrete set of values of a random variable represented as an higher level motivation of that dynamic model. Externally shareable observations related to actions that are a function of agent position (plan) can be observed and active probabilistic plan models can be learned including hidden continuous and discrete states. The most used algorithm to take advantage of learned hierarchical probabilistic knowledge in the online phase is Markov Jump Linear Systems (MJLS)


. MJLS uses a combination of Kalman filter (KF) and particle filter (PF) to predict and update the continuous and discrete state space posterior probabilities. In this paper, a MJLS is used to exploit learned knowledge for the SL that includes the capability to self-detect abnormality situations.

However, an agent can also infer dynamic models with respect to inner variables that it can observe from a First Person viewpoint while performing the same task; this private knowledge is only directly accessible to the agent itself. Detecting abnormalities by using a self-awareness model learned starting from private multisensorial first person data acquired by the agent can be possible, while doing the same task for which the self-awareness SL model has been obtained. Such a model can be defined as the Private Layer (PL) of self-awareness. An external observer has no access to such information, so not being able to directly detect PL abnormalities, while it can still evaluate SL abnormalities using third person models including shared variables. A PL model can allow an agent to be able to evaluate abnormalities related to PL and SL models, as it was shown in [3]. However, previous works mostly rely on a high level of supervision to learn PL self-awareness models [15, 13], while in this work, we propose a weakly-supervised method based on a hierarchy of Cross-modal Generative Adversarial Networks (GANs) [16] for estimating PL models. Weakly supervised PL models not can also provide a level of information to boost the SL model as well as they can be used to provide a joint self aware multisensorial modality to cross-predict heterogeneous multimodal anomalies related to the same task execution. This paper describes a novel method to learn the PL model using an incremental hierarchy of GANs [16, 17]. The private camera views acquired by first-person images taken by an agent during task execution can be used together with related optical-flows to learn models using Generative Adversarial Networks (GANs), more appropriate for reducing high dimensionality for the visual modalities.

Ii Multi-modal self-awareness models

This section describes the two levels of self-awareness. The first is called private layer (PL), whereas the second is called shared level (SL) as has been proposed in [3]. Each level is learned based on visual perception and localization, respectively. In addition, a probabilistic framework based on switching dynamic system is used to learn the SL. An incremental hierarchy of Generative Adversarial Networks (GANs) is used to learn the PL model.

Ii-a Shared Level of self-awareness

This level of self-awareness focuses on the analysis of observed moving agents for understanding their dynamics in a given scene. This model is said to be shared as corresponding measurements can be directly observed by the agent itself or external observers.

Ii-A1 Learning of spatial activities

Let be the measured location of an object in a given reference system. A Kalman Filter (KF) based on an “unmotivated model” is used for tracking agents’ motions. Such dynamic filter assumes that an observed object moves according to a random dynamic model described as: , where represents the agent’s generalized state composed of its coordinate positions and velocities at a time instant , such that . and . represents the dimensionality of the environment. is a dynamic model matrix: and .

represents a square identity matrix of size

and is a null matrix. represents the prediction noise which is here assumed to be zero-mean Gaussian for all variables in with a covariance matrix , such that . The filter is called unmotivated and assumes that the agent remains fixed. In other words, it moves only due to random noisy fluctuations associated with . By considering a 2-dimensional reference system, i.e., , the generalized state of an agent can be described as: .

Agents’ states

produced by the unmotivated filter are used as an input for a Self Organizing Maps (SOM) that clusters similar information (quasi-constant velocities) based on a weighted distance. This clustering process prioritizes similar velocities (actions) by weighting more such component. Accordingly, the following distance function that uses the weights

and is used for training SOM, such that:


where . , . and

are both 4-dimensional vectors of the form

. The SOM’s output consists of a set of neurons encoding the main information from observed data inside a prototype structure that has the same form of the generalized states. Trained neurons represent a set of zones that segment the continuous state space, into a set of regions here called Superstates

. Each zone is composed of a set of clustered positions and their correspondent velocities which define quasi-constant velocity model used for tracking and prediction purposes. Such quasi-constant velocity model can be defined as , where indexes identifies the region (zone) corresponding to the superstates. We can define a linear dynamic model for each superstate as follows:


where is a transition matrix that maps entities’ positions as constant with respect to the previous state, such that and . is a control input model. , , indexes the time, is the sampling time and is the process noise. The variable is a control vector that encodes the expected entity’s velocity when its state falls in a discrete region .

Ii-A2 Online testing MJPF

The SL model can be represented by means of a DBN switching model. Such model includes a discrete set of state regions subspaces corresponding to the vocabulary of switching variables. A set quasiconstant velocity model described before (Eq. (2

)) is associated to one of such regions that describe a possible alternative relation between consecutive temporal states. A further learning step facilitates to obtain temporal transition matrices between superstates. This allows the system to estimate not only the next superstates but the moment where discrete transitions take place. Fig.

1-a shows the proposed DBN used for modeling agents’ behaviors, where arrows represent conditional probabilities: vertical arrows introduce causalities between both (discrete and continuous) levels of influence and observed measurements. Horizontal arrows explain temporal causalities between hidden variables. A Markov Jump Particle Filter (MJPF) is used to infer posterior probabilities on discrete and continuous states iteratively. MJPF essentially consists in a particle filter (PF) working at discrete level, embedding in each particle a Kalman filter (KF). Consequently, for each particle has attached a KF which depends on the superstate (see Eq. (2)). Such filter is used to obtain the prediction for the continuous state associated with a particle’s superstate , that is ; and the posterior probability is estimated according to current observation .

Abnormalities can be seen as deviations from predictions that the MJPF can do using the learned models embedded in its switching model and new observed trajectories where interaction with the environment differ from training data. Since a probabilistic filtering approach is considered, two main moments can be distinguished: Prediction: which corresponds to an estimation of future states at a give time . Update: computation of the generalized state posterior probability based on the comparison between predicted states and new measurements. Accordingly, abnormality behaviors can be measured in the update phase, i.e., when predicted probabilities are far from to observations. As it is well known, innovations in KFs are defined as:


where is the innovation generated in the zone where the agent is located at a time . represents observed spatial data and is the KF estimation of the agent’s location at the future time calculated in the time instant Eq. (2). Additionally, is the observation model that maps measurements into states, such that where and is the covariance observation noise.

Abnormalities can be seen as moments when a tracking system fails to predict subsequent observations, so that new models are necessary to explain new observed situations. A weighted norm of innovations is employed for detecting abnormalities, such that:


In the MJPF, the expression shown (4) is computed for each particle and the median of such values is used as a global anomaly measurement of the filter. Further details about the implemented method can be found in [18].

Particle Filter

Kalman Filter




Fig. 1: Proposed DBN switching models for: (a) SL, (b) PL

Ii-B Private Layer of self-awareness

To model the PL of self-awareness, a hierarchical structure of cross-modal GANs is employed. This set of cross-modal GANs [16] is learned the normality by a sequence of observed images () synchronously collected with SL position data from a first person viewpoint paired with their corresponding optical-flow maps () as the direct observation of image changes consequent to joint agent and environment motion.

In order to understand the relation between these two modalities, the hierarchy of cross-modal GANs is adopted and trained in a weakly-supervised manner. The only supervision here is provided by a subset of normal data related to a reference situation (corresponding to the unmotivated filter for the SL layer) to train the first level of the hierarchy that we called . The provides a reference for the next levels of the hierarchy, in which all the further levels are trained in a self-supervised manner. The source of such self-supervision is the criteria provided by the . The rest of this section is dedicated to explain the procedure of learning a single cross-modal GAN, constructing the hierarchy of GANs, and finally the online application of the learned model for prediction and anomaly detection.

Ii-B1 Learning the cross-modal representation

GANs are generative deep networks and trained using only unsupervised data. The supervisory information in a GAN is indirectly provided by an adversarial game between two independent networks: a generator () and a discriminator (). During training, generates new data and tries to distinguish whether its input is real (i.e., it is a training image) or it was generated by . This competition between and is helpful in boosting the ability of both and . To learn the normal pattern two channels are used as observations: appearance (i.e., raw-pixels) and motion (optical-flow images) for two cross-channel tasks. In the first task, optical-flow images are generated from the original frames, while in the second task appearance information is estimated from an optical flow image. Specifically, let be the -th frame of a training video and the optical-flow obtained using and . is computed using [19]. Two networks are trained: , which is trained to generate optical-flow from frames (task 1) and , which generates frames from optical-flow (task 2). In both cases, inspired by [20, 17, 21], our architecture is composed by two fully-convolutional networks: the conditional generator and the conditional discriminator . The network is the U-Net architecture [20], which is an encoder-decoder following with skip connections helping to preserve important local information. For the PatchGAN discriminator [20, 22] is proposed, which is based on a “small” fully-convolutional discriminator.

Hierarchy of cross-modal GANs: As described in Sec.I, the assumption is that the distribution of the normality patterns is under a high degree of diversity. In order to learn such distribution we suggest a hierarchical strategy for high-diversity areas by encoding the different distributions into the different hierarchical levels, in which, each subset of train data is used to train a different GAN. The process of partitioning subsets of data an approach similar to SL layer is used, implying the segmentation into regions of the information about the distance between the (preddicted) generated frames and optical flow images and the new acquired images and optical flow at successive time instant. To construct the proposed hierarchy of GANs, a recursive procedure is adopted. As shown in Alg. 1 the inputs of the procedure are represented by two sets: could be seen as the set of observation vectors which includes all the observations from the normal sequence of training data. Specifically, includes a set of coupled Frame-Motion maps, where , and is the number of total train samples. Besides, the input is a subset of , provided to train GANs for each individual level of the hierarchy. For instance, in case of the first level GANs, that acts as reference dynamic model, the initial set is used to train two cross-modal networks , and . Note that, the only supervision here is the initial to train the first level of the hierarchy, and the next levels are built accordingly using the supervision provided by the first level of GANs. After training , and , we input and using each frame of the entire set and its corresponding optical-flow image , respectively. The generators predict Frame-Motion couples as:


where and are -th predicted image and predicted optical-flow, respectively. The distance maps between the observations and the predictions for both channel are computed by the discriminators :


The distance maps can be seen as the coupled image-motion innovation, where and are representing innovation related to the -th portion of generalized state associated with image and optical-flow, respectively. The joint innovations input to a self-organizing map (SOM) [23] in order to cluster similar innovations on appearance-motion information. Similar to clustering position-velocity information in the shared layer, here the clustering is done to discretize the innovations (i.e., variations with respect to the reference GAN) on appearance-motion into a set of super-states. Specifically, the SOM’s output is a set of neurons encoding the innovation information into a set of prototypes. Detected prototypes (clusters) provide the means of discretization for representing a set of super-states , where is the number of detected clusters. Each of these cluster can also memorize input images and optical-flow instances that can be related with a given innovation, so reflecting couples position-velocity of SL layer.

It is expected that the clusters which are containing the training data should obtain lower score, since the innovation between the prediction and observation is lower on the associated image and optical-flow set. This is the criteria to detect the new distributions for learning new GANs, in which the clusters with high different self similar average scores can be considered as new distributions. The data sets attached to the new detected distributions build the new subsets to train new networks , and for the -th level of the hierarchy, where is the level in the hierarchy. This procedure continues until no new distribution is detected. Then GANs and detected super-states in each level are stacked incrementally for constructing the entire hierarchical structure of GANs . Such incremental nature of the proposed method is similar to the one for learning different KFs for the SL layer, despite more general application of the GANs dynamic models and makes it a powerful model to learn a very complex distribution of data in a self-supervised manner.

3: Entire training sequences
15:     for each identified cluster do
17:         if  then
20:              go to train               
21:     return
Algorithm 1 Constructing the hierarchy of GANs

In our experiments, we use a Hierarchy of GANs, which consists of two levels: or base GAN (reference), which is trained with a provided initialization subset for representing the empty straight path (that plays here the role of the unmotivated filter for SL), and which is trained over the high-scored innovations clusters. In our case this clusters semantically represent curves as situations where the dynamic model describing changes in images and optical-flow differs from the one predicted by a model where a straight motion in a free space is assumed. Note that, selecting the straight empty path (in normal situation) as initialization subset is considered for the sake of simplicity, since it is the most efficient way to reach a target and so the variation with respect to this behavior could be described as differential actions with respect to going straight. Similarly, the discriminator’s learned decision boundaries are also used to detect the abnormal events at testing time that is explained in the next section.

Ii-B2 Online testing GANs

Once the GANs hierarchy is trained, it can be used for online prediction and anomaly detection tasks. Here we describe the testing phase for state/label estimation and detecting the possible abnormalities.

Label estimation: At testing time we aim to estimate the state and detect the possible abnormality with respect to the training set. More specifically, input the test sample into the first level in the hierarchy of GANs, let and be the patch-based discriminators trained using the two channel-transformation tasks. Given a test frame and its corresponding optical-flow image , we first produce the reconstructed and using the first level generators and , respectively. Then, the pairs of patch-based discriminators and , are applied for the first and the second task, respectively. This operation results in two scores maps for the observation: and , and two score maps for the prediction (the reconstructed data): and . In order to estimate the state, we used Eq. 7 to generate the joint representation , where:


Accordingly, to estimate the current super-state we use innovation with respect to the empty straight motion GAN to find closest SOM’s detected prototypes. This procedure repeat for all the levels in the hierarchy . This discrete situation estimation can be seen as a way to explore a switching model (see Fig. 1

-b), where in continues levels a hierarchy of GANs associated with different neurons are estimating the states and the discrete level can be modeled by an HMM working on neurons classifying innovations and their time transitions.

Anomaly detection

: Note that, a possible abnormality in the observation (e.g., an unusual object or an unusual movement) corresponds to an outlier with respect to the data distribution learned by

and during training. The presence of the anomaly, results in a low value of and (predictions), but a high value of and (observation). Hence, in order to decide whether an observation is normal or abnormal with respect to the scores from the current hierarchy level of GANs, we simply calculate the average value of the innovations between prediction and observation maps for both modalities, which is obtained from:


The final representation of private layer for an observation consists of the computed and estimated super-state . We defined an error threshold to detect the abnormal events: when all the levels in the hierarchy of GANs tag the sample as abnormal (e.g., dummy super-state) and the measurement is higher than this threshold, current measurement is considered as an abnormality. Note that, the process is aligned closely to the one followed with SL layer, despite GAN are more powerful as they allow to deal with strong multidimensional inputs as well as with not linear dynamic models at the continuous level. This complexity is required by video variables involved in PL as different from low dimensional positional variable involved in SL.

Iii Experimental Dataset

In our experiments an iCab vehicle [24] drove by a human operator is used to collect the dataset (see Fig. 2); we obtained the vehicle’s position mapped into Cartesian coordinates from the odometry manager [24], as well as captured video footage from a first person vision acquired with a built-in camera of the vehicle. The observations are generated by taking state space position of iCab and estimating its flow components over the time. Furthermore, for the cross-modal GANs in the PL self-awareness model, we input the captured video frames and their corresponding optical-flow maps.

Autonomous vehicle “iCab” Infrastructure
Fig. 2: Proposed moving entity and closed scene
(a) (b) (c)
Fig. 3: Three different action scenarios: (a) perimeter monitoring under the normal situation (training set), and performing perimeter monitoring task in presence of abnormality (test sets): (b) U-turn, and (c) emergency stop.

We aim to detect dynamics that have not been seen previously based on the normal situation (Scenario I) learned with the proposed method. Scenarios II and III includes unseen manoeuvres caused by the presence of pedestrians while the vehicle performs a perimeter control task. Accordingly, 3 situations (experiments) are considered in this work: Scenario I) or normal perimeter monitoring, where the vehicle follows a rectangular trajectory around a building (see Fig. 3-a). Scenario II) or U-turn, where the vehicle performs a perimeter monitoring and is faced with a pedestrian, so it makes a U-turn to continue the task in the opposite direction (see Fig. 3-b). Scenario III) or emergency stop, where the vehicle encounters with pedestrians crossing its path and needs to stop until the pedestrian leaves its field of view (see Fig. 3-c).

Situations II and III can be seen as deviations of the perimeter monitoring dynamics. When an observation falls outside the superstate, as the learned model are not applicable, a dummy neuron is used to represent the unavailability of an action from the learned experience and random filter where in Eq. (2) is considered for prediction to represent the uncertainty over state derivatives.

Iv Experimental results

Representation of normality: The two levels of the proposed self-awareness model, including the shared layer (modeled by MJPF), and the private layer (modeled as a hierarchy of GANs), are able to learn the normality. In our experiments this is defined as Scenario I (Fig. 3-a) and it is used to learn both models. As reviewed in Sec. II both SL and PL, represent their situation awareness by a set of super states following with abnormality signals. We select a period of normal perimeter monitoring task (see Fig. 5-a) as a test scenario. The result for PL and SL is shown in Fig. 4, which simply visualizes the learned normality representations. The ground truth label is shown in Fig. 4-a, and the color-coded detected super states from PL and SL are illustrated in Fig. 4-b and Fig. 4-c, respectively. It clearly shows not only the pattern of superstates are repetitive and highly-correlated with the ground truth, but also there is a strong correlation between the sequence of PL and SL super states.






Fig. 4: Normality representations from PL and SL: in (a) the ground truth labels are shown, moving straight is green and blue bars represent the curving. Color-coded super-states sequences and are shown in (b) and (c), respectively. They are highly correlated with the agent’s real status (a). (d) and (e) show the abnormality signals from PL and SL, respectively. The horizontal axis represents the sample number, and the vertical axis shows the innovation values (abnormality signal).

Note that the abnormality signals are stable for both PL and SL, while in case of an abnormal situation we expect to observe identical spikes over the signals. In order to study such abnormal situations we apply the trained self-awareness model over unseen test sequences. Two different scenarios are selected, in which the moving vehicle performing the perimeter monitoring task has to face abnormal events. In each scenario the agent performs different actions in order to solve the abnormal situation. The goal of this set of experiments is to evaluate the performance of the proposed self-awareness model (SL and PL) for detecting the abnormalities.

(a) perimeter monitoring (b) U-turn (c) emergency stop
Fig. 5: Sub-sequence examples from testing scenarios reported in the experimental results.






Fig. 6: Abnormality in the U-turn avoiding scenario: (a) ground truth labels. (b) and (c) color-coded transition of states and , respectively. (d) and (e) generated abnormality signal (innovation) from PL and SL, respectively. The horizontal axis represents the sample number, and the vertical axis shows the innovation values (abnormality signal).

Avoiding a pedestrian by a U-turn action: In this scenario, which is illustrated in Fig. 3-b, the vehicle performs an avoidance maneuver over a static pedestrian by a U-turn to continue the standard monitoring afterward. The goal is to detect the abnormality, which is the presence of the pedestrian and consequently the unexpected action of the agent with respect to the learned normality during the perimeter monitoring. In Fig. 6 the result of anomaly detection from PL and SL representations is shown. The results are related to the highlighted time slice of the testing scenario II (Fig. 5-b).

In Fig. 6-a the green background means that the vehicle moves on a straight line, the blue bars indicate curving, and red show the presence of an abnormal situation (which in this case is the static pedestrian). The abnormality area starts on first sight of the pedestrian and it continues until the avoiding maneuver finishes (end of U-turn). Similarly, the sequences of states and in Fig. 6-(b,c), follow the same pattern. While the situation is normal, the super-states repeat the expected normal pattern, but as soon as the abnormality begins the super-state patterns are changed in the both PL and SL (e.g., dummy super states). Furthermore, in the abnormal supper-states the abnormality signals showing the higher values. The abnormality signal generated by SL is shown in Fig. 6-e, which represents the innovation between prediction and the observation. The abnormality produced by the vehicle is higher while it moving thought the path which is indicated by red arrows in Fig. 5-b. This is due to the fact that the observations are outside the domain that superstates are trained on. Namely, during the training such state space configuration is never observed. This means the KF innovation becomes higher in the same time interval due to the opposite velocity compared with the normal behavior of the model, more specific, KF innovation is high due to the difference between prediction (which is predicted higher probability going straight) and the likelihood of the observed behavior in a curved path.

The abnormality signal generated by PL, Fig. 6-c, is computed by averaging over the distance maps between the prediction and the observation score maps: when an abnormality begins this value does not undergo large changes since the observing a compressed local abnormality (see Fig. 8-c) can not change the average value significantly. However, as soon as observing a full sight of pedestrian and starting the avoidance action by the vehicle, the abnormality signal becomes higher since both observed appearance and action are presenting an unseen situation. This situation is shown in Fig. 8-(d,e). As soon as the agent back to the known situation (e.g., curving) the abnormality signal becomes lower.






Fig. 7: Abnormality in the emergency stop scenario: (a) ground truth labels. (b) and (c) color-coded transition of states from PL and SL, respectively. (d) and (e) generated abnormality signal (innovation) and , respectively. The horizontal axis represents the sample number, and the vertical axis shows the innovation values (abnormality signal).

Emergency stop maneuver: This scenario is shown in Fig. 3-c, where the agent performs an emergency stop for a pedestrian to cross. Accordingly, the results of abnormality detection, for the highlighted time slice in Fig. 5-c, are represented in Fig. 7. In Fig. 7-a the red bars indicate the abnormality areas, where the agent is stopped and waits until the pedestrian cross. Accordingly, this areas are represented as the dummy super states from PL (light-blue color in Fig. 7-b) with high scores in the abnormality signal (see Fig. 7-d). The generated abnormality signal from PL increases smoothly as the agent get better visual to the pedestrian, and reaches to the peak when the agent stops and having a full visual of the pedestrian. Once the pedestrian passes and the agent starts to continue its straight path, the signal drops sharply. Similarly, the abnormality signal from SL representation model (see Fig. 7-e) shows two peaks that correspond to high innovation. Those peaks represent the abnormality patterns associated with the emergency stop maneuver. In contrast with the PL signal, the SL signal reaches to the peak sharply, and then smoothly back to the normal level. Such pattern indicates that the vehicle stops immediately and waits for a while, then it starts again to move and increases its velocity constantly to continue the straight path under the normal situations. As a consequence different motion patterns with respect to those predicted are detected by means of innovation. This is also confirmed by the color-coded super states in Fig. 7-c, where the green and dark-blue states are continued longer than what expected with respect to the normal pattern that learned from the previous observations.






Fig. 8: Visualization of abnormality: first column shows the localization over the original frame, second column is the predicted frame, and the last column shows the pixel-by-pixel distance over the optical-flow maps. (a) moving straight, (b) curving, (c) first observation of the pedestrian, (d) and (e) performing the avoiding action.

V Discussion

The cross-modal representation: One of our novelties in this paper is using GAN for a multi-channel data representation. Specifically, we use both appearance and motion (optical-flow) information: a two-channel approach which has been proved to be empirically important in previous works. Moreover, we propose to use a cross-channel approach where, we train two networks which respectively transform raw-pixel images in optical-flow representations and vice-versa. The rationale behind this is that the architecture of our conditional generators is based on an encoder-decoder (see Sec. II-B) and one of the advantage of such channel-transformation tasks is to prevent learns a trivial identity function and force and to construct sufficiently informative internal representations.

Private layer and shared layer cross-correlation: The PL and SL levels are providing complimentary information regarding the situation awareness. As an instance, it has been observed that in case of PL the super-states are invariant to the agent’s location, while SL super-states representation is sensitive to such spatial information. In other words, PL representation can be seen as the semantic aspect of agent’s situation awareness (e.g., moving straight, curving) regardless to the current location of the agent. That’s why the pattern of super-states sequences is repeated with respect to the agent’s taken action (see Fig. 4-b). However, the SL representation is included the spatial information the sequence of super-states could be different with respect to the agent’s location (see Fig. 4-c). This issue is more obvious in the example of U-shape avoiding experiment (see Fig. 6). In this case after performing avoiding maneuver the abnormality signal in the PL back to the normal as well as the super-states sequences, while the abnormality signal from SL remains high due to this space position dependency. In light of the above, these two representations are carrying complimentary information and finding a cross-correlation between PL and SL situation representation (e.g., using coupled Bayesian network) could potentially increase the ability of anomaly detection and consequently boost the entire self-awareness model.

Modalities alignment: In our work, the means of alignment between two modalities localization (in SL) and visual perception (in PL), is provided by the synchronous time stamps assigned from the sensors. Since the time reference is equal for both sensors (odometer and camera) this is possible to collect the aligned multi-modal data. However, another advantage of modeling the cross-correlation between two layers (PL and SL) could be providing an extra reference for finding such alignments. Namely, the cross-correlation of repetitive patterns in PL and SL can be used for the data alignment. In case of having asynchronous sensors, this could be useful to find the exact alignment.

Vi Conclusion

In order to improve the self-awareness model in this work we presented a multi-perspective approach to detect anomalies for moving agents. In the presented self-awareness model, tow levels are considered, shared level that we use a state-space representation from an external observer placed in the EC reference system. The private layer of self-awareness is constructed of a hierarchy of cross-modal GANs to learn a complex data distribution in a weakly-supervised manner. This model is used to represent the PL self-awareness model of autonomous embodied agents, where we face with a high diversity distribution. Scores of Discriminator network are used to approximate the complexity of data. Namely, a set of distance maps between prediction and the observation scores is used as a criteria for creating another level in the hierarchical structure. Such technique facilitates breaking and solving a complex data distribution in an incremental fashion. The experimental results on semi-autonomous ground vehicles show the capability of our methodology to recognize anomalies using multiple viewpoints, namely PL and SL. A future research path could consist in combining information from different sources for decision making and robust the proposed self-awareness model. In particular, situational awareness and self-reactions could be increased by modeling the cross-correlation between PL and SL by the multi-modal DBNs.


  • [1] J. Schlatow, M. Moostl, R. Ernst, M. Nolte, I. Jatzkowski, M. Maurer, C. Herber, and A. Herkersdorf, “Self-awareness in autonomous automotive systems,” in DATE, 2017.
  • [2] D. Campo, A. Betancourt, L. Marcenaro, and C. Regazzoni, “Static force field representation of environments based on agents’ nonlinear motions,” EURASIP Journal on Advances in Signal Processing, 2017.
  • [3] M. Baydoun, M. Ravanbaksh, D. Campo, P. Marin, D. Martin, L. Marcenaro, A. Cavallaro, and C. S. Regazzoni, “A multi-perspective approach to anomaly detection for self-aware embodied agents,” in ICASSP, 2018.
  • [4] V. Bastani, L. Marcenaro, and C. Regazzoni, “Online nonparametric bayesian activity mining and analysis from surveillance video,” Trans. Image Process., 2016.
  • [5] B. T. Morris and M. M. Trivedi, “A survey of vision-based trajectory learning and analysis for surveillance,” TCSVT, 2008.
  • [6] G. Pons-Moll, A. Baak, T. Helten, M. Müller, H. P. Seidel, and B. Rosenhahn, “Multisensor-fusion for 3d full-body human motion capture,” in CVPR, 2010.
  • [7] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe, “Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection,” WACV, 2018.
  • [8] M. Nabi, A. Bue, and V. Murino, “Temporal poselets for collective activity detection and recognition,” in ICCV Workshops, 2013.
  • [9] W. Lin, Y. Zhou et al., “A tube-and-droplet-based approach for representing and analyzing motion trajectories,” PAMI, 2017.
  • [10] M. Nabi, H. Mousavi, H. Rabiee, M. Ravanbakhsh, V. Murino, and N. Sebe, “Abnormal event recognition in crowd environments,” in Applied Cloud Deep Semantic Recognition, 2018.
  • [11] M. Sabokrou, M. Fayyaz, M. Fathy et al.

    , “Fully convolutional neural network for fast anomaly detection in crowded scenes,”

    CVIU, 2018.
  • [12] H. Rabiee, J. Haddadnia, H. Mousavi, M. Nabi, V. Murino, and N. Sebe, “Crowd behavior representation: an attribute-based approach,” SpringerPlus, 2016.
  • [13] D. M. Ramík, C. Sabourin, R. Moreno, and K. Madani, “A machine learning based intelligent vision system for autonomous object detection and recognition,” Applied Intelligence, 2014.
  • [14] A. Doucet, N. Gordon, and V. Krishnamurthy, “Particle filters for state estimation of jump markov linear systems,” Trans. Signal Process., 2001.
  • [15] J. S. Olier, P. Marín-Plaza, D. Martín, L. Marcenaro, E. Barakova, M. Rauterberg, and C. Regazzoni, “Dynamic representations for autonomous driving,” in AVSS, 2017.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014.
  • [17] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. S. Regazzoni, and N. Sebe, “Abnormal event detection in videos using generative adversarial nets,” in ICIP, 2017.
  • [18] M. Baydoun, D. Campo, V. Sanguineti, L. Marcenaro, A. Cavallaro, and C. Regazzoni, “Learning switching models for abnormality detection for autonomous driving,” FUSION, 2018.
  • [19] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in ECCV, 2004.
  • [20]

    P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    CVPR, 2016.
  • [21]

    E. Sangineto, M. Nabi, D. Culibrk, and N. Sebe, “Self paced deep learning for weakly supervised object detection,”

    TPAMI, 2018.
  • [22] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe, “Training adversarial discriminators for cross-channel abnormal event detection in crowds,” arXiv preprint arXiv:1706.07680, 2017.
  • [23] T. Kohonen, Self-Organizing Maps, ser. Physics and astronomy online library, 2001.
  • [24] P. Marın-Plaza, J. Beltrán, A. Hussein, B. Musleh, D. Martın, A. de la Escalera, and J. M. Armingol, “Stereo vision-based local occupancy grid map for autonomous navigation in ros,” VISIGRAPP, 2016.