Fully autonomous systems need perception to navigate through scenes and recognize objects in real environments 
. Recent advances in signal processing and machine learning techniques can be useful to design autonomous systems equipped with a self-awareness module that facilitates to recognize contextual information while a given task is executed. The capability of detecting abnormal situations based on such self-awareness is an important task that allows autonomous systems to increase their situational awareness and the effectiveness of the decision making submodules.
The analysis of observed moving agents for understanding normal/abnormal dynamics in a given scene represents an emerging research field [3, 2, 4]. This paper proposes a methodology for abnormality detection based on multiple sensors that observe the same phenomenon from different perspectives. Abnormalities can be first detected as deviations from Environment Centered (EC) models, i.e., from an observer viewpoint which does not have access to internal agent variables. Such layer can be defined as a Shared Level (SL) of self-awareness, since the observed information, e.g., observed position and velocity can be measured easily from an external observer.
An observed agent can also have further information corresponding to what it can observe from a first person viewpoint (FPV) while a task is performed. Abnormalities related to unexpected observations acquired while performing a task can be considered as the essential information to define a Private Layer (PL) of self-awareness: Such experiences are available only to the agent itself. Accordingly, an external observer cannot access to such information and has to rely solely on SL information.
Analyzing phenomena from different sensory data is definitively not a new problem. In , researchers use video data together with orientation information to capture the 3D motion of a human body. In , a multi-sensor monitoring system is proposed to prevent accidents and detect falls. Additionally, several researchers used multiple-cameras to recognize abnormalities [7, 8].
One of the main novelties of this work consists of a strategy which processes data from first and external viewpoints and facilitates a subdivision of subject’s behaviors into basic dynamical models (activities). Dynamic normality models and related algorithms can detect abnormalities by fusing shared and private agent information. Identified anomalies are here defined as patterns that have not seen or learned in previous experiences [9, 3] by taking into consideration private and shared perspectives of the same phenomenon.
Observations acquired from the external viewpoint (EV) are composed of agents’ positions and velocity with respect a fixed reference system. Locations and velocities (actions) are analyzed by a Gaussian Process (GP) regression that estimates the most probable agent’s action in each position of the whole scene. Results from the GP regression are then clustered in zones through a superpixel algorithm approach introduced in that encodes action patterns, e.g., going straight or curving.
Images collected synchronously from an FPV are used to compute optical flow at each frame. Video sequences can be seen as PL data related to the reference system of the moving agent itself. The optical flow between video frames can be seen as a private representation of the agent’s action in the PL. Accordingly, in order to understand the relation between a given FPV image and its optical flow, a Generative Adversarial Nets (GANs) approach  is adopted for training deep networks in a supervised way. Supervision here consists in supposing that the action patterns previously obtained by the GP approach for defining EC normality can be used to train normal visual models.
Both SL and PL learned models can be used to predict the dynamics of a vehicle performing a task. Produced models by each approach can be used to describe how these two representations are related to each other in normal conditions. This paper shows that the PL representation learned in a supervised way can provide further normality descriptors, enriching the ones obtained in an unsupervised way from GP for EC normality representation.
The remainder of the papers is organized as follows: section 2 describes the GP approach (section 2.1) and the GAN (section 2.2.1) for SL and PL self awareness, respectively. Section 3 shows the dataset that was used for obtaining results that are detailed in sections 4.1 and 4.2. Conclusions and future research directions are presented in section 5.
2 Proposed Method
2.1 Representation of observed dynamic motion
To model the SL self-awareness, we use a state space representation from an external observer placed in the EC reference system. Accordingly, the state subspace represents the location of an agent in the environment whereas represents its velocity. the whole scene can be seen as a grid of possible locations where agents can be. Let such spatial grid be defined as . As can be seen, is a set of locations that cover the whole environment. Given a set of observations from a moving agent, it is possible to express its positions and velocities as the subsets and , where is the total number of observations of the agent. By using and , it is possible to use a GP approach  that approximates velocity information over spatial grid, , such that:
where represents an estimation of velocity information for each point of the spatial grid, takes location information and estimates the expected motion (action) at such position for a given activity. Since agents’ actions can be seen as velocities, it is possible to describe them as a GP, where
is a Gaussian zero-mean white noise.
In this work, a 2-dimensional case is considered. Therefore, spatial coordinates and time derivatives consist of two components each, and , respectively. Accordingly, it is possible to represent as the pixels of an image whose corresponding color values carry information about agents’ actions . In particular, here RGB images are considered, where Red and Blue colors encode respectively and the Green channel is disregarded.
Uncertainties generated by the GP are used to remove information where not enough evidence is observed such as explained in . Since an image that encodes the GP is available, a superpixel algorithm  is applied to discretize the image plane into zones. Each of these zones is characterized by a quasi-constant velocity model , , where indexes identified zones, i.e., . Finally, a linear dynamic model can be defined for each zone such as follows:
where , , indexes the time, is the sampling time and is the process noise. The variable is a control input that encodes the action (motivation) of the agent. The process for identifying zones where quasilinear models are valid is summarized in the block diagram in Fig.1.
2.1.1 Abnormality detection by using Kalman filter method
It is possible to build a set of Kalman Filters (KFs) based on the built dynamical models shown in equation (2). Each KF is designed for tracking linear motions with low error (innovation) when observed data follows already characterized (normal) behaviors inside identified zones.
As is well known, KFs’ innovations represent residual values produced by measurements while assuming a specific normal model. Such values can be used to express abnormalities since they quantify the deviations from normal learned models in the environment. Innovations can be expressed as:
where is the innovation generated in the zone where the agent is located. represents observed spatial data and is the KF estimation of the agent’s location at the future time calculated in the time instant (2).
In this work, innovation vectors are composed of two components (one for each axis) and the magnitude of those vectors can be considered as a final measure of abnormality,, assuming that the observed agent is inside the region , it is possible write:
In order to evaluate if an observation is abnormal with respect to the current bank of KFs that encodes the normality in an environment (eq. (2)), an error threshold is defined for distinguish between abnormal and normal behaviors at each time instant. Accordingly, if a certain exceeds such threshold, the system considers the current measurement as an anomaly.
2.2 Representation of the agent embodied self awareness
In order to represent the PL of agent self-awareness, Generative Adversarial Networks (GANs)  are proposed to learn the normality pattern of the observed scene. GANs are deep networks commonly used to generate data (e.g., images) and are trained using only unsupervised data. The supervisory information in a GAN is indirectly provided by an adversarial game between two independent networks: a generator () and a discriminator (). During training, generates new data and tries to understand whether its input is real (i.e., it is a training image) or produced by . The competition between and is helpful for boosting the ability of both and .
2.2.1 Learning the normal pattern of the observed scene
Two channels are used to learn the normal pattern of the observed scene: appearance (i.e., raw-pixels) and motion (optical flow images) for two cross-channel tasks. In the first task, optical-flow images are generated from the original frames. In the second task, appearance information is estimated from an optical flow image.
Let be the -th frame of a training video and the optical flow obtained using and . is computed using . Fig.2 shows two networks: , which is trained to generate optical-flow from frames (task 1) and , which generates frames from optical-flow (task 2). In both cases, inspired by [14, 15], our networks are composed of a conditional generator and a conditional discriminator . takes as input an image and a noise vector (drawn from a noise distribution ) and outputs an image of the same dimensions of but represented in a different channel.
Both and, we adopt the U-Net architecture , which is an encoder-decoder. is proposed to be a PatchGAN discriminator , which is based on a “small” fully-convolutional discriminator . Additional details about the training procedure can be found in [16, 15]. During training, the output of is averaged over all the grid positions such that final score of is obtained with respect to the input. For testing purposes, we directly use the averaged scores of as a “detector” which is run over the grid to detect the abnormality from the input frame (see Sec. 2.2.2).
It is important to highlight that both and are here collected by using only the frames fromnormal scenarios in the identified zones provided by GP. The absence of abnormal events at the training phase makes it possible to train the discriminators corresponding to our two tasks without the need of supervised training data: acts as an implicit supervision for . We hypothesize that the latter lies outside the discriminator’s decision boundaries because they represent situations never observed during training and hence treated by
as outliers. We use aBank of Discriminators based on the identified zones provided by GP, which is grouped into two sets: , which is trained on a straight path, and that is trained over the curves. The discriminator’s learned decision boundaries can be used to detect unseen events as explained in the next section
2.2.2 Anomaly detection
Discriminators are used at the testing phase. More specifically, let and be the patch-based discriminators trained using the two channel-transformation tasks (see Sec. 2.2.1). Given a test frame and its corresponding optical-flow image , we first produce the reconstructed and using and , respectively. Then, the pairs of patch-based discriminators and are applied respectively to the first and second tasks. Such operation results in two scores for the ground truth observation: and , and two scores for the prediction (reconstructed data): and . The two scores are summed: , , and the values in and are normalized into the range . Note that, a possible abnormality in the observation (e.g., an unusual object/movement) corresponds to an outlier with respect to the data distribution learned by and during training. The presence of the anomaly results in a low value of and (prediction), but a high value of and (observation).
Hence, in order to decide whether an observation is abnormal with respect to the scores from the current bank of Discriminators, we simply measure the distance between predicted scores and observation scores such as shown in equation (4).
An error threshold is defined to detect the abnormal events: when is higher than such threshold, the current agent’s measurement is considered as an abnormal situation.
Proposed approach is validated with data acquired from a real vehicle during a perimeter monitoring task. The ’iCab’ vehicle is equipped with several heterogeneous sensors . In this work, we consider data related to vehicle’s position and images grabbed from a frontal on-board camera Two different scenarios are considered, consisting of standard perimeter surveillance task and anomalies while executing it.
Scenario 1: vehicle performs a rectangular path around the environment (perimeter monitoring), see Fig.2(a).
In both scenarios, the vehicle executes the correspondent task several times per each experiment.
This section presents the results of proposed methods applied on the dataset presented in section 3. The normal behavior corresponds to perimeter monitoring defined in the Scenario 1.
4.1 Shared Level Self Awareness abnormality detection
As discussed in subsection 2.1, Fig. 7 shows the segmentation of GP into zones. In each zone, quasi-constant velocity models are valid. Large and small zones represent the action patterns for going straight and curving respectively. Additionally, each zone is used to create a KF valid in that specific area. As explained in subsection 2.1.1, by considering innovations generated by the bank of KFs based on the perimeter control task, it is possible to identify abnormalities simply by observing new trajectory data that does not correspond with the already characterized models. The value of measures the abnormality level at the time instant . High innovation values indicate the presence possible anomalies in the scene. By processing position measurements from Scenario 2 (section 3) and analyzing innovations with respect to the normality model, it is possible to detect anomalies. The abnormality threshold is set at and produced anomaly detection results are shown in the Fig. 7. It is possible to find a pattern composed by two abnormal peaks that are associated to the avoidance of the standing pedestrian. An uniform anomaly pattern is not formed due to the straight component of the avoidance maneuver that follows the regular perimeter control behavior. In addition, under the threshold, other two behaviors can be recognized, i.e., straight and curve tracks performed by the vehicle. The lowest abnormality levels correspond to the straight parts of the track. Some abnormality peaks under the threshold value are created when the vehicle curves due to slightly different turning angles when performing the experiment laps.
4.2 Private Level Self Awareness abnormality detection
The bank of GANs are trained on the subsets of data based on GP zones. In our experiments, the bank of GANs is composed of two major subsets: and , see Fig. 7. Each GAN detects the abnormality in the corresponding set on which is trained. The self-awareness model is tested on the second scenario discussed in subsection 3. Anomaly detection results associated to the PL, using the proposed bank of GANs are shown in Fig. 6. Three signals are shown in Fig. 6: The green and blue signals respectively show the computed signals by our (trained on ) and (trained on ). The red signal indicates the final abnormality measurement which is defined as the minimum value of and . As it was expected, the obtained abnormality measurement in PL is aligned with SL results shown in Fig. 5.
Different parts of the curve can be associated and explained by considering the correspondent images acquired from the on-board sensor. Specifically, the small peak identified with number can be justified by the presence of the pedestrian in the field of view of the camera: the vehicle do not start the avoidance maneuver yet, thus, it can be seen as a pre-alarm. The small peak in corresponds to peak in , the latter is smaller due to the posture of the pedestrian, see correspondent images and .The areas of the curve identified with numbers and or and correspond to the starting point of the abnormal maneuver and the avoiding behavior itself: it can be seen that peaks and are higher than the selected threshold and then correspond to an anomaly. After the small peak , that corresponds to the closing part of the avoidance turn, the vehicle goes back to the standard behavior. In particular, at this point of the curve, the vehicle is actually turning. In the wider area (from to secs.), the ’iCab’ is moving straight. The slightly higher level of the abnormality curve in straight areas can be explained by a noise related to the vibration of the on-board camera due to the fast movement of the vehicle when increasing its speed.
It is notable that, the signal generated by becomes higher in the curving areas since it is only trained on for detecting straight paths. Similarly, the which is trained on , generates higher scores on the straight path. However, both , and can detect the abnormality area (pedestrian avoidance) where both generate a high abnormality score.
We presented a multi-perspective approach to detect anomalies for moving agents. Obtained results demonstrate the capability of our methodology to recognize anomalies using multiple viewpoints, namely PL and SL. A future research path could consist in combining information from different sources for decision making and robust the proposed self-awareness model. In particular, situational awareness and self-reactions could be increased with respect to the existing literature.
-  D. M. Ramík, C. Sabourin, R. Moreno, and K. Madani, “A machine learning based intelligent vision system for autonomous object detection and recognition,” Applied Intelligence, vol. 40, no. 2, pp. 358–375, Mar 2014.
-  D. Campo, A. Betancourt, L. Marcenaro, and C. Regazzoni, “Static force field representation of environments based on agents’ nonlinear motions,” Eurasip Journal on Advances in Signal Processing, vol. 2017, no. 1, 2017.
-  V. Bastani, L. Marcenaro, and C. S. Regazzoni, “Online nonparametric bayesian activity mining and analysis from surveillance video,” IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 2089–2102, May 2016.
-  B. T. Morris and M. M. Trivedi, “A survey of vision-based trajectory learning and analysis for surveillance,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 8, pp. 1114–1127, Aug 2008.
-  G. Pons-Moll, A. Baak, T. Helten, M. Müller, H. P. Seidel, and B. Rosenhahn, “Multisensor-fusion for 3d full-body human motion capture,” in , June 2010, pp. 663–670.
-  Y. Charlon, W. Bourennane, F. Bettahar, and E. Campo, “Activity monitoring system for elderly in a context of smart home,” IRBM, vol. 34, no. 1, pp. 60 – 63, 2013, Digital Technologies for Healthcare.
-  E.B. Ermis, V. Saligrama, P.-M. Jodoin, and J. Konrad, “Abnormal behavior detection and behavior matching for networked cameras,” 2008, cited By 15.
-  R. Emonet, J. Varadarajan, and J.-M. Odobez, “Multi-camera open space human activity discovery for anomaly detection,” 2011, pp. 218–223, cited By 13.
-  K. Kim, D. Lee, and I. Essa, “Gaussian process regression flow for analysis of motion trajectories,” 2011, pp. 1164–1171.
-  D. Campo, M. Baydoun, and Cavallaro A. Regazzoni C. Marcenaro, L., “Modeling and classification of trajectories based on a gaussian process decomposition into discrete components,” in 14th IEEE International conference on Advance Video and signal based surveillance. 2017, AVSS 2017, IEEE Computer Society.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds., pp. 2672–2680. Curran Associates, Inc., 2014.
Z. Li and J. Chen,
“Superpixel segmentation using linear spectral clustering,”in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1356–1363.
-  T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in ECCV 2004, 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, 2004.
-  M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe, “Training adversarial discriminators for cross-channel abnormal event detection in crowds,” arXiv preprint arXiv:1706.07680, 2017.
-  M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. S. Regazzoni, and N. Sebe, “Abnormal event detection in videos using generative adversarial nets,” in 2016 IEEE International Conference on Image Processing, ICIP 2017 Beijing, China, September 17-20, 2017, 2017.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  P. Marın-Plaza, J. Beltrán, A. Hussein, B. Musleh, D. Martın, A. de la Escalera, and J. M. Armingol, “Stereo vision-based local occupancy grid map for autonomous navigation in ros,” in 11th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications VISIGRAPP 2016, 2016.