Perceptual Attention-based Predictive Control

by   Keuntaek Lee, et al.
Georgia Institute of Technology

In this paper, we present a novel information processing architecture for end-to-end visual navigation of autonomous systems. The proposed information processing architecture is used to support a perceptual attention-based predictive control algorithm that leverages model predictive control, convolutional neural networks and uncertainty quantification methods. The key idea relies on using model predictive control to train convolutional neural networks to predict regions of interest in the input visual information. These regions of interest are then used as input to the Macula-Network, a 3D convolutional neural network that is trained to produce control actions as well as estimates of epistemic and aleatoric uncertainty in the incoming stream of data. The proposed architecture is tested on simulated examples and a 1:5 scale terrestrial vehicle. Experimental results show that the proposed architecture outperforms previous approaches on early detection of novel object/data which are outside of the initial training set. The proposed architecture is a first step towards using end-to-end perceptual control policies in safety-critical domains.


page 3

page 4

page 6

page 7

page 8


Ensemble Bayesian Decision Making with Redundant Deep Perceptual Control Policies

This work presents a novel ensemble of Bayesian Neural Networks (BNNs) f...

Safe end-to-end imitation learning for model predictive control

We propose the use of Bayesian networks, which provide both a mean value...

Recurrent Models of Visual Attention

Applying convolutional neural networks to large images is computationall...

Efficient and Robust LiDAR-Based End-to-End Navigation

Deep learning has been used to demonstrate end-to-end neural network lea...

Synthetic Perfusion Maps: Imaging Perfusion Deficits in DSC-MRI with Deep Learning

In this work, we present a novel convolutional neural net- work based me...

Inductive biases and Self Supervised Learning in modelling a physical heating system

Model Predictive Controllers (MPC) require a good model for the controll...

Model-Based Reinforcement Learning for Type 1Diabetes Blood Glucose Control

In this paper we investigate the use of model-based reinforcement learni...

I Introduction

For autonomous systems to be able to operate in uncertain environments they have to be equipped with robust decision-making capabilities using a variety of perceptual modalities including vision. Recent advancements in Artificial Intelligence and Deep Learning, have facilitated the development of algorithms that integrate perception and control in a holistic fashion. The resulting

perceptual control policies offer unique capabilities with respect to generalization, representation, and performance in tasks such as vision-based navigation.

Prior work on vision-based navigation relies on image classification through object detection and segmentation. Shelhamer et al. [21], Ren et al. [19], He et al. [8] introduce ways to improve performance in instance segmentation including object detection and semantic segmentation. Region Proposal Network (RPN) [19] determines the Region of Interests on the image and for each ROI, the network determines the class label of the object by ROI-pooling [7]. With the bounding box refinement step in the RPN, the ROI

boxes have different sizes. Classifiers do not handle variable input size very well because they usually require the input size to be fixed. This is where ROI-pooling comes into play.

He et al. [8]

introduced the ROIAlign technique to solve the quantized stride problem of ROI-pooling and showed excellent results in instance segmentation. ROIAlign preserves the spatial orientation of features without loss of data.

Alternative methodologies for performing vision-based navigation are via Imitation Learning (IL), also referred to as learning from demonstration. In the IL framework, the learning algorithm has access to an expert policy to take advantage of. This expert policy can come, for instance, from human demonstrations or from a Model Predictive Control (MPC) controller. For an autonomous driving task, Bojarski et al. [3] proposed an approach for learning to drive a full-size car autonomously directly from vision data. Moreover, Pan et al. [17] demonstrated an online approach of end-to-end IL using DAgger [20] for the high-speed autonomous driving task.

The tremendous success of these methods, however, cannot diminish the importance of safety due to the fact that conventional deterministic Deep Neural Network policies are fragile to adversarial attacks [9, 23, 16]. Lee et al. [13]

address the problem of incorporating safety by using the Bayesian approximation approach to quantify uncertainty in the output network control policy. By using a Bayesian neural network, the authors were able to pass the control authority to an optimal controller when the network outputs a large variance/uncertainty. One of the major limitations of this approach is that the increase of uncertainty at the output of the Bayesian neural network occurs when the vehicle is at a very close distance to the object. As a result, the time horizon within which the control authority is passed to an expert or fully observable predictive controller is small. This characteristic may result in an abrupt change in the controls or aggressive maneuvers which in turn may result in instabilities and ultimately accidents.

In this work, we view perceptual control policies as Information Processing Architectures (IPAs) and propose a novel architecture. The proposed IPA is used to support a Perceptual Attention-based Predictive Control (PAPC) algorithm that is capable of detecting objects in far distances while performing control using vision. Our approach has the following ingredients: (1) it uses IL to learn a perceptual controller; (2) it builds upon the Bayesian approach and (3) incorporates a novel attention mechanism that robustifies the detection of new objects even when these objects are located in far distances from the vehicle. PAPC takes advantage of an MPC expert by using the future state trajectories to determine ROIs on the input image. In PAPC, decision-making and perception are tightly coupled since predicted state trajectories force perception to focus in relevant, with respect to the future motion of the vehicle, areas of the input visual information. This attention mechanism enables early detection of unseen situations, such as the cases when a new obstacle appears in the driving lane. When such situations arise, the network policy is to concede control to a safe policy or expert, such as a fully observable MPC controller.

In summary, the contributions of this work are provided as follows:

  • We introduce the Model Prediction-Network (MP-Net) for learning trajectories represented as splines. The MP-Net is trained using input-output pairs that consist of images and state trajectories generated by MPC.

  • We use MP-Net as to determine ROIs in the input visual information. These ROIs are smaller than the initial image size, areas with variable resolution determined by the MP-Net predicted trajectory.

  • We introduce the Macula-Network (Macula-Net), a 3D Convolutional Neural Network (CNN). The Macula-Net uses as input the aforementioned ROIs and generates the corresponding controls as well as estimates of aleatoric and epistemic uncertainty. The Macula-Net is trained in a Bayesian fashion using input-output pairs consisting of ROIs and corresponding control commands generated by MPC.

  • We integrate all the aforementioned blocks and create the PAPC algorithm. PAPC detects novel objects in far distances from the vehicle while navigating it in an off-road environment. Detection of these objects is performed without any image classification or object detection.

  • The PAPC algorithm is tested and compared against prior state-of-the-art solutions. Experiments are performed in simulation as well as on real hardware and demonstrate the benefits and outperformance of PAPC.

The remaining of the paper is organized as follows: In Section II, we briefly review some preliminaries used in our work. In Section III, we introduce the Model Prediction Network, which predicts the future location of the vehicle in pixel coordinates used to construct ROI windows. In Section IV, we introduce the Macula-Net, which processes the ROIs and outputs a control mean and variance. We also detail our Perceptual Attention-based Predictive Control (PAPC) algorithm. Section V details simulation and real hardware experiments with analysis and comparisons of the proposed methods. Finally, we conclude and discuss future directions in Section VI and Section VII.

Ii Preliminaries

In this section, we provide the building blocks of the proposed Information Processing Architecture (IPA) for perceptual control. The aforementioned building blocks are MPC, Imitation Learning (IL), Bayesian Neural Networks, and B-Splines for trajectory representation.

Ii-a Model Predictive Optimal Control

MPC-based optimal controllers (e.g. iterative Linear Quadratic Gaussian/Model Predictive Control Differential Dynamic Programming (iLQG/MPC-DDP) [24], MPPI [26]) provide planned control trajectories, given an initial state and a cost function by solving the optimal control problem. An optimal control problem whose objective is to minimize a task-specific cost function can be formulated as follows:

subject to dynamics , where represents the system states, represents the control, is the state cost at the final time , is the running cost, and is the value function. By solving this optimization problem, we get the future optimal state trajectories from the optimal control trajectories. In this paper, we take advantage of the optimal state and control trajectories provided by MPC to train perceptual control policies and design an attention mechanism. As it will be explained later, the state trajectories will be used to train CNNs to predict ROIs using raw images while control will be used to train another CNN to predict control using as input the aforementioned ROIs.

Ii-B End-to-End Control via Imitation Learning

IL is one way to learn how to do a specific task by imitating a teacher’s or an expert’s control policy. In IL settings, it is usually assumed that the expert is perfect and always makes optimal decisions. The IL framework allows us to do end-to-end control since it bypasses all the burdensome steps in navigation (perception, filtering, localization, path planning, etc.) and directly applies control only with given observations.

The goal of IL is to learn a policy that minimizes the difference between the task-specific cost compounded by the expert’s policy and that incurred by the learner. To achieve the goal, the learned policy should aim to converge to the expert’s policy. For a deep learning based IL, one loss function that can be used to train the learner network for this regression problem is the mean squared error between the network’s predictions and the ground truth control actions labeled by an expert.

Ii-C Bayesian Neural Networks

The assessment of how far is a DNN from its training set is an essential and necessary capability for the deployed of DNNs to safety-critical applications. To incorporate this capability to perceptual end-to-end control policies, we will use Bayesian neural networks. Currently, in the field of Bayesian neural networks, Bayes by back propagation [2] and Monte Carlo (MC) dropout [6]

methods are widely used in broad applications. The MC-dropout method uses the dropout technique to build a probability distribution over network weights and this method allows us to obtain the distribution of the prediction at test time. This output distribution comes from the input distribution, where the trained Bayesian network will output a large output variance if the input distribution at the test time is largely different from the training input distribution.

Kendall and Gal [11] introduced the heteroscedastic loss function which provides two different notions of uncertainty, aleatoric and epistemic. Aleatoric uncertainty comes from incomplete knowledge of the environment whereas the epistemic uncertainty comes from the incomplete data. In this work, we use this heteroscedastic loss function to train our Bayesian network and use the network’s output variance for the early detection of novel inputs.

Ii-D B-spline

The B-spline is a collection of Bezier splines that are defined by a set of knot coordinates around which each spline is centered [4]. This set of splines has the following continuity requirements: i) The end of the previous curve must have the same value as the start of the next. ii) The first and second derivatives must be conserved between the intersecting points. Therefore, a high-degree B-spline can smoothly approximate a curve. The equation for a k-degree B-Spline is formulated below.

where () are control points and are the basis functions defined using the recursive Cox-de Boor formula [5]. In this work, the B-spline coefficients were used to train the Model Prediction Network described in the next section.

Iii Model Prediction Network

The MPC-based controllers provide the future state (e.g. position, velocity) trajectories. Inspired by MPC, we introduce a network which predicts a robot’s future positions in the image space. Based on the predicted future trajectories, we find ROIs that the robot can focus on.

In our attention mechanism, in order to find the ROIs on the image, we use the MPC’s state trajectory to find the ROIs by mapping the trajectory in the original state space to a corresponding trajectory in pixel coordinates, as seen in Fig. 1. This allows us to “see” wherein the image the car will be in the future timesteps. In turn, we pick a specific number of focal points (Fig. 1 (D)) along the obtained pixel trajectory, which we use to create the ROI windows. This attention mechanism allows our safe policy to be sensitive even to unseen obstacles that are far away and small in its image view, as long as they are in the way of its trajectory.

Fig. 1: An overview of the Model Prediction Network (MP-Net). (B) From IMU/GPS data, the system states are estimated and the model predictive controller generates the model’s future path/position in the world coordinates. From (A) the input image, the MP-Net is trained to predict (C) the B-spline coefficients of the predicted state trajectory in the pixel coordinates. From the spline coefficients predicted by the MP-Net, we reconstruct the spline and (D) choose focal points from the reconstructed spline.

The mapping of the state trajectory from state space to pixel coordinates will be implicitly learned with a deep convolutional neural network (CNN) we refer as Model Prediction Network (MP-Net). MP-Net has a similar network structure as VGG16 [22], a widely used deep convolutional neural network structure which learns the mapping between input images and the corresponding class labels. In MP-Net, as we deal with the regression problem, we train the network with the mean squared error loss. This network will take an image as input and will output a trajectory in pixel coordinates. In Section III-A, we describe how we obtain the targets to train this network, and in Section III-B, we detail how the MP-Net is trained on those targets.

Fig. 2: An overview of the Perceptual Attention-based Predictive Control (PAPC) algorithm. (A) Input image with size (64, 128, 3), normalized from [0, 255] to [0, 1]. (B) Predicted focal points (yellow) from MP-Net trained with MPC outputs. (C) Constructed ROIs from the predicted focal points. (D) Resized ROIs with the same size (32, 32, 3). Bigger ROI loses more resolution by downsampling. (E) Stacked 2D images into 3D data (4, 32, 32, 3). This 3D data resembles the input to the macula.

Iii-a Targets for MP-Net using Coordinate Transformation

As seen in Fig. 1, MP-Net needs to project the vehicle’s future state trajectory described in the world coordinates onto a 2D image in a moving frame of reference. This coordinate transformation technique is widely used in 3D computer graphics [25]. The coordinate transformation consists of 4 steps:

In this work, we follow the convention in the computer graphics community and set the Z (optic)-axis as the vehicle’s longitudinal (roll) axis, the Y-axis as the axis normal to the road, the positive direction being upwards, and the X-axis as the axis perpendicular on the vehicle’s longitudinal axis, the positive direction pointing to the left side of vehicle.

Let us define roll, pitch, yaw angles as and the camera (vehicle) position in the world coordinates. The camera focal length is defined as . Then, we construct the rotation matrices around the U, V, W-axis , the translation matrix , the robot-to-camera coordinate transformation matrix and the projection matrix as:

where the projection matrix projects the point in the camera coordinates into the film coordinates using the perspective projection equations [25] and the offsets and transform the film coordinates to the pixel coordinates by shifting the origin.

The total rotation matrix is computed as

and the matrix , transforming the world coordinates to the robot coordinates by translation and rotation, is calculated as

Then, after converting the X,Y,Z

-axes to follow the convention in the computer vision community through

, the projection matrix converts the camera coordinates to the pixel coordinates. Finally, we get the matrix, which transforms the world coordinates to the pixel coordinates:

To obtain the vehicle (camera) position in the pixel coordinates (u,v):


However, this coordinate-transformed point in the pixel coordinates has the origin at the top left corner of the image. In our work, as we deal with the state trajectory of the vehicle, we define the new origin at the bottom center of the image , where and represents the height and width of the image, and rotate the axes by switching and . Finally, we subtract from and get the final :


Iii-B Training MP-Net

Instead of training the MP-Net to predict the entire trajectory in the pixel coordinates, we train it to learn the spline coefficients of the trajectory. This is possible because the MPC trajectories are simple and smooth enough to be represented with splines. This greatly simplifies the regression problem, without jeopardizing performance. To train for spline coefficients, we first fit a spline through the pixel trajectory (obtained as detailed in the previous section) and we regress on the spline coefficients. From predicted spline coefficients, we reconstruct a spline and sample a fixed number of focal points to create ROIs.

Another way to get the focal points is directly regressing them in pixel space. However, this is not flexible to changes in the number of focal points, as the network would have to be re-trained for a different number of points. Our spline-learning approach allows us to generate any number of focal points, which proved to be very useful during experimentation.

We compared the prediction error of the spline-learning and the direct focal points learning method. For a fair comparison, we used the same CNN architecture for both methods. For spline-learning, we trained the network to predict the eight B-spline coefficients and for the direct focal points learning, we trained the same model to predict the four focal points in the pixel coordinates. Note that although the spline-learning method outputs a spline coefficients, we can still evaluate the spline at specific locations to obtain the focal points. Our experiments showed that the spline-learning

method required much fewer training epochs and it clearly outperformed the direct focal points learning approach, which was trained with 100 times more training epochs. The average testing error (MSE) in pixel space was 0.4 for the

spline-learning method and 25.2 for the direct focal points learning method. Here, we argue that even with the same number of values, the spline coefficients carry much more information than pixel position values do.

Fig. 3: The Macula-Network structure having 3D image data as an input and control action (mean and variance) as an output. The network is the 3D version of VGG16 Network [22] with a Bayesian scheme by dropout.

Iv Perceptual Attention-based Predictive Control

As described in Fig. 2, once we obtain the focal points with the MP-Net, we construct ROI windows according to these focal points and feed these into the Macula-Net, named after the central part of our eyes’ retinas, where we get the clearest vision with most resolution. The Macula-Net will take these ROIs and output a control mean and variance via the Bayesian dropout method [13, 6]. For the architecture of the Macula-Net (Fig. 3), we adopted the 3D version of VGG 16 [22], even though researchers have been recently developed smaller network structures with better accuracy. This is because of the easiness of the VGG structure, where we can simply apply the Concrete Dropout [27] method on top of it. Even though the network size is not small, compared to other structures, we are still able to get a good approximation of the Bayesian Network via Monte Carlo sampling with around 25 samples in real-time (20Hz).

The Macula-Net is trained using the heteroscedastic loss function [11] to produce the distribution as an output. This loss function is defined as


where is the target data, and and represent the distribution of the prediction.

The ROIs are constructed as follows:

  1. Define the fovea focal point as the farthest focal point along the spline.

  2. Construct the smallest ROI, referred to as fovea, as a window of size 32x32 with as its center.

  3. For each of the other focal points , construct an ROI with center


    and a window size of (), where


In this way, each ROI can cover the corresponding focal point and the fovea with some margin, defined manually.

As we can see in (D) of Fig. 2, unimportant features (e.g. buildings, sky, trees, etc.) have been removed from the input image. This is one of the greatest advantages, wherein the network focuses on the important/task-related regions of the image, while also eliminating irrelevant parts of the image.

We resize all ROIs into the same size as the smallest ROI, which is constructed from the farthest focal point generated by the MP-Net. By this step, with 4 ROIs, the concatenated 3D image has a size (4, 32, 32, 3), two times smaller than the original input (64, 128, 3).

The resizing step is inspired by the Glimpse Sensor [15], where multiple resolution patches were used to improve classification performance. Unlike the Glimpse Sensor, we do not use a simple fully connected layer after the concatenated multi-resolution 2D images. Rather, we process 3D convolutions to extract attention-based 3D information among the stacked images. Through the resizing step, the smallest ROI from farthest focal point maintains its resolution, but the other ROIs downsample to the fixed size. As a result, bigger ROIs get lower resolution due to the downsampling and this is where the network resembles the parafovea/perifovea area of the macula. When we see something, we can focus on a specific region clearly with a high resolution (fovea) but other regions on its outside are blurred out, thus having a lower resolution (parafovea/perifovea). We can think about this resizing step as putting more weights to the smallest and the most important ROI.

We stacked images in 3D, resulting in another dimension, z, but the number of stacked images is not that big, so we do not want this dimension to be reduced and lose information by a pooling layer. Therefore the 3D max-pooling layers in the network act like 2D max-pooling layers because they do not pool the z-dimension. For 3D filters, we used (3, 3, 3) kernels.

Finally, we combine MPC and the attention-based image processing, using the MP-Net (Fig. 1) and the Macula-Net (Fig. 3), into the PAPC algorithm, described in Algorithm 1.

1:  for  do
2:     MPC
3:  end for
4:   CoordinateTransform
5:   Spline
6:   GenerateROIs
7:  while Training MP-Net do
8:      MP-Net
10:     Update MP-Net
11:  end while
12:  while Training Macula-Net do
14:      Macula-Net
16:     Update Macula-Net
17:  end while
18:  while Testing do
19:      Macula(GenROIs(, MP-Net
20:  end while
Algorithm 1 Perceptual Attention-based Predictive Control (PAPC)

V Experiments/Results

V-a End-to-End Autonomous Driving with Anomaly Detection

First, we collect data from our model predictive controller, driving a 1/5 scale vehicle around an oval track for 100 laps. Then, we produce one more piece of data, which is the stacked 3D image at every time. This is generated from the coordinate transformation of the MPC’s trajectories and the B-spline method as described in Section III-A. When training, we train both MP-Net and Macula-Net separately with the same set of data, but different loss functions. The MP-Net requires pairs of the original image and the spline coefficients of the planned path and the Macula-Net needs 3D stacked images and control action pairs. The control action in our experiments is the vehicle steering command which is a continuous real number between .

We observed in the experiments that out of the two different uncertainties we get from our Bayesian network trained with the heteroscedastic loss function, the value of the epistemic uncertainty showed a drastic change reacting to the novel observation whereas the value of the aleatoric uncertainty does not show a big difference. This is reasonable because the epistemic uncertainty is the one coming from the lack of data, and it provides large variance given novel input data. Therefore, we used this epistemic uncertainty as the network’s output variance, the safety signal.

We tested our algorithm in a simulated autonomous driving environment in ROS [18] as well as with our real hardware. We conducted 100 test runs with 5 different obstacles in ROS to evaluate PAPC’s performance compared to the state of the art. In the real hardware experiments, we conducted 10 trials per each obstacle and per networks for comparison. All of the experiments were done with NVIDIA GeForce GTX 1050 Ti GPU for the real hardware experiments and 1060 GPU for simulation experiments.

Fig. 4: The autonomous driving simulation environment. Top Left: Vehicle’s view from its onboard camera. Note that the construction cone, a novel object, is 7m away from the vehicle.

Both MP-Net and Macula-Net were trained with Adam [12]

optimizer in Tensorflow

[1]. We used the Concrete Dropout [27]

to find the optimal dropout probability per each layer in our Bayesian Network. After every convolution and fully connected layer, we performed batch normalization

[10] to speed up the training and there was no data aggregation involved except for the 3D stack part and all models were trained in batch.

We set a threshold for the output variance signal to 3-10 times larger than the maximum value of the usual variance in the normal case, depending on the number of samples we choose for the MC-dropout. The usual output variance in the normal situation without any novel object in the scene was between and , depending on the number of samples as well. We used this safety threshold for an emergency stop of the autonomous vehicle. If the output variance is larger than the threshold, the vehicle will be stopped.

Fig. 5: Left: The 1/5 scale vehicle with onboard cameras used for the experiments. Right: An obstacle (cardboard box) and the vehicle on the track.

The MP-Net

takes as input full view of images and is trained to predict the planned path by imitating our model predictive controller. Through experiments, we found that VGG-like Network is trained to map important and relevant features to the target output, which is the planned trajectories in

MP-Net. As we can see in the activated feature maps at each layer, the extracted important features were the line information of the track. In other words, the VGG-like MP-Net is trained to map the track part of the image to the corresponding output (future path in few seconds ahead). As the MP-Net gets the full view of the image as an input, the new obstacle does not affect/fool the network output most of the time until the obstacle dominates the track in the image. We tested with putting a different kind of objects on the track and getting rid of some features like trees or buildings (in simulation), but neither of them could fool the network because the most important features, track boundaries, still existed in the image. Therefore, we can believe the MP-Net to predict the correct trajectories.

From the cropped, re-sized, and stacked 3D image data, the model is trained to drive the vehicle autonomously in an end-to-end fashion. Our policy not only performs the original task, but also it is able to quantify uncertainty in its control policy when novel obstacles are put in the track, even far-off from the vehicle.

We place a novel obstacle (Table I), which was never been seen in the training data on the vehicle’s path. Lee et al. [13] showed in their previous work the increased variance signal from the Bayesian neural network when a vehicle saw a novel object on the road. However, as they mentioned in their work, the increase of the variance signal was gradual, but not fast enough to take actions to avoid new obstacles at a high speed. Here, with our PAPC

algorithm, we show a faster anomaly detection which gives the vehicle enough time to avoid it without any collision (

Fig. 8).

New Objects DropoutVGG [13] PAPC [Ours]
Min: 0.37 m
Avg: 0.39 m
Max: 0.42 m
Min: 4.28 m
Avg: 6.87 m
Max: 9.22 m
Min: 2.20 m
Avg: 2.54 m
Max: 2.86 m
Min: 4.81 m
Avg: 5.48 m
Max: 6.25 m
Min: 0.00 m
Avg: 0.00 m
Max: 0.00 m
Min: 6.80 m
Avg: 7.25 m
Max: 7.83 m
Min: 2.12 m
Avg: 2.25 m
Max: 2.44 m
Min: 7.62 m
Avg: 6.87 m
Max: 8.33 m
Min: 1.28 m
Avg: 2.06 m
Max: 2.44 m
Min: 6.55 m
Avg: 7.51 m
Max: 8.17 m
Min: 0.00 m
Avg: 0.63 m
Max: 2.51 m
Min: 10.58 m
Avg: 11.28 m
Max: 14.96 m
Min: 0.00 m
Avg: 0.26 m
Max: 1.29 m
Min: 6.91 m
Avg: 12.55 m
Max: 14.63 m
Min: 0.00 m
Avg: 0.67 m
Max: 4.11 m
Min: 6.17 m
Avg: 10.09 m
Max: 13.25 m
TABLE I: Distance left from the object when the network detects it. Top 5 objects are tested in the ROS simulator and the bottom 3 objects are tested with our real hardware.

Using only the smallest ROI around the furthest focal point, it is hard for a network to learn a task since the smallest ROI does not contain enough information related to the task. However, even with a low resolution, we have bigger ROIs which have some information related to given tasks. With a combination of this smallest ROI and the largest ROI, the model can learn the task as well as focus on important regions of the image to pay attention to and report safety threats.

We can see that through the PAPC algorithm, the input image data to the Macula-Net excludes unimportant features (e.g. buildings, trees, sky, etc.) and only focus on the part where the MPC guides the network to focus on.

To stop the high-speed vehicle before collision or to avoid an obstacle, 1-2 seconds (5-10m) will be required to take proper action for a vehicle driving in around 5m/s. Within 1-2 seconds, a human expert or model predictive controllers can take control of the vehicle and avoid the obstacle since the model predictive controllers do optimal path planning and control within 20-50ms.

Fig. 6: Input 3D image (stacked ROIs) and the averaged feature activation maps after the first two max-pooling layers. Lighter color represents more activation. The smallest ROI, ROI 1, is more activated by the new object than the track/road boundary. As the ROI becomes larger, it tends to be activated from the track/road boundary than from the new object.

We analyze our method via plotting the averaged feature maps after the first two max-pooling layers per each ROI (Fig. 6). ROI 1 (fovea) is activated to the feature of a new obstacle, more than the track boundary. However, interestingly, as ROI becomes larger and downsampled, we can see that the ROI is more activated from the track boundary. From this combination of multi-resolution 3D inputs, the PAPC algorithm is able to detect a new object while focusing on the task-related features in a normal situation, where abnormal objects do not appear. Also, we can see that in Fig. 6

that the deeper layer (closer to the output layer) tends to focus more on a single feature. For example in ROI 2, after the first max pooling layer, the neurons are activated almost equally from both the new object and the track boundary. However, after two more convolutional layers and a max pooling layer, the neurons are activated only from the single feature, the track boundary. This pattern is also seen in other


In Fig. 6, we can also see that the resolution of the ROIs becomes lower as the ROI becomes larger since it needs more downsampling. From this downsampling, we can observe the loss of information on the bigger ROIs, where the brightness has been changed.

Fig. 7: Top: A screenshot of the 3D t-SNE plot of the ROI images given to the Macula-Net. Bottom: A 2D t-SNE plot of the 3rd max-pooling layer in the Macula-Net. The colorful cluster shows how the Macula-Net is able to abstract the presence of an obstacle in its way. The 3D plot is best viewed in the video:
Fig. 8: The Output variance plot from Bayesian networks. It shows the epistemic uncertainty from the MC-dropout. PAPC shows an earlier detection of the novel obstacle compared to the DropoutVGG [13].

V-B Failure Cases

Our Network sometimes fails, depending on the object size or color. PAPC was not able to detect a can in the simulation environment and a detergent container in the real world (Fig. 9). We argue that their distributions were not different enough from the training data, even though the fovea ROI caught the object correctly. However, when the vehicle gets closer to the objects, the fovea ROI has already passed it, so no more strong attentions exist at that time. We believe these kinds of smaller objects or those having a similar distribution to the training data without obstacles can be detected by increasing the number of focal points and ROIs, with the fovea having a smaller window. Having more ROIs will require faster GPUs and smaller network structures to run the network in real time, however.

All of the simulation and real world experiments can be found in the video:

Fig. 9: The failure cases of detecting a novel beer can in the ROS simulator and the detergent container at the real track. Even though the fovea (red box) caught the objects, the PAPC did not output any meaningful signal.

V-C Data Distribution Visualization with t-SNE

Because the Macula-Network uses the Bayesian dropout [6] approach to determine whether ROI contains a new obstacle, it is useful to analyze how the distribution of the ROIs with and without obstacle differ from one another. We use the t-SNE technique for dimensionality reduction in order to visualize the high dimensional ROI images that the Macula-Network takes in as input. We run t-SNE with perplexity values around 60 with simulated and real-world ROI data. In addition, we not only run t-SNE with the ROI images but also with the output of the first fully-connected layer in the Macula-Net. Using a middle layer allows us to visualize how the Macula-Net itself interprets the ROI images. In Fig. 7, we show that the Macula-Net features of the images with obstacles lie in a noticeably different distribution (colorful cluster) as the images with no obstacles. This shows that the Macula-Net was able to capture a change in the distribution of the ROIs when an obstacle is put in the track. Because these image samples with obstacles lie in a different distribution as compared to the training data, the Bayesian network inside the Macula-Net will output control values with high variance, thus indicating the policy that it is no longer safe to drive. In this manner, the algorithm will be able to tell when to concede control to a safer controller.

Vi Discussion

For our future work, we would like to explore smaller network structures instead of the VGG-based network. We were able to run our robot in real time (15-20Hz), but we could only sample around 25 samples from the Monte Carlo dropout. We also tested with more samples in lower control frequency and saw less noisy anomaly detection signal from the output variance of our Bayesian Network.

While PAPC is able to detect novel objects and increase the output predicted uncertainty, the duration of this detection is instantaneous. This because the detection depends on the time interval during which the ROI with the highest resolution on the tip of the MP-Net predicted trajectory overlaps with the new object. In future work, we plan to combine tracking mechanisms with PAPC so that to achieve an increase in predicted uncertainty of the Macula-Net for as long as the new object is in the field of view of the vehicle.

We would like to emphasize here that the proposed IPA, namely the PAPC architecture, can be used in any autonomous system that performs navigation using vision sensors for a safe path planning and control (e.g. visuomotor for manipulation [14]). In addition, while our initial goal is to use the PAPC architecture as the main system for navigation, its operational role can also be as a secondary safety controller performing anomaly detection on the observation side, while the main controller is running. PAPC can raise an emergency flag when it sees something new. This flag can be used in the decision-making module for designing a robust/adaptive controller. In addition, this anomaly detection approach can be used for improving data aggregation during learning or for intelligent exploration in unknown or partially known environments.

Vii Conclusion

In this work, we view perceptual control policies as Information Processing Architectures, or IPAs for short, and propose a new architecture to support an algorithm for perceptual attention-based control. The architecture for PAPC consists primarily of two CNNs, namely the MP-Net and the Macula-Net which is introduced for the first time in this paper. The MP-Net is trained so that to generate predicted state trajectories using vision. These trajectories determine ROIs in the image with variable resolution. Using as input the aforementioned multi-resolution ROIs, the Macula-Net is able to focus on relevant, with respect to the task, areas of the input visual information, detect novel objects and control the vehicle in consideration. We validated our proposed attention-based deep Bayesian network in both ROS simulator and the real hardware for a safety-aware autonomous driving task. PAPC was able to detect different obstacles quickly in comparison with the state-of-the-art approach of the end-to-end Bayesian network [13].


This research was supported by the Amazon Web Services Machine Learning Research Awards.