I Introduction
We introduce a novel mechanism which combines vision into a model predictive control (MPC) framework.
Deep learning (DL)based perceptual control using endtoend imitation learning has shown great success in many robotics disciplines including autonomous driving [17, 2, 25], manipulation [13], and autonomous drone flying [6, 22].
In this paper, instead of taking a fully endtoend approach ([6, 22]), we deploy the power of DL in novel system modeling. In a traditional (not endtoend) navigation, DLaided vision pipeline played a big role in detecting objects and obstacles as a perception module and sometimes as a part of state estimation (e.g. VSLAM [14]). A controller then performed its task of navigation, avoidance, or tracking using the information provided from the vision part [10].
The visual object tracking or visual servoing technologies have been developed over the past few decades and can be found in some commercial drone products. However, most of the work in literature [6, 22, 11] are all based on reactive controllers; the robot turns left if the object is on the rightside of a robot’s view, and vice versa. This reactive visual servoing requires the drone to fly at a slow speed or hover until it finishes servoing. Here we propose a predictive visual tracking controller for highspeed racing with a datadriven optical flow dynamics model composed of optical flow and robot dynamics.
In a drone racing scenario, the optical flow mostly comes from a moving camera and a static environment. Since the controller moves the robot through space, the changes in scene, the optical flow, can be thought of as indirect dynamics.
Recently, there has been a lot of progress in DLbased optical flow techniques [8, 23, 19]. However, all prior work relies on large convolutional neural networks with a lot of parameters to estimate the optical flow of the entire image. In our work, the application of the optical flow is to predict the relative motion of a ‘single’ pixel, so we use a small fullyconnected feedforward network.
The main problem we address in this paper is the visibility/field of view of a moving camera, especially when it comes to highspeed racing. The more the robot observes through a camera, the more information we use to perform accurate state estimation and navigation. Therefore, it is important to control the robot to see more information, for example, by pitching up or rolling/yawing. However, this conflicts with the highspeed flying task for a drone because a quadrotor needs to pitch down to fly at a high speed and this results in losing more visual information.
To solve the problem of limitation in the field of view by visual servoing, [18, 15] proposed a Sequential Quadratic Programmingbased approach where the visibility is formulated in hard constraints. However, these methods do not fit into our problem formulation which requires realtime planning and control.
In the visual servoing literature, to the best of our knowledge, the realtime predictive controllers used for a visual tracking task are [16, 4]. Although [16] formulated an MPC problem for a viewpoint optimization, the goal of the paper was controlling the drone to stabilize a gimbal to get a good quality of a video. In [4], the most relevant work to us, the authors derived the target pixel velocity based on the information of the relative 3D position () of the target and the robot. With the pixel velocity information, the authors were able to form an MPC problem along with vision and perform visual object tracking control in a predictive way.
However, in our work, we implement a datadriven deeplearning approach, that does not require any prior information of camera intrinsics, extrinsics, or the 3D global position of the target. Instead, our algorithm requires an object detector that detects a target in image space. Thanks to the great success in the field of computer vision, we can use realtime object detectors
[20, 21] with GPUs. Although our method requires prior knowledge of the image of the targets (gates) and a trained detector, we believe this is less restrictive than full knowledge of the global 3D position of features like in [4]. Furthermore, we believe our case is less restrictive since our proposed approach can be used for any moving target objects located anywhere in the scene.In summary, the contributions of this work are twofold:

[leftmargin=0.4cm]

We introduce datadriven Deep Optical Flow (DOF) dynamics, learned from the optical flow of consecutive images and robot dynamics. DOF dynamics are efficient in memory and computation.

We introduce the Pixel Model Predictive Control (PixelMPC) algorithm which predicts the relative motion of pixels by actuating the robot to visually track important features (targets) while accomplishing the highlevel tasks (e.g. racing or chasing). The algorithm makes the visionbased state estimation more robust as it explicitly allows the control algorithm to prioritize visual information.
The remaining of the paper is organized as follows: In Section II, we briefly review some preliminaries used in our work. The DOF dynamics are introduced in Section III and in Section IV, we introduce our PixelMPC algorithm. Section V details visionbased drone racing and state estimation experiments with analysis and comparisons of the proposed methods. Finally, we conclude and discuss future directions in Section VI.
Ii Preliminaries
In this section, we provide the building blocks of the proposed Pixel Model Predictive Control (PixelMPC): MPC, drone dynamics model, and the optical flow.
Iia Model Predictive Optimal Control
Model Predictive Control (MPC)based optimal controllers (e.g. Model Predictive Path Integral (MPPI) [24]) provide planned control trajectories given an initial state and a cost function by solving the optimal control problem. An optimal control problem whose objective is to minimize a taskspecific cost function can be formulated as follows:
(1)  
(2) 
subject to dynamics
(3) 
where represents the system states, represents the control, is the state cost at the final time , is the running cost, and is the value function. By solving this local optimization problem, we get the optimal control sequences. This can be solved in a receding horizon fashion in an MPC framework and it allows us to have a realtime optimal controller with feedback.
In our work, a samplingbased recedinghorizon stochastic optimization algorithm, MPPI controller [24] is used as an MPC controller. We chose MPPI for several reasons, first off being the generality of cost functions and dynamics allowed. Most variants of MPC require us to have a convex cost function and first or secondorder approximations of the dynamics. MPPI has neither of these requirements. Therefore we can directly encode our task into the cost function without any modifications to the highlevel objective. Second, MPPI has been shown to be highly successful at aggressive autonomous racing on ground vehicles with general cost functions and neural network dynamics [24].
For a short summary of MPPI algorithmically, it samples
trajectories by applying noise into the control channels and forward propagating the dynamics. Each sample can be rolled out in parallel, and then each corresponding trajectory and cost are combined to generate a final control vector. The optimization can be run
times to further refine the solution before executing it. The previous control solution is used as the center value of the Gaussian sampling to warm start the optimization each round.IiB Quadrotor Dynamics
We use the quadrotor dynamics model provided in the FlightGoggles simulator [7] used in this paper. The defined 10 states are , where is the worldcoordinate position vector, is the vehicle attitude unit quaternion vector, and is the worldcoordinate linear velocity vector. The vehicle dynamics are given by
(4)  
(5) 
where is the gravitational acceleration, is the quadrotor mass, is the rotation matrix from body to world frame, is the total thrust, is the aerodynamic drag, and is the stochastic force vector to capture unmodeled dynamics (e.g. vibrations and turbulance). The rotation matrix from body to world frame is
(6) 
and the relation between quaternions and the angular rates is
(7) 
where the angular rates and are part of the control inputs we used along with the total thrust . The control is . We make a small assumption here that the model immediately follows the control inputs, especially the angular rates. Indeed, the quadrotor in the FlightGoggles takes as an input and the lowlevel PID controller controls the robot to follow the commands. Since we directly input the angular rates, we do not use the dynamics of the angular rates, described in [7] when we propagate the model in MPC. The robot dynamics we used in this paper is also described in Fig. 4.
IiC Optical Flow
Optical flow estimates the instantaneous motion of objects and features in a visual scene from a sequence of ordered images. The motion comes from the relative motion between an observer and a scene. In our case, the motion comes from a moving observer (a camera attached on a robot) and a static environment. To compute the optical flow, two strict assumptions are required: 1) The brightness of any observed object point on images is constant over time, 2) In the image plane, neighborhood points move similarly with similar velocity. The first constraint can be written as:
(8) 
where represents the intensity of a pixel and represent the displacement of the pixel position between two consecutive images observed at time and . This equation can be written in a form of Taylor series by assuming that the movement is small:
(9) 
which results in
(10) 
However, it is impossible to estimate the two unknowns and , only with one equation, so all the optical flow calculation methods make additional assumptions to estimate the actual flow.
We used one of the most popular algorithms [5] to calculate the dense optical flow. The algorithm approximates each neighborhood of both frames by quadratic polynomials. The details of the algorithm can be found in [5] and the implementation of the algorithm is available in OpenCV [3]. For a better calculation of the dense optical flow, we used a sequence of downsized grayscaled images instead of original RGB images. The parameters used for calculating optical flow with [5] were: pyr scale=0.5, levels=10, winsize=51, iterations=15, poly=5, poly=1.1. The visualization of the dense optical flow as a vector or in color can be found in Fig. 2 and the supplementary video includes the optical flow of a full run of racing.
Iii Deep Optical Flow Dynamics
By taking advantage of the algorithms [5]
calculating the optical flow, deep optical flow learning becomes selfsupervised learning, which does not require any manual labeling. Our proposed neural networkbased Deep Optical Flow (DOF) dynamics have two major selling points:
Iii1 Computationally efficient
DOF dynamics predict an optical flow/vector of a single pixel while most of the DLbased optical flow [8, 23, 19] predicts the next timestep’s image of optical flow, with the same size of the input images. This allows us to have a very small network, so we can use the model in a realtime optimal controller that performs optimization within 2050ms. If we build a UNetlike convolutional neural network, which predicts an image from an input image, we have to propagate the deep CNN every timestep in MPC framework to generate optical flow, which is computationally very expensive and slow. For the parameters used in the paper, our MPC algorithm samples over a million times per second.
Iii2 Dataefficient
Given an image, size of WH, DOF can use WH data points for training, whereas typical DLbased optical flow [8, 23, 19] only uses a single data point (an image of the whole optical flow).
DOF dynamics predict, just like typical robot dynamics models, the derivative of the states. Here, in DOF, it predicts the velocity of a pixel. DOF takes 3 components as input: pixel state (position) , control actions , and robot orientation . The pixel position means the position in (u, v) coordinate system on the image plane, where the top left corner is the origin (0, 0). Control actions command angular velocities in frame and total thrust, which affects both robot motion/acceleration and the image stream. The main point here in the DOF input is the robot orientation part. We incorporate the orientation of the robot into the DOF dynamics because even with the same control input, the optical flow changes depending on the roll, pitch, and yaw angles of the robot as shown in Fig. 3.
We train the DOF dynamics with a neural network (NN) model to predict the magnitude and the angle of a single optical flow/vector. By defining the state of the pixel , we can write the optical flow as
(11) 
where and are the optical flow vector component, predicted from the DOF. Therefore, the final DOF dynamics is
(12)  
(13) 
where the PolarToEuler mapping is Eq. 11.
Algorithm 1 describes the training process of DOF dynamics. In the first forloop of Algorithm 1, the robot state can be either ground truth or estimated states. The first forloop describes collecting training dataset of robot states, optimal control actions, images, and optical flow between two consecutive images. Then the following forloops update the weights and biases of DOF dynamics model with respect to the mean squared error (MSE) loss between the target magnitude and the angle of optical flow and the prediction.
We normalize the pixel state into [0.0, 1.0][0.0, 1.0] space and do regression. This allows the original discrete image space [0, W][0, H] to be a continuous 2D space [0.0, 1.0][0.0, 1.0] and same for the pixel state space, as well.
We designed a feedforward NN with 5 layers having [10, 128, 128, 128, 2] neurons per each, where 10 is for an input layer and 2 is for the output. The
Rectified Linear Unit () function, ,is used for the activation function in layers 14, and the output layer has a linear activation.
All the layers are fully connected with regularization via 10 of dropouts. The motivation behind using the stated number of neurons was to achieve realtime performance with MPPI. We had to balance the accuracy of the model with realtime constraints. A more accurate model would need more neurons and more layers, but this would prevent realtime usage in MPPI. We empirically choose an architecture that was accurate and could still be run in realtime. These two goals conflicts: To achieve a more accurate model, we need more neurons and more layers, but this would result in slow inference time. As a result, we empirically chose the numbers to achieve both goals. For training the neural network, the Adam [12]optimizer was used with Tensorflow
[1].The usage of the trained model can be found in Algorithm 2. Given a center of the object from a Detector (e.g. YOLOv3 [20]), a trained DOF dynamics model takes the center position (), robot orientation, and control action as an input. The output of the trained DOF dynamics is the magnitude and the angle of a predicted optical flow of that single point . From predicted and , the velocity of the single point is calculated as in Eq. 11.
Runtime [ms]  

max  
YOLOv3  204153  1  1  14.9 5.6  21.1 
DOF  11  1  1  1.5 0.4  2.1 
DOF  11  1  80  6.7 2.9  10.1 
DOF  11  512  1  1.6 0.7  2.8 
DOF  11  512  80  8.6 3.0  11.9 
DOF  204153  1  1  6.9 1.5  9.4 
DOF  204153  1  80  327.0 10.6  340.3 
DOF  204153  512  1  OOM  OOM 
SpyNet  192160  1  1  3.3 0.5  4.0 
SpyNet  192160  512  1  OOM  OOM 
We have included a comparison table, Table I, that shows the differences in runtimes of our optical flow prediction with another stateoftheart network. However, even though our DOF dynamics approach cares about accuracy, our primary constraint was speed. Therefore, we compared our network with the fastest and smallest of stateoftheart networks, the SpyNet[19]. We refer to Table 9 in [9] for benchmark results for optical flow. The table shows the accuracy and the runtime of the stateoftheart approaches ([8, 23, 19], etc). Note that the total number of parameters in DOF dynamics NN is 34,690, whereas the SpyNet has 1,200,250 parameters. We tested DOF dynamics both with a single pixel prediction case and the wholeimage prediction case (31,212 pixels). We clearly see that for multistep prediction (), running DOF dynamics for a whole image () to predict the optical flow is too slow (3 Hz) and does not fit into realtime MPC algorithms. Since the SpyNet requires the input image pairs to have width and height to be multiple of 32, we resized the image to have a similar size as our training data set: 192160=30,720 pixels. From Table I, it is apparent that the single pixel approach with our DOF dynamics can only fit into a realtime “samplingbased” MPC framework.
We believe comparing the accuracy of our method and the standard fullimage optical flow method is unfair because both approaches use different information to predict the optical flow. While the fullimage approach uses more perceptual information, our DOF approach uses more nonvisual information; the robot orientation and controls.
We report the prediction error of our DOF dynamics on the test dataset in Average Endpoint Error (AEE) of 2.45. The endpoint error calculates the Euclidean distance between the ground truth optical flow vectors and the predicted vectors. In the optical flow literature, depending on the training dataset, the stateoftheart methods report AEE of 0.510.0.
Iv Pixel Model Predictive Control
In this chapter, we introduce Pixel Model Predictive Control (PixelMPC) algorithm for visual object tracking and autonomous racing. PixelMPC literally predicts the future state trajectory of a “pixel model”, the deep optical flow (DOF) dynamics, and calculates the optimal control sequence (Fig. 1).
Assuming we have a visual object detector, for example, detecting custom classes of objects using You Look Only Once (YOLO) [20] algorithm. Given some detected objects, we can predict the future trajectories of their center points/pixels []. For a visual tracking task, one cost function for the optimal control of the DOF dynamics can be the L1 distance between the object pixel position [] and the center of the image. This L1 cost function will force the pixel to be close to the center of the image :
(14) 
This cost function is reasonable for visual object tracking because the closer the target is to the center of the image, the longer we observe the target. In addition, the center of the image has the least distortion, which means the lowest information lost.
The autonomous racing taskrelated cost function for a finitehorizon optimal control problem can be designed in a form of Eq. 1. For example, to follow the desired position, orientation, and velocity and :
(15) 
which control cost is ignored and is an indicator function which returns 1,000 if a robot crashes into a gate or a value between [1, 1]. A smaller return represents the robot being closer to the desired path. The ordered waypoints (gates) are assumed to be given with the map of the entire racing track (e.g. Fig. 5). Note that this information including the position of the targets is only for the racing task along with a realtime path planning, not for the visualservoing task. If the task of PixelMPC is similar to [4], where the task is following given waypoints, then prior information of target locations is not required.
Now, the total cost function for the optimization Eq. 2 is formed as
(16) 
where a new state is defined as .
The total dynamics used to optimize Eq. 16 can be written as a combination of two dynamics Eqs. (4)(7) and Eq. 13. Our formulation allows us to emphasize one task over another by tuning the cost function. If we want to achieve a faster speed instead of more visibility, then we can weight it more heavily.
Algorithm 3 shows the PixelMPC algorithm. Either from a ground truth or a state estimator, we receive a new robot state and an image from a monocular camera. A detector (e.g. YOLOv3 [20]) detects the center of a target (gate, in a racing scenario) on the image space and an optimal model predictive controller solves the optimization problem with respect to the total cost , Eq. 16, with a receding time horizon . After propagating the combined model dynamics and running the optimization step, we execute the first control action and use the remaining control trajectory solution for the next optimization loop as a warm start. Then again we receive a new robot state with an image and repeat the optimization at a rate of 40 Hz.
V Experiments/Results
Va Experimental Setup
We tested our algorithm in the FlightGoggles simulation [7], which is developed for agile flight simulation with high fidelity. The racing scenario is from the AlphaPilot–Lockheed Martin AI Drone Racing Innovation Challenge^{1}^{1}1https://www.herox.com/alphapilot/852019virtualqualifiertests (Fig. 5).
We used the quadrotor’s dynamics model introduced in Section IIB Eqs. (4)(7).
To derive our DOF dynamics from optical flow data, we collected 10 rounds of autonomous flight using a nominal MPPI controller, which took around 30 seconds for each round. To fully explore the state space we varied the target speed between 6m/s and 14m/s across rounds. The timestep in MPPI was 0.025 seconds. In total 14,000 images from a monocular camera along with drone states and controls were collected. The images were each downsized to a size of [204, 153]. This provided 204153=31,212 data points. As a result, around 437 million data points for training DOF were collected from 5 minutes of flying data. The states are collected from ground truth provided in the FlightGoggles simulator.
VB Model Predictive Path Integral control (MPPI)
In this work, out of many realtime MPC algorithms, we adopt the samplingbased stochastic optimal control algorithm, the Model Predictive Path Integral control (MPPI) [24]. MPPI allows us to handle stochasticity and it provides the easiness of designing and tuning nonquadratic cost functions, compared to other optimal control algorithms where most of them require a quadratic cost function.
For a drone racing task along with the visual object tracking task, the cost function parameters used for MPPI are =0.025 (), =400, =250, =8.0, =80, and
=1. The control variance had noise profiles:
=0.2, =0.2, =0.3, and =2.2. was tuned between 3.0e+6 and 9.0e+6 and the resulting different behaviors are reported in Table II and Table III. The reason why the parameter is chosen 4 orders of magnitude higher than all other cost function parameters is because we only normalized the pixel position term from [0, W][0, H] to [0.0, 1.0][0.0, 1.0]. A total of 512 samples were used to propagate the 12 states with a time horizon of 80 in 40 Hz, which results in a 2 second trajectory. The number of samples depends on the hardware (GPU, CPU, RAM, etc.) and the size of the DOF NN dynamics. The nominal MPPI case for the racing task used the cost function only composed of Eq. 15 and the same parameters described above were used to give a fair comparison.Although the drone dynamics we introduced in this paper are the simplified linear dynamics from the simulator [7], any nonlinear dynamics model can also be used as robot dynamics model in PixelMPC.
VC Drone Racing with Object Tracking
We compare the visibility in percentage; how long the robot grabs the target in its view. In the PixelMPC framework, there are some additional DOF dynamicsrelated parameters we can tune: 1) the time horizon considered for the pixel cost and 2) the cost coefficient . means the pixel cost Eq. 14 only penalizes the pixel trajectory within 1.0 second.
Table II shows the time the drone loses the target in (). We consider the ‘loss’ as visually losing the target after the robot first sees it. In this experiment, we used the ground truth provided by the FlightGoggles for robot states.
In the nominal case, without considering DOF dynamics in MPPI, the time the robot has less than 50% visibility of a target was .6 , which is more than 42% of the total flying time (31.8 ). With PixelMPC, we can decrease it to 1.5 , 4.5% of the flying time (33.5 ). The time of robot having less than 0% visibility of a target also decreased from 3.6 (4% of flying time) to 0.2 (less than 1% of flying time). Notice that in both 0% and 50% cases, the 2 of the lost time is very large in the nominal case, compared to the PixelMPC cases. This can be explained in Fig. 8, where the plots show how smooth the movement is when we use PixelMPC. Also, compared to the nominal MPPI, PixelMPC showed 29 decrease in linear and angular accelerations in mean, which resulted in a slower speed but it provided much smoother behavior; please see Fig. 8 (Best shown in the supplementary video). However, the smoothness behavior of PixelMPC is a byproduct of the visual target tracking, not the main goal. Also, the visual target tracking cannot be accomplished by simply applying a smoothing/filtering to a controller.
Time () of less than 50% visibility  

0.0  3.0e+6  6.0e+6  9.0e+6  
0.0  13.63.6       
1.0    3.60.6  1.90.6  1.50.2 
2.0    3.10.9  2.00.6  1.91.2 
Time () of 0% visibility  

0.0  3.0e+6  6.0e+6  9.0e+6  
0.0  3.61.1       
1.0    1.00.3  1.10.5  0.20.1 
2.0    0.70.4  0.60.2  0.20.1 
In Table III, we compare the race time for each case to see how much lap time delay we get to pay for more visual information. Table III shows the mean and the 2standard deviation from 10 laps of racing per each case. As expected, the PixelMPC loses lap time by achieving more visibility of the racecourse. However, we believe it is worth to pay 1.7 , sometimes less than 0.2 , to achieve 42% 4.5% decrease in time that the robot loses important information in its view.
We also report DOF dynamics’ multistep prediction error in Fig. 7. We see the predicted pixel trajectory is shorter than the actual trajectory in general, but the predicted trajectory closely follows the actual trajectory directionwise. Note that our MPC scheme solves this compounding error problem with feedback and realtime optimization.
Lap time ()  

0.0  3.0e+6  6.0e+6  9.0e+6  
0.0  31.81.0       
1.0    32.70.4  33.20.2  33.50.2 
2.0    33.20.1  34.20.3  34.00.5 
VD Visionbased State Estimation with Particle Filter
For state estimation with sensors (IMU, cameras, etc.), having more visual information and smooth flying behavior will benefit the state estimation and result in fewer failures overall. The most likely cause of a collision is an inaccurate state estimate. In a racing scenario, we can still assume that the racing map, i.e. the gates’ location information is given. Then, one of the biggest challenges will be estimating the robot’s state, to perform accurate path planning and control.
For estimating the robot’s state, we use a particle filter with an observation model using gate information from observed images. The particle filter is run with 6400 particles and uses the GPU to parallelize the motion and sensor updates.
VD1 Motion Update
The motion update of the particle filter is done by integrating the IMU measurements directly. Additional Gaussian noise is injected into the filter with mean and variance directly on position []. In addition to that, Gaussian noise is added to the IMU measurements directly both with mean and variance for acceleration and variance for angular rates. These tunings allow the particle filter to quickly jump to whatever sensor update occurs, but make the state estimate very unstable. The filter’s covariance will quickly balloon without any feature detections.
VD2 Sensor Update
The only sensor model of the particle filter is to use the nominal locations of the gate corners in the 3D world and back project them into the image plane. Then we find the difference between the detected results and the expected ones. Any missing detection is penalized heavily by W where W is the width of the camera image. Our custom YOLOv3 [20] gate detector is used to generate the detection of the 2D positions from an image along with a bounding box, which includes the third (depth) information.
Table IV shows that, with the target speed of 14 , the success rate of both cases are the same (80) but if we increase the target speed to 16 with the same cost parameters, the PixelMPC reports a higher success rate. The failure (crash) cases came from losing target visibility which resulted in the divergence of the state estimation. The 25 trajectories of running PixelMPC (=1.0, =9.0e+6) and the nominal MPPI with a particle filter is shown in Fig. 9. Since the racetrack we used only allows few seconds of flying between two consecutive gates, it is not intuitive to see if the PixelMPC decreases the particle filter covariance because even nominal MPPI could see the target gates very often. Therefore, we did one more straightline flying test to fully see the effect of PixelMPC on state estimation. In this case, we increased the target speed to 20 , where MPPI has to pitch down a lot to hit the target speed. As soon as the detector detects the gates, the PixelMPC tries to grab the feature in its view and this results in a smaller covariance of the particle filter. The last column of Table IV shows the maximum covariance of position from 25 runs of nominal MPPI and PixelMPC.
[Success rate (), Lap time ()]  

Ctrl  14  16  20 
MPPI  [80, 31.80.7]  [52, 29.60.8]  9.2 
PixelMPC  [80, 33.00.8]  [60, 30.60.7]  5.7 
Vi Conclusion
By fusing vision, path planning, and control into a single optimization framework, highspeed racing can be accomplished with more stable state estimation along with more visual information. Our algorithm can be generally used in any camerabased robot system for visual servoing. Testing our algorithm with real hardware will be our next step to move forward, but there is still room for improvement. The suggested deep optical flow (DOF) dynamics does not take the depth/distance of the target pixel and the robot’s velocity information into account. The current DOF approach works well thanks to the generalization property of the deep neural network, but incorporating the target pixel’s depth information will result in a more robust dynamics propagation. Another direction to robustify the suggested dynamics will be propagating the target bounding box, i.e. the 4 corners of it, like a particle filter approach. Lastly, although the constant target velocity settings for racing and other inputs indirectly include the velocity information, directly incorporating the velocity will be helpful also for other nonracing tasks dealing with variable speed and other specific maneuvers.
References
 [1] (2015) TensorFlow: LargeScale Machine Learning on Heterogeneous Systems. Note: Software available from tensorflow.org External Links: Link Cited by: §III2.
 [2] (201604) End to End Learning for SelfDriving Cars. External Links: 1604.07316, Link Cited by: §I.
 [3] (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §IIC.
 [4] (2018) PAMPC: PerceptionAware Model Predictive Control for Quadrotors. In 2018 IEEE International Conference on Intelligent Robots and Systems (IROS), Cited by: §I, §I, §IV.
 [5] (200306) TwoFrame Motion Estimation Based on Polynomial Expansion. In Scandinavian Conference on Image Analysis, Vol. 2749, pp. 363–370. External Links: Document Cited by: §IIC, §III.
 [6] (2016) A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots. IEEE Robotics and Automation Letters. Cited by: §I, §I, §I.
 [7] (2019) FlightGoggles: Photorealistic Sensor Simulation for Perceptiondriven Robotics using Photogrammetry and Virtual Reality. arXiv. Cited by: §IIB, Fig. 5, §VA, §VB.

[8]
(201707)
FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §I, §III1, §III2, §III2.  [9] (2018) Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation. the European Conference on Computer Vision (ECCV). Cited by: §III2.
 [10] (2019) Beauty and the Beast: Optimal Methods Meet Learning for Drone Racing. 2019 IEEE International Conference on Robotics and Automation (ICRA). Cited by: §I.
 [11] (201829–31 Oct) Deep Drone Racing: Learning Agile Flight in Dynamic Environments. In Proceedings of The 2nd Conference on Robot Learning, Proceedings of Machine Learning Research, Vol. 87, , pp. 133–145. Cited by: §I.
 [12] (2014) Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR) abs/1412.6980. External Links: 1412.6980 Cited by: §III2, 0.
 [13] (2016) EndtoEnd Training of Deep Visuomotor Policies. Journal of Machine Learning Research 17 (39), pp. 1–40. External Links: Link Cited by: §I.
 [14] (2017) ORBSLAM2: an OpenSource SLAM System for Monocular, Stereo and RGBD Cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. External Links: Document Cited by: §I.
 [15] (201907) Perceptionaware trajectory generation for aggressive quadrotor flight using differential flatness. In 2019 American Control Conference (ACC), Vol. , pp. 3936–3943. External Links: Document, ISSN Cited by: §I.
 [16] (201707) RealTime Motion Planning for Aerial Videography With Dynamic Obstacle Avoidance and Viewpoint Optimization. IEEE Robotics and Automation Letters 2 (3), pp. 1696–1703. External Links: Document, ISSN Cited by: §I.
 [17] (2018) Agile Autonomous Driving using EndtoEnd Deep Imitation Learning. Robotics: Science and Systems. External Links: Link Cited by: §I.
 [18] (2017Sep.) Visionbased minimumtime trajectory generation for a quadrotor UAV. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 6199–6206. External Links: Document, ISSN 21530866 Cited by: §I.
 [19] (2017) Optical flow estimation using a spatial pyramid network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §I, §III1, §III2, §III2, TABLE I.
 [20] (2018) YOLOv3: An Incremental Improvement. arXiv. Cited by: §I, §III2, §IV, §IV, §VA, §VD2, 0.
 [21] (2015) Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 91–99. Cited by: §I.
 [22] (2017) Toward lowflying autonomous MAV trail navigation using deep neural networks for environmental awareness. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4241–4247. Cited by: §I, §I, §I.
 [23] (201806) PWCNet: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §III1, §III2, §III2.
 [24] (201705) Information theoretic MPC for modelbased reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 1714–1721. External Links: Document, ISSN Cited by: §IIA, §IIA, §VB.

[25]
(201706)
Queryefficient imitation learning for endtoend simulated driving.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, Cited by: §I.
Citations
Plain Text:
K. Lee, J. Gibson, and E. A. Theodorou, “Aggressive PerceptionAware Navigation using Deep Optical Flow Dynamics and PixelMPC,” in IEEE Robotics and Automation Letters, 2020.
BibTeX:
@ARTICLElee2020pixelmpc,
author=Keuntaek Lee and Jason Gibson and Evangelos A. Theodorou,
journal=IEEE Robotics and Automation Letters,
title=Aggressive PerceptionAware Navigation using Deep Optical Flow Dynamics and PixelMPC,
year=2020
Comments
There are no comments yet.