I Introduction
Autonomous robots, such as selfdriving cars and aerial robots, are on the rise [Timothy2017, medicalorg, 8373043, searchandrescueorg, surveillanceorg, wan2020survey]. Building computing systems for these domains is challenging because autonomous robots differ from traditional computing systems (embedded systems, servers, etc.) in that the robots must sense the environment through its sensors, make realtime decisions (e.g., detection and evasion) with the available onboard computing and actuate itself within the environment (e.g., evade an obstacle). These robots have cyber components (sensor/compute) and physical components (such as frames/rotors) that interact with oneanother to work as one coherent system. Hence, autonomous robots are cyberphysical systems (CPS) and the traditional computing platform is just one component among many others. To design the optimal onboard compute we need to do cyberphysical codesign. The selection of the cyber and physical components affects system “performance” (i.e., velocity, mission time, energy) of the aerial robot. For instance, cyber quantities, such as sensor framerate and process rate of the sensor data, determine how fast the aerial robot reacts in a dynamic environment, which in turn, determines the safe velocity. Physical quantities, such as weight (frame, payload), determine if the physics allows it to accelerate and move faster. To perform cyberphysical codesign, we must first understand the role of computing (specifically in autonomous aerial robots), and then we design domainspecific architectures. To intuitively understand the role of computing in such a cyberphysical system, we introduce the “Formula1” (F1) visual performance model to guide the design of optimal systems for a given robot task. F1 determines which of the cyberphysical components (compute, sensor, body) determines the safe operating velocity; safe highspeed autonomous navigation remains one of the key challenges in enabling aerial robot applications [darpa, vijaykumar1, vijaykumar2, droneracing]. Safety ensures that the control algorithm is reactive to a dynamic environment, while highspeed navigation ensures that the aerial robot finishes tasks quickly, thereby lowering mission time and energy [mavbench]. Using F1, we show that performant aerialrobot requires careful codesign of the autonomy algorithm, as well as the underlying hardware along with the cyberphysical parameters of the aerial robot. We evaluate two popular learningbased autonomy algorithms, DroNet [dronet] and Vgg16 (CAD2RL)[cad2rl]
, on computing platforms used in aerial robots, namely, Nvidia Xavier, Nvidia TX2, Intel NCS, and RasPi. Our observations show that the adhoc selection of autonomy algorithms or onboard computing platforms is far from optimal. To efficiently design domainspecific architectures while being cognizant of the cyberphysical parameters, we introduce AutoPilot—an intelligent cyberphysical design space exploration framework that uses the F1 model to automatically generate the optimal autonomy algorithm (learningbased) and its associated hardware accelerator from a highlevel userdefined robot task, platform constraints and optimization target specifications. AutoPilot consists of two parts: (1) a learningbased autonomy algorithm generator and (2) a multiobjective algorithmhardware tuner. The algorithm generator focuses on generating a functionally correct learningbased autonomy algorithm. AutoPilot automatically trains and tests the neural networkbased autonomy algorithm for a given aerial robot task using deep reinforcement learning (RL)
[airlearning]. The tuner uses multiobjective Bayesian optimization [bayesopt]to automatically tune the learningbased autonomy algorithms hyperparameters and the hardware accelerator parameters simultaneously to meet the optimization target (e.g., high safe velocity, lower mission energy) specified in the highlevel specification. We use AutoPilot to automatically generate the Paretooptimal design points for aerial robot navigation tasks. We show AutoPilot’s ability to generate these design points for three different target drones platforms (miniUAV, microUAV, and nanoUAV) with sensor framerate of 30 FPS and 60 FPS. We show that AutoPilot’s generated optimal design point achieves up to 2
, 1.54, and 1.81 lower mission energy for miniUAV, microUAV, and nanoUAV compared to using commercially offtheshelf accelerators (Nvidia TX2) or other nonoptimal design points generated by AutoPilot. Our results show the importance of cyberphysical codesign, as opposed to the adhoc standalone design of the onboard computing platforms and its implication of selecting optimal design points on mission time and mission energy. In summary, we make the following contributions:
F1, a visual performance model to understand the role of a computing platform in aerial robots while considering other components such as sensor and body dynamics.

AutoPilot, an intelligent cyberphysical design space exploration framework that allows us to automatically codesign a learningbased control algorithm with the accelerator from a highlevel user specification.

Exploit cyberphysical codesign to maximize safe flying velocity while minimizing the overall mission energy.
Ii Autonomous Aerial Robot Background
This section provides a background on the key components in aerial robots, the role of flight controller,and brief overview of the two control algorithm paradigms, namely the “SenseActPlan” and “EndtoEnd learning”.
Iia Aerial Robot Components
Autonomous aerial robots typically have three key components, namely rotors, sensors, and an onboard computing platform. Rotors determine the thrust that the aerial robot can generate. The sensor (e.g., camera) allows the aerial robot to sense the environment. The onboard compute executes the autonomy algorithm to process the sensor data. The size of an aerial machine plays an important role in component selection.
IiB Flight Controller
The task of flight controller is stabilization and control of the aerial robots. It is designed in a multilevel hierarchical fashion and is realized using PID controllers. The flight controller firmware stack is computationally light and is typically run on the microcontrollers [ucontrollerfc1, ucontrollerfc2]. To stabilize the drone from unpredictable errors (sudden winds or damaged rotors), the innerloop typically runs at closedloop frequencies of up to 1 kHz [1khzcontrol, koch2019neuroflight].
IiC Onboard Compute
In addition to the flight controller, there is a separate and dedicated computer responsible for generating highlevel actions from various autonomy algorithms (which we describe later in Section IID). In nanoUAVs, due to their size and weight, typically use microcontrollers as the onboard computing platform. For example, CrazyFlie [crazyflie] weighs less than 27 g and is powered by an ARM CortexM4 microcontroller. On the other end are miniUAVs, which are bigger and have a higher weight (payload capacity). MiniUAV typically uses a generalpurpose computing platform such as Intel NUC or Nvidia Jetson TX1/TX2. For example, AscTec Pelican [highspeeddrone], which weighs 1.6 Kgs is powered by an Intel NUC platform.
IiD Autonomy Algorithms
Autonomous behaviour of aerial robot is achieved by algorithms that classify into two broad categories, namely “SensePlanAct” and “EndtoEnd Learning”. In “SensePlanAct,” the algorithm is broken into three or more distinct stages, namely the sensing stage, the planning stage, and the control stage. In the sensing stage, the sensor data is used to create a map
[rusu20113d, elfes1989using, dissanayake2001solution] of the environment. The planning stage [rrt, motionplanningsurvey] processes the map, to determine the best trajectory (e.g., collisionfree). The trajectory information is used by the control stage, which actuates the rotor, so the robot stays within the trajectory. The execution time for these algorithms varies from hundreds of milliseconds to few seconds [mavbench]. EndtoEnd learning methods, which we focus on in this work, directly process raw input sensor information (e.g., RGB, Lidar, etc.) and use a neural network model to produce output actions directly. Unlike the SensePlanAct paradigm, the endtoend learning methods do not require maps or separate planning stages and hence are much faster compared to nonNN based autonomy algorithms [pulpdronet, vggtx2]. The model can be trained using supervised learning
[e2envidia, dronet, trailnet0, trailnet] or reinforcement learning [cad2rl, qtopt, rlcar].Iii F1 Performance Model
In this section, we introduce the F1 visual performance model that helps computer architects understand whether a robot’s performance is bottlenecked by the selection (or design) of compute (and autonomy algorithm), or by other components in the aerial robot such as sensor or its bodydynamics (laws of physics). We first start with the F1 model overview to understand it as a performance model, and explain how it can be useful. Then we describe how we construct the F1 model.
The F1 model visually resembles that of a traditional computer system roofline model [roofline], albeit the parameters in the F1 model quantifies the aerialrobot as a holistic system as opposed to compute system in isolation. Similar to the roofline model, the F1 model can be used by computer architects in two ways. First, it can be used as a visual performance model to understand various bounds and bottlenecks. Second, it can identify an optimal system (autonomy algorithm + onboard compute) for an aerial robot.
Iiia Need for a CyberPhysical Performance Model
The rate at which motion decisions are made in a drone depends on the speeds of components within the sensorcomputecontrol pipeline (Fig. 0(b)): the sensor capturing a snapshot (e.g., image) of the environment, the computer processing the sensor data to generate highlevel decisions, and the controller realizing the final decisions. The slowest of the sensor, compute, and control subsystems create the upper bound on the rate at which final decisions are generated. The decisionmaking rate determines how fast an intelligent agent (biological or mechanical) can travel while maintaining maneuverability. For example, consider the case of a drone flying through a crowded obstacle course. While the drone’s response time to new stimuli is governed by the total latency of the entire sensorcomputecontrol pipeline, the rate at which the drone can output motor actions is tied to the maximum throughput of that pipeline (Fig. 0(b)). As long as the total latency is fast enough to perceive and track objects in the environment (e.g., obstacles, other drones), then the speed with which the drone can maneuver through obstacles with agility is limited by the rate at which valid decision actions can be output by the pipeline (i.e., the throughput). Our insight is that this problem resembles the canonical ratematching problem in computer systems. Computer architects are familiar with how to model this using analytical model such as bottleneck analysis [bottleneckanalysis], Roofline [roofline], and Gables [hill2019gables]. However, to achieve highspeed agility for drones, one must also consider the effect of physical quantities (governed by physics) and how it affects the selection of sensor, compute, and control subsystems. The traditional computer architecture models fall short of capturing these effects. Hence to design an agile highspeed drone, one must factor in both the physical quantities and rate matching of individual subsystems. The F1 model unifies the parameters determining the decisionmaking rate and the parameters that determine the drone’s physics to realize agile highspeed velocity effectively.
IiiB Using the F1 Model
An F1 model defines the upperbound on the safe velocity, considering the maximum rate at which the drone’s sensorcomputecontrol pipeline can make a decision. Responsivness within a safe perceptual operating regime is the typical use case for most drones, and to ensure that the drone stays in that safe regime, it can be programmed to invoke a stopping policy [highspeeddrone, stoppolicy2, stoppolicy3]. Our work focuses on operating efficiently within the safe regime to maximize agile velocity, and thus minimize mission time and battery energy.The F1 model is a logscale plot between safe velocity (V) on the axis and “Action Throughput (f),” on the axis (Fig. 1(a)). The Action throughput is the throughput of the sensorcomputecontrol pipeline, i.e., the rate at which decisions (e.g., move forward, turn left etc) are generated. Safe velocity (V) is the defined as the velocity an aerial robot can travel without colliding with an obstacle. Any speeds less than or equal to V guarantees safety, while any speeds exceeding V is considered unsafe.The F1 model shows that a robot’s velocity increases with improving the throughput of its sensecomputecontrol pipeline only up to a point, after which it is independent of the pipeline’s throughput. We define the decisionmaking rate of the robot as f, and its inverse, the control period, as T. Because the stages in the sensorcomputecontrol pipeline can be run concurrently, we see that the minimum control period of the pipeline can never be smaller than the maximum latency of each component in the subsystem:
(1) 
If the stages of the pipeline are not fully overlapped, the smallest practical control period may approach the total pipeline latency:
(2) 
While the perceptual responsiveness to new stimuli (i.e., latency) is fixed at the upper bound in Eq. 2, through successful pipelining, we can output new control actions at a higher rate, approaching the lower bound in Eq. 1. As long as the robot’s perceptual responsiveness is within a safe operating regime, as mentioned earlier, this allows the robot to execute complicated maneuvers at a higher traveling velocity – making for a more agile drone with shorter mission times. The upperbound on the Action Throughput (f) for a pipelined scenario can be defined from Eq. 1:
(3) 
where, T = 1/f is the latency to sample data from the sensor. If the aerial robot has 60 FPS camera, it means that the sensor data can be sample at 16.67 ms interval, which becomes the sensor latency. T
is the latency of the autonomy algorithm to estimate the highlevel action commands. The algorithm running on the computing system feeds on the sensor data. Compute throughput is a function of the autonomy algorithm (Section
IID) as well as the underlying hardware architecture.T = 1/f is the latency to generate the lowlevel actuation commands. The typical values of f is upwards of 1 kHz [koch2019neuroflight]. With these terms in place, the F1 visual performance model can be used to perform a boundandbottleneck analysis to determine if the safe velocity is affected by the onboard sensor/compute. Any point to the left of the “kneepoint” in F1 (Fig. 1(a)) denotes that the safe velocity is bounded by the choice of compute (and autonomy algorithms) or sensor and any point to the right of the kneepoint denotes the velocity is bounded by bodydynamics of the aerial robot. Ideally, to achieve the optimal pipeline design, it’s action throughput should be equal to that of the kneepoint. BodyDynamics Bound. An aerial robot’s physical properties such as weight, thrust produced by its rotors will determine how fast it can move and hence the ultimate bound on the safe velocity (V) will be determined by its bodydynamics. We call the region to the right of the kneepoint (i.e., when sensetoact throughput is greater than or equal to ) as bodydynamics bound. In this region, unless the physical components are improved (e.g., increasing thrusttoweight ratio), the velocity cannot exceed the current peak safe velocity no matter how fast a decision is made (i.e., selection of faster compute/sensor). Sensor Bound. The choice of onboard sensors limits the decisionmaking rate (f) which can limit the safe velocity (V). As shown in Fig. 1(a), a robot’s velocity is sensorbound if its action throughput is equal to the sensor’s frame rate () but less than the kneepoint throughput (). The sensorbound case occurs when the compute throughput (f) is less than or equal to the sensor throughput (f) (i.e. action throughput is equal to according to Equation 3), and . In this scenario, the sensor adds a new ceiling to the F1 model, thus, bounding the velocity under . In this region, unless the sensor throughput is improved (e.g., higher FPS sensor), the velocity cannot exceed the sensorbound ceiling () no matter how fast onboard compute can process the sensor input.Compute Bound. The choice of onboard compute (or autonomy algorithm) also affects the decision making rate (f). As also shown in Fig. 1(a), a robot’s velocity is computebound if its action throughput () is less than the sensor’s frame rate () and the kneepoint throughput (). In this case, the computing platform adds a new ceiling to the roofline model, bounding the velocity under this limit (). In this scenario, the sensor adds a new ceiling to the F1 model, thus, bounding the velocity under . In this region, unless the compute performance is improved (e.g., hardware accelerators/ algorithmhardware codesign) the velocity cannot exceed . Optimal Design. The F1 model can identify system designs that achieve an optimal/balanced overall system capability. Fig. 1(b) shows how understanding the bounds on safe velocity using F1 can help designing an optimal system for aerial robots. For a given robot with fixed mechanical properties, changing the sensor type or onboard compute impacts the f. Consequently, the optimal design point is when the sensor throughput and compute throughput result in a action throughput that is equal to the kneepoint throughput (). OverOptimal Design. If the action throughput is f such that f f, then either the sensor/computer is overoptimized since any value greater than f yields no improvement in the velocity of the aerial robot. Such an overdesigned computing/sensor platform involves not only extra optimization effort but also burns additional power that further increases the drone’s total power, decreasing its overall battery life. SubOptimal Design. The F1 model can help architects understand the performance gap between the current compute design and optimal design. For instance, if the action throughput is f, such that f f, then the sensor/computer is underoptimized, which signifies that current system if off by (f f) and there is scope for improvement through a better algorithm or selection (or design) of the computing system.IiiC Constructing the F1 Model
In this section, we describe how we construct the F1 model starting from prior work [highspeeddrone] that has established and validated the relationship between the cyberphysical parameters and the safe velocity of the aerial robot as described by Eq. 4. Eq.4 states that if the robot’s bodydynamics (physics) can permit it to accelerate at most by a, its compute and sensors permit it to sense and act at an interval of T (1/f), and its sensor(s) can sense the environment as far as ‘d’ meters, then robot can travel as fast V. For instance, Fig. 3 depicts an aerial robot with its field of view (FoV) [fov] and an obstacle (e.g., tree or a bird) within the FoV. FoV is the region that the sensor can observe in an environment. In this scenario, the aerial robot can travel at most by V and stop without colliding with the obstacle.
To construct the model, we sweep the T from 0 5 seconds along with typical accelerations values (a = 50 ) and the sensor range (d = 10m), as shown in Fig. 3(a). We observe an asymptotic relation between velocity and T such that as T 0, the velocity 32 (as seen in the magnified portion of Fig. 3(a)). Likewise, as the T , the velocity 0. We also plot the f (inverse of T) on the axis and velocity on the axis in Fig. 3(b). Both the axis and axis are plotted on a linear scale. As T decreases (or 1/T increases), there is a sudden transition in velocity (0 to 31 m/s) and saturates thereafter. We see that there is a point beyond which increasing f does not increase the velocity, showing a saturation or a roofline. Fig. 3(c), plots the xaxis on log scale. Plotting the axis on logscale allows to observe the transition that was not evident in the linear scale (Fig. 3(b)) or in the orignal CPS relation (Fig. 3(a)). We also annotate the three plots with two sample points denoted as point ‘A’ and ‘kneepoint’. The point A has a f of 1 Hz while the kneepoint has a f of 100 Hz. Between point A to kneepoint denotes 100 improvement in action throughput and translates to increase in velocity from 10 m/s to 30 m/s. Whereas even 100 improvement in f after the kneepoint results in 1.0004 improvement in velocity (signifying no improvement in velocity). Hence, increasing the action throughput (e.g., faster computing platform, faster sensor etc.) beyond a certain point will yield no improvement in the velocity. To visualize the F1 model (Fig. 1(a)), we need to show two regions: (i) where a robot’s velocity depends on f, and (ii) where the velocity is independent of f.
IiiD Effects of Cyber and Physical Component Interaction
In this section, we show how the parameters in Eq. 4 couples the cyber and physical components interaction in an aerial robot. The cyber components integrate the sensing, computation, and control pipeline in drones. The effect of cyber components can be abstracted by the T (1/f) in Eq. 4. The physical components in an aerial robot, such as the mass of sensor/compute/body frame/battery, the thrusttoweight ratio, the aerodynamic effects such as drag [drag1], sensing quality etc can be abstracted by the a and d parameters in Eq 4. The three parameters (T, a, d) in Eq. 4, can be used to capture overheads of improving safety, reliability, and redundancy. For instance, safety of autonomous vehicles can be improved by increasing its FOV [fov] (i.e., reducing the blind spot) [blindspot], or designing better tracking algorithms [fusion1, tracking1, tracking2] and/or adding redundancy in compute [redundancy1, redundancy2].The a parameter captures the physical effects of adding payload (sensor, onboard compute, battery, etc.) to the aerial robot. The payload weight affects the thrusttoweight [thrustweight] ratio which lowers the a [lowamax]. The F1 model captures the impact of varying a on V: a higher a leads to a higher V (with roofline shifting upwards), as shown in Fig. 1(c). The d parameter captures the sensing quality of the aerial robot. For instance, a laser based sensor can provide a higher sensing range, whereas a camera array based depth sensor has a limited range [sensorrange]. The F1 model captures the impact of varying d on V: a higher d leads to higher V (with roofline and slope shifting upwards), as shown in Fig. 1(d). Lastly, the f parameter captures the effect of sensor framerate, improvements to autonomy algorithm, or onboard compute. The additional latency incurred due to extra sensor/computation (e.g., sensorfusion) affects the f based on Eq. 3. The F1 model captures the impact of varying f by adding new ceilings which will limit the V. In summary, Eq. 4 couples the cyber and physical components and its associated effects into a single relationship. Thus F1 model which is built based on Eq. 4 provides a unified performance model for computer architects to design onboard compute while taking into account the cyberphysical effects.
IiiE Validation and Generalizability
The F1 model is derived by plotting the CPS relationship between safe velocity (V) and throughput (f). The CPS relationship is validated on different environments with varying number of obstacles density in both simulation as well as on a realworld with wind speeds up to 7 m/s on a quadcopter. The F1 model applies to both the autonomy algorithm paradigms (Section IID) and quadrotors of all different sizes. As we show later, it is useful for nano, micro and mini UAVs analysis.
Iv F1 Analysis of Offtheshelf Compute
We use F1 to characterize the performance of commonlyused learningbased autonomy algorithms running on realworld computing platforms that are used in aerial robots. We show that commonlyused autonomy algorithms and hardware platforms do not lead to optimal robot velocity, indicating that the choice of the (1) onboard computing platform and (2) autonomy algorithm affect the safe maximum velocity of the robot, thus confirming the need for cyberphysical codesign. We consider a baseline aerial robot that has a thrusttoweight ratio of 2.4 [velocitymodel], equipped with a camera sensor of 60 FPS, and weighs 1350 g, including the weight of the sensor, body frame, and battery. The robot is human teleoperated; it comes with a microcontroller unit but has limited computing and memory capacity for autonomy algorithms other than the flight controller stack. Since this onboard compute system does not use a hardware accelerator, we refer to this baseline as “NoAcc”. Such a robot can achieve a max acceleration of 15.95 . This is annotated as “Body Roof” in Fig. 5. The vertical red dotted line in Fig. 5 denotes the sensor throughput (f). We augment the baseline robot configuration with four different offtheshelf accelerators that have varying compute capabilities: Nvidia Xavier, Nvidia TX2, Intel NCS, and RasPi 3b. These systems are selected as they are used in real aerial robots [drlagx, tx2drone, raspidrone, movidiusdrone]. Therefore, in addition to the “NoAcc” baseline, we create four other robot configurations: each using a different accelerator, while the rest of the mechanical parameters (e.g., sensor) remain the same as the “NoAcc” baseline. Two autonomy algorithms that have been used for aerial robots in prior works are selected to run on these four configurations: VGG16 [cad2rl] and DroNet [dronet].
Platform 





Control Algorithm  
NoAcc  <500mW      1350  15.95  Human teleoperated  
Xavier  <30W  280 [agxweight]  162  1350  11.58  Vgg16, DroNet  
TX2  <15W  85 [tx2weight]  81  1350  14.40  Vgg16, DroNet  
RasPi  <1.5W  18 [raspiweight]  8.1  1350  15.60  DroNet  
Intel NCS  <1W  42 [ncsweight]  5.4  1350  15.10  Vgg16, DroNet 
Compute is heavy, and weighs down the aerial robot’s agility. Highperformance onboard compute can process the autonomy algorithms faster but it trades off performance with higher TDP and weight which in turn lowers the maximum acceleration (a). Table I shows the maximum acceleration for each of the four robot configurations when using the different acceleratorbased computing platforms. Since Xavier (highperformance highTDP larger heatsink) is the heaviest of the four, it shows the lowest acceleration, while RaspPi/Intel NCS (low performance lowpower lighter heatsink) achieve the highest. However, these peak acceleration values are still lower than the “NoAcc” baseline acceleration of 16 , thus implying it is important to consider the effect of compute weight on a robot’s max acceleration. HighPerformance compute does not imply a highperformance aerial robot. Highperformance onboard compute platform does not always translate to higher robot performance (e.g., velocity or missionenergy etc). For instance, Fig. 4(a) shows running DroNet on four different onboard compute platforms. In this case, lowperformance NCS can achieve higher velocity compared to the highperformance TX2 and Xavier as shown by their rooflines. This is because both TX2 and Xavier has higherTDP thus has higher heatsink weight which lowers the maximum acceleration which in turn lowers the velocity. In the case of NCS, it is overdesigned for the performance but a lower power (such that f is to the right of its kneepoint) thus achieves higher velocity by being lighter(compared to TX2 and Xavier). However, in the case of RasPi, even though it is lighter compared to TX2 and Xavier, its performance is lower (f left of the kneepoint) thus making it computebound which lowers the velocity. Computationallyintensive algorithms need highperformance compute. Fig. 4(b) shows ceilings for the platforms (RasPi 3b runs out of memory for VGG16). The action throughput of Xavier, TX2, and NCS are dominated by their compute latencies as they are higher than the sensor latency. Xavier achieves higher action throughput of 28 Hz compared to TX2 (10 Hz) and NCS (1.3 Hz). For Xavier, TX2, and NCS, the velocity is bounded by compute as its action throughput is to the left of to its roofline’s knee point. However, Xavier is the least computebound among these accelerators since its action throughput is closest (within 3.5%) to its roofline’s kneepoint. As a result, Xavier achieves a higher max velocity than other accelerators. However, it is still not an optimal choice of compute as its velocity (9.56 m/s) is far from the baseline NoAcc max velocity (11.64 m/s) due to its weight.
Takeaway. While high performance ensures that velocity is not computebound, low power dissipation translates in lower weight (smaller heatsink), hence able to support higher a (higher roofline). Given that the action throughput of these commonlyused autonomy algorithms and computing platforms are not optimal, we need algorithmhardware codesign to achieve design points close to the kneepoint.
V AutoPilot
Our F1 analysis motivates the need to determine the best platform (i.e., autonomy algorithm and accelerator design) that will result in a kneepoint action throughput while considering drone body dynamics and sensor type. To this end, we introduce the AutoPilot cyberphysical codesign framework. For a given robot’s highlevel specification such as its thrusttoweight ratio, sensor type, target task/environment, the tool automatically finds the optimal NN policy and its accelerator to ensure robust navigation and maximize safe velocity. AutoPilot is made up of three phases (Fig. 6). Phase 1 of AutoPilot takes an input specification of the robot and trains various Neural Network (NN) policies for a given task/environment and measures the effectiveness of these policies in terms of success rate. Phase 2 performs an automated design space exploration (DSE) to find the candidate NN policies and accelerator architectures that are optimal in terms of success rate and hardware power/performance. Phase 3 then uses the F1 performance model to find the NN policy and its accelerator design, from the various candidates from phase 2, that maximize the velocity and success rate.
Va Phase 1: Specification and Training
In Phase 1, the user provides an input Specification and configures the NN training environment via the Air Learning NN training gym. The specification consists of all the inputs to the AutoPilot framework, such as the robot task, environment, optimization target (velocity), robot’s physical properties, etc. The Air Learning training simulator [krishnan2019] is used to train different NN policies for a given environment.Specification. There are three main categories within the specification. The first category is the robot tasklevel specification, such as the success rate. The second category includes specifications about the target CPS system: the sensor framerate, the rigid bodydynamics (thrusttoweight ratio), power of rotors/body/sensors, etc. The last category is the optimization target, such as maximize velocity and number of missions, which is used by AutoPilot to determine the final NN policy and the hardware accelerator architecture. ReinforcementLearning Training. AutoPilot uses Air Learning [krishnan2019] to train and validate learningbased autonomy algorithms for a given robot task. Air Learning provides a highquality implementation of reinforcement learning algorithms that can be used to train an NN policy for aerial robot navigation tasks.Air Learning includes a configurable environment generator [airlearninggithub] with domain randomization [domainrand1] support that allows changing various parameters such as the number of obstacles, size of the arena, etc. We customize these parameters to generate different environments, with a varying number of obstacles, in order to denote the change in the task complexity. To determine the NN policy for each robot task (environment complexity in obstacles, congestion, etc.), we start with the basic template used in Air Learning [airlearning] and vary its hyperparameters (number of layers/filters) to create many candidate NN policies. Based on the specified robot task, the desired success rate, AutoPilot launches several Air Learning training instances in parallel for the different NN policy candidates. Each of the NN policies that achieve the required success rate is evaluated in a random environment to validate the task level functionality. The validated NN policies are updated into an Air Learning database along with their success rates, which are then used by Bayesian optimization in the next DSE phase.
VB Phase 2: Design Space Exploration
In Phase 2, an automated multiobjective DSE is performed to find NN policies and hardware accelerator architectures that are optimal in terms of success rate and accelerator performance/power for a target environment. The success rate is only affected by NN hyperparameters (e.g., number of layers/filters). The accelerator’s runtime and power depend on both the NN and accelerator microarchitecture parameters (number of processing elements, onchip memory, etc.). Success rates for the NN policies are accessed from the Air Learning database, while a cycleaccurate simulator is used to evaluate accelerator performance/power for the different policies and hardware configurations. To achieve rapid convergence to optimal solutions, without performing an exhaustive search, Bayesian optimization is used to tune the different parameters. Air Learning Database. This database stores the training results for the various NN policies trained using Air Learning. Each entry in the database has an NN policy identifier, the hyperparameters used for training and performance of the NN policy validated for a given task. An example of performance metrics can be the success rate or the number of steps taken by the aerial robot to reach the objective or goal target. CycleAccurate Hardware Simulator. AutoPilot uses SCALESim, which is a configurable systolicarray based cycleaccurate DNN accelerator simulator [samajdar2018scale]. It exposes various microarchitectural parameters such as array size (number of MAC units), array aspect ratio (array height vs. width), scratchpad memory size for the input feature maps (ifmap), filters, and output feature maps (ofmaps), dataflow mapping strategies, as well as system integration parameters, e.g., memory bandwidth. Taking these architectural parameters, filter dimensions of each DNN layer, and the image size as input, SCALESim generates the latency, utilization, SRAM accesses, DRAM accesses, and DRAM bandwidth requirement. While SCALESim only generates performance metrics for the hardware accelerator, we augmented it with power models. The SRAM power is estimated using CACTI [cacti], and the DRAM power is estimated using Micron’s DDR4 power calculator [dram]. We assume that the accelerator is integrated into the final SoC. The details about the SoC level integration and estimation of SoC power are in Section VI. Bayesian Optimization. AutoPilot uses Bayesian optimization [bayesopt] for multiobjective DSE to generate tasksystem Pareto frontiers. Bayesian optimization has been shown to be highly effective for optimizing blackbox functions [SnoekLA12], [ShahriariSWAF16]
that are expensive to evaluate and cannot be expressed as closedform expressions. BayesOpt can achieve faster convergence than genetic algorithms when optimizing multiple objectives
[ReagenHAGWWB17]. In AutoPilot, BayesOpt optimizes three objective functions: (i) task success rate, (ii) SoC power, and (iii) accelerator inference latency (runtime). A Paretooptimal design is one that achieves maximum task success rate, and minimum inference latency , and SoC power. The algorithm tunes NN policy hyperparameters (such as number of layers/filters) and accelerator hardware parameters (e.g., number of processing elements, SRAM sizes, etc.) to converge to Paretooptimal NN policies and accelerator architectures. An opensource BayesOpt implementation
[bayesopt] is used in AutoPilot.VC Phase 3: CyberPhysical CoDesign with F1
The goal of phase 3 is to find a design point (policy and accelerator) with optimal success rate and velocity. There are two steps involved: CPS codesign and architectural tuning. CPS CoDesign. First, designs with the highest success rate (minimum success rate is userspecified) amongst Phase 2 generated designs are selected. Then, the velocities for these designs are computed using the CPS relation (Section IIIC) that takes into account the effect of the weight of different components, including the compute, on velocity. Next, the AutoPilot system constructs the F1 roofline plot (following sections IIIC), which consists of a roof corresponding to the baseline robot (i.e., humanoperated which does not use any onboard NN accelerator), and other roofs corresponding to the velocities of the successratefiltered designs. The latter roofs would be close or lower than the base roof due to the added weights of the accelerators. Finally, the design is selected that achieves the max velocity equivalent to the humanoperated base robot, and whose action throughput is equal to the base kneepoint throughput. Architectural FineTuning. In the case when no optimal design exists that achieves the base kneepoint velocity, some architectural tuning may be required to shift the design close to the kneepoint. AutoPilot provides two options for which points to consider for optimization: (i) these can be userdefined, or (ii) the design point that is closest to the kneepoint can be selected. The architectural tuning can be performed using a variety of optimizations until the optimized design is at (or very close to) the base kneepoint in the F1 roofline. We employ a bag of architectural optimizations in the tuning process. AutoPilot comes with two techniques, namely frequency scaling and technology scaling. In frequency scaling, we increase or decrease the operating frequency to tradeoff performance and power of the hardware accelerator. Lowering the frequency leads to lower power (TDP), which reduces the heatsink weight and increases its a and velocity. This optimization is useful when a design is body dynamics bound and is overdesigned. Likewise, increasing the operating frequency improves accelerator runtime and can be used if a design is underoptimized and computebound. In technology scaling, we evaluate the designs in different process technology nodes to see if we can move a design closer to the kneepoint. Summary. AutoPilot methodology is general (MLbased multiobjective DSE) and can be extended in scope to include other autonomous vehicles such as cars (with its CPS model [intelrss, nvidiacpscars]), other autonomy algorithms (Section IID), hardware targets (e.g., FPGAs, CGRAs, multicores, systolic/nonsystolic array based, etc.). Within a fixed accelerator target, any other architectural optimization technique (e.g., quantization of policy [quarl], model compression [han2015deep], memory optimizations [maxnvm]) that tradeoff power and performance can be a part of the bag of architectural optimizations.
Vi Experimental Setup
Air Learning Training Environments. We generate two environments using the Air Learning environment generator with varying degrees of clutter. The arena size is typical and is twice the arena sizes used in aerial robotics testbeds [flyingarena1, flyingarena2, flyingarena3, flyingarena4]. The NN is trained using Deep QNetworks [dqn]. DQN works well on highlevel navigation tasks for aerial robots [dqnuav1, dqnuav2]. We use the same reward function and other hyperparameters as used by the authors of Air Learning [krishnan2019]. The training is terminated after 1 M steps or reaches the required success rate. NN Policy Architecture Search. We use the Air Learning model architecture as the baseline template and change its hyperparameters. The NN policy is multimodal, and prior work [krishnan2019] has shown that each input modality contributes to the success rate for the task. The basic template of the architecture used in that work is shown in Fig. 6(a)
. We made additional changes to the base template, such as the choice of filter sizes, strides, etc. We choose filter size of 3
3 and a stride of 1, with ReLu
[relu]activation function with no pooling.SoC Power Estimation. We assume an SoC, which includes an architecture template for hardware accelerator shown in Fig. 6(b). For estimating the total SoC power, we add the power of individual components in the SoC. For estimating the power of the hardware accelerator, we run a given NN policy on a cycleaccurate simulator. The cycleaccurate simulator produces SRAM traces, DRAM traces, number of read/write access to SRAM, number of read/write access to the DRAM. Using the SRAM and DRAM trace information, we model the SRAM power in CACTI [cacti] and DRAM power in Micron DRAM model [dram]. For estimating the power for the systolicarray, we multiply the array size with the energy of the PE. The PE power is modeled after the breakdown in [limemdsedac]. For the ULP camera, we assume the camera is capable of sustaining frame rates of up to 60 FPS at 144 x 256 size images at low power of less than 100 mW and form factor of 6.24 mm 3.84 mm [ulpcamera]. We account for the camera power in our overall power calculation. We also assume that the camera is interfaced with the system using a camera parallel interface [parallel] or MIPI [MIPI] similar to this work [pulpdronet] from which the accelerator subsystem can directly fetch the inputs to process the images. We also assume that the filter weights are loaded into the system memory as a onetime operation.We assume that there are two low power MCU class cores in the SoC to run the flight controller stack, which is typically a PID controller to control the four rotors. The flight controller stack run baremetal on the MCU, similar to Bitcraze CrazyFlie aerial robots [crazyflie]. For the MCU class cores, we use CortexM cores that implement the ARMv8M ISA architecture [armv8misa]. Each MCU core consumes about 0.38 mW in a 28 nm process clocked at 100 MHz [armm33productpage]. We account for the power of the ultralowpower core into our final power numbers. The MCU cores receive the highlevel action commands from the accelerator subsystem through the system bus after running each frame through the NN policy. The NN produces the action which is interpreted by the flight controller to generate lowlevel motor actuation signals to control the aerial robot. Compute Weight Estimation. Using the SoC power as the heat source, we calculate the heatsink volume required based on heatsink calculator [heatsink]. The weight of the heatsink is determined by multiplying the estimated volume with the density of aluminum (commonly used heatsink material). We also assume that the final SoC is mounted on a PCB along with all electrical components weighing 20g (which per our analysis is typical for RasPi [raspiweight], CORAL [coralweight] like systems).Vii Evaluation
We present the results and analysis associated with AutoPilot (i.e., compute DSE, CPS codesign, and architectural finetuning) of AutoPilot. Then we show that the SoCs optimized for velocity lead to an increase in the total mission count.
Viia Compute Design Space Exploration (DSE)
Since offtheshelf components fall short of being optimal, we demonstrate that AutoPilot is capable of automatically exploring a large design space in finding optimal NN policies and accelerator designs. We show the system’s ability to generate a variety of different policies and architectures by subjecting AutoPilot to environments with varying levels of obstacle density. Increasing complexity affects the NN policy (deeper policy), as well as the hardware accelerator design. Fig. 8 shows the designs obtained using AutoPilot for two different task complexities (low and high obstacle density). Each design point represents the SoC power, DNN accelerator inference latency, and the success rate (color map). As described in Section V, AutoPilot uses Bayesian optimization to tune the various parameters until convergence while optimizing the costs (performance, power, and success rate). While the NN policy determines the success rate, the accelerator power (performance) depends on both the policy and HW parameters. AutoPilot converges to optimal accelerator designs by sampling less than 0.5% of the total design space.AutoPilot tunes the NN policies such that they have 26 layers with each layer having either 32, 48, or 64 filters. For the complex task, AutoPilot automatically selects deeper NN policies as its success rate is higher. For instance, 32 filters (and 35 layers) are sufficient to achieve a success rate higher than 80% for low obstacle density, 48 filters are required for high obstacle density to get to a similar success rate. AutoPilot tunes the hardware accelerator parameters to generate designs ranging from lowpower to highperformance. We specifically tune array height/width between 16128 and SRAM (Ifmap/Ofmap/Filter) sizes between 32KB2MB. Fig. 8 highlights three regions in the DSE to demonstrate how AutoPilot can generate hardware accelerator candidates under certain powerperformance bounds irrespective of the task complexity. Region A, B, and C denote bounds that are under 2 W (25 FPS), 42 W (50 FPS), 48 W (100 FPS) respectively.
AutoPilot using Bayesian optimization converges to optimal accelerator designs by sampling less than 0.5% of total design points from the entire design space. As task complexity changes, it can generate a multitude of design candidates within the same powerperformance bounds. As we codesign cyberphysical parameters, having multiple design candidates translates to greater scalability of the methodology to select optimal compute platforms as sensor or bodydynamics changes (Section VIIB).
ViiB CyberPhysical CoDesign
While compute DSE generates a large spread of architectural designs, not all points are ideally suited for deployment on an aerial robot to achieve a balanced system (as shown in Section IV using F1). Hence, in this section, we show that (1) the F1 model is essential in finding the accelerator architecture based on a user specification (e.g., drone type, sensor framerate) that will lead to optimal robot velocity, and (2) architectures optimized for raw performance or lowpower, do not necessarily result in the optimal kneepoint (maximum velocity). For a comprehensive analysis, we perform CPS codesign with three aerial robots, namely Asctec Pelican (miniUAV), DJI Spark (microUAV), and a nanoUAV [nanouav] which have a thrusttoweight ratio (includes battery/sensor) of 2.4, 1.9, and 3.1 respectively, denoting a change in the bodydynamics. We also consider sensor framerates of 30 and 60 FPS. Fig. 9 shows CPS codesign for the navigation task in the highdensity environment. We filter the design points from Fig. 7(b) based on highscoring success rates, as shown in Fig. 8(a). These designs represent various accelerator candidates designed for the NN policy that achieves a success rate of at least 83.4% (4 layers and 32 filters). Success rate of greater than 80% is nominal [trailnet, accuracy1] in aerial robot navigation tasks. Out of the many accelerator design candidates, we highlight four designs denoted as ‘1’ (lowestpower and slower runtime) ‘2’ (AutoPilot selected), ‘3’ (highest performance and highest power), and ‘4’ (AutoPilot selected). The architectural details about these design points such as systolic array size, IFM/filter memory are annotated within Fig. 8(a). Using these four selected points, we demonstrate the need for the F1 model in designing the onboard compute for aerial robots. We also show that cyberphysical codesign is critical to achieving optimal computing platforms to maximize velocity, instead of isolated hardware design target objective such as high performance, low power, or energy efficiency. F1 model identifies optimal design points. Plotting the four architectural designs points on the F1 roofline model for AscTec Pelican (Fig. 8(b)), DJI Spark (Fig. 8(c)), and nanoUAV [nanouav] (Fig. 8(d)) with 30 FPS and 60 FPS sensor framerate, we observe that balanced, highperformance, and lowpower design points are all far from optimal kneepoint for their respective aerial robot. Instead, design point ‘2’ selected by AutoPilot is optimal kneepoint for AscTec Pelican with 60 FPS sensor. In the case of 30 FPS sensor with AscTec Pelican, the design point ‘4’ is the optimal design in terms of compute and any further improvement in compute performance will not result in any improvement in velocity since the performance is bound by the sensor framerate (30FPS). For DJI Spark with 30 FPS and 60 FPS sensor and nanodrone [nanouav] with 30 FPS sensor, AutoPilot selects ‘4’ as the optimal compute design. However, for nanodrone with 60 FPS sensor, ‘4’ is not optimal kneepoint and will result in computebound scenario. Using the F1 model in the CPS codesign phase, we show that adhoc selection of highperformance compute designs such as ‘3’ can degrade the overall performance (e.g., highspeed velocity) of the drone. For instance, the highest performance accelerator design point ‘3’ (highest power) for the DJI Spark and nanouav [nanouav] decreases the safevelocity by 13.2% and 44% due to the added weight of the heatsink for cooling compared to the baseline NoAcc case. Thus, the F1 performance model allows us to pick the optimal design rather than selecting the typical lowpower, highperformance, or balanced architecture designs which would often be the case if the compute is designed in isolation without CPS codesign. Onesize compute does not fit all. For the AscTec Pelican (Fig. 8(b)) with 30 FPS and 60 FPS sensor framerate, we observe that the optimal design point is ‘4’ and ‘2’ , respectively. Interestingly, ‘4’ (optimal for 30 FPS sensor framerate), if chosen as the compute platform, becomes computebound for AscTec Pelican with 60 FPS framerate which drops the maximum safe velocity by 15% compared to design point ‘2’. Takeaway. Choosing the computing platform (either generalpurpose or customdesigned) in an adhoc fashion can deteriorate the physical performance of the robot, which will then have implications on the mission energy (discussed in Section VIID). Hence, while designing (or selecting) the computing platform for the aerial robot, one must account for the cyberphysical parameters of the robot for maximum performance.
ViiC Architectural FineTuning
To show the effectiveness of architectural finetuning, we consider the Asctec Pelican robot with 60 FPS sensor, but assume that the kneepoint (i.e., the design ‘2’ in Fig. 8(b)) was not achieved. For this case, using the bag of architectural optimizations (frequency/node scaling), we are able to move the suboptimal body dynamics and computebound designs (points ‘3’ and ‘4’ in Fig. 9(a)) to the kneepoint. BodyDynamics Bound. Design ‘3’ (highpower, highperformance, and bodydynamics bound) clocked at 1 GHz and in a 45nm process node has a compute throughput that is 3 higher than the kneepoint for the AscTec Pelican. By scaling down its frequency to 125 MHz, AutoPilot brings this suboptimal point closer to the kneepoint (denoted by ‘3’) as shown in Fig. 9(b). Lowering the frequency from 1 GHz to 125 MHz reduces power consumption from 7.5 W to 1 W (Fig. 9(a)). The power reduction reduces the heat sink requirement, making this design lighter and nearoptimal.ComputeBound. Design ‘4’ clocked at 1 GHz in the 45 nm node, results in a computebound case for the Asctec Pelican. To get this design to the kneepoint, AutoPilot increases the throughput of this accelerator by 1.6 without significantly increasing its power consumption. We show that by performing process scaling to 22 nm and clocking at 4 GHz, AutoPilot brings it closer to the kneepoint (Fig. 9(b)). Takeaway. When the CPS codesign step (Section VIIB) is unable to generate the optimal kneepoint design, the architectural finetuning engine can be launched that uses various optimization techniques to deliver the final kneepoint design. The level of flexibility allowed by AutoPilot and the tradeoffs it can make are configurable by the enduser.
ViiD Mission Time/Energy Implications of Optimal System
The endgoal of AutoPilot is choosing an onboard compute system (design points) that minimizes the mission time and energy. To this end, we evaluate three robots, AscTec Pelican (miniUAV, DJI Spark (microUAV), and nanoUAV used in Zhang et. al [nanouav]. We show that the optimal design point (i.e. kneepoint) generated by AutoPilot always outperforms the other nonoptimal designs (other AutoPilot generated designs or adhoc selection of onboard compute (e.g., TX2). Mission Time Comparisons. To estimate mission time, we assume a package delivery mission application/scenario, where a radius of 100 m separates the source and destination. We pick two categories of points, namely the kneepoint and others (computebound, body dynamics bound, adhoc selection of computing platform) for the three aerial robots. For each of the design points, we estimate the maximum velocity the robot achieves if it uses these designs as the onboard compute. Fig. 10(a) shows mission times (lower is better) for the five different computing platforms across the AscTec Pelican (miniUAV), DJISpark and a NanoUAV. AutoPilot generated optimal design (kneepoint) always achieve the lowest mission time. It is worth noting that the selection of the kneepoint design becomes more critical as we miniaturize the aerial robot (miniUAV microUAV nanoUAV). For instance, in AscTec Pelican (MiniUAV), the AutoPilot generated kneepoint and the bodybound design point (also generated by AutoPilot), the mission time improvement is only 5% whereas for microUAVs and nanoUAVs the difference is 20% and 80%, respectively. The reason for marginal improvement in the case of AscTec Pelican is because it is a bigger sized drone and has a higher payload carrying capability. Hence, the TDP difference of 4 W between the kneepoint and bodybound design point and its additional heatsink weight is negligible to cause any significant degradation to its bodydynamics (a) and its safe velocity (Eq. 4). However, in the case of DJISpark and NanoUAV [nanouav], the payload carrying capacity is lower and extra heatsink weight (TDP of compute) can significantly lower the acceleration (a) and its safe velocity.
Mission Energy Comparisons. Fig. 10(b) shows the mission energy for three drone platforms with five different compute platforms. We show that the kneepoint design always lowers the mission energy compared to other selections. Mission energy (E) is related to the mission time as follows:
(5) 
Where, t is the time it takes to complete the mission and P, P, and P are the power consumed by the rotors, onboard compute, and other components (sensor, flight controller etc) in the aerial robots. It is important to note P consume more than 95% of the total power [mavbench, pulpdronet], but a higher P (higher TDP higher heatsink weight) can lower the acceleration (a), which can lower the V (higher t). Thus, a kneepoint design lowers the mission energy by minimizing t (higher V lower t) and minimizing P compared to other design points (compute/bodybound, or adhoc selection). The optimal design generated by AutoPilot for AscTec Pelican, DJI Spark, and the nanoUAV, achieves 2, 1.54, and 1.81 lower mission energy compared to other designs.
Viii Related Work
Performance Models. Analytical performance models, such as the Multicore Amdhals’ law [amdahlmulticore], Roofline model [roofline], and Gables [hill2019gables], and several others [chitaanalyticalintro] are useful to guide the design of an optimal system for a given workload. These performance models are applicable for traditional compute and are not explicitly targeted for robots that have cyber and physical components. Our work proposes a rooflinelike model to help understand the role of computing in aerial robots. In the context of performance modelling for complex systems (i.e., beyond compute only system), cote [orbitalcomputing] is a fullsystem model for design and control of nanosatellites. The cote model takes into account orbital mechanics, physical bounds on communication, computation, and data storage to design a costeffective, lowlatency, and scalable nanosatellite system. The F1 model has a similar objective, where it combines the interactions between compute/sensor (cyber components), and bodydynamics (physical components) to understand various bottlenecks to build an optimal system. Accelerators for Robots. Recently, a lowpower accelerator [pulpdronet] was proposed for neural networkbased control, but the work is customized for nanodrones, running DroNet [dronet]. Our work provides a general methodology to generate multiple NN policies and hardware accelerator designs from a highlevel specification. Navion [navion] is a specialized accelerator for aerial robots, in the senseplanact control algorithm paradigm, for improving visualinertialodometry. We focus on endtoend based control algorithms, which is an emerging autonomy algorithm paradigm. RoboX [robox] generates an accelerator for motion predictive control from a highlevel DSL. Though the highlevel goal is the same, our work differs from theirs in that they do not consider the effect of the cyberphysical parameters on the computing platform. We instead contribute the F1 model to quantify the optimality of our designs. Outside of aerial robots, prior work [dansorin1, dansorin2] has shown the benefits of designing custom hardware accelerators for motion planning algorithms for robotics arm to improve the robots’ performance. Though the robots are different, AutoPilot guided by the F1 model can build similar optimally designed motion planning hardware accelerators targeted for aerial robots.
Ix Conclusion
AutoPilot is a pushbutton solution that automates cyberphysical codesign to automatically generate an optimal control algorithm (NN policy) and its hardware accelerator from a highlevel user specification. The concepts we have developed for AutoPilot, such as cyberphysical codesign, the F1 model for identifying the optimal design point, architectural finetuning, and selecting the optimal design points by showing how it affects the overall mission can be adapted to other types of autonomous robots such as selfdriving cars.