Autonomous robots, such as self-driving cars and aerial robots, are on the rise [Timothy2017, medical-org, 8373043, search-and-rescue-org, surveillance-org, wan2020survey]. Building computing systems for these domains is challenging because autonomous robots differ from traditional computing systems (embedded systems, servers, etc.) in that the robots must sense the environment through its sensors, make real-time decisions (e.g., detection and evasion) with the available onboard computing and actuate itself within the environment (e.g., evade an obstacle). These robots have cyber components (sensor/compute) and physical components (such as frames/rotors) that interact with one-another to work as one coherent system. Hence, autonomous robots are cyber-physical systems (CPS) and the traditional computing platform is just one component among many others. To design the optimal onboard compute we need to do cyber-physical co-design. The selection of the cyber and physical components affects system “performance” (i.e., velocity, mission time, energy) of the aerial robot. For instance, cyber quantities, such as sensor framerate and process rate of the sensor data, determine how fast the aerial robot reacts in a dynamic environment, which in turn, determines the safe velocity. Physical quantities, such as weight (frame, payload), determine if the physics allows it to accelerate and move faster. To perform cyber-physical co-design, we must first understand the role of computing (specifically in autonomous aerial robots), and then we design domain-specific architectures. To intuitively understand the role of computing in such a cyber-physical system, we introduce the “Formula-1” (F-1) visual performance model to guide the design of optimal systems for a given robot task. F-1 determines which of the cyber-physical components (compute, sensor, body) determines the safe operating velocity; safe high-speed autonomous navigation remains one of the key challenges in enabling aerial robot applications [darpa, vijaykumar-1, vijaykumar-2, droneracing]. Safety ensures that the control algorithm is reactive to a dynamic environment, while high-speed navigation ensures that the aerial robot finishes tasks quickly, thereby lowering mission time and energy [mavbench]. Using F-1, we show that performant aerial-robot requires careful co-design of the autonomy algorithm, as well as the underlying hardware along with the cyber-physical parameters of the aerial robot. We evaluate two popular learning-based autonomy algorithms, DroNet [dronet] and Vgg16 (CAD2RL)[cad2rl]
, on computing platforms used in aerial robots, namely, Nvidia Xavier, Nvidia TX2, Intel NCS, and Ras-Pi. Our observations show that the ad-hoc selection of autonomy algorithms or onboard computing platforms is far from optimal. To efficiently design domain-specific architectures while being cognizant of the cyber-physical parameters, we introduce AutoPilot—an intelligent cyber-physical design space exploration framework that uses the F-1 model to automatically generate the optimal autonomy algorithm (learning-based) and its associated hardware accelerator from a high-level user-defined robot task, platform constraints and optimization target specifications. AutoPilot consists of two parts: (1) a learning-based autonomy algorithm generator and (2) a multi-objective algorithm-hardware tuner. The algorithm generator focuses on generating a functionally correct learning-based autonomy algorithm. AutoPilot automatically trains and tests the neural network-based autonomy algorithm for a given aerial robot task using deep reinforcement learning (RL)[airlearning]. The tuner uses multi-objective Bayesian optimization [bayesopt]
to automatically tune the learning-based autonomy algorithms hyperparameters and the hardware accelerator parameters simultaneously to meet the optimization target (e.g., high safe velocity, lower mission energy) specified in the high-level specification. We use AutoPilot to automatically generate the Pareto-optimal design points for aerial robot navigation tasks. We show AutoPilot’s ability to generate these design points for three different target drones platforms (mini-UAV, micro-UAV, and nano-UAV) with sensor framerate of 30 FPS and 60 FPS. We show that AutoPilot’s generated optimal design point achieves up to 2, 1.54, and 1.81 lower mission energy for mini-UAV, micro-UAV, and nano-UAV compared to using commercially off-the-shelf accelerators (Nvidia TX2) or other non-optimal design points generated by AutoPilot. Our results show the importance of cyber-physical co-design, as opposed to the ad-hoc stand-alone design of the onboard computing platforms and its implication of selecting optimal design points on mission time and mission energy. In summary, we make the following contributions:
F-1, a visual performance model to understand the role of a computing platform in aerial robots while considering other components such as sensor and body dynamics.
AutoPilot, an intelligent cyber-physical design space exploration framework that allows us to automatically co-design a learning-based control algorithm with the accelerator from a high-level user specification.
Exploit cyber-physical co-design to maximize safe flying velocity while minimizing the overall mission energy.
Ii Autonomous Aerial Robot Background
This section provides a background on the key components in aerial robots, the role of flight controller,and brief overview of the two control algorithm paradigms, namely the “Sense-Act-Plan” and “End-to-End learning”.
Ii-a Aerial Robot Components
Autonomous aerial robots typically have three key components, namely rotors, sensors, and an onboard computing platform. Rotors determine the thrust that the aerial robot can generate. The sensor (e.g., camera) allows the aerial robot to sense the environment. The onboard compute executes the autonomy algorithm to process the sensor data. The size of an aerial machine plays an important role in component selection.
Ii-B Flight Controller
The task of flight controller is stabilization and control of the aerial robots. It is designed in a multi-level hierarchical fashion and is realized using PID controllers. The flight controller firmware stack is computationally light and is typically run on the microcontrollers [ucontroller-fc-1, ucontroller-fc-2]. To stabilize the drone from unpredictable errors (sudden winds or damaged rotors), the inner-loop typically runs at closed-loop frequencies of up to 1 kHz [1khz-control, koch2019neuroflight].
Ii-C Onboard Compute
In addition to the flight controller, there is a separate and dedicated computer responsible for generating high-level actions from various autonomy algorithms (which we describe later in Section II-D). In nano-UAVs, due to their size and weight, typically use microcontrollers as the onboard computing platform. For example, CrazyFlie [crazyflie] weighs less than 27 g and is powered by an ARM Cortex-M4 microcontroller. On the other end are mini-UAVs, which are bigger and have a higher weight (payload capacity). Mini-UAV typically uses a general-purpose computing platform such as Intel NUC or Nvidia Jetson TX1/TX2. For example, AscTec Pelican [high-speed-drone], which weighs 1.6 Kgs is powered by an Intel NUC platform.
Ii-D Autonomy Algorithms
Autonomous behaviour of aerial robot is achieved by algorithms that classify into two broad categories, namely “Sense-Plan-Act” and “End-to-End Learning”. In “Sense-Plan-Act,” the algorithm is broken into three or more distinct stages, namely the sensing stage, the planning stage, and the control stage. In the sensing stage, the sensor data is used to create a map[rusu20113d, elfes1989using, dissanayake2001solution] of the environment. The planning stage [rrt, motion-planning-survey] processes the map, to determine the best trajectory (e.g., collision-free). The trajectory information is used by the control stage, which actuates the rotor, so the robot stays within the trajectory. The execution time for these algorithms varies from hundreds of milliseconds to few seconds [mavbench]. End-to-End learning methods, which we focus on in this work, directly process raw input sensor information (e.g., RGB, Lidar, etc.) and use a neural network model to produce output actions directly. Unlike the Sense-Plan-Act paradigm, the end-to-end learning methods do not require maps or separate planning stages and hence are much faster compared to non-NN based autonomy algorithms [pulp-dronet, vgg-tx2]
. The model can be trained using supervised learning[e2e-nvidia, dronet, trailnet0, trail-net] or reinforcement learning [cad2rl, qt-opt, rl-car].
Iii F-1 Performance Model
In this section, we introduce the F-1 visual performance model that helps computer architects understand whether a robot’s performance is bottlenecked by the selection (or design) of compute (and autonomy algorithm), or by other components in the aerial robot such as sensor or its body-dynamics (laws of physics). We first start with the F-1 model overview to understand it as a performance model, and explain how it can be useful. Then we describe how we construct the F-1 model.
The F-1 model visually resembles that of a traditional computer system roofline model [roofline], albeit the parameters in the F-1 model quantifies the aerial-robot as a holistic system as opposed to compute system in isolation. Similar to the roofline model, the F-1 model can be used by computer architects in two ways. First, it can be used as a visual performance model to understand various bounds and bottlenecks. Second, it can identify an optimal system (autonomy algorithm + on-board compute) for an aerial robot.
Iii-a Need for a Cyber-Physical Performance Model
The rate at which motion decisions are made in a drone depends on the speeds of components within the sensor-compute-control pipeline (Fig. 0(b)): the sensor capturing a snapshot (e.g., image) of the environment, the computer processing the sensor data to generate high-level decisions, and the controller realizing the final decisions. The slowest of the sensor, compute, and control subsystems create the upper bound on the rate at which final decisions are generated. The decision-making rate determines how fast an intelligent agent (biological or mechanical) can travel while maintaining maneuverability. For example, consider the case of a drone flying through a crowded obstacle course. While the drone’s response time to new stimuli is governed by the total latency of the entire sensor-compute-control pipeline, the rate at which the drone can output motor actions is tied to the maximum throughput of that pipeline (Fig. 0(b)). As long as the total latency is fast enough to perceive and track objects in the environment (e.g., obstacles, other drones), then the speed with which the drone can maneuver through obstacles with agility is limited by the rate at which valid decision actions can be output by the pipeline (i.e., the throughput). Our insight is that this problem resembles the canonical rate-matching problem in computer systems. Computer architects are familiar with how to model this using analytical model such as bottleneck analysis [bottleneck-analysis], Roofline [roofline], and Gables [hill2019gables]. However, to achieve high-speed agility for drones, one must also consider the effect of physical quantities (governed by physics) and how it affects the selection of sensor, compute, and control subsystems. The traditional computer architecture models fall short of capturing these effects. Hence to design an agile high-speed drone, one must factor in both the physical quantities and rate matching of individual subsystems. The F-1 model unifies the parameters determining the decision-making rate and the parameters that determine the drone’s physics to realize agile high-speed velocity effectively.
Iii-B Using the F-1 Model
An F-1 model defines the upper-bound on the safe velocity, considering the maximum rate at which the drone’s sensor-compute-control pipeline can make a decision. Responsivness within a safe perceptual operating regime is the typical use case for most drones, and to ensure that the drone stays in that safe regime, it can be programmed to invoke a stopping policy [high-speed-drone, stop-policy-2, stop-policy-3]. Our work focuses on operating efficiently within the safe regime to maximize agile velocity, and thus minimize mission time and battery energy.The F-1 model is a log-scale plot between safe velocity (V) on the -axis and “Action Throughput (f),” on the -axis (Fig. 1(a)). The Action throughput is the throughput of the sensor-compute-control pipeline, i.e., the rate at which decisions (e.g., move forward, turn left etc) are generated. Safe velocity (V) is the defined as the velocity an aerial robot can travel without colliding with an obstacle. Any speeds less than or equal to V guarantees safety, while any speeds exceeding V is considered unsafe.The F-1 model shows that a robot’s velocity increases with improving the throughput of its sense-compute-control pipeline only up to a point, after which it is independent of the pipeline’s throughput. We define the decision-making rate of the robot as f, and its inverse, the control period, as T. Because the stages in the sensor-compute-control pipeline can be run concurrently, we see that the minimum control period of the pipeline can never be smaller than the maximum latency of each component in the subsystem:
If the stages of the pipeline are not fully overlapped, the smallest practical control period may approach the total pipeline latency:
While the perceptual responsiveness to new stimuli (i.e., latency) is fixed at the upper bound in Eq. 2, through successful pipelining, we can output new control actions at a higher rate, approaching the lower bound in Eq. 1. As long as the robot’s perceptual responsiveness is within a safe operating regime, as mentioned earlier, this allows the robot to execute complicated maneuvers at a higher traveling velocity – making for a more agile drone with shorter mission times. The upper-bound on the Action Throughput (f) for a pipelined scenario can be defined from Eq. 1:
where, T = 1/f is the latency to sample data from the sensor. If the aerial robot has 60 FPS camera, it means that the sensor data can be sample at 16.67 ms interval, which becomes the sensor latency. T
is the latency of the autonomy algorithm to estimate the high-level action commands. The algorithm running on the computing system feeds on the sensor data. Compute throughput is a function of the autonomy algorithm (SectionII-D) as well as the underlying hardware architecture.T = 1/f is the latency to generate the low-level actuation commands. The typical values of f is upwards of 1 kHz [koch2019neuroflight]. With these terms in place, the F-1 visual performance model can be used to perform a bound-and-bottleneck analysis to determine if the safe velocity is affected by the onboard sensor/compute. Any point to the left of the “knee-point” in F-1 (Fig. 1(a)) denotes that the safe velocity is bounded by the choice of compute (and autonomy algorithms) or sensor and any point to the right of the knee-point denotes the velocity is bounded by body-dynamics of the aerial robot. Ideally, to achieve the optimal pipeline design, it’s action throughput should be equal to that of the knee-point. Body-Dynamics Bound. An aerial robot’s physical properties such as weight, thrust produced by its rotors will determine how fast it can move and hence the ultimate bound on the safe velocity (V) will be determined by its body-dynamics. We call the region to the right of the knee-point (i.e., when sense-to-act throughput is greater than or equal to ) as body-dynamics bound. In this region, unless the physical components are improved (e.g., increasing thrust-to-weight ratio), the velocity cannot exceed the current peak safe velocity no matter how fast a decision is made (i.e., selection of faster compute/sensor). Sensor Bound. The choice of onboard sensors limits the decision-making rate (f) which can limit the safe velocity (V). As shown in Fig. 1(a), a robot’s velocity is sensor-bound if its action throughput is equal to the sensor’s frame rate () but less than the knee-point throughput (). The sensor-bound case occurs when the compute throughput (f) is less than or equal to the sensor throughput (f) (i.e. action throughput is equal to according to Equation 3), and . In this scenario, the sensor adds a new ceiling to the F-1 model, thus, bounding the velocity under . In this region, unless the sensor throughput is improved (e.g., higher FPS sensor), the velocity cannot exceed the sensor-bound ceiling () no matter how fast onboard compute can process the sensor input.Compute Bound. The choice of onboard compute (or autonomy algorithm) also affects the decision making rate (f). As also shown in Fig. 1(a), a robot’s velocity is compute-bound if its action throughput () is less than the sensor’s frame rate () and the knee-point throughput (). In this case, the computing platform adds a new ceiling to the roofline model, bounding the velocity under this limit (). In this scenario, the sensor adds a new ceiling to the F-1 model, thus, bounding the velocity under . In this region, unless the compute performance is improved (e.g., hardware accelerators/ algorithm-hardware co-design) the velocity cannot exceed . Optimal Design. The F-1 model can identify system designs that achieve an optimal/balanced overall system capability. Fig. 1(b) shows how understanding the bounds on safe velocity using F-1 can help designing an optimal system for aerial robots. For a given robot with fixed mechanical properties, changing the sensor type or onboard compute impacts the f. Consequently, the optimal design point is when the sensor throughput and compute throughput result in a action throughput that is equal to the knee-point throughput (). Over-Optimal Design. If the action throughput is f such that f f, then either the sensor/computer is over-optimized since any value greater than f yields no improvement in the velocity of the aerial robot. Such an over-designed computing/sensor platform involves not only extra optimization effort but also burns additional power that further increases the drone’s total power, decreasing its overall battery life. Sub-Optimal Design. The F-1 model can help architects understand the performance gap between the current compute design and optimal design. For instance, if the action throughput is f, such that f f, then the sensor/computer is under-optimized, which signifies that current system if off by (f f) and there is scope for improvement through a better algorithm or selection (or design) of the computing system.
Iii-C Constructing the F-1 Model
In this section, we describe how we construct the F-1 model starting from prior work [high-speed-drone] that has established and validated the relationship between the cyber-physical parameters and the safe velocity of the aerial robot as described by Eq. 4. Eq.4 states that if the robot’s body-dynamics (physics) can permit it to accelerate at most by a, its compute and sensors permit it to sense and act at an interval of T (1/f), and its sensor(s) can sense the environment as far as ‘d’ meters, then robot can travel as fast V. For instance, Fig. 3 depicts an aerial robot with its field of view (FoV) [fov] and an obstacle (e.g., tree or a bird) within the FoV. FoV is the region that the sensor can observe in an environment. In this scenario, the aerial robot can travel at most by V and stop without colliding with the obstacle.
To construct the model, we sweep the T from 0 5 seconds along with typical accelerations values (a = 50 ) and the sensor range (d = 10m), as shown in Fig. 3(a). We observe an asymptotic relation between velocity and T such that as T 0, the velocity 32 (as seen in the magnified portion of Fig. 3(a)). Likewise, as the T , the velocity 0. We also plot the f (inverse of T) on the -axis and velocity on the -axis in Fig. 3(b). Both the -axis and -axis are plotted on a linear scale. As T decreases (or 1/T increases), there is a sudden transition in velocity (0 to 31 m/s) and saturates thereafter. We see that there is a point beyond which increasing f does not increase the velocity, showing a saturation or a roofline. Fig. 3(c), plots the x-axis on log scale. Plotting the -axis on log-scale allows to observe the transition that was not evident in the linear scale (Fig. 3(b)) or in the orignal CPS relation (Fig. 3(a)). We also annotate the three plots with two sample points denoted as point ‘A’ and ‘knee-point’. The point A has a f of 1 Hz while the knee-point has a f of 100 Hz. Between point A to knee-point denotes 100 improvement in action throughput and translates to increase in velocity from 10 m/s to 30 m/s. Whereas even 100 improvement in f after the knee-point results in 1.0004 improvement in velocity (signifying no improvement in velocity). Hence, increasing the action throughput (e.g., faster computing platform, faster sensor etc.) beyond a certain point will yield no improvement in the velocity. To visualize the F-1 model (Fig. 1(a)), we need to show two regions: (i) where a robot’s velocity depends on f, and (ii) where the velocity is independent of f.
Iii-D Effects of Cyber and Physical Component Interaction
In this section, we show how the parameters in Eq. 4 couples the cyber and physical components interaction in an aerial robot. The cyber components integrate the sensing, computation, and control pipeline in drones. The effect of cyber components can be abstracted by the T (1/f) in Eq. 4. The physical components in an aerial robot, such as the mass of sensor/compute/body frame/battery, the thrust-to-weight ratio, the aerodynamic effects such as drag [drag-1], sensing quality etc can be abstracted by the a and d parameters in Eq 4. The three parameters (T, a, d) in Eq. 4, can be used to capture overheads of improving safety, reliability, and redundancy. For instance, safety of autonomous vehicles can be improved by increasing its FOV [fov] (i.e., reducing the blind spot) [blind-spot], or designing better tracking algorithms [fusion1, tracking-1, tracking-2] and/or adding redundancy in compute [redundancy-1, redundancy-2].The a parameter captures the physical effects of adding payload (sensor, onboard compute, battery, etc.) to the aerial robot. The payload weight affects the thrust-to-weight [thrust-weight] ratio which lowers the a [low-amax]. The F-1 model captures the impact of varying a on V: a higher a leads to a higher V (with roofline shifting upwards), as shown in Fig. 1(c). The d parameter captures the sensing quality of the aerial robot. For instance, a laser based sensor can provide a higher sensing range, whereas a camera array based depth sensor has a limited range [sensor-range]. The F-1 model captures the impact of varying d on V: a higher d leads to higher V (with roofline and slope shifting upwards), as shown in Fig. 1(d). Lastly, the f parameter captures the effect of sensor framerate, improvements to autonomy algorithm, or onboard compute. The additional latency incurred due to extra sensor/computation (e.g., sensor-fusion) affects the f based on Eq. 3. The F-1 model captures the impact of varying f by adding new ceilings which will limit the V. In summary, Eq. 4 couples the cyber and physical components and its associated effects into a single relationship. Thus F-1 model which is built based on Eq. 4 provides a unified performance model for computer architects to design onboard compute while taking into account the cyber-physical effects.
Iii-E Validation and Generalizability
The F-1 model is derived by plotting the CPS relationship between safe velocity (V) and throughput (f). The CPS relationship is validated on different environments with varying number of obstacles density in both simulation as well as on a real-world with wind speeds up to 7 m/s on a quadcopter. The F-1 model applies to both the autonomy algorithm paradigms (Section II-D) and quadrotors of all different sizes. As we show later, it is useful for nano, micro and mini UAVs analysis.
Iv F-1 Analysis of Off-the-shelf Compute
We use F-1 to characterize the performance of commonly-used learning-based autonomy algorithms running on real-world computing platforms that are used in aerial robots. We show that commonly-used autonomy algorithms and hardware platforms do not lead to optimal robot velocity, indicating that the choice of the (1) onboard computing platform and (2) autonomy algorithm affect the safe maximum velocity of the robot, thus confirming the need for cyber-physical co-design. We consider a baseline aerial robot that has a thrust-to-weight ratio of 2.4 [velocity-model], equipped with a camera sensor of 60 FPS, and weighs 1350 g, including the weight of the sensor, body frame, and battery. The robot is human teleoperated; it comes with a micro-controller unit but has limited computing and memory capacity for autonomy algorithms other than the flight controller stack. Since this onboard compute system does not use a hardware accelerator, we refer to this baseline as “No-Acc”. Such a robot can achieve a max acceleration of 15.95 . This is annotated as “Body Roof” in Fig. 5. The vertical red dotted line in Fig. 5 denotes the sensor throughput (f). We augment the baseline robot configuration with four different off-the-shelf accelerators that have varying compute capabilities: Nvidia Xavier, Nvidia TX2, Intel NCS, and Ras-Pi 3b. These systems are selected as they are used in real aerial robots [drl-agx, tx2-drone, raspi-drone, movidius-drone]. Therefore, in addition to the “No-Acc” baseline, we create four other robot configurations: each using a different accelerator, while the rest of the mechanical parameters (e.g., sensor) remain the same as the “No-Acc” baseline. Two autonomy algorithms that have been used for aerial robots in prior works are selected to run on these four configurations: VGG-16 [cad2rl] and DroNet [dronet].
|Xavier||<30W||280 [agx-weight]||162||1350||11.58||Vgg16, DroNet|
|TX2||<15W||85 [tx2-weight]||81||1350||14.40||Vgg16, DroNet|
|Intel NCS||<1W||42 [ncs-weight]||5.4||1350||15.10||Vgg16, DroNet|
Compute is heavy, and weighs down the aerial robot’s agility. High-performance onboard compute can process the autonomy algorithms faster but it trades off performance with higher TDP and weight which in turn lowers the maximum acceleration (a). Table I shows the maximum acceleration for each of the four robot configurations when using the different accelerator-based computing platforms. Since Xavier (high-performance high-TDP larger heat-sink) is the heaviest of the four, it shows the lowest acceleration, while Rasp-Pi/Intel NCS (low performance low-power lighter heat-sink) achieve the highest. However, these peak acceleration values are still lower than the “No-Acc” baseline acceleration of 16 , thus implying it is important to consider the effect of compute weight on a robot’s max acceleration. High-Performance compute does not imply a high-performance aerial robot. High-performance onboard compute platform does not always translate to higher robot performance (e.g., velocity or mission-energy etc). For instance, Fig. 4(a) shows running DroNet on four different onboard compute platforms. In this case, low-performance NCS can achieve higher velocity compared to the high-performance TX2 and Xavier as shown by their rooflines. This is because both TX2 and Xavier has higher-TDP thus has higher heat-sink weight which lowers the maximum acceleration which in turn lowers the velocity. In the case of NCS, it is overdesigned for the performance but a lower power (such that f is to the right of its knee-point) thus achieves higher velocity by being lighter(compared to TX2 and Xavier). However, in the case of Ras-Pi, even though it is lighter compared to TX2 and Xavier, its performance is lower (f left of the knee-point) thus making it compute-bound which lowers the velocity. Computationally-intensive algorithms need high-performance compute. Fig. 4(b) shows ceilings for the platforms (Ras-Pi 3b runs out of memory for VGG-16). The action throughput of Xavier, TX2, and NCS are dominated by their compute latencies as they are higher than the sensor latency. Xavier achieves higher action throughput of 28 Hz compared to TX2 (10 Hz) and NCS (1.3 Hz). For Xavier, TX2, and NCS, the velocity is bounded by compute as its action throughput is to the left of to its roofline’s knee point. However, Xavier is the least compute-bound among these accelerators since its action throughput is closest (within 3.5%) to its roofline’s knee-point. As a result, Xavier achieves a higher max velocity than other accelerators. However, it is still not an optimal choice of compute as its velocity (9.56 m/s) is far from the baseline No-Acc max velocity (11.64 m/s) due to its weight.
Takeaway. While high performance ensures that velocity is not compute-bound, low power dissipation translates in lower weight (smaller heatsink), hence able to support higher a (higher roofline). Given that the action throughput of these commonly-used autonomy algorithms and computing platforms are not optimal, we need algorithm-hardware co-design to achieve design points close to the knee-point.
Our F-1 analysis motivates the need to determine the best platform (i.e., autonomy algorithm and accelerator design) that will result in a knee-point action throughput while considering drone body dynamics and sensor type. To this end, we introduce the AutoPilot cyber-physical co-design framework. For a given robot’s high-level specification such as its thrust-to-weight ratio, sensor type, target task/environment, the tool automatically finds the optimal NN policy and its accelerator to ensure robust navigation and maximize safe velocity. AutoPilot is made up of three phases (Fig. 6). Phase 1 of AutoPilot takes an input specification of the robot and trains various Neural Network (NN) policies for a given task/environment and measures the effectiveness of these policies in terms of success rate. Phase 2 performs an automated design space exploration (DSE) to find the candidate NN policies and accelerator architectures that are optimal in terms of success rate and hardware power/performance. Phase 3 then uses the F-1 performance model to find the NN policy and its accelerator design, from the various candidates from phase 2, that maximize the velocity and success rate.
V-a Phase 1: Specification and Training
In Phase 1, the user provides an input Specification and configures the NN training environment via the Air Learning NN training gym. The specification consists of all the inputs to the AutoPilot framework, such as the robot task, environment, optimization target (velocity), robot’s physical properties, etc. The Air Learning training simulator [krishnan2019] is used to train different NN policies for a given environment.Specification. There are three main categories within the specification. The first category is the robot task-level specification, such as the success rate. The second category includes specifications about the target CPS system: the sensor framerate, the rigid body-dynamics (thrust-to-weight ratio), power of rotors/body/sensors, etc. The last category is the optimization target, such as maximize velocity and number of missions, which is used by AutoPilot to determine the final NN policy and the hardware accelerator architecture. Reinforcement-Learning Training. AutoPilot uses Air Learning [krishnan2019] to train and validate learning-based autonomy algorithms for a given robot task. Air Learning provides a high-quality implementation of reinforcement learning algorithms that can be used to train an NN policy for aerial robot navigation tasks.Air Learning includes a configurable environment generator [airlearning-github] with domain randomization [domain-rand1] support that allows changing various parameters such as the number of obstacles, size of the arena, etc. We customize these parameters to generate different environments, with a varying number of obstacles, in order to denote the change in the task complexity. To determine the NN policy for each robot task (environment complexity in obstacles, congestion, etc.), we start with the basic template used in Air Learning [airlearning] and vary its hyperparameters (number of layers/filters) to create many candidate NN policies. Based on the specified robot task, the desired success rate, AutoPilot launches several Air Learning training instances in parallel for the different NN policy candidates. Each of the NN policies that achieve the required success rate is evaluated in a random environment to validate the task level functionality. The validated NN policies are updated into an Air Learning database along with their success rates, which are then used by Bayesian optimization in the next DSE phase.
V-B Phase 2: Design Space Exploration
In Phase 2, an automated multi-objective DSE is performed to find NN policies and hardware accelerator architectures that are optimal in terms of success rate and accelerator performance/power for a target environment. The success rate is only affected by NN hyper-parameters (e.g., number of layers/filters). The accelerator’s runtime and power depend on both the NN and accelerator microarchitecture parameters (number of processing elements, on-chip memory, etc.). Success rates for the NN policies are accessed from the Air Learning database, while a cycle-accurate simulator is used to evaluate accelerator performance/power for the different policies and hardware configurations. To achieve rapid convergence to optimal solutions, without performing an exhaustive search, Bayesian optimization is used to tune the different parameters. Air Learning Database. This database stores the training results for the various NN policies trained using Air Learning. Each entry in the database has an NN policy identifier, the hyper-parameters used for training and performance of the NN policy validated for a given task. An example of performance metrics can be the success rate or the number of steps taken by the aerial robot to reach the objective or goal target. Cycle-Accurate Hardware Simulator. AutoPilot uses SCALE-Sim, which is a configurable systolic-array based cycle-accurate DNN accelerator simulator [samajdar2018scale]. It exposes various micro-architectural parameters such as array size (number of MAC units), array aspect ratio (array height vs. width), scratchpad memory size for the input feature maps (ifmap), filters, and output feature maps (ofmaps), dataflow mapping strategies, as well as system integration parameters, e.g., memory bandwidth. Taking these architectural parameters, filter dimensions of each DNN layer, and the image size as input, SCALE-Sim generates the latency, utilization, SRAM accesses, DRAM accesses, and DRAM bandwidth requirement. While SCALE-Sim only generates performance metrics for the hardware accelerator, we augmented it with power models. The SRAM power is estimated using CACTI [cacti], and the DRAM power is estimated using Micron’s DDR4 power calculator [dram]. We assume that the accelerator is integrated into the final SoC. The details about the SoC level integration and estimation of SoC power are in Section VI. Bayesian Optimization. AutoPilot uses Bayesian optimization [bayesopt] for multi-objective DSE to generate task-system Pareto frontiers. Bayesian optimization has been shown to be highly effective for optimizing black-box functions [SnoekLA12], [ShahriariSWAF16]
that are expensive to evaluate and cannot be expressed as closed-form expressions. BayesOpt can achieve faster convergence than genetic algorithms when optimizing multiple objectives[ReagenHAGWWB17]
. In AutoPilot, BayesOpt optimizes three objective functions: (i) task success rate, (ii) SoC power, and (iii) accelerator inference latency (runtime). A Pareto-optimal design is one that achieves maximum task success rate, and minimum inference latency , and SoC power. The algorithm tunes NN policy hyper-parameters (such as number of layers/filters) and accelerator hardware parameters (e.g., number of processing elements, SRAM sizes, etc.) to converge to Pareto-optimal NN policies and accelerator architectures. An open-source BayesOpt implementation[bayesopt] is used in AutoPilot.
V-C Phase 3: Cyber-Physical Co-Design with F-1
The goal of phase 3 is to find a design point (policy and accelerator) with optimal success rate and velocity. There are two steps involved: CPS co-design and architectural tuning. CPS Co-Design. First, designs with the highest success rate (minimum success rate is user-specified) amongst Phase 2 generated designs are selected. Then, the velocities for these designs are computed using the CPS relation (Section III-C) that takes into account the effect of the weight of different components, including the compute, on velocity. Next, the AutoPilot system constructs the F-1 roofline plot (following sections III-C), which consists of a roof corresponding to the baseline robot (i.e., human-operated which does not use any onboard NN accelerator), and other roofs corresponding to the velocities of the success-rate-filtered designs. The latter roofs would be close or lower than the base roof due to the added weights of the accelerators. Finally, the design is selected that achieves the max velocity equivalent to the human-operated base robot, and whose action throughput is equal to the base knee-point throughput. Architectural Fine-Tuning. In the case when no optimal design exists that achieves the base knee-point velocity, some architectural tuning may be required to shift the design close to the knee-point. AutoPilot provides two options for which points to consider for optimization: (i) these can be user-defined, or (ii) the design point that is closest to the knee-point can be selected. The architectural tuning can be performed using a variety of optimizations until the optimized design is at (or very close to) the base knee-point in the F-1 roofline. We employ a bag of architectural optimizations in the tuning process. AutoPilot comes with two techniques, namely frequency scaling and technology scaling. In frequency scaling, we increase or decrease the operating frequency to trade-off performance and power of the hardware accelerator. Lowering the frequency leads to lower power (TDP), which reduces the heat-sink weight and increases its a and velocity. This optimization is useful when a design is body dynamics bound and is over-designed. Likewise, increasing the operating frequency improves accelerator runtime and can be used if a design is under-optimized and compute-bound. In technology scaling, we evaluate the designs in different process technology nodes to see if we can move a design closer to the knee-point. Summary. AutoPilot methodology is general (ML-based multi-objective DSE) and can be extended in scope to include other autonomous vehicles such as cars (with its CPS model [intel-rss, nvidia-cps-cars]), other autonomy algorithms (Section II-D), hardware targets (e.g., FPGAs, CGRAs, multi-cores, systolic/non-systolic array based, etc.). Within a fixed accelerator target, any other architectural optimization technique (e.g., quantization of policy [quarl], model compression [han2015deep], memory optimizations [max-nvm]) that trade-off power and performance can be a part of the bag of architectural optimizations.
Vi Experimental Setup
Air Learning Training Environments. We generate two environments using the Air Learning environment generator with varying degrees of clutter. The arena size is typical and is twice the arena sizes used in aerial robotics testbeds [flying-arena1, flying-arena2, flying-arena3, flying-arena4]. The NN is trained using Deep Q-Networks [dqn]. DQN works well on high-level navigation tasks for aerial robots [dqn-uav1, dqn-uav2]. We use the same reward function and other hyperparameters as used by the authors of Air Learning [krishnan2019]. The training is terminated after 1 M steps or reaches the required success rate. NN Policy Architecture Search. We use the Air Learning model architecture as the baseline template and change its hyperparameters. The NN policy is multi-modal, and prior work [krishnan2019] has shown that each input modality contributes to the success rate for the task. The basic template of the architecture used in that work is shown in Fig. 6(a)
. We made additional changes to the base template, such as the choice of filter sizes, strides, etc. We choose filter size of 3
3 and a stride of 1, with ReLu[relu]activation function with no pooling.SoC Power Estimation. We assume an SoC, which includes an architecture template for hardware accelerator shown in Fig. 6(b). For estimating the total SoC power, we add the power of individual components in the SoC. For estimating the power of the hardware accelerator, we run a given NN policy on a cycle-accurate simulator. The cycle-accurate simulator produces SRAM traces, DRAM traces, number of read/write access to SRAM, number of read/write access to the DRAM. Using the SRAM and DRAM trace information, we model the SRAM power in CACTI [cacti] and DRAM power in Micron DRAM model [dram]. For estimating the power for the systolic-array, we multiply the array size with the energy of the PE. The PE power is modeled after the breakdown in [li-memdse-dac]. For the ULP camera, we assume the camera is capable of sustaining frame rates of up to 60 FPS at 144 x 256 size images at low power of less than 100 mW and form factor of 6.24 mm 3.84 mm [ulp-camera]. We account for the camera power in our overall power calculation. We also assume that the camera is interfaced with the system using a camera parallel interface [parallel] or MIPI [MIPI] similar to this work [pulp-dronet] from which the accelerator sub-system can directly fetch the inputs to process the images. We also assume that the filter weights are loaded into the system memory as a one-time operation.We assume that there are two low power MCU class cores in the SoC to run the flight controller stack, which is typically a PID controller to control the four rotors. The flight controller stack run bare-metal on the MCU, similar to Bitcraze CrazyFlie aerial robots [crazyflie]. For the MCU class cores, we use Cortex-M cores that implement the ARMv8-M ISA architecture [arm-v8m-isa]. Each MCU core consumes about 0.38 mW in a 28 nm process clocked at 100 MHz [arm-m33-productpage]. We account for the power of the ultra-low-power core into our final power numbers. The MCU cores receive the high-level action commands from the accelerator sub-system through the system bus after running each frame through the NN policy. The NN produces the action which is interpreted by the flight controller to generate low-level motor actuation signals to control the aerial robot. Compute Weight Estimation. Using the SoC power as the heat source, we calculate the heat-sink volume required based on heat-sink calculator [heat-sink]. The weight of the heat-sink is determined by multiplying the estimated volume with the density of aluminum (commonly used heat-sink material). We also assume that the final SoC is mounted on a PCB along with all electrical components weighing 20g (which per our analysis is typical for Ras-Pi [ras-pi-weight], CORAL [coral-weight] like systems).
We present the results and analysis associated with AutoPilot (i.e., compute DSE, CPS co-design, and architectural fine-tuning) of AutoPilot. Then we show that the SoCs optimized for velocity lead to an increase in the total mission count.
Vii-a Compute Design Space Exploration (DSE)
Since off-the-shelf components fall short of being optimal, we demonstrate that AutoPilot is capable of automatically exploring a large design space in finding optimal NN policies and accelerator designs. We show the system’s ability to generate a variety of different policies and architectures by subjecting AutoPilot to environments with varying levels of obstacle density. Increasing complexity affects the NN policy (deeper policy), as well as the hardware accelerator design. Fig. 8 shows the designs obtained using AutoPilot for two different task complexities (low and high obstacle density). Each design point represents the SoC power, DNN accelerator inference latency, and the success rate (color map). As described in Section V, AutoPilot uses Bayesian optimization to tune the various parameters until convergence while optimizing the costs (performance, power, and success rate). While the NN policy determines the success rate, the accelerator power (performance) depends on both the policy and HW parameters. AutoPilot converges to optimal accelerator designs by sampling less than 0.5% of the total design space.AutoPilot tunes the NN policies such that they have 2-6 layers with each layer having either 32, 48, or 64 filters. For the complex task, AutoPilot automatically selects deeper NN policies as its success rate is higher. For instance, 32 filters (and 3-5 layers) are sufficient to achieve a success rate higher than 80% for low obstacle density, 48 filters are required for high obstacle density to get to a similar success rate. AutoPilot tunes the hardware accelerator parameters to generate designs ranging from low-power to high-performance. We specifically tune array height/width between 16-128 and SRAM (Ifmap/Ofmap/Filter) sizes between 32KB-2MB. Fig. 8 highlights three regions in the DSE to demonstrate how AutoPilot can generate hardware accelerator candidates under certain power-performance bounds irrespective of the task complexity. Region A, B, and C denote bounds that are under 2 W (25 FPS), 4-2 W (50 FPS), 4-8 W (100 FPS) respectively.
AutoPilot using Bayesian optimization converges to optimal accelerator designs by sampling less than 0.5% of total design points from the entire design space. As task complexity changes, it can generate a multitude of design candidates within the same power-performance bounds. As we co-design cyber-physical parameters, having multiple design candidates translates to greater scalability of the methodology to select optimal compute platforms as sensor or body-dynamics changes (Section VII-B).
Vii-B Cyber-Physical Co-Design
While compute DSE generates a large spread of architectural designs, not all points are ideally suited for deployment on an aerial robot to achieve a balanced system (as shown in Section IV using F-1). Hence, in this section, we show that (1) the F-1 model is essential in finding the accelerator architecture based on a user specification (e.g., drone type, sensor framerate) that will lead to optimal robot velocity, and (2) architectures optimized for raw performance or low-power, do not necessarily result in the optimal knee-point (maximum velocity). For a comprehensive analysis, we perform CPS co-design with three aerial robots, namely Asctec Pelican (mini-UAV), DJI Spark (micro-UAV), and a nano-UAV [nano-uav] which have a thrust-to-weight ratio (includes battery/sensor) of 2.4, 1.9, and 3.1 respectively, denoting a change in the body-dynamics. We also consider sensor framerates of 30 and 60 FPS. Fig. 9 shows CPS co-design for the navigation task in the high-density environment. We filter the design points from Fig. 7(b) based on high-scoring success rates, as shown in Fig. 8(a). These designs represent various accelerator candidates designed for the NN policy that achieves a success rate of at least 83.4% (4 layers and 32 filters). Success rate of greater than 80% is nominal [trail-net, accuracy-1] in aerial robot navigation tasks. Out of the many accelerator design candidates, we highlight four designs denoted as ‘1’ (lowest-power and slower runtime) ‘2’ (AutoPilot selected), ‘3’ (highest performance and highest power), and ‘4’ (AutoPilot selected). The architectural details about these design points such as systolic array size, IFM/filter memory are annotated within Fig. 8(a). Using these four selected points, we demonstrate the need for the F-1 model in designing the onboard compute for aerial robots. We also show that cyber-physical co-design is critical to achieving optimal computing platforms to maximize velocity, instead of isolated hardware design target objective such as high performance, low power, or energy efficiency. F-1 model identifies optimal design points. Plotting the four architectural designs points on the F-1 roofline model for AscTec Pelican (Fig. 8(b)), DJI Spark (Fig. 8(c)), and nano-UAV [nano-uav] (Fig. 8(d)) with 30 FPS and 60 FPS sensor framerate, we observe that balanced, high-performance, and low-power design points are all far from optimal knee-point for their respective aerial robot. Instead, design point ‘2’ selected by AutoPilot is optimal knee-point for AscTec Pelican with 60 FPS sensor. In the case of 30 FPS sensor with AscTec Pelican, the design point ‘4’ is the optimal design in terms of compute and any further improvement in compute performance will not result in any improvement in velocity since the performance is bound by the sensor framerate (30FPS). For DJI Spark with 30 FPS and 60 FPS sensor and nano-drone [nano-uav] with 30 FPS sensor, AutoPilot selects ‘4’ as the optimal compute design. However, for nano-drone with 60 FPS sensor, ‘4’ is not optimal knee-point and will result in compute-bound scenario. Using the F-1 model in the CPS co-design phase, we show that ad-hoc selection of high-performance compute designs such as ‘3’ can degrade the overall performance (e.g., high-speed velocity) of the drone. For instance, the highest performance accelerator design point ‘3’ (highest power) for the DJI Spark and nano-uav [nano-uav] decreases the safe-velocity by 13.2% and 44% due to the added weight of the heat-sink for cooling compared to the baseline No-Acc case. Thus, the F-1 performance model allows us to pick the optimal design rather than selecting the typical low-power, high-performance, or balanced architecture designs which would often be the case if the compute is designed in isolation without CPS co-design. One-size compute does not fit all. For the AscTec Pelican (Fig. 8(b)) with 30 FPS and 60 FPS sensor framerate, we observe that the optimal design point is ‘4’ and ‘2’ , respectively. Interestingly, ‘4’ (optimal for 30 FPS sensor framerate), if chosen as the compute platform, becomes compute-bound for AscTec Pelican with 60 FPS framerate which drops the maximum safe velocity by 15% compared to design point ‘2’. Takeaway. Choosing the computing platform (either general-purpose or custom-designed) in an ad-hoc fashion can deteriorate the physical performance of the robot, which will then have implications on the mission energy (discussed in Section VII-D). Hence, while designing (or selecting) the computing platform for the aerial robot, one must account for the cyber-physical parameters of the robot for maximum performance.
Vii-C Architectural Fine-Tuning
To show the effectiveness of architectural fine-tuning, we consider the Asctec Pelican robot with 60 FPS sensor, but assume that the knee-point (i.e., the design ‘2’ in Fig. 8(b)) was not achieved. For this case, using the bag of architectural optimizations (frequency/node scaling), we are able to move the sub-optimal body dynamics- and compute-bound designs (points ‘3’ and ‘4’ in Fig. 9(a)) to the knee-point. Body-Dynamics Bound. Design ‘3’ (high-power, high-performance, and body-dynamics bound) clocked at 1 GHz and in a 45nm process node has a compute throughput that is 3 higher than the knee-point for the AscTec Pelican. By scaling down its frequency to 125 MHz, AutoPilot brings this sub-optimal point closer to the knee-point (denoted by ‘3’) as shown in Fig. 9(b). Lowering the frequency from 1 GHz to 125 MHz reduces power consumption from 7.5 W to 1 W (Fig. 9(a)). The power reduction reduces the heat sink requirement, making this design lighter and near-optimal.Compute-Bound. Design ‘4’ clocked at 1 GHz in the 45 nm node, results in a compute-bound case for the Asctec Pelican. To get this design to the knee-point, AutoPilot increases the throughput of this accelerator by 1.6 without significantly increasing its power consumption. We show that by performing process scaling to 22 nm and clocking at 4 GHz, AutoPilot brings it closer to the knee-point (Fig. 9(b)). Takeaway. When the CPS co-design step (Section VII-B) is unable to generate the optimal knee-point design, the architectural fine-tuning engine can be launched that uses various optimization techniques to deliver the final knee-point design. The level of flexibility allowed by AutoPilot and the trade-offs it can make are configurable by the end-user.
Vii-D Mission Time/Energy Implications of Optimal System
The end-goal of AutoPilot is choosing an onboard compute system (design points) that minimizes the mission time and energy. To this end, we evaluate three robots, AscTec Pelican (mini-UAV, DJI Spark (micro-UAV), and nano-UAV used in Zhang et. al [nano-uav]. We show that the optimal design point (i.e. knee-point) generated by AutoPilot always outperforms the other non-optimal designs (other AutoPilot generated designs or ad-hoc selection of onboard compute (e.g., TX2). Mission Time Comparisons. To estimate mission time, we assume a package delivery mission application/scenario, where a radius of 100 m separates the source and destination. We pick two categories of points, namely the knee-point and others (compute-bound, body dynamics bound, ad-hoc selection of computing platform) for the three aerial robots. For each of the design points, we estimate the maximum velocity the robot achieves if it uses these designs as the onboard compute. Fig. 10(a) shows mission times (lower is better) for the five different computing platforms across the AscTec Pelican (mini-UAV), DJI-Spark and a Nano-UAV. AutoPilot generated optimal design (knee-point) always achieve the lowest mission time. It is worth noting that the selection of the knee-point design becomes more critical as we miniaturize the aerial robot (mini-UAV micro-UAV nano-UAV). For instance, in AscTec Pelican (Mini-UAV), the AutoPilot generated knee-point and the body-bound design point (also generated by AutoPilot), the mission time improvement is only 5% whereas for micro-UAVs and nano-UAVs the difference is 20% and 80%, respectively. The reason for marginal improvement in the case of AscTec Pelican is because it is a bigger sized drone and has a higher payload carrying capability. Hence, the TDP difference of 4 W between the knee-point and body-bound design point and its additional heatsink weight is negligible to cause any significant degradation to its body-dynamics (a) and its safe velocity (Eq. 4). However, in the case of DJI-Spark and Nano-UAV [nano-uav], the payload carrying capacity is lower and extra heatsink weight (TDP of compute) can significantly lower the acceleration (a) and its safe velocity.
Mission Energy Comparisons. Fig. 10(b) shows the mission energy for three drone platforms with five different compute platforms. We show that the knee-point design always lowers the mission energy compared to other selections. Mission energy (E) is related to the mission time as follows:
Where, t is the time it takes to complete the mission and P, P, and P are the power consumed by the rotors, onboard compute, and other components (sensor, flight controller etc) in the aerial robots. It is important to note P consume more than 95% of the total power [mavbench, pulp-dronet], but a higher P (higher TDP higher heatsink weight) can lower the acceleration (a), which can lower the V (higher t). Thus, a knee-point design lowers the mission energy by minimizing t (higher V lower t) and minimizing P compared to other design points (compute/body-bound, or ad-hoc selection). The optimal design generated by AutoPilot for AscTec Pelican, DJI Spark, and the nano-UAV, achieves 2, 1.54, and 1.81 lower mission energy compared to other designs.
Viii Related Work
Performance Models. Analytical performance models, such as the Multicore Amdhals’ law [amdahl-multicore], Roofline model [roofline], and Gables [hill2019gables], and several others [chita-analytical-intro] are useful to guide the design of an optimal system for a given workload. These performance models are applicable for traditional compute and are not explicitly targeted for robots that have cyber and physical components. Our work proposes a roofline-like model to help understand the role of computing in aerial robots. In the context of performance modelling for complex systems (i.e., beyond compute only system), cote [orbital-computing] is a full-system model for design and control of nano-satellites. The cote model takes into account orbital mechanics, physical bounds on communication, computation, and data storage to design a cost-effective, low-latency, and scalable nano-satellite system. The F-1 model has a similar objective, where it combines the interactions between compute/sensor (cyber components), and body-dynamics (physical components) to understand various bottlenecks to build an optimal system. Accelerators for Robots. Recently, a low-power accelerator [pulp-dronet] was proposed for neural network-based control, but the work is customized for nano-drones, running DroNet [dronet]. Our work provides a general methodology to generate multiple NN policies and hardware accelerator designs from a high-level specification. Navion [navion] is a specialized accelerator for aerial robots, in the sense-plan-act control algorithm paradigm, for improving visual-inertial-odometry. We focus on end-to-end based control algorithms, which is an emerging autonomy algorithm paradigm. RoboX [robox] generates an accelerator for motion predictive control from a high-level DSL. Though the high-level goal is the same, our work differs from theirs in that they do not consider the effect of the cyber-physical parameters on the computing platform. We instead contribute the F-1 model to quantify the optimality of our designs. Outside of aerial robots, prior work [dan-sorin-1, dan-sorin-2] has shown the benefits of designing custom hardware accelerators for motion planning algorithms for robotics arm to improve the robots’ performance. Though the robots are different, AutoPilot guided by the F-1 model can build similar optimally designed motion planning hardware accelerators targeted for aerial robots.
AutoPilot is a push-button solution that automates cyber-physical co-design to automatically generate an optimal control algorithm (NN policy) and its hardware accelerator from a high-level user specification. The concepts we have developed for AutoPilot, such as cyber-physical co-design, the F-1 model for identifying the optimal design point, architectural fine-tuning, and selecting the optimal design points by showing how it affects the overall mission can be adapted to other types of autonomous robots such as self-driving cars.