I Introduction
Unmanned aerial vehicles (UAVs) are emerging as critical tools for mapping large areas, patrolling, searching, and rescuing applications. These tasks are usually dangerous, repetitive and have to be carried out in extreme conditions, making them ideal for autonomous drones. Selfnavigation and collisionavoiding applications are key for UAVs to operate individually and rely on highperformance and lowpower computing edges.
We cannot stress the importance of the performance of flight control applications enough. In a recent investigation Natalie (2019), the Federal Aviation Administration discovered that the lack of dataprocessing speed of a specific flight control computer chip has led to two Boeing 737 Max crashes in 2019 that killed 346 people. At the same time, lowpower design is critical for UAVs as well. One reason is that high power dissipation brings tremendous cooling challenges to maintain the hardware at a suitable temperature. Another is that batteries are the only energy source for drones, limiting the running time of drones.
In order to push the performance and energy boundary of systemsonchips, Dally and Towles Dally (2001) proposed the tilebased Networkonchips (NoC) as the ideal architecture for scalable and lowpower onchip communication. Such chips use tiles as building blocks such as CPUs, GPUs, ASIC and memory. A standard interface is embedded into each tile to route flits for communication. There have been many previous studies on energyaware NoC designs. In contrast to prior NoC work, the goal of this paper is to investigate the parallelization of the UAV perception and navigation intelligence while taking the computation and communication power consumption into consideration. As shown in Fig. 1, we first compile the navigation program into LLVM IR and construct the DDG, where each node denotes only a useful instruction with its power consumption and each edge represents the data dependency with the weight being data size times latency. Second, based on DDG graph, we propose a scheduling algorithm to partition the PNI application into clusters such that (1) intercluster communication is minimized, (2) NoC energy is reduced and (3) the workloads of different cores are balanced for maximum parallel execution. Finally, we incorporate topological sort into the our energyaware mapping scheme to further reduce static power consumption resulted by congestion.
Towards this end, the main contributions of this paper are as follow:

To the best of our knowledge, our work is the first to incorporate the static energy consumption analysis of application into a compilerbased task partition.

Besides volume, we propose a mapping strategy to also consider the timing of intercore communications, reducing the congestion time and static energy consumption of hardware resources.
The rest of the paper is organized as follows: Section II discusses the related work. Section III introduces the basics of UAV control. Section IV illustrates the energy model for NoCs, the loadbalancing and energyaware community detection algorithm, and the lowpower mapping. Section V validates the framework and shows experimental results compared to the baseline model.
Ii Related Work
There has been a significant amount of previous research on energyaware and loadbalancing scheduling and mapping on multicore embedded systems. From a mathematical and control perspective, Bogdan et al. in Bogdan (2015, 2015) provide a complex approach to dynamically characterize the workload of multicore systems for performance and power optimization. Xiao et al. propose a complex networkinspired application partitioning tool to improve multicore parallelization Xiao (2017). Tan et al. develop a lowpower customizable manycore architecture for wearables using a lightweight messagepassing scheme Tan (2018). Navion Suleiman (2019) design an energyefficient accelerator to fully integrate visualinertial odometry systemonchip while eliminating expansive offchip processing and storage for autonomous navigation of drones. In terms of mapping and routing, an efficient branchandbound algorithm proposed by Hu et al. Hu (2003) automatically maps the IPs onto a generic NoC so that the communication cost is minimized while the timing constraint is met. In contrast to prior work, we present an energyaware loadbalancing community detection algorithm together with a mapping strategy and test it using a UAV selfnavigation application.
Iii Brief Overview of the Basics of the UAV Navigation Controller
Fig. 1(A) shows a UAV with six degrees of freedom. Three degrees of freedom describe the translational motions (
) and the other three are the rotational motions (). Each of the four propellers is equipped with a rotor providing the angular velocity. These four angular velocities correspond to the inputs of the quadrotor, . Twelve outputs are generated from the quadrotor, , corresponding to the translational and rotational positions, and their corresponding velocities Corke (2017).For realtime applications, the error between the actual UAV position, estimated by a navigation system, and the desired position is fed into a PDcontroller to determine the required control inputs. The required rotor speeds are then calculated from the respective torques using:
(1) 
where
is the thrust vector for each propeller,
is the torque vector applied to the airframe, represents the lift constant, is the distance from the rotor to the center of the mass and is secondary lift constant. The control structure employed to fly the quadrotor can be found in Corke (2017); Armah (2016), and is based on Proportional Derivative action to get the quadrotor’s attitude (roll, pitch, yaw) and altitude.Iv Parallelization Discovery and Energy Optimization Approach
A Energy Model
Both IP cores and interconnection consume energy. While most of the mapping algorithms based on the one in Hu (2003) only compute dynamic energy, our model considers both static and dynamic power dissipation. N. Grech et al. Grech (2015) propose an application static energy analysis technique to determine the instruction energy model directly at the LLVM IR level. Through analysis and measurement of a large set of target ISA instructions, it was found that LLVM IR instructions can be divided roughly into four groups: memory, , program flow, , division, , and all other instructions, . This yields an energy model of a program executed sequentially in a computing node:
(2) 
where is the energy cost of a single instruction in group , is the number of the instructions executed in that group, and denotes the number of instructions.
Using the bit energy concept proposed by Ye et al. in Ye (2002), the total dynamic energy consumption can be computed by:
(3) 
where and represent the energy consumed by switch and link; is the number of routers the packet from tile to tile passes through along the way; is the size of the packet; and denote the number of tiles on and respectively.
The static power is defined to characterize the energy consumed when packets are congested in the buffers. For simplicity, static power is defined as:
(4) 
where is the number of times that congestion occurs; is the energy consumption of one bit of data stored in the buffer for one unit of time; is the data size of the th congestion; and is time of the th congestion. Equation (5) gives the total energy consumption for the interconnect.
(5) 
Finally, given the total number of tiles , the energy consumption of the entire chip is computed as:
(6) 
B Compiler Analysis and Model of Computation Extraction
In order to generate the data dependency graph (DDG), we adopt the LLVM IR Lattner (2004). The rationale behind this is that LLVM is a languageindependent system that exposes the commonlyused primitives to implement highlevel language features, which makes it very easy to generate backend for any target platform.
With the help of Clang, C/C++ applications are compiled into a dynamic IR execution trace. We developed a parser to construct a data dependency graph from the IR trace. The parser analyzes memory operations to obtain latency and data sizes. Because the execution times and energy vary on data sizes and where the data resides, taking those values into account could potentially reduce intercore communications by grouping the source and destination instructions of a register into one cluster. Three hash tables are created and updated when parsing: the source table, the destination table and the dependency table. The source/destination tables are used to keep track of source/destination registers with keys being source or destination registers and values being the corresponding line number. The dependency table is to store dependencies between nodes with keys being the line number for current instruction, and values being clock cycles, data sizes and line numbers of previous instructions dependent on the same virtual register.
For example, in Table I, a LLVM IR snippet is extracted from an application compiled by Clang frontend. As the parser reads the first line, a source table and a destination table are created. The source table is updated with the key being %5 and the value being 1 and its destination register is hashed into the destination table with the key being %1 and value the being 1. When line 2 is read, the source register %1 happens to be the destination register in line 1. A dependency table is created and updated with the key being 2 (line number of current instruction) and value being 1 (line number of the dependent instruction). Following the same procedure, the three hash tables will look like what is shown in Table I.
LLVM IR trace  
store double %5, double* %1, align 8
%2 = load double, double* %1, align 8 %3 = load double, double* %6, align 8 %4 = fcmp oeq double %2, %3 

Src Table  Dest Table  Dependency Table  
Key  Value  Key  Value  Key  Value 
%5  1  %1  1  2  1 
%1  2  %2  2  4  2,3 
%6  3  %3  3  
%2, %3  4  %4  4 
: The source, destination and weight tables
C Discovering the Processing Community Structure
To formulate this problem, we introduce the following concepts:
Definition 1.
A data dependency graph (DDG) is a weighted directed graph where each vertex represents one LLVM IR instruction; each edge with weights characterizes either the dependency from to or the control flow such as jumps or branches from one block to another; and stands for the estimated energy of the vertex given in Section IV.A.
Definition 2.
A weight between and is calculated by latency times data size. Latency characterizes the delay from to based on the timing information. Data size represents the number of bytes transferred.
Definition 3.
A quality function determines how efficient the LLVM IR instructions are grouped together in terms of energy consumption, parallelism, load balancing, hardware utilization and intercluster data movements.
The discovery of the processing community structure problem can now be formulated as follows: Given a DDG, find nonoverlapping processing communities which maximize the quality function:
(7) 
and satisfy:
(8) 
The first term in equation (7) confines the data flow within the cluster as much as possible. It indicates the difference between the sum of the weights in a cluster and the sum of the weights of the edge connected to the cluster. The greater this term is, the fewer intercluster data movements, and the more energy is saved.
The second term in equation (7) measures the standard deviation squared between sum of weights in cluster
and average sum of weights in all clusters. Minimizing this term ensures load balancing and fully takes advantage of parallel execution.The third term in equation (7) characterizes the energy model of the application, where calculates the energy consumed at each node using Equation (2) and computes the energy consumption for communication transactions. To maximize quality , this term needs to be minimized in order to save energy.
D Compact Intelligence Mapping into Constrained Hardware
The tile to which each cluster is mapped significantly affects the power consumption of the application since it determines the dynamic and static communication cost. Consequently, an approach, which is similar to the one in Hu (2003), is proposed, but it takes cluster ordering into consideration as well so that it reduces static energy consumption caused by congestion and contention of hardware resources.
Definition 4.
A task graph (TG) is a weighted directed acyclic graph where each vertex represents a cluster of LLVM IR instructions that are grouped together by our community detection algorithm, and each edge represents communication from node to node .

: data size from to .

: bandwidth requirement from to .
Definition 5.
An architecture graph (AG) is a directed graph where each vertex represents a tile, and each edge represents a routing path from to .

: energy consumption from to .

: set of links that makes up
In order to exploit parallelism and pipelining, we apply topological sort to the task graph before mapping. The depth of cluster is defined as the maximum number of edges from the root to . In Fig. 2, cluster cannot execute before cluster and because it needs data from both of them. However, cluster and can execute in parallel because they are at the same depth.
d.1 Energy and Congestion Analysis
The energyaware mapping proposed in Hu (2003) (we refer to it as H) fails to consider the order of the clusters, leading to significant potential congestion and static energy consumption in NoCs. This section shows how our algorithm mitigates this problem.
For illustration purposes, we assume . Applying the H’s mapping to the in Fig. 2 may yield the following two different mappings in Table 2. For instance, using Equation (3) in H’s mapping, . Both mappings’ dynamic energy costs are .
In terms of static energy, we assume , and the execution time is for all clusters. Also assume one packet flit is and the time for a flit to pass through a switch () is and a link () is . Fig. 3 shows the timing diagram of all computations and all packet deliveries of both mappings. For instance, in H’s mapping, the first flit of the packet from cluster to takes to arrive (routing delay), while the rest of the packet needs another (packet delay).
Dynamic energy = J  
H’s mapping  Our mapping 
: Mapping comparison: dynamic energy
In H’s mapping, when cluster finishes execution and is about to route the packet to , ’s input buffer is busy because of and packet transmissions. Thus, must wait until is done. While the two mappings yield the same execution time of , the packets from to in H’s mapping experiences a longer congestion delay, hence consuming more static energy. Applying Equations (4) and (5), H’s mapping consumes more energy in interconnect.
V Experimental Results
We use gem5 Binker (2016) together with McPAT Li (2009) for architectural and power simulation. Our baseline model is 2core ARM processor connected in a 2D mesh topology NoC Agarwal (2009) with MESI cache protocol. Detailed parameters are listed in TABLE III.
Cores  2 inorder ARM cores at 500MHz 
L1 Private Cache  32KB, 4way, 32byte block 
L2 Shared Cache  128KB, 8way 
Topology  2D Mesh with XY routing 
: Simulation parameters of baseline processor
First, we examine our processing community discovery algorithm’s computational complexity (Fig. 4) as the number of core grows. The processing community discovery is done offline (only once), so a run time around two minutes will not affect the controller speed during UAV navigation. For system sizes under 256, the run time is roughly only related to the map size and remains constant as the core number increases. Once the core count passes a threshold of 256, the run time rises significantly.
Row 






1  1  16637  24324  2.31  13.87  
2  BL  514.9  12023  2932  15.87  
3  2  415.9  8497  2646.5  11.48  
4  4  199.6  6391  1932.2  12.64  
5  8  104.8  4213  748.5  11.88  
6  16  53.2  3531  593.5  10.67  
7  32  29.1  2919  713.8  9.73  
8  64  12.5  1823  293.3  10.58  
9  128  13.4  3769  252.9  7.84  
10  256  7.3  5322  120.3  12.39  
11  512  4.3  14483  45.9  16.99 
: TGs of different core count
The statistics of the generated clusters are shown in TABLE IV. The row 1 in the Inst/core (SD) column stands for the total number of instructions of this application; starting from row 2, it records the standard deviation of the number of instructions partitioned into each cluster. The row 1 in the (Intercore) flits column stands for the total number of edges in this application; starting from row 2, it records the total number of flits needed to be transported between cores. The baseline is randomly partitioned. As the number of cores increases, the intercore communication first drops to 913 (86.4% reduction compared to the baseline) edges at 64 cores and then soars to 7482 at 512 cores (11.3% more than baseline). Same on 2 cores, our algorithm reduces the edges by 21.2%. The reason is that our algorithm effectively lowers the intercore communication when the core count is less than 64. After 64, as fewer and fewer instructions are run on each core, the intercore message passing increases dramatically.
Next, we evaluate the speedup and power consumption of our design (Fig. 6). The power values are collected by feeding the outputs from gem5 to McPAT. Having fully taken advantage of parallel execution, loadbalancing and optimal intercommunity communication, our design has achieved maximum speedup of around 10.5x at 64core architecture and energy savings of 8.4x at 32 cores. The scalability of this application is roughly under 64 cores due to the relatively small number of instructions. Mapping to 512 cores even yields longer run times and higher energy consumption because more flits need to be routed between cores. The delay in Fig. 6 refers to the time to run one iteration of the next target position calculation. The minimum powerdelay product is achieved by the 32core configuration at 5.56, 39.3x lower than the baseline power delay product (PDP) of 219.6. It is noted that map size hardly affects the run time and power, as simulations run on three different map sizes are approximately the same.
Model 



DJI ACE ONE  5W 


DJI NAZAH  3.2W 


DJI NAZAM LITE 



DJI NAZAM V2 


: Power consumption of DJI flight controllers
Finally, we illustrate the potential of our design by comparing it with the stateofart flight controllers used in DJI drones. As shown in Table V, NAZAM LITE has the lowest power consumption among the other controllers with a max power of 1.5W and a normal power of 0.6W. Our design consumes significantly less energy compared to the DJI’s controllers.
Vi Conclusion
In this paper, we first develop an LLVM IR parser to construct the DDG for UAV autonomous navigation application. Next, we analyze the DDG structure and discover its best parallelization degree by applying our loadbalancing and energyaware processing community discovery algorithm so that data movement is confined within clusters and static energy consumption is minimized. Finally, a congestionaware mapping scheme based on topological sort is proposed to map clusters onto the NoCs for parallel execution. Simulations show that our optimal 32core design achieves an average 8.4x energy savings and that 64core configuration achieves 10.5x performance speedup.
0.8
References
 Agarwal [2009] N. Agarwal, T. Krishna, L.S. Peh, and N. K. Jha. Garnet: A detailed onchip network model inside a fullsystem simulator. In 2009 ISPASS, pages 33–42. IEEE, 2009.
 Armah [2016] S. Armah, S. Yi, W. Choi, and D. Shin. Feedback control of quadrotors with a matlabbased simulator. American Journal of Applied Sciences, 2016.
 Binker [2016] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1–7, 2011.
 Bogdan [2015] P. Bogdan. Mathematical modeling and control of multifractal workloads for datacenteronachip optimization. In Proceedings of the 9th NOCS, page 21. ACM, 2015.
 Bogdan [2015] P. Bogdan and Y. Xue. Mathematical models and control algorithms for dynamic optimization of multicore platforms: A complex dynamics approach. In Proceedings of the ICCAD, pages 170–175. IEEE Press, 2015.
 Corke [2017] P. Corke. Flying Robots Book: Robotics, Vision and Control. Springer, 2017.
 Dally [2001] W. J. Dally and B. Towles. Route packets, not wires: onchip inteconnection networks. In Proceedings of the 38th DAC, pages 684–689. Acm, 2001.
 Grech [2015] N. Grech, K. Georgiou, J. Pallister, S. Kerrison, J. Morse, and K. Eder. Static analysis of energy consumption for llvm ir programs. In Proceedings of the 18th SCOPES, pages 12–21. ACM, 2015.
 Hu [2003] J. Hu and R. Marculescu. Exploiting the routing flexibility for energy/performance aware mapping of regular noc architectures. In 2003 DATE, pages 688–693. IEEE, 2003.
 Lattner [2004] C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Proceedings of the CGO’04, page 75. IEEE Computer Society, 2004.
 Li [2009] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd MICRO, pages 469–480. ACM, 2009.
 Natalie [2019] T. H. Natalie Kitroeff. Boeing’s 737 max suffers setback in flight simulator test, 2019
 Suleiman [2019] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze. Navion: A 2mw fully integrated realtime visualinertial odometry accelerator for autonomous navigation of nano drones. IEEE Journal of SolidState Circuits, 2019.
 Tan [2018] C. Tan, A. Kulkarni, V. Venkataramani, M. Karunaratne, T. Mitra, and L.S. Peh. Locus: Lowpower customizable manycore architecture for wearables. TECS, 17(1):16, 2018.
 Xiao [2017] Y. Xiao, Y. Xue, S. Nazarian, and P. Bogdan. A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach. In Proceedings of the 36th ICCAD, pages 217–224. IEEE Press, 2017.
 Ye [2002] T. T. Ye, L. Benini, and G. De Micheli. Analysis of power consumption on switch fabrics in network routers. In Proceedings 2002 DAC, pages 524–529. IEEE, 2002.
Comments
There are no comments yet.