Log In Sign Up

Network Offloading Policies for Cloud Robotics: a Learning-based Approach

by   Sandeep Chinchali, et al.

Today's robotic systems are increasingly turning to computationally expensive models such as deep neural networks (DNNs) for tasks like localization, perception, planning, and object detection. However, resource-constrained robots, like low-power drones, often have insufficient on-board compute resources or power reserves to scalably run the most accurate, state-of-the art neural network compute models. Cloud robotics allows mobile robots the benefit of offloading compute to centralized servers if they are uncertain locally or want to run more accurate, compute-intensive models. However, cloud robotics comes with a key, often understated cost: communicating with the cloud over congested wireless networks may result in latency or loss of data. In fact, sending high data-rate video or LIDAR from multiple robots over congested networks can lead to prohibitive delay for real-time applications, which we measure experimentally. In this paper, we formulate a novel Robot Offloading Problem --- how and when should robots offload sensing tasks, especially if they are uncertain, to improve accuracy while minimizing the cost of cloud communication? We formulate offloading as a sequential decision making problem for robots, and propose a solution using deep reinforcement learning. In both simulations and hardware experiments using state-of-the art vision DNNs, our offloading strategy improves vision task performance by between 1.3-2.6x of benchmark offloading strategies, allowing robots the potential to significantly transcend their on-board sensing accuracy but with limited cost of cloud communication.


page 1

page 4

page 9


Cost-effective Machine Learning Inference Offload for Edge Computing

Computing at the edge is increasingly important since a massive amount o...

Task-relevant Representation Learning for Networked Robotic Perception

Today, even the most compute-and-power constrained robots can measure co...

ROBO: Robust, Fully Neural Object Detection for Robot Soccer

Deep Learning has become exceptionally popular in the last few years due...

Sampling Training Data for Continual Learning Between Robots and the Cloud

Today's robotic fleets are increasingly measuring high-volume video and ...

Energy-delay-cost Tradeoff for Task Offloading in Imbalanced Edge Cloud Based Computing

In this paper, the imbalance edge cloud based computing offloading for m...

Towards Unsupervised Fine-Tuning for Edge Video Analytics

Judging by popular and generic computer vision challenges, such as the I...

Dynamic Selection of Perception Models for Robotic Control

Robotic perception models, such as Deep Neural Networks (DNNs), are beco...

I Introduction

Fig. 1: Autonomous mobile robots are faced with a key tradeoff. Should they rely on local compute models, which could be fast, power-efficient, but less accurate? Or, should they offload computation to a more accurate model in the cloud, which increases latency due to congested networks? In this paper, we propose a novel algorithmic framework to address such tradeoffs.

For autonomous mobile robots such as delivery drones to become ubiquitous, the amount of onboard computational resources will need to be kept relatively small to reduce energy usage and manufacturing cost. However, simultaneously, perception and decision-making systems in robotics are becoming increasingly computationally expensive111For example, deep neural network-based vision systems running on a consumer GPU are able to perform detection at a rate of approximately 180 frames per second (fps), but the GPU consumes approximately 150W. In contrast, mobile-optimized GPUs such as the Nvidia Jetson TX1 are only capable of running the detection pipeline at 70fps, while still consuming 10W of power [49]. As a point of comparison, consumer drones typically consume in the range of 20-150W during hover [7]. As a result, reaching practical detection rates on small mobile robots will result in large power demands that will substantially reduce the operational time of the robot. [51]. In addition to restrictions on computation, autonomous robotic systems may also have storage limitations for, e.g., cached maps.

To avoid these restrictions, it is possible for a robot to offload computation or storage to the cloud, where resources are effectively limitless. This approach, which is commonly referred to as cloud robotics [31], imposes a set of trade-offs that have hitherto only been marginally addressed in the literature. Specifically, while offloading computation (for example) to the cloud reduces the onboard computing requirements, it may result in latency that could severely degrade performance, as well as information loss or total failure if a network is highly congested. Indeed, even economical querying of cloud resources may quickly overload a network in the case where the data transfer requirements are relatively large (such as high definition (HD) video or LIDAR point clouds) or where multiple robots are operating.

In this work, we formally study the decision problem associated with offloading to cloud resources for robotic systems. Given the limitations of real-world networks, we argue that robots should offload only when necessary or highly beneficial, and should incorporate network conditions into this calculus. We view this problem as a (potentially partially-observed) Markov Decision Process (MDP)

[11], where an autonomous system is deciding whether to offload at every time step.

Contributions and Organization

In Section II, we survey existing work on the offloading problem in robotics and find that it under-emphasizes key costs of cloud communication such as increased latency, network congestion, and load on cloud compute resources, which in turn adversely affect a robot. We further show experimentally that current cloud robotics systems can lead to network failure and/or performance degradation, and discuss how this problem will become more severe in the future without intervention. To address this gap, we formulate a novel cost-based cloud offloading problem in Section III

, and describe characteristics of this problem that make it difficult to solve with simple heuristics. In Section

IV, we propose solutions to this problem based on deep reinforcement learning [52, 53], which are capable of handling diverse network conditions and flexibly trade-off robot and cloud computation. In Section V, we demonstrate that our proposed approach allows robots to intelligently, but sparingly, query the cloud for better perception accuracy, both in simulations and hardware experiments. To our knowledge, this is the first work that formulates the general cloud offloading problem as a sequential decision-making problem under uncertainty and presents general-purpose, extensible models for the costs of robot/cloud compute and network communication.

Ii Background & Related Work

Ii-a Cloud Robotics

Cloud robotics has been proposed as a solution to limited onboard computation in mobile robotic systems, and the term broadly refers to the process of offloading to cloud-based computational resources [31, 21, 55, 30]

. For example, a robot may offload video processing and associated perception, audio and natural language processing, or other sensory inputs such as LIDAR. Concretely, this approach has been used in mapping

[39] and localization [44], perception [46], grasping [29], visuomotor control [56], speech processing [50], and other applications [43]. For a review of work in the field, we refer the reader to [30]. Cloud robotics can also include offloading complex decision making to a human, an approach that has been used in path planning [23, 25], and as a backup option for self-driving cars in case of planning failures [35].

In general, the paradigm is useful in any scenario in which there is a tradeoff between performance and computational resources. A notable example of this tradeoff is in perception, a scenario we use as a case-study in this paper. Specifically, vision Deep Neural Networks (DNNs) are becoming the de facto standard for object detection, classification, and localization for robotics. However, as shown in Table I, different DNNs offer varied compute/accuracy tradeoffs. Mobile-optimized vision DNNs, such as MobileNets [47] and ShuffleNets [57], often sacrifice accuracy to be faster and use less power. While MobileNet has lower accuracy, it has significantly fewer parameters and operations than the more accurate Mask R-CNN model, and thus might be favored for an on-robot processing model. A cloud-robotics framework would give improved performance by allowing the robot to query a cloud server running the compute-intensive Mask R-CNN model as needed, and using the onboard model when the lower accuracy is tolerable.

Ii-B Costs of Offloading

While offloading computation or storage to the cloud has the potential to enable cheap mobile robots to perform increasingly complex tasks, these benefits come at a cost. Querying the cloud is not instant, and there are costs associated with this latency. Furthermore, mobile robots largely use wireless networks (e.g., cellular or WiFi networks), which can be highly stochastic and low bandwidth [45]. Often, the offloaded data can be large relative to this bandwidth: HD video (e.g., for detecting obstacles) from a single robot can be over 8 megabits per second (Mbps) [58], while cellular networks are often uplink-limited and have between 1-10 Mbps [45, 33] to share across users.

DNN Accuracy Size CPU Infer. GPU Infer.
MobileNet v1 18 18 MB 270 ms 26 ms
MobileNet v2 22 67 MB 200 ms 29 ms
Mask R-CNN 45.2 1.6GB 325 ms 18 ms
TABLE I: Accuracy, size, and speed tradeoffs of deep neural networks, where accuracy is the standard mean average precision (mAP) metric on the MS-COCO visual dataset [32].

Current state-of-the-art methods in cloud robotics largely fail to adequately consider these costs. For example, to limit network bandwidth utilization, [39] offload only key-frames (as opposed to all data) in mapping and localization. These key-frames are determined without considering the state of the network connection, and are sent at a predetermined frequency. In [43], the authors factor in the current state of the system, and hand-design a one-stage decision rule. However, designing such decision rules involves a number of trade-offs and can be difficult. In [46], the authors present a detailed comparison of offloading characteristics in cloud robotics to inform this design process, however, hand engineering heuristic solutions remains difficult and very domain specific, and it is unclear if these approaches can scale to higher data requirements where the costs of offloading become more significant.

Related offloading architectures have been employed outside of cloud robotics in the Internet-of-Things (IoT) community, especially as machine learning models have increased in complexity

[41, 16]. However, the offloading techniques used in this field rely on similar techniques, e.g., determining and offloading key-frames for object detection and utilizing heuristics such as frame differences to select the key-frames [15]; the policy is not optimized with system-level performance in mind. Alternative approaches include splitting neural network models across edge devices (in our application, the robot) and the cloud [28], but these may perform poorly under variable network conditions that are likely as a mobile robot navigates through regions of varying signal quality.

In this paper, we present an approach to address these hitherto under-emphasized costs in cloud robotics, that incorporates features of the input stream and network conditions in the system-level decision problem of whether or not to offload.

Fig. 3: Streaming LIDAR over WiFi using the Robot Operating System (ROS) produces high-bitrate point cloud streams, which can lead to a sender receiving only half the data (as pictured) or even network failure in the multi-robot setting.

Ii-C Case Study: Costs of Offloading LIDAR Data

Network experiments with ROS

To motivate our contributions, we measure the impact of streaming LIDAR over a real network, which might occur in a cloud-based mapping scenario. The hurdles of streaming LIDAR over wireless networks have previously been informally described in online forums [3], but, to our knowledge, never rigorously measured in academic literature. We measured the effective data-rate when sending LIDAR over a WiFi connection between an embedded compute platform useful for robotics (the NVIDIA Jetson Tx2) and a central server. A Velodyne VLP-16 LIDAR was connected to the source (NVIDIA Jetson Tx2) and produced point clouds of average size Kilobytes (KB) at a rate of 10 Hz, as expected from Velodyne specifications. Using the standard Robot Operating System (ROS) [42] as the message passing interface, point clouds were sent in real-time on an uncongested wireless network, with only one other dormant connected machine, to a central server.

At the source, we measured a median Mbps data-rate, as expected from our measured average data size of (KB) KiloBytes at 10 Hz. However, at the receiver, we measured a median data-rate of Mbps (less than half of the sender’s) with a received sampling frequency of only Hz. Fig. 3 contrasts sender (red) and receiver (blue) data-rates. Since the WiFi link had regular delay statistics, we attribute the low received data-rate to inefficiencies of ROS in handling a large ( Mbps) sustained stream. In fact, official ROS documentation for the bandwidth measurement tool rostopic bw acknowledges that poor network connectivity and Python, not faster C++, code could be the cause of the receiver not keeping pace with the sender. Though anecdotal, we noticed several users on ROS forums with similar issues for both LIDAR and video.222The post ROS Ate My Network Bandwidth! details similar [3] behaviors.

For a single sender-receiver pair, the problem we measured in Fig. 3 may be solved by optimizing ROS receiver code. Or, one could state-fully encode differences in LIDAR point clouds, inspired by today’s video encoders [26]. However, the point cloud stream of Mbps is disconcertingly large, since WiFi networks often attain only 5-100 Mbps [54, 36] while uplink-limited cellular networks often only sustain 1-10 Mbps [1, 45, 33] across users due to congestion or fading environments.

Indeed, to stress test the above scenario, we streamed data from several Velodyne sensors over a previously uncongested WiFi network and observed severe delay, dropped ROS messages, and network outages before we could take rigorous measurements.

Our experiments have striking resemblance to issues faced in the computer systems community with streaming HD video to the cloud for computer vision

[41, 16, 18]

. In the context of robotics, an informal report from Intel estimates that self-driving cars will generate 4 Terabytes of sensor data per day, much more than served by today’s cell networks

[2]. Even if this data could be streamed to the cloud, it will place an enormous load on cloud compute services, such as the recent widespread outage of Amazon’s Alexa speech-processing agent due to an influx of new devices on Christmas day [34]. As more robotics platforms turn to the cloud, such as the Anki toy robot which offloads the bulk of its interactive speech processing [4], robots will have a huge incentive to minimize their network impact.

Indeed, sharing the network will allow swarms of robots to reap the benefits of the cloud and, importantly, reduce the latency they experience in receiving informative responses from the cloud due to network congestion. We now propose an algorithmic framework on how robots can combine local compute, active data filtering, and beneficial querying of the cloud.

Iii Problem Statement

In this paper, we focus on an abstract cloud robotics scenario, in which a robot experiences a stream of sensory inputs that it must process. At each timestep, it must choose to process the input onboard or to offload the processing to the cloud over a network. In this section, we offer practically motivated abstractions of these ideas, and combine them to formally define the robot-offloading problem.

Fig. 5:

While our framework is general, we demonstrate it on face recognition from video, a common task for personal assistance robots or search-and-rescue drones. Video surveillance occurs over a finite horizon episode where a robot can use either an optimized local model or query the cloud if uncertain.

Sensory Input

We model the raw sensory input into the robot as the sequence , where represents the data, such as a video frame or LIDAR point cloud, that arrives at time . While the robot cannot know this sequence in advance, there may be properties of the distribution over these inputs that may guide the robot’s offloading policy. For example, this stream may have temporal coherence (see Fig. 5), such as when objects are relatively stationary in video [27, 26], which implies that is similar to . As an example, a person will rarely appear in a video for only a single frame, and instead will be present for an extended time period, and the image of the person will likely change slowly. However, building a model of coherence can be difficult, and so a model-based approach even based on relatively simple heuristics may be difficult. Instead, we turn to model-free approaches, which sidestep modelling temporal coherence and instead can directly learn (comparatively) simple decision rules.

Computation Models

The computation that we consider offloading to the cloud is the process of estimating some output given some input . For example, in the scenario of processing a video stream, could be a semantically separated version of the input frame (e.g., object detection), useful for downstream decision making. For the sake of generality, we make no assumptions on what this computation is, and only assume that both the robot and the cloud have models that map a query to predictions of and importantly, a score of their confidence :

Typically, is a computationally efficient model suitable for resource-constrained mobile robots. In contrast,

represents a more accurate model which cannot be deployed at scale, for example a large DNN or the decision making of a human operator. The accuracy of these models can be measured through a loss function

that penalizes differences between the predictions and the true results, e.g., the cross entropy loss for classification problems or root mean squared error (RMSE) loss for regression tasks. In the experiments in this paper, we operate in a classification setting, in which confidences are easy to characterize (typically via softmax output layers). However in the regression setting, there are also a wide variety of models capable of outputting prediction confidence [12, 20, 22]. The use of separate, modular robot and cloud models allows a robot to operate independently in case of network failure.

Offload Bandwidth Constraints

The volume of data that can be offloaded is limited by bandwidth, either of the network, or a human operator. We abstract this notion by giving the robot a finite query budget of samples that a robot can offload over a finite horizon of timesteps. This formalism flexibly allows modeling network bandwidth constraints, or rate-limiting queries to a human. Indeed, the fraction of samples a robot can offload in finite horizon can be interpreted as the robot’s “fair-share” of a network link to limit congestion, a metric used in network resource allocation [19, 40].

These factors impose a variety of tradeoffs to consider when designing an effective offloading policy. Indeed, we can see that the problem of robot offloading can be seen as a sequential decision making problem under uncertainty. Thus, we formulate this problem as a Markov Decision Process (MDP), allowing us to naturally express desiderata for an offloading policy through the design of a cost function, and from there guide the offloading policy design process.

Iii-a The Robot Offloading Markov Decision Process

In this section, we express the generic robot offloading problem as an MDP


where is the state space, is the action space, is a reward function, defines the stochastic dynamics, and is the problem horizon. In the following section, we define each of these elements in terms of the abstractions of the robot offloading problem discussed earlier. Figure 6 shows the interplay between the agent (the robot), the offloading policy, and the environment, consisting of the sensory input stream and the robot and cloud prediction models.

Action Space

We consider the offloading decision problem to be the choice of which prediction to use for downstream tasks at time . The offloading system can either (A) choose to use past predictions and exploit temporal coherence to avoid performing computation on the new input , or (B) incur the computation or network cost of using either the on-device model or querying the cloud model . Specifically, we have four discrete actions:


where is the last time the robot model was queried, and is the last time the cloud model was queried.

State Space

We define the state in the offload MDP to contain the information needed to choose between the actions outlined above. Intuitively, this choice should depend on the current sensory input , the stored previous predictions, a measure of the “staleness” of these predictions, and finally, the remaining query budget. We choose to measure the staleness of the past predictions by their age, defining and . Formally, we define the state in the offloading MDP to be:


Note that the sensory input may be high-dimensional, and including it directly in the planning problem state could yield an extremely large state-space. Instead, we consider including features that are a function of the inputs. We note that in place of our choice of input representation, these state elements may be any summary of the input stream. The specific choice is context dependent and depends on the expense associated with utilizing the chosen features, as well as standard encodings or feature mappings. We describe the choice of practical features in Section V.


The dynamics in the robot offloading MDP capture both the stochastic evolution of the sensory input, as well as how the offloading decisions impact the other state elements such as the stored predictions and the query budget. The evolution of is independent of the offloading action, and follows a stochastic transition model that is domain-specific. For example, the evolution of video frames or LIDAR point clouds depends on the coherence of the background scene and robot mobility. The time remaining deterministically decrements by 1 at every timestep. The other state variable’s transitions depend on the chosen action.

If , then the past prediction elements of the state do not change, but we increment their age by one. If , meaning we used the current on-robot model, then we update the stored robot model prediction and reset its age to . Similarly, if we choose to query the cloud model, , then we update the stored prediction and reset its age to , and also decrement the query budget by 1.

The modelling of the network query budget is directly based on by our measurements (Fig. 3) and recent work in the systems community on network congestion [45, 41, 16]. Our use of sequential features is inspired by the coherence of video frames [27, 26], which we also measured experimentally and observed for LIDAR point clouds.

Fig. 6: We formulate a novel Robot Offloading MDP, depicted above, where a robot uses an on-board offloading policy to select if it should use cached predictions, query a local model, or incur the cost, but also accuracy benefits, of querying the cloud.


We choose the reward function in the MDP to express our objective of achieving good prediction accuracy while minimizing both on-robot computation and network utilization. We can naturally express the goal of high prediction accuracy by adding a penalty proportional to the loss function

under which the cloud and robot models are evaluated. We note, however, that this choice of loss is arbitrary, and a loss derived from some downstream application may be used instead. Indeed, if a scenario is such that mis-classification will result in high cost (e.g., mis-classifying a human as a stationary object during path planning), this may be incorporated into the MDP reward function. To model the cost of network utilization and computation, we add action costs. This gives us the reward function


where , are weights. The costs for network utilization are best derived from the economic analysis of onboard power usage and the cost of bandwidth utilization. For example, a mobile robot with a small battery might warrant a higher cost for querying the onboard model than a robot with a higher battery capacity.

Iii-B The Robot Offloading Problem

Having formally defined the robot offloading scenario as an MDP, we can quantify the performance of an offloading policy in terms of the expected total reward it obtains in this MDP. This allows us to formally describe the general robot offloading problem as:

Problem 1 (Robot Offloading Problem)

Given robot model , cloud model , a cloud query budget of over a finite horizon of steps, and an offloading MDP (Equation 1), find optimal offloading control policy that maximizes expected cumulative reward :


where .

Our MDP formulation, depicted in Fig. 6, is based both on experimental insights and practical engineering abstractions. A key abstraction is the use of separate, modular robot and cloud perception models. Thus, a designer can flexibly trade-off accuracy, speed, and hardware cost, using a suite of pre-trained models available today [47], as alluded to in Table I. Importantly, the robot can always default to its local model in case of network failure, which provides a guarantee on minimum performance.

While we have framed this problem as an MDP, we cannot easily apply conventional tools for exactly solving MDPs such as dynamic programming, as many of the aspects of this problem are hard to analytically characterize, notably the dynamics of the sensory input stream. This motivates studying approximate solution techniques to this problem, which we discuss in the following section.

We emphasize that the framework we present is agnostic to the sensory input modality, and is capable of handling a wide variety of data streams or cost functions. Moreover, the action space can be simply extended if multiple offloading options exist. As such, it describes the generic offloading problem for robotic systems.

Iv A Deep RL Approach to Robot Offloading

Our Approach

We approach the offloading problem using deep reinforcement learning (RL) [52, 53, 37]

for several reasons. First and foremost, model-free policy search methods such as reinforcement learning avoid needing to model the dynamics of the system. While most of the dynamics of the system are relatively simple, it is extremely difficult to model the evolution of the incoming sensory inputs. The model-free approach is capable of learning optimal offloading policies based solely on the features included in the state, and may avoid trying to predict incoming images, for example. Moreover, the use of a recurrent policy allows better estimation of latent variables defining the context of the incoming images. This recurrent policy accounts for possible non-Markovianity of the state. Indeed, since the state vector only includes features from the previous two most recent inputs, a Markovian policy likely can not accurately model the sensory stream.

There are several other advantages to using RL to compute good offloading policies. RL enables simple methods to handle stochastic rewards. We have chosen a relatively general reward function in the previous section, which may be stochastic due to variable costs associated with network conditions or variable cost of computation due to other processes. Finally, an RL based approach allows inexpensive evaluation of the policy, as it is not necessary to evaluate dynamics and perform optimization-based action selection as in, e.g., model predictive control [14]. In contrast to these approaches, a deep RL-based approach requires only evaluating a neural network. Because this policy evaluation is performed as an intermediate step to perception onboard the robot, efficient evaluation is critical to achieving low latency.

We represent the RL offloading policy as a deep neural network and train it using the Advantage Actor-Critic (A2C) algorithm [38]. We discuss the details of the training procedure in the next section. We refer to the policy trained via RL as .

Baseline Approaches

We compare the RL-based policy against the following baseline policies:

  1. Random Sampling

    This extremely simple benchmark chooses a random when the cloud query budget is not saturated and, afterwards, chooses randomly from actions .

  2. Robot-only Policy

    The robot-only policy chooses at every time-step to query the robot model and can optionally use past robot predictions in between.

  3. Cloud-only Policy The cloud-only policy chooses uniformly every steps (queries the cloud model) and uses the past cloud predictions in between. Essentially, we periodically sample the cloud model and hold the prediction.

  4. Robot-uncertainty Based Sampling This policy uses robot confidence to offload the percentile least-confident samples to the cloud as long as the remaining cloud query budget allows.

While approaches 2 and 3 may seem simple, we note that these are the de-facto strategies used in either standard robotics (all robot computations) or standard cloud robotics (all offloading with holds to reduce bandwidth requirements). Robot-uncertainty based sampling is a heuristic that may be used for key-frame selection, analogously to [39].

V Experimental Performance of
Our Deep RL Offloader

We benchmark our proposed RL-based cloud offloading policy within a realistic and representative setting for cloud robotics. Specifically, we focus on a face detection scenario using cutting edge vision DNNs. This scenario is prevalent in robotics applications ranging from search and rescue to robots that assist humans in commercial or industrial settings. More generally, it is representative of an object detection task that is a cornerpiece of virtually any robotics perception pipeline. We test this system with both a simulated input image stream with controlled temporal coherence as well as on a robotic hardware platform with real video streams, and find that in both cases, the RL policy intelligently, and sparingly, queries the cloud to achieve high prediction accuracy while incurring low query costs, outperforming baselines.

Face-detection Scenario

We formulate this scenario, depicted in Fig. 5, in terms of the general abstractions we introduced in Section III. Here, the sensory input stream is a video, where each is a still frame from that video. To avoid training a policy over the large image space directly, we choose the feature encoding that is used in the state space of the offloading MDP to be the sum of absolute differences between sequential frames. For the on-robot prediction model , we use a combination of FaceNet [48], a widely-used pre-trained face detection model which embeds faces into embedding vectors, together with an SVM classifier over these embeddings. This model has seen success for face detection in live streaming video on embedded devices [9]. For the cloud model , we use a human oracle, which always gives an accurate prediction with a high confidence measure. We used a zero-one loss function to measure the accuracy of the predictions, with if the prediction was incorrect, and if it was correct.

We choose the reward function to balance prediction accuracy and minimize onboard computation, as well as queries to the human operator through the network. The cost of past robot model and cloud queries, denoted by actions , was set to zero (), while the robot model cost was set to and the cost of the cloud model was chosen to be , to especially penalize querying the human oracle who will have limited bandwidth. We tested with different weightings in the reward function (Eqn. 4), and found and to yield performance for our specific cost setup, and therefore report results for this parameter setting. These costs were chosen to incentivize reasonably rational behavior; in real robotic systems they could be computed through an economic cost-benefit analysis333We provide the offloading simulation environment, robot and cloud FaceNet models, and MDP dynamics outlined in Eqns. 3 - 4, as a standard OpenAI gym [13] environment at

Offloading Policy Architecture

In practice, the input query sequence may show time-variant patterns and the MDP may become nonstationary if the agent only knows the current state. To address this problem using a recurrent policy, we use a Long Short Term Memory (LSTM)

[24] as the first hidden layer in the offloader policy to extract a representation over a short history of states. In particular, the actor (or critic) DNN has a LSTM first layer of 64 units, a fully-connected second layer of 256 units, and a softmax (or linear) output layer. We softly enforce the action constraint of disallowing the offloading action when the budget has depleted by having action 3 map to action 2 when .

We use standard hyper-parameters for A2C training, with an orthogonal initializer and RMSprop gradient optimizer. Specifically, we set actor learning rate to

, critic learning rate to , minibatch size , entropy loss coefficient , and gradient norm clipping . We train A2C over 1 million episodes, with discount factor and episode length . We observed stable convergence after episodes, consistent over different weightings of the accuracy and loss terms in the reward.

A key aspect of this problem is how the coherence of the input stream allows the offloading policy to leverage cached predictions to avoid excessively querying the cloud model. In order to test this, we applied the deep RL approach in two scenarios: a synthetic stream of images where coherence was controlled, as well as an on-hardware demo which used real video data. In the following subsections, we detail the training and testing procedure for each scenario, and discuss the results.

V-a Synthetic Input Stream Experiments

To model the coherence of video frames we observed experimentally, we divided an episode of steps into “coherent” sub-intervals, where only various frames of the same person appear within one contiguous sub-interval, albeit with different background and lighting conditions. Then, the input stochastically switches to a new individual, who could be unknown to the robot model. As such, we simulate a coherent stream of faces which are a diverse mixture of known and unknown faces to the robot, as shown at the top of Fig. 5. The length of a coherence interval was varied between of an episode duration to show a diversity of faces in an episode.

Each training trace (episode of the MDP) lasted steps where a face image (query ) arrived at each timestep . To test RL on a diverse set of network usage limits, we randomly sampled a query budget at the start of each trace.

Fig. 8: RL beats benchmark offloading policies by over in diverse test episodes over a mixture of network conditions.
Fig. 10: The reward trades off offloading costs, which penalize network and cloud usage, with classification loss.


We evaluated the RL policy and the benchmarks on 100 diverse testing traces each, where the face pictures present in each trace were distinct from those in the set of training traces. To test an offloader’s ability to adapt to various network bandwidth constraints, we evaluated each trace with four trials on each query budget fraction in , simulating budgets in highly-constrained to unconstrained networks.

We show RL test results for the same representative reward function parameters described above in Section V-A.

RL Intelligently, but Sparingly, Queries the Cloud

Figure 8 shows the distribution of rewards attained by the different offloader policies on all test traces, where our RL approach is depicted in the yellow boxplot. Then, we break down the mean episode reward into its components of prediction accuracy and offloading cost, and show the mean performance over all test traces for each policy in Fig. 10.

Benchmark policies of random-sampling (), all-robot compute (), periodic cloud-compute (), and the best confidence-threshold based heuristic policy () are shown in the left four boxplots (red to purple). An oracle upper-bound solution, which is unachievable in practice since it perfectly knows the robot and cloud predictions and future timeseries , is depicted in gray in Figs. 8 - 10.

Fig. 8 shows that RL has at least higher median episode reward than the benchmarks, and is competitive with the upper-bound oracle solution, achieving its reward. This is because the RL policy sparingly queries the costly cloud model, in contrast to an all-cloud policy that incurs significantly higher model query and network cost, as shown in Fig. 10

, which plots the mean reward terms and the 95% standard error estimates to display our certainty in the mean estimate. Essentially, RL learns to judiciously query the cloud when the robot model is highly uncertain, which allows it to improve the overall system accuracy and achieve a low prediction loss (Fig.

10). Interestingly, it has better prediction accuracy than an “all-cloud” scheme since bandwidth limits cause this policy to periodically, but sparsely, sample the cloud and hold past cloud predictions. RL learns to conserve precious cloud queries when the query budget is low and use them when they help prediction accuracy the most, thereby achieving low prediction error in Fig. 10.

V-B Hardware Experiments

Inspired by deep RL’s promising performance on synthetic input streams, we built an RL offloader that runs on the NVIDIA Jetson Tx2

embedded computer, which is optimized for deep learning and used in mobile robotics. The RL offloader takes in live video from the Jetson camera and runs the small

nn4.small2.v1444available at FaceNet DNN from the popular OpenFace project [9] and an SVM classifier on selected frames as the robot model. OpenFace [9] provides four pre-trained FaceNet models, and we chose to use the smallest, fastest model on the robot, to consider cases where a robot has limited hardware. The smallest model has half the parameters and is faster on a GPU than the largest FaceNet model [9].

If the RL agent deems a face needs to be offloaded due to local uncertainty, it can either send the concise Facenet embedding (128-dimensional byte vector) or the face image to a central server that can run a larger FaceNet DNN and/or SVM classifier trained on many more humans of interest as the cloud model. We use OpenCV for video frame processing, a PyTorch


OpenFace model, and TensorFlow

[5] for the RL offloader neural network policy to achieve real-time processing performance on the embedded Jetson platform.

Data Collection

We captured 6 training and 3 testing videos spanning 9 volunteers, consisting of over 2600 frames of video where offloading decisions could be made. The nine volunteers were known to the robot model but our dataset had 18 distinct, annotated people. Collectively, the training and test datasets showed diverse scenarios, ranging from a single person moving slowly, to dynamic scenarios where several humans move amidst background action.

The Jetson was deployed with a robot FaceNet model and an SVM trained on a subset of images from only people, while the cloud model was trained on several more images of all volunteers to be more accurate. The state variables in Eq. 3 were input to our Deep RL offloader, where the sum of frame differences , rather than a full image, was an important indicator of how quickly video content was changing. Frame differences are depicted in Fig. 13, which helps the offloader subtract background noise and hone in on rapidly-changing faces.

Evaluation and Discussion

As expected, the trained RL offloader queries the cloud when robot model confidence is low, the video is chaotic (indicated by large pixel difference scores), and several hard-to-detect faces appear in the background.

However, it weights such factors together to decide an effective policy and hence is significantly better than a confidence-threshold based heuristic. In fact, the decision to offload to the cloud is only 51.7% correlated (Spearman correlation coefficient) with robot model confidence, showing several factors influence the policy.

(a) FaceNet on a live video stream.
(b) Offload yellow faces.
Fig. 13: Hardware Experiments: Our offloader, depicted on frame pixel differences, interleaves FaceNet predictions on a robot (red box) and cloud model (yellow) when uncertain.

In our hardware experiments on real streaming video, our offloader achieved higher reward than an all-robot policy, better reward than an all-cloud policy, and that of an oracle upper bound solution. Further, it attained higher reward than confidence-based threshold heuristics, where we did a linear sweep over threshold confidences, which depend on the specific faces in the video datasets.

Our video dataset, robot/cloud models, and offloader code will be made publicly available. In particular, videos of the offloading policy show how robot computation can be effectively juxtaposed with cloud computation in Fig. 13. Finally, since the offloader has a concise state space and does not take in full images, but rather a sum of pixel differences as input, it is extremely small ( KB). Essentially, it is an order of magnitude smaller than even optimized vision DNNs (Table I), allowing it to be scalably run on a robot without interfering in perception or control.

Vi Discussion and Conclusions

In this work we have presented a general mathematical formulation of the cloud offloading problem, tailored to robotic systems. Our formulation as a Markov Decision Problem is both general and powerful. We have demonstrated deep reinforcement learning may be used within this framework effectively, outperforming common heuristics. However, we wish to emphasize that RL is likely an effective choice to optimize offloading policies even for modifications of the offloading problem as stated.

Future Work

While there are many theoretical and practical avenues of future work within the cloud robotics setting (and more specifically within offloading), we wish to herein emphasize two problems that we believe are important for improved performance and adoption of cloud robotics techniques. First, we have characterized the offloading problem as an MDP, in which factors such as latency correspond to costs. However, for safety critical applications such as self-driving cars, one may want to include hard constraints, such as a bounding the distribution of latency times. This approach would fit within the scope of Constrained MDPs [8], which has seen recent research activity within deep reinforcement learning [17, 6].

Secondly, we have dealt with input streams that are independent of our decisions in this work. However, the input streams that robotic systems receive are a consequence of the actions that they take. Therefore, a promising extension to improve performance of cloud robotics systems is considering the offloading problem and network characteristics during action selection (e.g., planning or control). Conceptually, this is related to active perception [10], but also incorporates information about network conditions or input stream stale-ness.


  • ver [2013] 4g lte speeds vs. your home network., 2013. [Online; accessed 31-Jan.-2019].
  • int [2016] Data is the new oil in the future of automated driving., 2016. [Online; accessed 30-Jan.-2019].
  • ROS [2017] Ros ate my network bandwidth!, 2017. [Online; accessed 31-Jan.-2019].
  • ank [2019] The new robot kickstarter by anki is powered by qualcomm., 2019. [Online; accessed 02-Jan.-2019].
  • Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the OSDI 2016. Savannah, Georgia, USA, 2016.
  • Achiam et al. [2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31, 2017.
  • [7] Rhett Allain. The physics of why bigger drones can fly longer. Wired Magazine. URL
  • Altman [1999] Eitan Altman. Constrained Markov decision processes. CRC Press, 1999.
  • Amos et al. [2016] Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. Openface: A general-purpose face recognition library with mobile applications. Technical report, CMU-CS-16-118, CMU School of Computer Science, 2016.
  • Bajcsy [1988] Ruzena Bajcsy. Active perception. Proceedings of the IEEE, 1988.
  • Bellman [1957] R. Bellman. A markovian decision process. Technical report, DTIC Document, 1957.
  • Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. International Conference on Machine Learning (ICML), 2015.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv:1606.01540, 2016.
  • Camacho and Alba [2013] Eduardo F Camacho and Carlos Bordons Alba. Model predictive control. Springer Science & Business Media, 2013.
  • Chen et al. [2015] Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. Glimpse: Continuous, real-time object recognition on mobile devices. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, 2015.
  • Chinchali et al. [2018] Sandeep P. Chinchali, Eyal Cidon, Evgenya Pergament, Tianshu Chu, and Sachin Katti. Neural networks meet physical networks: Distributed inference between edge devices and the cloud. In ACM Workshop on Hot Topics in Networks (HotNets), 2018.
  • Chow et al. [2018] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. Neural Information Processing Systems (NIPS), 2018.
  • Chowdhery and Chiang [2018] A. Chowdhery and M. Chiang. Model predictive compression for drone video analytics. In 2018 IEEE International Conference on Sensing, Communication and Networking (SECON Workshops), 2018.
  • Forouzan and Fegan [2002] Behrouz A Forouzan and Sophia Chung Fegan. TCP/IP protocol suite. McGraw-Hill Higher Education, 2002.
  • Gal et al. [2017] Yarin Gal, Riashat Islam, and Zoubin Ghahramani.

    Deep bayesian active learning with image data.

    In International Conference on Machine Learning, 2017.
  • Goldberg and Kehoe [2013] Ken Goldberg and Ben Kehoe. Cloud robotics and automation: A survey of related work. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2013-5, 2013.
  • Harrison et al. [2018] James Harrison, Apoorva Sharma, and Marco Pavone. Meta-learning priors for efficient online bayesian regression. Workshop on the Algorithmic Foundations of Robotics (WAFR), 2018.
  • Higuera et al. [2012] Juan Camilo Gamboa Higuera, Anqi Xu, Florian Shkurti, and Gregory Dudek. Socially-driven collective path planning for robot missions. IEEE Conference on Computer and Robot Vision, 2012.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Jain et al. [2015] Ashesh Jain, Debarghya Das, Jayesh K Gupta, and Ashutosh Saxena. Planit: A crowdsourcing approach for learning to plan paths from large scale preference feedback. IEEE International Conference on Robotics and Automation (ICRA), 2015.
  • Kalva [2006] Hari Kalva. The h. 264 video coding standard. IEEE multimedia, 13(4):86–90, 2006.
  • Kang et al. [2017a] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. Noscope: Optimizing neural network queries over video at scale. Proc. VLDB Endow., 2017a.
  • Kang et al. [2017b] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGPLAN Notices, 52(4):615–629, 2017b.
  • Kehoe et al. [2013] Ben Kehoe, Akihiro Matsukawa, Sal Candido, James Kuffner, and Ken Goldberg. Cloud-based robot grasping with the google object recognition engine. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 4263–4270. IEEE, 2013.
  • Kehoe et al. [2015] Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation. IEEE Trans. Automation Science and Engineering, 12(2):398–409, 2015.
  • Kuffner [2010] J Kuffner. Cloud-enabled robots in: Ieee-ras international conference on humanoid robots. Piscataway, NJ: IEEE, 2010.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • Mao et al. [2017] Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with pensieve. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 197–210. ACM, 2017.
  • Marsh [2018] Sarah Marsh. Amazon alexa crashes after christmas day overload., 2018. [Online; accessed 20-Jan.-2019].
  • [35] Aarian Marshall. Starsky robotics unleashes its truly driverless truck in florida. Wired Magazine. URL
  • Mitchell [2018] Bradley Mitchell. Learn exactly how fast a wi-fi network can move., 2018. [Online; accessed 31-Jan.-2019].
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • Mohanarajah et al. [2015] Gajamohan Mohanarajah, Vladyslav Usenko, Mayank Singh, Raffaello D’Andrea, and Markus Waibel. Cloud-based collaborative 3d mapping in real-time with low-cost robots. IEEE Transactions on Automation Science and Engineering, 2015.
  • Padhye et al. [1999] Jitendra Padhye, Victor Firoiu, and Don Towsley. A stochastic model of tcp reno congestion avoidence and control. 1999.
  • Pakha et al. [2018] Chrisma Pakha, Aakanksha Chowdhery, and Junchen Jiang. Reinventing video streaming for distributed vision analytics. In 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 18), Boston, MA, 2018. USENIX Association. URL
  • Quigley et al. [2009] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y Ng. Ros: an open-source robot operating system. In ICRA workshop on open source software, volume 3, page 5. Kobe, Japan, 2009.
  • Rahman et al. [2016] Akhlaqur Rahman, Jiong Jin, Antonio Cricenti, Ashfaqur Rahman, and Dong Yuan. A cloud robotics framework of optimal task offloading for smart city applications. IEEE Global Communications Conference (GLOBECOM), 2016.
  • Riazuelo et al. [2014] Luis Riazuelo, Javier Civera, and JM Martínez Montiel. C2tam: A cloud framework for cooperative tracking and mapping. Robotics and Autonomous Systems, 62(4):401–413, 2014.
  • Riiser et al. [2013] Haakon Riiser, Paul Vigmostad, Carsten Griwodz, and Pål Halvorsen. Commute path bandwidth traces from 3g networks: Analysis and applications. In Proceedings of the 4th ACM Multimedia Systems Conference, MMSys ’13, pages 114–118, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1894-5.
  • Salmerón-Garcı et al. [2015] Javier Salmerón-Garcı, Pablo Íñigo-Blasco, Fernando Dı, Daniel Cagigas-Muniz, et al. A tradeoff analysis of a cloud-based robot navigation assistant using stereo image processing. IEEE Transactions on Automation Science and Engineering, 12(2):444–454, 2015.
  • [47] Mark Sandler and Andrew Howard. Mobilenetv2: The next generation of on-device computer vision networks.
  • Schroff et al. [2015] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 815–823, 2015.
  • Song Han [2016] Zuozhen Liu Song Han, William Shen. Deep drone: Object detection and tracking for smart drones on embedded system. Stanford University CS231a Class Project Report, 2016.
  • Sugiura and Zettsu [2015] Komei Sugiura and Koji Zettsu. Rospeex: A cloud robotics platform for human-robot spoken dialogues. IEEE International Conference on Intelligent Robots and Systems (IROS), 2015.
  • Sünderhauf et al. [2018] Niko Sünderhauf, Oliver Brock, Walter Scheirer, Raia Hadsell, Dieter Fox, Jürgen Leitner, Ben Upcroft, Pieter Abbeel, Wolfram Burgard, Michael Milford, et al. The limits and potentials of deep learning for robotics. The International Journal of Robotics Research, 2018.
  • Sutton and Barto [1998] R.S. Sutton and A.G. Barto. Reinforcement learning: an introduction. Neural Networks, IEEE Transactions on, 9(5):1054–1054, 1998.
  • Szepesvári [2010] C. Szepesvári. Algorithms for reinforcement learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    , 4(1):1–103, 2010.
  • Verma et al. [2013] Lochan Verma, Mohammad Fakharzadeh, and Sunghyun Choi. Wifi on steroids: 802.11 ac and 802.11 ad. IEEE Wireless Communications, 20(6):30–35, 2013.
  • Wan et al. [2016] J. Wan, S. Tang, H. Yan, D. Li, S. Wang, and A. V. Vasilakos. Cloud robotics: Current status and open issues. IEEE Access, 4:2797–2807, 2016. ISSN 2169-3536.
  • Wu et al. [2013] Haiyan Wu, Lei Lou, Chih-Chung Chen, Sandra Hirche, and Kolja Kuhnlenz. Cloud-based networked visual servo control. IEEE Transactions on Industrial Electronics, 2013.
  • Xiangyu et al. [2017] Z Xiangyu, Z Xinyu, L Mengxiao, and S Jian.

    Shufflenet: an extremely efficient convolutional neural network for mobile devices.

    In Computer Vision and Pattern Recognition, 2017.
  • [58] Youtube. Youtube: Recommended upload encoding settings.