ACES -- Automatic Configuration of Energy Harvesting Sensors with Reinforcement Learning

09/04/2019 ∙ by Francesco Fraternali, et al. ∙ Carnegie Mellon University University of California, San Diego 0

Internet of Things forms the backbone of modern building applications. Wireless sensors are being increasingly adopted for their flexibility and reduced cost of deployment. However, most wireless sensors are powered by batteries today and large deployments are inhibited by manual battery replacement. Energy harvesting sensors provide an attractive alternative, but they need to provide adequate quality of service to applications given uncertain energy availability. We propose using reinforcement learning to optimize the operation of energy harvesting sensors to maximize sensing quality with available energy. We present our system ACES that uses reinforcement learning for periodic and event-driven sensing indoors with ambient light energy harvesting. Our custom-built board uses a supercapacitor to store energy temporarily, senses light, motion events and relays them using Bluetooth Low Energy. Using simulations and real deployments, we show that our sensor nodes adapt to their lighting conditions and continuously sends measurements and events across nights and weekends. We use deployment data to continually adapt sensing to changing environmental patterns and transfer learning to reduce the training time in real deployments. In our 60 node deployment lasting two weeks, we observe a dead time of 0.1 have a mean sampling period of 90 seconds and the event sensors that detect motion with PIR captured 86 battery-powered node.



There are no comments yet.


page 9

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Buildings are an essential part of modern society and benefit immensely from the Internet of Things (IoT) technologies (Gubbi et al., 2013). Networked sensors are the bedrock of modern building services such as security, fire safety, energy, and lighting. Sensors penetrate nooks and corners of a modern building to sense temperature, light, smoke, occupancy, energy use; it is common to have hundreds to thousands of sensors in a medium-sized building (Khan and Hornbæk, 2011). While traditional buildings use wired sensors, wireless technology is being increasingly adopted due to lower deployment cost and flexibility of placement (for Demand Response and Efficiency, 2015). Much of the wireless sensors in the market are battery powered (lin, [n. d.]) and manual battery replacement is a key bottleneck that inhibits large scale deployments (Howell, 2017). Energy harvesting sensors provide an attractive alternative, but their design needs to ensure adequate Quality of Service (QoS) given limited and uncertain energy availability. Many innovative battery-free solutions have been proposed in prior works (Campbell et al., 2016; Campbell and Dutta, 2014; Naderiparizi et al., 2015b; Talla et al., 2017; Fraternali et al., 2018a).

Energy harvesting systems have to make a careful tradeoff in sensing, communication, and computation to maximize application utility with available energy (Hsu et al., 2006; Dias et al., 2016)

. These design tradeoffs change depending on hardware, application requirements and the energy availability in the environment. Prior works either perform manual configuration or use heuristics to identify the operating points

(, 2018;, 2018;, 2017;, 2019). For example, EnOcean launched a commercial energy harvesting sensor in June 2019 that uses a “a simple user interface consisting of one button and one LED allows for simple configuration without additional tools” (, 2019). Manual configuration does not scale and heuristics do not generalize well to every context. To overcome these limitations, we present our system ACES that uses Reinforcement Learning (RL) (Sutton et al., 1998) to automatically optimize the operation of sensors nodes to maximize QoS under uncertain energy availability.

In RL, an agent interacts with an environment and learns to make optimal decisions with experience. The domain expert identifies the objective (i.e. reward function in RL terminology) and the inputs (i.e., state) that affect the decisions (i.e. actions) of the agent. The agent tries out different actions and learns from the feedback (rewards) received. RL algorithms are good at learning sequences of actions that maximizes the long term expected cumulative rewards. Hence, in case of energy harvesting, the long term objective ensures that it learns patterns in energy availability across days, nights and weekends. Energy availability is not perfectly predictable, especially in indoor conditions. The RL agent learns a strategy that works in expectation and is robust to noise. However, convergence of an RL algorithm to a good policy can be slow and exploratory actions can lead to poor QoS during training. We use real data assisted simulations to reduce convergence time and show the learned policies work demonstrably well in the real world at scale. Using an Intel Core-i7 CPU clocked at 3 GHz, our algorithm converges in ¡30 minutes of wall-clock time and automatically adapts to changing lighting conditions.

We implemented and evaluated ACES in a real deployment of energy harvesting sensors in our department building. We built a Bluetooth Low Energy (BLE) based energy harvesting sensor platform that uses a solar panel to harvest energy and a super-capacitor to store energy. We use the node to sense light periodically and detect motion events with a PIR sensor. We deployed 5 sensor nodes in different indoor lighting conditions collecting measurements such as light intensity and super capacitor voltage level. Using these data traces, we developed a simulator that models the essential aspects of our sensor-node and the environment. We use the simulator to train the ACES RL agent using the Q-learning algorithm (Watkins, 1989). We performed multiple deployments that traded off learning a one-time policy from historical data versus learning a new policy each day. Our real-world deployments show that ACES effectively learns from observing environmental changes and appropriately configures each mote to maximize sensing quality while avoiding energy storage depletion.

To the best of our knowledge, ours is the first real-world deployment that demonstrates an RL based sensing mechanism for energy harvesting sensors. We used 15 days of light measurements to learn a policy in the simulator. We deployed the nodes with the learned policy for 31 days. The nodes achieved mean sampling period of 56 seconds, opportunistically collecting up to 1.7x more data when energy is available and had 0% dead time across nights and weekends when it samples sensors at 15 minute intervals. However, one-time learning of a policy requires a data collection phase and can be susceptible to changes in the environment. Hence, we introduce a new training strategy that does not require any historical data. We start with a default periodic policy on the first day and then train a new policy each day based on data collected so far. This ‘day-by-day’ training strategy learns a stable policy within a few days and achieves near zero dead time. To alleviate the initial training phase of a few days, we further reduce training time with transfer learning when multiple nodes are deployed. Finally, we generalize ACES formulation to event-driven sensing. We deployed 45 nodes with PIR event sensing and periodic light sensing for two weeks. Compared to battery-powered nodes, ACES nodes could detect 86% of the events on average. We expanded our deployment to 15 more nodes with just periodic light sensing. The 60 nodes send light measurements every 90 seconds on average and had 0.1% dead time across 2 weeks.

2. Related Work

A multitude of innovative energy harvesting solutions have been proposed in literature (Sudevalayam and Kulkarni, 2011; Ulukus et al., 2015). Many of the proposed sensors use up available energy as soon as they become available, e.g., harvesting AC power lines for energy metering (DeBruin et al., 2013), RFID based battery-free camera (Naderiparizi et al., 2015a), thermoelectric harvesting based flow sensor (Martin et al., 2012). Backscatter sensors are a special case that eliminates communication-based energy expenditure by using existing RF signals for both communication and harvesting (Kellogg et al., 2014; Ensworth and Reynolds, 2017; Talla et al., 2017). However, sensing whenever energy is available is not ideal for all applications - sensors may miss important events because they did not preserve enough energy or send too much data when it is not needed (Jayakumar et al., 2014; Lawson and Ramaswamy, 2015). A solar panel temperature sensor should work on nights and weekends even when energy availability is low and a backscatter-based motion sensor should capture events when there are no RF transmissions. Hence, these sensors nodes need to be carefully configured to increase their quality of service and compete with existing battery-based solutions. Several works are tackling intermittent operations of battery-less energy harvesting systems (Hester and Sorber, 2017; Hester et al., 2017; Maeng et al., 2017; Lucia et al., 2017). With this work we are solving an orthogonal problem, we are optimizing our system to do sensing, computation and communication while managing limited resources.

Reducing manual intervention for sensor configuration is an important task as underlined by many works (Chi et al., 2014; Udenze and McDonald-Maier, 2009; Moser et al., 2010; Dias et al., 2016; Hsu et al., 2009a; Hsu et al., 2009b; Fraternali et al., 2018a): Adaptive duty cycling of energy harvesting sensors has been used to achieve energy-neutral operations (Kansal et al., 2007; Hsu et al., 2006; Moser et al., 2010; Vigorito et al., 2007): nodes adjust their duty-cycle parameters based on the predicted energy availability to increase lifetime and applications performance (Sudevalayam and Kulkarni, 2011). Moser et al. (Moser et al., 2010) adapt parameters of the application itself to maximize the long-term utility based on future energy availability prediction. Reinforcement learning (RL) also predicts energy availability, but unlike the above solutions it also learns an optimal policy that maximizes long-term reward. Hence, it makes better decisions compared to heuristics as we show in our evaluation.

RL has been identified by many prior works to configure wireless sensors and shown promising results based on simulations (Yau et al., 2012). Zhu et al. (Hsu et al., 2014) apply RL for sustaining perpetual operation and satisfying the throughput demand requirements for energy harvesting nodes. However, their algorithm is built and tested in an outdoor environment, where sunlight patterns are consistent throughout the day (light is available from sunrise to sunset). We focus on indoor sensing where the daylight patterns are less well-defined and the light availability is affected by human occupancy. Shaswot et al. (Shresthamali et al., 2017) use solar energy harvesting sensor nodes powered by a battery to teach an RL system to get energy-neutral operations. They use SARSA algorithm (Sutton et al., 1998) to study the impact of weather, battery degradation and changes to hardware. But their work is also limited to simulations of outdoor environment.

Simulation results by Dias et al. (Dias et al., 2016) optimize energy efficiency based on data collected during five days by five sensor nodes. However, their reward function does not depend on battery level or energy consumed, and thus, do not capture realistic conditions. They also assume a fixed 12 hours period as the time needed by the Q-Learning algorithm to learn the action-value function for the rest of the 4.5 days experiment. But a one-time training time could not capture all kind of environmental changes. In this work, we propose a periodic training time that maximizes system performance even in the presence of environmental changes.

Aoudia et al. presents RLMan (Aoudia et al., 2018) that uses the actor-critic algorithm with linear function approximation (Grondman et al., 2012). They use existing indoor and outdoor light measurements for simulations. While they claim their linear function approximations facilitate RL training in a wireless node, they do not actually deploy RL on a real node nor do they report their memory and compute requirements. Our formulation performs RL training on the base station that the sensor node talks to and eliminates the need for putting RL inside the node. We use Q-learning algorithm, and our Q-table is only 30kB in size and demonstrate that it can be embedded in the sensor node as well.

Our own prior work (Fraternali et al., 2018b) used Q-learning algorithm for adapting sampling rate. The paper focused on methods to scale RL to large deployments and reduce training time. Using simulations, we showed that training a new policy periodically based on real light data improved the adaptability to changing conditions. We also showed transfer reinforcement learning works well and it is possible to share the learned policies among sensor nodes that share similar lighting conditions. However, all of our results were based on simulations from 5 sensor nodes in multiple lighting conditions. In this work, we show that these results transfer to the real-world with multiple large scale deployments. We have also extended the RL formulation to event-driven sensors and evaluate their performance in real deployments. In addition, we analyze the effect of changing the state and action space and demonstrate the energy neutrality of the learned policies.

To the best of our knowledge, this is the first work that exploits reinforcement learning for adaptive sampling in energy harvesting nodes in a real deployment

. The RL design needs to take into account changing environmental conditions, imprecise state estimates and stochastic state transitions. We have formulated our problem and perform simulation modeling to capture realistic conditions that transfer to the real-world. In addition, we have extended our problem formulation to accommodate

event-driven sensors such as PIR motion detectors and magnetic door sensors. The sensor node needs to preserve enough energy to communicate events as they occur. We show that ACES successfully learns typical event firing patterns and curtails its sensing period to account for the additional energy use.

3. ACES Design and Implementation

3.1. Problem Statement

In buildings, sensors are used for applications such as environment (e.g. air conditioning) control, safety, security or convenience. We broadly categorize the sensing as periodic (e.g light intensity sensor) or event driven (e.g. motion sensor). For periodic sensors, the higher the frequency, the more responsive the control systems and better the QoS. For event driven sensors, the lesser the number of missed events, the better the QoS. However, the QoS is extremely poor if the sensors are not operational for hours at a time - control systems will lose their feedback loop, event driven applications will be non-functional. Hence, we define our objective as to maximise sensor sampling rate for periodic sensors and minimize missed events for event-driven sensors while ensuring the energy harvesting node remain alive. The objective is especially suited to solar harvesting sensors, as energy is available during high periods of human activity and systems can operate at a minimal QoS at night.

Nodes can be placed in different locations in a building. Each sensor node will be subject to different lighting patterns that are determined by human behavior (i.e. lights turned on and off) or by natural light whenever the node is close to a natural source of light (i.e. window). Light availability will vary from weekdays to weekends, from winter to summer and with changes in usage patterns, e.g. a conference room vs a lobby. The sensor node needs to adapt itself to these changing conditions so as to maximize the utility to its applications.

3.2. Reinforcement Learning and Q-Learning

In a typical Reinforcement learning (RL) problem (Sutton et al., 1998), an agent starts in a state and by choosing an action , it receives a reward and moves to a new state . This process is repeated until a final state is reached. RL agent’s goal is to find the best sequence of actions that maximizes the long term reward. The way the agent chooses actions in each state is called its policy .


For each given state and action , we define a function that returns an estimate of a total discounted reward we would achieve by starting at state , taking the action and then following a given policy till a final state is reached.


where is called a discount factor and it determines how much the function in state depends on the past actions (i.e. how long in the past does the agent see) since each member in the equation exponentially diminish the further they are in the past. Equation (2) can be rewritten in a recursive form called as the Bellman equation:


The Q-learning algorithm starts with a randomly initialized Q-value for each state-action pair and an initial state . An episode is defined as a sequence of state transitions from the initial state to the terminal state. The algorithm follows a

-greedy policy, where for each state it picks the action that has the maximum Q-value with probability

and a random action otherwise. The reward obtained by selecting the action is used to update the Q-value with a small learning rate. Under the conditions that each of the state-action pairs are visited infinitely often, the Q-learning algorithm is proven to converge to the optimal function  (Watkins and Dayan, 1992). The optimal policy takes the action that maximizes in each state. The -greedy policy is used as an exploitation exploration trade-off (Sutton et al., 1998), where the occasional random actions encourages the agent to explore the state-action space. It is typical to use a high value of at the start of training and gradually reduce it over time to increase exploitation.

Between different RL algorithms, we choose Q-learning because it gives the optimal policy and is easy to use when there is a set of discrete states and actions. Q-learning is an off-policy algorithm, which means we can learn a policy (and a Q function) with historical data when available. By exploiting the off-policy nature of Q-learning, we introduce two additional variants to the algorithm to reduce convergence time: day-by-day learning and transfer learning. In day-by-day learning, we learn a new Q function each day based on data collected in the past day. The Q function gradually improves each day and can accommodate changing environmental conditions. With transfer learning, we use a Q function learned for another sensor node as an initial policy for a new node.

3.3. RL Problem Formulation for Energy Harvesting Sensors

Figure 1 shows the overview of our RL formulation for configuration of energy harvesting sensor nodes. A sensor node collects energy from a solar panel and sends sensor data to a base station. The base station refers to the RL policy and determines the sensing rate based on data provided by the sensor-node.

Figure 1. Reinforcement Learning Communication Process: Block Diagram.

Agent: The agent is the program in the basestation that takes in measurements from the sensor, outputs the sensing rate to use and updates the RL policy based on rewards.

Environment: It is everything outside the agent, which includes sensor nodes, the wireless channel, the lighting conditions and events that trigger the sensors.

Action Index Sensing Rate [s]
3 15
2 60 (1 min)
1 300 (5 min)
0 900 (15 min)
Table 1. Sensing Rate based on Action Index

State: We use: (i) light intensity, (ii) energy storage level, (iii) weekend/weekdays. We discretize light intensity and energy storage levels to 10 values each. The discretization helps reduce the state space, and hence, decreases the convergence time of the Q-learning algorithm.

Action: Typical commercial IoT devices for buildings use sensing period in the order of minutes, ranging from tens of seconds to 1 hour (Finnigan et al., 2017; acurite, 18;, 2018). We discretize the sensing periods as reported in Table 1. For periodic sensors, the action selects the sensing period to use, e.g. action 2 corresponds to sending data every minute. For event-driven sensors, we observe that once an event is triggered, a subsequent event within a short amount of time is inconsequential. Hence, we keep the node alive until an event occurs, after which it sends the event packet and sleeps for the period indicated by the action; any events during sleep time is missed.

State Transitions:

The agent observes state transitions and takes actions, i.e. sends commands to change the sensing rate, every 15 minutes. A small timestep for state transitions increases the communication overhead between the base station and sensor node and a large timestep misses the opportunity to tune the sensing rate in a fine-grained manner. We select a timestep of 15 mins as a design trade-off between these factors. We use 24 hours as our episode, starting from the moment the node is turned on. We will consider dynamic transitions based on sensing period in future work.

Reward Function: The goal of the system is twofold:

  • Maximize sensing of periodic sensors. Minimize missed events for event-driven sensors.

  • Minimize dead time. If the sensor node dies, it enters a cold start phase and does not send data for a few hours.

The reward function needs to trade-off between maximising sensing that consumes energy while penalizing dead time. We assign the rewards as follows:

  • Reward = Action index (i.e. 0, 1, 2 or 3)

  • Reward = -300 if energy storage level reaches 0.

The penalty should be such that the benefit of maximizing sensing does not outweigh the detriment of dead time lasting several hours. With the above reward function, the maximum total reward in 24 hours will be ¡ 300 (24 * 4 * 3 = 288).

The general principle here is to identify the objective function of the problem and formulate it as a reward function. Any constraints, such should not deplete energy in storage, can be added as a penalty. While we use heuristic to identify the penalty coefficient, one can also identify the coefficient automatically using Lagrangian relaxation (Bohez et al., 2019). Q-learning requires use of discrete states and actions, and the range of actions can be decided based on the desired application requirements. Fine-grained discretization can lead to a better policy, but increases convergence time. We study design trade-off between different discretizations in Section 4.2.2. Our problem formulation can be easily adapted to different sensors, energy harvesters and applications.

1:  Initialize q as an empty set
2:  Initialize control action a, state s and s
3:   = 1
4:   Sense Environment
5:  while time passed 24 hours do
7:     wait 15 mins
8:      Sense Environment
9:     r = reward(s, a, s)
15:  end while
Algorithm 1 RL Algorithm for Energy Harvesting Sensors

3.4. RL Algorithm

Algorithm 1 details use of Q-learning in ACES. After initialization (line 1-3), the agent receives the node state - light, energy storage voltage and current action index (line 4). The algorithm starts the episode and selects an action following the -greedy policy (line 6). At the beginning, is initialized to 1, the algorithm selects a lot of random actions in the first phase of the learning. At each time step (15 mins), is decreased by (line 13), causing the action policy to select more exploitative actions over time.

The sensor node updates its sensing period and sends its state again after 15 minutes (line 7-8). We calculate the rewards with the current state, action and next state (line 9). The algorithm calculates the difference between q and q where q is the value in the Q-table for the state-action pair and q is the the reward obtained by taking action plus the rewards obtained by picking the actions with the highest Q-value until the end of the episode. Their difference is scaled by a learning factor and added to the Q-value of the corresponding state-action pair. Finally, on line 14 we simply save the value of the next state to the current state and repeat the while loop until the end of an episode is reached. We assume our system has reached convergence when the mean of all the values in the Q-Table do not change their value by ¿5%.

We progressively improved our policy training strategy to reduce time to convergence.

One-Time Learning: We train a policy in a simulator based on sensor measurements collected for 15 days. This is a typical setting used in prior works (Aoudia et al., 2018).

Day-by-Day Learning: We train a policy every day based on the sensor measurements collected in the past 24 hours. Each day’s training starts from the policy learned in the previous day. The policy converges within a few days to achieve energy neutral operation. The daily training also adjusts the policy to changing environment conditions (non-stationary environments). This is called Batch RL (Kalyanakrishnan and Stone, 2007).

Transfer Learning: Instead of learning a policy from scratch, the initial policy is borrowed from another node’s converged policy. We achieve 0% dead time from the first day of deployment with transfer learning. We continue to use the day-by-day learning procedure for iterative improvement of the policy.

Table 2 reports the hyper-parameters used for the Q-learning algorithm for our simulations.

Hyper-Parameter Value
Reward-decay () 0.99
Epsilon max () 1
Epsilon max () 0.1
Epsilon decrement () 0.0004
Learning rate () 0.1
Table 2. Q-Learning Hyper-parameters

3.5. Hardware and Communication Process

3.5.1. Sensor Node:

We developed a general purpose energy harvesting battery-less sensor (Figure 2(Fraternali et al., 2018a). We use a solar panel as our energy harvester and store energy in a super-capacitor. Once the voltage reaches a usable level, an energy management board powers the micro-controller (MCU) to starts its operations. We use a 1F super-capacitor with a 5.5V nominal voltage. The MCU uses a high resistance voltage divider (10MOhm) to minimize leakage current. We use BLE for communication and the node has light and PIR sensors. The sensor-node achieves up to a week of lifetime without light when it sends one sensor measurement every 10 minutes.

Figure 2. General Energy Harvesting Sensor-Node.

3.5.2. Wireless Sensor Network Architecture:

The base station uses BLE gattool functions to exchange data with the sensor nodes deployed around the building. As soon as a sensor-node wakes-up, it starts advertising. The base station reads the advertisement, connects to the sensor-node and exchanges data. During the connection, the base station communicates the next action to do to the sensor node while the sensor node communicates the read sensor values (light, temperature, PIR) to the base station. The base station stores and sends the data to the cloud for post-processing using a Wi-Fi connection. All the RL learning is done at the cloud-side and the sensor nodes ‘execute’ the actions decided. In our system, the Base Station is composed of a Raspberry PI equipped with a BLE USB dongle. In our deployment we connected up to 15 nodes to a single base station. To facilitate large deployments, the nodes do not remain connected to the base station as in a typical Bluetooth connection but they always disconnect and reconnect using the advertisement. In this way, we are not limited by the number of simultaneous BLE connections.

4. Simulations

We use the simulator to speed up the RL training since it can take thousands of episodes to converge without it. Our objective is to model the environment to sufficiently capture real-world characteristics, while keeping its complexity low to allow for fast simulations.

4.1. Modeling the Simulation Environment

When the agent acts on the environment, the simulation needs to respond with the next state and reward defined in Section 3.3. To create a simulation environment, we need to identify how much energy will be consumed and harvested under a given environment condition and sensing quality. We start with modeling the sensor platform, where we need to identify the energy consumed in different modes of operations. The energy consumption can be calculated from the datasheet of individual components used or by directly measurements in different operating modes (Wang et al., 2006). We use a combination of both. For more complex platforms where individual component analysis is not feasible, we can fit a model based on power consumption in various modes of operation (Rivoire et al., 2008). The energy gained is a function of both the efficiency of the harvester module and the energy available in the environment. We use raw light measurements collected by sensors in different settings to capture environment characteristics.

4.1.1. Current Measurements:

We use power consumption of solar panel and PIR from their respective data-sheets and measure the current consumption for the other components to increase the quality of our simulator. We measure the current consumed by using the National Instrument USB-6210 with MATLAB (16-bit datum per minute). Table 3 shows the power consumption of sensor-node’s main components when using a super-capacitor charged at 3V. For the sensing operations (e.g. light intensity measurements), the current includes the sensor reading and transmission of the data using BLE.

Feature Current [A]
Board Leakage + MCU in Sleep Mode 3.5
Read Light Sensor + BLE Transmission 199
Board + PIR and MCU in Sleep Mode 4.5
PIR Detection 102
Solar Panel at 200 lux 31
Solar Panel at 50 lux 7.75
Table 3. Sensor-Node Current Features

4.1.2. Light Measurements:

We placed a node in different types of locations and measured light intensity, supercapacitor voltage at 5 minute intervals for 15 days. The locations are: (i) a windowless Conference Room where light is On only when people occupy the room; (ii) a Staircase where internal lights are always On for security reasons; (iii) the Middle of an Office room, mainly subject to internal lights; (iv) a node subject to natural light from a Window; and (vi) a node placed closed to the Door of an office room where light intensity is low.

4.1.3. Modeling the Energy Storage Level:

The super-capacitor accumulates energy when light is available and depletes energy with node operations.

Energy Produced:


From the solar panel data-sheet (CO, 2007), the power generation per lux of light is 0.23 W/lux. E.g., the energy produced with 200 lux of light for 600 seconds is 27.6 mJ. As we measure light intensity only once per 5 minutes, we miss light fluctuation events. We ignore solar plan inclination, the wavelength of light and reflection of lights to keep the model simple. Hence, the energy estimated is an approximation. However, we show that the model is sufficient to learn a policy that works in the real world.

Energy Consumed:


E is the energy consumed to read and transmit a sensor data packet, and E that is the energy consumed by the sensor node in between two transmissions.

4.1.4. Validation of Modeling

We validate our discharge model when no light is available. Figure 3 compares the lifetime of the sensor-nodes in simulations (black) and real data (red) using different sensing rates. The two trends capture the high-level characteristics well, but show small differences due to real-world events that are difficult to model. The sensor-node lasts up to 9 hours by collecting data every 15 seconds, up to 34 hours at 1-minute sending rate, and up to 6.25 days with 10-minute sending-rate. Hence, the RL agent needs to take the right action based on the current environment state to maximize the sensing rate while avoiding energy depletion. We validate our charging model with current measurements in different light conditions as shown in Table 3.

Figure 3. Super-Capacitor Discharge Comparison between our Simulator and Real-World Using Different Sensing Rates

4.1.5. RL Environment Setup

The simulator interacts with the RL agent as explained in Algorithm 1. Equations 4 and 5 keep track of the energy voltage level and the light intensity is taken from real-world measurements. We model the environment state as follows:
Light intensity: We normalize the light intensity from a range of 0 to 10, where 0 represents no light and 10 represents 2000 lux or above. We select 2000 lux as a maximum value after checking typical indoor light intensity in buildings.

Energy storage level: We scale the energy storage voltage from 0 (min voltage 2.1V) to 10 (max voltage 5.5V).

Weekend/Weekday: Buildings indoor lights patterns are strongly dependent on the presence of people (Campbell et al., 2016). Hence, we consider a binary state to capture weekdays and weekends.

Once the super-capacitor voltage reaches ¡2.1V, it terminates all its operations and energy recovery can take hours. To avoid long communication gaps between the sensor-node and the base station, we penalize the RL agent with high negative reward (-300) when super-capacitor voltage is ¡3V.

4.2. Simulation Results

We present results of policies learned with our 15 day light measurements data. We run 5 different simulations, one for each lighting condition. Figure 4 shows the results of the simulation.

Figure 4. Simulation Result on Different Light Conditions

From Figure 4-left for the Conference Room, we see that ACES uses the maximum sensing-rate (action 3) when lights are on and the energy storage is almost full (SC voltage level is 9), but as soon as lights turn off, it reduces the sensing-rate (action 0) to save energy. The conference room has no windows, has long periods of time with lights off and hence, light patterns are sporadic. The system learns that to avoid energy storage depletion, it is better to save energy as soon as lights turn off. When the lights turn on again and the energy levels are not full, ACES switches between action 2 and 3 to allow the energy-storage to recover to full charge.

From Figure 4-center for the Stairs Access, we notice that light is always on at level 1 (200 lux) due to security reasons, but the intensity is not enough to keep the super-capacitor charged. Hence, ACES uses a low sensing-rate that allows slow charging of the super-capacitor over time. When the voltage level reaches the maximum (level 10), ACES uses a higher sensing-rate (action 2 or 3). That drops the super-capacitor voltage and forces ACES to use lower actions again.

From Figure 4-right for the Window, we notice that ACES uses the highest sensing rate (action 3) when the light is on and energy-storage is full, but it starts using lower actions as soon as lights go off and the super-capacitor reduces its voltage. But compared to the conference room case, the sensing-rate reduction is gradual as ACES learns light patterns from sunset to sunrise. It switches to higher sensing-rate even when there are no lights, forecasting that the light will become available in the next few time-steps.

4.2.1. Convergence Time

For each simulation, we collect the total reward obtained at the end of each episode and average them to show the convergence of the algorithm over time. In Figure 5, we show an example of the average reward convergence while running a simulation using equal to 0.1 in the Window location. The convergence happens around 12500 episodes. The entire simulation takes  30 minutes of wall-clock time using an Intel Core-i7 CPU with our Python implementation.

Figure 5. Average Reward Simulation Results for Window using equal to 0.1. It takes less than 30 minutes for our algorithm to converge on an typical Intel Core-i7 CPU.

4.2.2. Input/Action Space Analysis

The dimension of the Q-Table is the product of the input and action state. By increasing the number of states: (i) we can represent the input and output variables more accurately, (ii) we need more iterations for the Q-learning algorithm to converge to an optimal solution. To better understand how this trade-off behaves in our system, Table 4 and 5 show the rewards achieved while selecting different inputs and actions respectively.

Reward Door Stair Middle Conf Window Avg
SC 392 723 322 551 1310 670
SC-Week 179 723 322 353 1510 617
SC-Light 135 699 527 611 1270 648
Week 161 721 832 1040 1413 833
Week-Time 319 722 0 0 1304 0
SC = Super Capacitor Voltage; Time = hours of the day
Table 4. Input Space Analysis

Table 4 reports the reward achieved by ACES when the input uses (i) the only super-capacitor voltage (i.e. SC); (ii) The SC and Light measure; (iii) the SC and week/weekend day (i.e. Week); (iv) SC, Light and Week as for ACES and (v) SC, Light, Week and Time of the day that is expressed in hour. For these simulations we use an action space equal to 4. Using the only super-capacitor as state, gives high reward on places where light is low in intensity (Door, Stair). Introducing Light as an input state, increases rewards where light has high variability throughout the day (Conference and Window). As an average of the five lighting conditions, the use of SC, Light and Week is the one than brings more reward to the system. The increase in rewards comes at the cost of having as an average an input space that is 8 times bigger compared to using only the SC. The use of the SC-Light-Week-Time as an input brings the system to achieve negative reward in 2 places. This is due to the input state being very large and hence ACES could not converge to a good solution within 24 hours for which we ran the simulator. For accommodating these larger state space, we would need to use function approximation algorithms such as Deep Q-Networks (Mnih et al., 2015).

Table 5 shows the effect of changing the action space. Since the reward collected is equal to the action selected by the system at each step, for this experiment we normalized all the rewards from 0 to 3 (i.e. such as having 4 actions). We use SC-Light-Week as the input state. Increasing the number of actions increases the final reward on average. We use 4 actions in our real world experiments to keep the Q-Table small.

Reward Door Stair Middle Conf Window Avg
Action = 2 607 147 81 240 1359 486
Action = 4 161 721 832 1040 1413 833
Action = 8 435 1097 649 1235 1923 1068
Table 5. Action Space Analysis

4.2.3. Event-Driven Applications

The PIR sensor adds to the sleep current of the node by 1A, each event consumes an additional 102 A (Table 3) and lasts for 2.5 seconds. ACES needs to account for this additional energy use when selecting the sensing rate. We simulate stochastic events in the environment with an average of 50 events per day during weekdays and 20 events per day during the weekends, a conservative estimate of daily events as reported by Agarwal et al. (Agarwal et al., 2010).

Node Periodic-Sense Event-Driven Percentage
Placement [data sent] [data sent] [%]
Conference Room 33117 32613 98
Window 43411 41572 96
Middle 9561 9352 97
Door 10207 7465 72
Stairs Access 15879 13647 85
Table 6. ACES comparison between Periodic Sensing Applications and Event-Driven Applications

Table 6 compares the data-packets sent by the final policy with and without the PIR sensor. In all the scenarios, ACES successfully accounts for additional energy expenditure from event-driven sensors and continues to maintain perpetual operation. When abundant light is available throughout the day (i.e. windows, conference room and center of office), the number of data packets sent is similar to the baseline periodic sensing (95% to 98%). However, in stair access and door the light availability is low, and hence, the number of data samples drops to 85% and 72% respectively.

5. Real-World Experimental Results

5.1. One-Time Learning

We learn policies using the simulator with 15 days of light data and deploy the resulting Q-Tables on real sensor nodes in the respective locations. We continue to update the policies based on real-world data. The Q-learning updates use the same parameters reported in Table 2 except for , which is set to 0.9. Thus, the agent takes majority of the actions based learned Q-values from the simulator, but minimally takes random actions to learn changing patterns. If the agent encounters a state not listed in the Q-table, it takes a random action.

Figure 6

shows the results of three of the five nodes. The behavior in the real deployment is very similar to the simulation results. There are several outliers in the sequence of actions due to

-greedy exploration. In the Door location (Figure 6-left), light intensity is low during the day (it reaches level 3 at most) and ACES gradually selects higher actions when enough light is available. For the Stair Access case (Figure 6-middle), the light availability is low, hence ACES picks lower sensing rate actions (i.e. 0 or 1).

Figure 6. ACES Real-World Results in Different Lighting Conditions. As a first experiment we use equal to 0.9 to leave the system exploring in the real-world. This is why we can see several outliers in the middle of a constant sequence of actions.

For the Window location (Figure 6-right), ACES uses the highest sensing rate (action 3) during the day, and decreases the action to 2 when the light goes down. However, the action is never lowered to action index 1 as observed in simulations because the energy-storage level never drops to ¡6. Upon further digging, we found that the communication between the sensor node and the Base Station on average took more time than in simulation. Hence, data is exchanged less frequently (every 17 seconds instead of 15) and the power used is lower as well. The policy automatically adjusted to this difference in environment to maximize its rewards.

The Center of Office results is similar to the Door node results since it has similar light patterns. The Conference Room node also performs close to the simulation results.

5.1.1. Comparison with Fixed Periodic Sensing

Node Battery-Power ACES Percent Dead
Placement [data sent] [data sent] [%] Time (%)
Room 48936 47146 96 0
Window 51856 88014 170 0
Office 47610 34743 73 0
Door 41760 18374 44 0
Access 48924 23788 49 0
Table 7. Placement-QoS comparison between a battery-powered system and ACES

Table 7 compares the data sent by ACES with a fixed sensing period of 1 minute used commonly in buildings. ACES exceeds the number of data packets compared to a fixed sensing period when there is a consistent amount of light as the agent learns the daylight patterns and uses a sensing rate of 15 seconds when the light is available (Window case). The number of data packets sent is similar (96%) for the node placed in the conference room. This value is closely related to the amount of light available and the presence of people in the environment. Percentages are lower for Center of Office, Door and Stairs Access as light available is lower. Even for these locations, sensor nodes send data every minutes (Stairs Access: 49%, Door: 44%). All the 5 nodes avoided battery depletion and have 0% dead time throughout the 31-day real-world experiment.

5.2. Day-by-Day Learning

Instead of learning from the simulator just once with 15 days of data, we switch to learning a new policy from simulator every day. In our One-Time Learning deployment, we observed that by leaving the RL to explore a new sequence of actions can decrease the quality of service of applications when it takes a random action once in a while. Hence, it is better to avoid exploration while running the sensor node in the real world.

On the first day, we start with a fixed policy of collecting data every 15 mins. At the end of the day, the collected data is used to learn a new Q-table in the simulator by running Algorithm 1 with the hyper-parameters reported in Table 2. The learned Q-Table is used to collect data the next day with set to 0. From the second day, the training starts from the previous day’s Q-table that summarizes the learning until that day. Thus, the Q-Table is updated day-by-day from the first day of the deployment.

Figure 7. ACES Real-World Results in Different Lighting Conditions using Day-by-Day learning.

Figure 7 shows the rewards obtained by the fives nodes placed on the five different lighting conditions for 15 days. The figure cuts off with a value of -10 which indicates the nodes died because of energy storage depletion. 3 out of the 5 nodes deployed (i.e. Middle Office, Window and Stair Access) had 0% dead time while two nodes (i.e. Conference Room and Door) depleted their energy in the first days of the experiments. The lighting energy in the latter case is mainly subjects to human behavior (i.e. turning on lights or blinds) that can change between days. ACES needs some time to understand those patterns and adapt to them. On the other hand, the nodes that maintain continuous operations are the ones that are mainly subject to a constant light pattern - light is available from sunrise to sunset for the Window node and the light is always on for security reasons near the Stair Access node. For this experiment, we experience a dead time of 4%.

To increase validation of the Day-by-Day learning, we further deployed other 10 nodes for 15 days. We upgrade our boards to use a 1.5F super-capacitor to extend node lifetime in case there is no light available. The nodes achieve 0% dead time in 15 days of deployment.

5.3. Transfer-Learning

We speed up the learning process for a multi-node deployment with transfer learning (Taylor and Stone, 2009).

5.3.1. General Q-Table

We use the first week of light data for all the five different lighting conditions to run the ACES simulator and build a general Q-Table. Then, we place five new sensor nodes in similar lighting conditions as the previous nodes, i.e. a node close to a window, a node in another conference room and so on. We run ACES on the new nodes while starting them with the general Q-Table instead of learning everything from scratch.

Num Data-Sent Empty General Percentage
(Num Node Died) Q-Table Q-Table [%]
Conference Room 2670 (1) 4032 (0) 151
Avg Light [lux] 520 455 89
Window 10126 (0) 9894 (0) 98
Avg Light [lux] 4518 3888 87
Middle Office 2560 (0) 2783 (0) 108
Avg Light [lux] 281 340 121
Door 745 (2) 860 (1) 115
Avg Light [lux] 117 99 85
Stairs Access 2881 (0) 2784 (0) 97
Avg Light [lux] 184 181 98
Table 8. ACES results after adopting transfer learning after the first 3 days of deployment. Between brackets we show the number of times the nodes died during deployment.

Table 8 reports the number of data-packet sent for the five different places for the first 3 days of deployment by starting ACES using a general Q-Table versus an empty one. Between parenthesis, we report the number of times the nodes died. By using a transfer learning approach, the nodes perform better as indicated by increase in the number of packets sent. The nodes subject to easy to predict patterns (Stair Access and Window) have similar results regardless of an empty table or a general table, ACES is able to find the best sequence of action after only one day. The situation is different for the placements subject to human occupancy patterns (Conference Room, Middle Office, and Door): in these cases, the general Q-Table helps speed up the learning process. The node placed in the conference room sent up to 1.5x more data. Most importantly, the general Q-Table helps the nodes to reduce dead time: in case of the conference room, this is reduced from 1 to 0, while in the Door case this is reduced from 2 to 1.

5.3.2. Similar Lighting Q-Table

We also tried transfer learning by starting 5 different nodes while using a Q-Table already generated by nodes running in similar lighting conditions (e.g. Window to Window). All the nodes benefited from the initial policy and achieved 0% dead time across 2 weeks of deployment.

5.4. Comparison with State of the Art Solutions

We compare the sensing-rate for periodic sensors used by ACES using a day-by-day learning policy with (i) an energy manager based on reinforcement learning for energy harvesting wireless sensor networks (RLMan (Aoudia et al., 2018)), (ii) a local power management algorithm that has the goal to maximize sensing-rate while avoiding energy depletion on energy-harvesting battery-less motes (i.e. Mote-Local (Fraternali et al., 2018a)) and (iii) a battery powered systems that send data every minute (lin, [n. d.]). We consider those 3 architectures since they include current solutions (iii), literature heuristic (ii) and RL-based methods (i). We used the first week of lighting data to compare the 4 systems. Table 9 shows the comparison for each lighting condition.

RLMan uses an actor critic algorithm with function approximations and learns a policy based on historical data with only the energy storage level as state. We use the same state, but use Q-learning algorithm to learn a policy. Q-learning is guaranteed to converge to an optimal policy and we run the algorithm until it converges in simulator, hence the change in algorithm will not affect the final policy learned. To simulate the Mote-Local method, we used the power management algorithm described in (Fraternali et al., 2018a): the system increases the sensing-rate if light is available and the super-capacitor is increasing voltage with time, and decreases the sensing-rate when the light is off or the voltage is decreasing or maintaining the same value with time.

Placement Window Conf. Middle Door Stairs
Room Office Access
Mote-Local(Fraternali et al., 2018a) 10651 4064 4725 1792 6433
RLMan(Aoudia et al., 2018) 19972 4538 4025 1472 4431
Powered(lin, [n. d.]) 10080 10080 10080 10080 10080
ACES 24987 4905 6079 1842 9540
Table 9. Sensing-rate comparison between different methods w.r.t. ACES on different indoor lightning conditions

ACES exceeds the number of data packets sent compared to the RLMan and Mote-Local techniques. RL beats the Mote-Local technique as it learns the impact of each action it takes with rewards received, the formulation teaches the agent to forecast conditions based on historical data and optimize for higher sensing rate. ACES beats RLMan because it uses additional state information such as light intensity and weekday/weekend. Hence, these additional states we include in ACES makes a measurable impact on the performance of the node.

5.5. Real-World PIR Event + Periodical Sensing

We deployed 45 nodes with a PIR sensor to evaluate ACES in the real world. We place 9 nodes for each of the five lightning conditions. The nodes send both PIR events and light intensity data. 40 nodes use day-by-day learning policy and 5 nodes use transfer-learning policy. As event sensing nodes remain awake much of the time, we increased the super-capacitor size to 1.5F to account for additional energy expenditure. A node now lasts up to 9.6 hours when it sends a packet every 15 seconds with the PIR always on. It lasts up to 8 days with 10 minutes sensing period. Table 10 reports a summary of the nodes deployed in the different lighting conditions, including the number of times the nodes die and the time in hours that they were off. For the Door case, we placed two of the 9 nodes close to the ceiling near a source of light to facilitate detection of people entering or leaving the room, and hence the average light the nodes achieves in this location is higher than other averages.

Node Avg Avg Peak Avg Dead Node
Placement Light PIR PIR Sensing Time Dead
[lux] [event] [event] [packet] [h] [num]
Conference 1139 43 83 726 24 1
Window 4301 83 199 1098 0 0
Middle 423 61 175 807 0 2
Door 554* 33 85 781 0 0
Corridors 479 81 154 581 0 1
* 2 nodes out of 9 in the ceiling near internal lights
Table 10. A summary of the 45 nodes we deployed for PIR event-detection. 9 nodes deployed for each lighting condition. Numbers are averaged per day during a two weeks experiment.

3 nodes died in the Middle and Corridor cases. Upon investigation, we found that the nodes were defective and did not charge even when lights were on. For the Conference case, most nodes performed well, but one of the nodes ran out of energy for about 24 hours. Upon checking its historical data, we found that the light in the room remained off for several days and the nodes died during the weekend. After people entered the room and turned on the light, the node resumed operations in just 15 minutes. Results are positive for the Stairs/Corridor case where nodes detect as an average of 154 events per day and sent light data 581 times per day. The Window location is the most performant sensor data sent 1098 times per day, 200 motion events per day on average.

5.5.1. PIR Detection Accuracy

To evaluate how many events are missed by ACES, we placed 15 battery-powered nodes as ground truth for the events detected. We count at most one motion event every 2 minutes for both the ACES and ground truth sensor nodes. The Table 11 compares the events detected in the different lighting conditions w.r.t to the ground truth nodes. Table reports the average of events per day.

Node Ground-Truth ACES Percentage
Placement [events] [events] [%]
Conference Room 52 48 91
Window 110 109 99
Middle 125 98 79
Door 63 54 86
Stairs/Corridors 154 112 73
Table 11. Event detection comparison between ACES and ground-truth nodes. Events averaged per day.

The Table shows that for the Window case the number of data sent from ACES w.r.t to a battery-powered node is 99%. This is not surprising as the majority of events happens during the day. Conference nodes also get 91 % of the events since as soon as people enter the room, the super-capacitor gets fully charged in a few minutes. The Corridor case is the most challenging, where there is a continuous stream of events due to people moving and ACES catches only 73% of them (mean of 112 event per day). This can be mitigated with a different event detection strategy, e.g. reward events detected every 10 mins, for locations with large activity.

5.5.2. Continuous Operation and Morning First Event Detection

Barring the 3 sensors that were defective, 41 nodes out of the 42 deployed maintained continuous working operations. More importantly, they were operational throughout the night and detected the first PIR event in the morning when people come into their office. The first morning PIR event is important for convenience as users expect a fast response from building systems when they enter their office. ACES detects 99% of these events and only one was delayed by 15 minutes.

5.5.3. Transfer Learning

We deployed 5 of the nodes using transfer learning with a Q-Table from a node in a similar lighting condition described in Section 5.3.2. All the nodes benefited from the initial policy and achieved 0% dead time.

5.6. Energy Neutral Operations

It is possible that the ACES nodes perform well in our multi-week deployments, but are consuming incrementally more energy than is available and die out in a longer deployment. We evaluate the energy neutrality of ACES nodes by monitoring the super-capacitor voltage level of five randomly picked nodes in each type of location. If the super-capacitor voltage level steadily decreases over time, ACES nodes will die more frequently in longer deployments. Table 12 shows the super-capacitor voltage percentage at midnight after two consecutive days for the five nodes.

Node Midnight Day1 Midnight Day2 Difference
Placement [SC Volt in %] [SC Volt in %] [num]
Room 93 100 7
Window 88 87 1
Middle 88 87 1
Door 67 68 1
Stairs/Corridors 91 91 0
Table 12. Energy neutral operations after consecutive days for different lightning conditions

For 4 out of the 5 nodes, the voltage level in the super-capacitor stays within 1% after 24 hours. All the four nodes have voltage level at ¡100%. Thus, the ACES agent learns to modulate its sensing rate such that it neither expends too much energy nor is it saving too much energy. Figure 8 shows energy-neutral operations for the Stairs Case. In this case, the light is almost constant throughout the day, and ACES adapts the action accordingly to maintain a constant voltage level. At the end of two consecutive days, the super-capacitor voltage level percentage remains exactly the same.

Figure 8. Energy Neutral Operations for Stairs Case. The light is almost constant throughout the day, and ACES adapts the actions in order to maintain a constant voltage level. At the end of two consecutive days, the super capacitor voltage level percentage remains exactly the same.

For the Conference case, the situation is different and the super-capacitor voltage percentage reaches to 100%. In this case, the ACES agent was conservative in spending its energy and did not send as much data as it could have. It is possible the conservative strategy works better in the Conference room because of its unpredictable changes in light conditions.

6. Real-World Experience and Limitations

We have demonstrated that ACES can successfully set the sensing rate of energy harvesting devices according to light availability. Our simulation and real deployment results are promising. Our current deployment consist of 60 nodes across our department floor building with 15 nodes for periodical light sensing and 45 for PIR event-detection. We use 15 additional battery powered nodes as a ground truth reference for the PIR nodes. Figure 9 shows the floor map and the position of the nodes. Based on our experience, we highlight the pertinent future research directions for using reinforcement learning at the edge.

Figure 9. 60 nodes deployed across our department building floor. 45 nodes send both PIR events and periodic light measurements, 15 nodes only sense light periodically. All the nodes use day-by-day learning and 10 nodes were initialized with transfer learning policy.

RL problem formulation: The key to the success of reinforcement learning is to formulate the states, actions, rewards and state transitions to capture the essence of the problem. We empirically tried several formulations before finalizing the current design. This is a one-time effort that can be applied to the deployment of numerous sensors, and is much better than manually configuring each sensor. However, tools to assist domain experts in the problem specification will be immensely useful to adopt ACES like solutions.

Safe exploration in real-world deployments: In our one-time learning policy, we kept to let the system explore during the day. Sometimes the randomly chosen action can drastically reduce the QoS. We avoided exploration by using the day-by-day learning strategy, where we learn the policy in simulator. We are able to do this because the effect our actions on the environment were relatively easy to simulate. If that is not the case, we need a safe way to explore in the real world deployment (Berkenkamp et al., 2017).

Accommodate larger state space: We carefully designed our states and actions to limit the search space of the Q-learning algorithm and make the problem tractable. However, in realistic IoT deployments where nodes can have tens of parameters, we may need to accommodate large state spaces and action choices. Recently proposed deep reinforcement learning algorithms such as DQN (Mnih et al., 2015) and TRPO (Schulman et al., 2015) can accommodate such large state spaces. As our agent resides in the power socket connected base station, it has enough computation power to deploy these algorithms. In future work, we plan to explore the computation versus performance tradeoffs among these algorithms.

Speeding-up learning: In our current setup, we initially learned a new policy for each sensor node from scratch. To scale to thousands of sensors, we can learn a single policy that generalizes to many different contexts. We showed how transfer-learning can be used to speed up the learning process. However, we can learn a single policy by expanding the state space to include contextual features so that the RL agent learns the actions that maximize the long-term rewards in a context-specific manner. For our use case, we can add state such as the type of room (conference room vs lobby), if the room has a window, etc. A single policy can learn from data collected by all the sensor nodes and hence will also speed up learning significantly.

Managing network failures: Communications between the nodes and the base station can fail in real-world environment due to unexpected events. During our deployment, our department network was subject to IP address reconfiguration of IP addresses by the network building manager and the base stations failed to communicate correct actions to the nodes. 3 nodes died during a 4 day disconnection. To avoid network failures in the future, the Q-learning policy can be executed inside the node to make the system more robust. The memory and compute requirements are low. The CC2650 MODA that is included in the board has 128Kb of programmable flash and the Q-Tables we learned are at most 25kb. As a proof of concept, we implemented a full Q-table inside one node and the node could modulate its sensing period without the need of a base station. In future work, we will consider transferring the learned Q-tables from the base-station to the node.

Simulation and Real World Differences: While in the simulations the best sequence of actions are learned by considering all the nodes behaving identically, during our real-world deployment we discovered a number of factors that can differ between the nodes while severely impact the node behavior. We report them here as follows: (i) current leakage variability (ii) environmental interference, distance from the base station causing disconnections and more power needed for the node to reconnect. Taking into account those differences could increase nodes performance. Furthermore, those differences become non-negligible in a large scale deployment. When we increased our deployment from 5 to 60 nodes, the disconnections in the network increased at a rapid rate. On average, 1 out of 20 packets is affected by disconnections and the node needs to retry the communication to exchange the data with the base station causing increased energy consumption.

Placement of Nodes: One node never reached normal working functionalities because the light available was never enough for the node to charge. The node was placed in a corridor and only 60 lux was available during during the day. After moving it closer to the source of light (110 lux on average), the node started working as expected. Hence, placement of nodes should take into consideration the minimal energy required to operate the node.

7. Conclusion

Battery replacement of wireless sensors is a major bottleneck towards the adoption of large-scale deployments in buildings. We show that energy harvesting sensors provide a credible alternative when we optimize their operations using reinforcement learning (RL). Our system ACES uses RL to configure the sampling rate of solar panel based energy harvesting sensor nodes. Our sensor node uses Bluetooth low energy for communicating data to a nearby base station and senses both periodic measurements such as temperature and event-driven ones such as motion. Using both simulations and real-world deployments, we show that our Q-learning algorithm based RL policy learns to adapt to varying lighting conditions and sends sensor data across nights and weekends without depleting available energy. We explored deploying a one-time policy learned from historical data as well as learning a new policy based on data collected each day. Both of these strategies achieved perpetual operations in real-world deployments. We also show that transfer learning can be used to effectively decrease the training period in real deployments.


  • (1)
  • lin ([n. d.]) [n. d.]. ([n. d.]).
  • acurite (18) acurite. ’18. (’18).
  • Agarwal et al. (2010) Yuvraj Agarwal, Bharathan Balaji, Rajesh Gupta, Jacob Lyles, Michael Wei, and Thomas Weng. 2010. Occupancy-driven Energy Management for Smart Building Automation. In Proceedings of the 2Nd ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Building (BuildSys ’10). ACM, New York, NY, USA, 1–6.
  • Aoudia et al. (2018) Fayçal Ait Aoudia, Matthieu Gautier, and Olivier Berder. 2018. RLMan: an Energy Manager Based on Reinforcement Learning for Energy Harvesting Wireless Sensor Networks. IEEE Transactions on Green Communications and Networking 2, 2 (2018), 408–417.
  • Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. 2017. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems. 908–918.
  • Bohez et al. (2019) Steven Bohez, Abbas Abdolmaleki, Michael Neunert, Jonas Buchli, Nicolas Heess, and Raia Hadsell. 2019. Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623 (2019).
  • Campbell et al. (2016) Bradford Campbell, Joshua Adkins, and Prabal Dutta. 2016. Cinamin: A Perpetual and Nearly Invisible BLE Beacon. In Proceedings of the 2016 International Conference on Embedded Wireless Systems and Networks (EWSN ’16). Junction Publishing, USA, 331–332.
  • Campbell and Dutta (2014) Bradford Campbell and Prabal Dutta. 2014. An Energy-harvesting Sensor Architecture and Toolkit for Building Monitoring and Event Detection. In Proceedings of the 1st ACM Conference on Embedded Systems for Energy-Efficient Buildings (BuildSys ’14). New York, NY, USA, 100–109.
  • Chi et al. (2014) Qingping Chi, Hairong Yan, Chuan Zhang, Zhibo Pang, and Li Da Xu. 2014. A reconfigurable smart sensor interface for industrial WSN in IoT environment. IEEE transactions on industrial informatics 10, 2 (2014), 1417–1425.
  • CO (2007) Sanio Semiconductor CO. 2007. (2007).
  • DeBruin et al. (2013) Samuel DeBruin, Bradford Campbell, and Prabal Dutta. 2013. Monjolo: An Energy-harvesting Energy Meter Architecture. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems (SenSys ’13). ACM, New York, NY, USA, Article 18, 14 pages.
  • Dias et al. (2016) Gabriel Martins Dias, Maddalena Nurchis, and Boris Bellalta. 2016. Adapting sampling interval of sensor networks using on-line reinforcement learning. In Internet of Things (WF-IoT), 2016 IEEE 3rd World Forum on. IEEE, 460–465.
  • (2018) 2018. (2018).
  • Ensworth and Reynolds (2017) Joshua F Ensworth and Matthew S Reynolds. 2017. Ble-backscatter: ultralow-power IoT nodes compatible with bluetooth 4.0 low energy (BLE) smartphones and tablets. IEEE Transactions on Microwave Theory and Techniques 65, 9 (2017), 3360–3368.
  • Finnigan et al. (2017) S Mitchell Finnigan, AK Clear, Geremy Farr-Wharton, Kim Ladha, and Rob Comber. 2017. Augmenting Audits: Exploring the Role of Sensor Toolkits in Sustainable Buildings Management. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 2 (2017), 10.
  • for Demand Response and Efficiency (2015) Cypress Envirosystems. 2015. Retrofitting Existing Buildings for Demand Response and Energy Efficiency. 2015. (2015).
  • Fraternali et al. (2018a) Francesco Fraternali, Bharathan Balaji, Yuvraj Agarwal, Luca Benini, and Rajesh K. Gupta. 2018a. Pible: Battery-Free Mote for Perpetual Indoor BLE Applications. In Proceedings of the 5th ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Building (BuildSys ’18). ACM.
  • Fraternali et al. (2018b) Francesco Fraternali, Bharathan Balaji, and Rajesh Gupta. 2018b. Scaling Configuration of Energy Harvesting Sensors with Reinforcement Learning. In Proceedings of the 6th International Workshop on Energy Harvesting & Energy-Neutral Sensing Systems (ENSsys ’18). ACM, New York, NY, USA, 7–13.
  • Grondman et al. (2012) Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. 2012. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 6 (2012), 1291–1307.
  • Gubbi et al. (2013) Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29, 7 (2013), 1645 – 1660. Including Special sections: Cyber-enabled Distributed Computing for Ubiquitous Cloud and Network Services & Cloud Computing and Scientific Applications — Big Data, Scalable Analytics, and Beyond.
  • Hester and Sorber (2017) Josiah Hester and Jacob Sorber. 2017. Flicker: Rapid Prototyping for the Batteryless Internet-of-Things. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems. ACM, 19.
  • Hester et al. (2017) Josiah Hester, Kevin Storer, and Jacob Sorber. 2017. Timely Execution on Intermittently Powered Batteryless Sensors. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems (SenSys ’17). ACM, New York, NY, USA, Article 17, 13 pages.
  • Howell (2017) Jenalea Howell. 2017. (2017).
  • Hsu et al. (2006) Jason Hsu, Sadaf Zahedi, Aman Kansal, Mani Srivastava, and Vijay Raghunathan. 2006. Adaptive duty cycling for energy harvesting systems. In Proceedings of the 2006 international symposium on Low power electronics and design. ACM, 180–185.
  • Hsu et al. (2009a) Roy Chaoming Hsu, Cheng-Ting Liu, and Wei-Ming Lee. 2009a. Reinforcement learning-based dynamic power management for energy harvesting wireless sensor network. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, 399–408.
  • Hsu et al. (2014) R. C. Hsu, C. T. Liu, and H. L. Wang. 2014. A Reinforcement Learning-Based ToD Provisioning Dynamic Power Management for Sustainable Operation of Energy Harvesting Wireless Sensor Node. IEEE Transactions on Emerging Topics in Computing 2, 2 (June 2014), 181–191.
  • Hsu et al. (2009b) R. C. Hsu, C. T. Liu, K. C. Wang, and W. M. Lee. 2009b. QoS-Aware Power Management for Energy Harvesting Wireless Sensor Network Utilizing Reinforcement Learning. In 2009 International Conference on Computational Science and Engineering, Vol. 2. 537–542.
  • (2018) 2018. (2018).
  • Jayakumar et al. (2014) Hrishikesh Jayakumar, Kangwoo Lee, Woo Suk Lee, Arnab Raha, Younghyun Kim, and Vijay Raghunathan. 2014. Powering the internet of things. In Proceedings of the 2014 international symposium on Low power electronics and design. ACM, 375–380.
  • Kalyanakrishnan and Stone (2007) Shivaram Kalyanakrishnan and Peter Stone. 2007. Batch reinforcement learning in a complex domain. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. ACM, 94.
  • Kansal et al. (2007) Aman Kansal, Jason Hsu, Sadaf Zahedi, and Mani B. Srivastava. 2007. Power Management in Energy Harvesting Sensor Networks. ACM Trans. Embed. Comput. Syst. 6, 4, Article 32 (Sept. 2007).
  • Kellogg et al. (2014) Bryce Kellogg, Aaron Parks, Shyamnath Gollakota, Joshua R Smith, and David Wetherall. 2014. Wi-Fi backscatter: Internet connectivity for RF-powered devices. In ACM SIGCOMM Computer Communication Review, Vol. 44. ACM, 607–618.
  • Khan and Hornbæk (2011) Azam Khan and Kasper Hornbæk. 2011. Big data from the built environment. In Proceedings of the 2nd international workshop on Research in the large. ACM, 29–32.
  • Lawson and Ramaswamy (2015) Victor Lawson and Lakshmish Ramaswamy. 2015. Data Quality and Energy Management Tradeoffs in Sensor Service Clouds. In Big Data (BigData Congress), 2015 IEEE International Congress on. IEEE, 749–752.
  • Lucia et al. (2017) Brandon Lucia, Vignesh Balaji, Alexei Colin, Kiwan Maeng, and Emily Ruppel. 2017. Intermittent Computing: Challenges and Opportunities. In LIPIcs-Leibniz International Proceedings in Informatics, Vol. 71. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
  • Maeng et al. (2017) Kiwan Maeng, Alexei Colin, and Brandon Lucia. 2017. Alpaca: Intermittent Execution Without Checkpoints. Proc. ACM Program. Lang. 1, OOPSLA, Article 96 (Oct. 2017), 30 pages.
  • Martin et al. (2012) Paul Martin, Zainul Charbiwala, and Mani Srivastava. 2012. DoubleDip: Leveraging Thermoelectric Harvesting for Low Power Monitoring of Sporadic Water Use. In Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems (SenSys ’12). New York, NY, USA, 225–238.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
  • Moser et al. (2010) C. Moser, L. Thiele, D. Brunelli, and L. Benini. 2010. Adaptive Power Management for Environmentally Powered Systems. IEEE Trans. Comput. 59, 4 (April 2010), 478–491.
  • Naderiparizi et al. (2015a) Saman Naderiparizi, Aaron N Parks, Zerina Kapetanovic, Benjamin Ransford, and Joshua R Smith. 2015a. WISPCam: A battery-free RFID camera. In RFID (RFID), 2015 IEEE International Conference on. IEEE, 166–173.
  • Naderiparizi et al. (2015b) Saman Naderiparizi, Yi Zhao, James Youngquist, Alanson P. Sample, and Joshua R. Smith. 2015b. Self-localizing Battery-free Cameras. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp ’15). ACM, New York, NY, USA, 445–449.
  • (2017) 2017. (2017).
  • Rivoire et al. (2008) Suzanne Rivoire, Parthasarathy Ranganathan, and Christos Kozyrakis. 2008. A comparison of high-level full-system power models. HotPower 8, 2 (2008), 32–39.
  • Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In

    International Conference on Machine Learning

    . 1889–1897.
  • Shresthamali et al. (2017) Shaswot Shresthamali, Masaaki Kondo, and Hiroshi Nakamura. 2017. Adaptive Power Management in Solar Energy Harvesting Sensor Node Using Reinforcement Learning. ACM Transactions on Embedded Computing Systems (TECS) 16, 5s (2017), 181.
  • Sudevalayam and Kulkarni (2011) Sujesha Sudevalayam and Purushottam Kulkarni. 2011. Energy harvesting sensor nodes: Survey and implications. IEEE Communications Surveys & Tutorials 13, 3 (2011), 443–461.
  • Sutton et al. (1998) Richard S Sutton, Andrew G Barto, Francis Bach, et al. 1998. Reinforcement learning: An introduction. MIT press.
  • Talla et al. (2017) Vamsi Talla, Bryce Kellogg, Shyamnath Gollakota, and Joshua R Smith. 2017. Battery-free cellphone. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 2 (2017), 25.
  • Taylor and Stone (2009) Matthew E Taylor and Peter Stone. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10, Jul (2009), 1633–1685.
  • Udenze and McDonald-Maier (2009) Adrian Udenze and Klaus McDonald-Maier. 2009. Direct reinforcement learning for autonomous power configuration and control in wireless networks. In Adaptive Hardware and Systems, 2009. AHS 2009. NASA/ESA Conference on. IEEE, 289–296.
  • Ulukus et al. (2015) Sennur Ulukus, Aylin Yener, Elza Erkip, Osvaldo Simeone, Michele Zorzi, Pulkit Grover, and Kaibin Huang. 2015. Energy harvesting wireless communications: A review of recent advances. IEEE Journal on Selected Areas in Communications 33, 3 (2015), 360–381.
  • (2018) 2018. (2018).
  • Vigorito et al. (2007) Christopher M Vigorito, Deepak Ganesan, and Andrew G Barto. 2007. Adaptive control of duty cycling in energy-harvesting wireless sensor networks. In 2007 4th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks. IEEE, 21–30.
  • Wang et al. (2006) Qin Wang, Mark Hempstead, and Woodward Yang. 2006. A realistic power consumption model for wireless sensor network devices. In 2006 3rd annual IEEE communications society on sensor and ad hoc communications and networks, Vol. 1. IEEE, 286–295.
  • Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.
  • Watkins (1989) Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards. Ph.D. Dissertation. King’s College, Cambridge.
  • (2019) 2019. (2019).
  • Yau et al. (2012) Kok-Lim Alvin Yau, Peter Komisarczuk, and Paul D Teal. 2012. Reinforcement learning for context awareness and intelligence in wireless networks: Review, new features and open issues. Journal of Network and Computer Applications 35, 1 (2012), 253–267.