Scaling Configuration of Energy Harvesting Sensors with Reinforcement Learning

11/27/2018 ∙ by Francesco Fraternali, et al. ∙ University of California, San Diego 0

With the advent of the Internet of Things (IoT), an increasing number of energy harvesting methods are being used to supplement or supplant battery based sensors. Energy harvesting sensors need to be configured according to the application, hardware, and environmental conditions to maximize their usefulness. As of today, the configuration of sensors is either manual or heuristics based, requiring valuable domain expertise. Reinforcement learning (RL) is a promising approach to automate configuration and efficiently scale IoT deployments, but it is not yet adopted in practice. We propose solutions to bridge this gap: reduce the training phase of RL so that nodes are operational within a short time after deployment and reduce the computational requirements to scale to large deployments. We focus on configuration of the sampling rate of indoor solar panel based energy harvesting sensors. We created a simulator based on 3 months of data collected from 5 sensor nodes subject to different lighting conditions. Our simulation results show that RL can effectively learn energy availability patterns and configure the sampling rate of the sensor nodes to maximize the sensing data while ensuring that energy storage is not depleted. The nodes can be operational within the first day by using our methods. We show that it is possible to reduce the number of RL policies by using a single policy for nodes that share similar lighting conditions.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The number of connected Internet of Things (IoT) devices is expected to increase from 27 billion in 2017 to 125 billion in 2030 (Howell, 2017). Driven by the need to collect data about the environment and human behavior, an increasing number of applications have emerged and produce useful, scientifically-relevant data. Energy harvesting sensors are a key part of the IoT ecosystem to reduce dependence on batteries (Atzori et al., [n. d.]; Hester and Sorber, 2017; Renner, 2013; Hester et al., 2017). However, the performance of energy harvesting sensors is highly dependent on energy availability in the environment (Campbell et al., 2016; et al., 2014; Fraternali et al., 2018). To maximize performance, the sensors need to be tuned carefully according to the application requirements, hardware capabilities and environmental conditions. As IoT deployments scale, setting these configuration parameters manually or based on heuristics becomes infeasible. We explore Reinforcement Learning (RL) as a promising solution as it can dynamically set these parameters by online learning.

We focus on indoor solar energy harvesting sensors and use RL to dynamically configure the sampling rate of the sensor. If the sampling rate is too high, the node expends available energy and reduces its uptime. If the rate is too low, it impacts the application performance. The objective is to configure the sensing period such that the number of sensor samples is maximized while ensuring that the sensor node does not run out of energy.

Several solutions propose machine learning techniques to automatically configure sensors 

(Dias et al., 2016; Hsu et al., 2014; Hsu et al., 2009b), but their results are limited: (i) they are based on short simulations and do not capture pragmatic aspects of a real-life environment; (ii) do not consider the scaling of the system to thousands of nodes. To overcome these limitations we conduct a series of simulation experiments to evaluate solutions for auto-configuration of sensors using RL. The domain expert specifies the parameters (i.e. actions in RL terminology) to be configured, the contextual features (i.e. states) which affect these parameters and the utility (i.e. reward) that they seek to maximize. RL then tries out different parameter values, observes its effect on the given utility and learns the optimal sensor-node configuration to maximize the long-term utility according to the environmental situation.

We built a generic sensor node with a solar panel for energy harvesting, supercapacitor for energy storage, BLE radio for communication and general sensors such as light and temperature(Fraternali et al., 2018). We deployed five nodes in different indoor lighting conditions collecting light intensity and supercapacitor voltage level for 3 months in our department building. We developed a simulator based on this data that models the essential aspects of the sensor nodes and its environment and train different RL agents using the Q-learning algorithm (Watkins, 1989). We show how the system adapts to environmental changes and appropriately configures each mote to maximize sensing-rate while avoiding energy storage depletion.

Typical RL solutions need significant historical data or an online training phase where they explore the solution space randomly to effectively learn a strategy. However, this is detrimental to sensor deployments as historical data is expensive to collect, and a long training phase makes the sensor node unusable immediately after deployment. To combat this, we propose an adaptive on-policy RL solution that reduces the training phase after deployment. We show that nodes can effectively operate, i.e. sense data periodically without depleting the stored energy, within the first day. We also show that similar results can be obtained by exploiting transfer learning. Finally, prior solutions consider one RL policy for each sensor node and affect the scalability. We show that it is possible to use a single policy for sensors that share similar lighting conditions and still effectively configure the sensor nodes.

2. Related Work

The importance of automatic sensor configuration to reduce manual intervention is underlined by many works (Chi et al., 2014; at al., 2009; Moser et al., 2010; Dias et al., 2016; Hsu et al., 2009a; Hsu et al., 2009b). Due to the close relationship between data quality and energy consumption (Lawson and Ramaswamy, 2015), a sensor-node should adapt its sensing to meet application requirements while avoiding energy depletion (Jayakumar et al., 2014). Prior works have proposed adaptive duty cycling on energy harvesting sensors to achieve energy neutral operations (Kansal et al., 2007; Hsu et al., 2006; Moser et al., 2010). Based on the predicted energy, nodes adjust their duty-cycle parameters and increase lifetime and applications performance (et al., 2011). Like prior work, RL uses prediction to make decisions, but unlike prior work the policy is learned automatically to converge to the optimal solution.

Machine learning techniques have been adopted to predict the future energy availability of a sensor node and select the correct sensor parameter configuration (Yau et al., 2012). For example, Dalamagkidis et al. (Dalamagkidis et al., 2007) and Udenze et al. (at al., 2009) show that Reinforcement Learning (RL) outperforms traditional on/off controller and a Fuzzy-PD controller. RL has been widely adopted to improve wireless sensor network performance: to dynamically select at run time a routing protocol from a pre-defined set of routing options, which provides the best performance (Nurchis et al., 2011); to bring wireless nodes to the lowest possible transmission power level and, in turn, to respect the quality requirements of the overall network (Chincoli and Liotta, 2018); to adapt sampling intervals (Dias et al., 2016) in changing environments. Overall, RL promises to learn the optimal policy that is specific to each context and application. Hence, it helps push the boundaries of what is possible in sensor networks. RL can be implemented local to the device as well, the Q-table does not take up much memory or compute. However, prior works in RL based sensor configuration are not considering many aspects of the design required for a large-scale real-world deployments of thousands of nodes.

Simulation results by Dias et al. (Dias et al., 2016) optimize energy efficiency based on data collected during five days by five sensor nodes. The data collection period is just too short to capture wide range of real environmental changing conditions. Moreover, they assume a fixed 12 hours period as the time needed by the Q-Learning algorithm to ”calibrate” the action-value function for the rest of the 4.5 days experiment. But a fixed Q-Learning training time can not capture all kind of environmental changes: faster environmental changes could require higher calibration time, while slow environmental changes could relax. In this paper, we propose a dynamic on-policy training interval, that dynamically varies the time between trainings. On a system that includes thousand of nodes, the time between two consecutive on-policy training is important because can reduce computation and cost. Similarly, Yue et al. (Hsu et al., 2009a), show a dynamic power management method for increasing battery life of mobile phones. Although they use realistic simulation models for battery use, they use data from Linux network trace for simulations that does not capture environmental conditions of the target application. RL has also been used to improve the energy utilization for energy harvesting wireless sensor networks (Hsu et al., 2009b; Hsu et al., 2014). Hsu et al. (Hsu et al., 2014) apply RL for sustaining perpetual operation and satisfying the throughput demand requirements for today’s energy harvesting wireless sensor nodes. However, their algorithm is built and tested in an outdoor environment, where sunlight patterns are consistent throughout the day (i.e., light is available from sunrise to sunset). In our work, we focus on indoor sensing where the daylight patterns are less well-defined and the light availability is also affected by human occupancy. Furthermore, we focus on learning faster and scaling better with (1) adaptive policy learning (2) transfer RL (3) using common policy for nodes in similar environments.

3. Problem Formulation

3.1. Problem Statement

Nodes can be placed in different locations in the building. Each sensor node will be subject to different light patterns that are determined by human behavior (i.e. lights turned on and off) or by natural light. Light availability will vary from weekdays to weekends, from winter to summer and with changes in usage patterns, e.g. a conference room vs a lobby. A solar panel based energy harvesting node needs to automatically adapt its sampling rate to these changing conditions so as to maximize the utility to its applications, i.e., maximize sensor sampling while keeping the node alive. We use sensing frequency as a measure of QoS, but RL can be adapted to other metrics by changing the reward function.

3.2. Reinforcement Learning and Q-Learning

In a typical Reinforcement learning (RL) problem (Sutton and Barto, 1998), an agent starts in a state and by choosing an action , it receives a reward and moves to a new state . RL agent’s goal is to find the best sequence of actions that maximizes the cumulative long term reward. The way the agent chooses actions in each state is called its policy.


For each given state and action , we define a function

that returns an estimate of expected total reward by starting at state

, taking the action and then following a given policy . is the value obtained using the optimal policy that maximizes its expected cumulative long term reward.


where is called a discount factor and it determines how much the function in state depends on the future actions, the rewards exponentially diminish the further they are in the future. In Q-learning, the policy picks the action that has the highest Q-value in each state. Thus, we obtain the classic Q-function (Watkins, 1989):


The algorithm starts with a randomly initialized Q-value for each state-action pair and an initial state . An episode is defined as a sequence of state transitions from the initial state to the terminal state. The Q-learning algorithm visits each state-action pair multiple times in an episode. It follows a

-greedy policy, where for each state it picks the action that has the maximum Q-value with probability

and a random action otherwise. The reward obtained by selecting the action is used to update the Q-value with a small learning rate. Under the conditions that each of the state-action pairs are visited infinitely often, the Q-learning algorithm is proven to converge to the desired  (Watkins and Dayan, 1992).

Figure 1. Energy Harvesting Node Block Diagram.

3.3. RL for Configuring Sensors

Figure 2 shows the overview of our approach. Our battery-less sensor node (i.e. Pible) collects energy from a solar panel and sends sensor data to a base station. The base station determines the sampling rate to set based on data provided by the node.

Figure 2. RL Communication Process: Block Diagram.

Agent: The agent is the brain of our system and learns from the environment by collecting data from the sensor nodes and by communicating back to them what action they need to take next. In our setting, the agent resides in the base station that the sensor node communicates with and the compute intensive RL training is done at the base station level.

Environment: It is represented by the sensor nodes (i.e. Pible) that sense the environment and communicate information back to the agent. Data sent includes light intensity, voltage level of the energy-storage element, and the current performance state (i.e. sampling rate). Super-capacitor voltage data are used to track energy consumption trends and calibrate simulation parameters.

Performance State Sensing Rate [s]
3 15
2 60 (1 min)
1 300 (5 min)
0 900 (15 min)
Table 1. Sensing Rate based on Performance State

Actions: Typical building sensors use sensing rate in the order of minutes, ranging from tens of seconds to 1 hour (Finnigan et al., [n. d.];, 2018;, 18). Hence, we discretize the sensing rate to four performance states as reported in Table 1. The action selected corresponds to the performance state to use, e.g. taking action 2 corresponds sending data every minute. The discretization reduces the action space, and hence, decreases the convergence time of the Q-learning algorithm.

State: The state is used by the agent to pick the actions. We use three states for our system: (i) light intensity level, (ii) energy storage level, (iii) weekend/weekdays. We discretize light intensity and energy storage levels to 10 values each.

Reward Function: The goal of the system is twofold: (i) Send as much data as possible by increasing performance states. (ii) Avoid energy storage depletion. The reward function decides the trade-off between the two factors.

  • Reward = Performance States (i.e. 0, 1, 2 or 3)

  • Reward = -300 if energy storage level reaches 0 (node dies).

The reward value of -300 has been calculated considering that we use 24 hours as our horizon and the RL agent selects an action every 15 mins. The accumulated rewards by sensing rapidly cannot exceed the energy depletion penalty.

State Transitions: The agent observes state transitions and takes actions, i.e. sends commands to change the sampling rate, every 15 minutes.

1:  Initialize q as an empty set
2:  Initialize control action a, state s and s
3:   = 1
4:   Sense Environment
5:  while time passed 24 hours do
7:     wait 15 mins
8:      Sense Environment
9:     r = reward(s, a, s)
15:  end while
Algorithm 1 RL Algorithm for Energy Harvesting Sensors

A limitation of the proposed method is that it is not robust to communication failures (i.e. base station and sensor node communication). But the Q-learning policy can be executed inside the node to make it robust. The memory and compute requirements are low.

3.4. RL Algorithm

Algorithm 1 lays out our Q-learning steps. After initializing the Q-table to a matrix of zeros and setting to the maximum value (line 1-3), the agent receives the node status (i.e. light, energy storage voltage and performance state) on line 4. At this point, the algorithm enters a while loop that lasts for an episode (i.e. 24 hours) and starts by selecting an action (line 6). The action is selected by following the -greedy policy. At the beginning, is initialized to 1, the algorithm selects a lot of random actions (i.e. exploration) in the first phase of the learning. But for each executed while loop, is decreased by (line 13), causing the action policy to select more exploitive actions over time. Based on the action selected, the environment changes its status and sends the updated state to the agent after 15 minutes (line 7-8). A reward is given to the agent based on the current status, action and next node status (line 9). The algorithm calculates the difference between q and q where q is the value in the Q-table for the state-action pair and q is the the reward obtained by taking action plus the rewards obtained by picking the actions with the highest Q-value until the end of the episode. Their difference is scaled by a learning factor and added to the Q-value of the corresponding state-action pair. Finally, on line 14 we save the value of the next state to the current state and repeat the while loop until the end of an episode. We assume our system has reached convergence when the mean of all Q-Table’s value does not change over time. We want to underline that Q-learning learns the probability of transition from one state (light, voltage levels) to another. Hence, it will be robust to spurious changes to voltage levels as long as it doesn’t happen consistently.

3.5. Sensor Node

As a target, we used our energy harvesting platform build for general building applications (Fraternali et al., 2018). Pible’s energy harvesting sensor architecture is depicted in Figure 1: a solar panel and transfers power to an energy management board, which stores the accumulated energy in a supercapacitor. Once the energy accumulated reaches a usable voltage level, the energy management board powers the micro-controller (MCU) that starts its operations.

We made further improvements to the hardware design to increase operational time without light. We adopt a 1F super-capacitor with a higher nominal voltage (i.e. 5.5V) to store more energy (E = 0.5*V*V*C). To allow the MCU to read this high voltage, we introduced a voltage divider that uses very high resistor value (10 MOhm) to minimize leakage current. With these improvements, the Pible node achieves up to a week of lifetime without light by sensing one sensor every 10 minutes. For this work, we consider the same power consumption measurements as in (Fraternali et al., 2018).

4. Scaling Methods

4.1. Day-by-Day Learning - Baseline

As a baseline, we learn a new Q-table daily based on light data collected each day (Hsu et al., 2009b; Hsu et al., 2014). The sensor node starts with a fixed policy on the first day, and collects data every 5 min. We use the collected data in our simulator to train the RL agent using Q-learning. The learned Q-table based policy is used to collect the data the next day. At the end of the day, the collected data is again used to create a new Q-table via Q-learning. Instead of initializing the Q-table to zero, the training starts from the previous day’s Q-table that summarizes the learning until that day. The learned Q-table is used as the deployment policy for the next day and the cycle continues.

4.2. Dynamic Policy Training Interval

The day-by-day learning trains a new policy every day. Instead of a static update interval, we propose a dynamic approach, where the policy updates happen over shorter interval initially when the agent is learning and the interval is increased as learning stabilizes. If environmental conditions change, the interval can again be shortened to encourage faster learning. Starting from an empty Q-Table, we run on-policy training simulations every hour. As soon as the Q-Table converges, we use this new generated Q-Table to run a simulation that calculates the total reward based on the light-data collected so far. If the reward results are equal or better compared to the reward results obtained with the old Q-Table, we double the time we run the on-policy training. At the contrary, if the total reward achieved by the new Q-Table is less w.r.t the old Q-Table, we halve the on-policy training interval time up to a minimum of 1 hour. We show that this simple adaptive strategy significantly speeds up training. The halving and doubling of interval is itself a heuristic, but can be generalized with meta learning methods (Finn et al., 2017).

4.3. Sharing RL Policy

Prior works (Dias et al., 2016; Hsu et al., 2014; Hsu et al., 2009b) and the methods proposed thus far use a different RL policy for each sensor node. As we scale to thousands of nodes, policy training becomes infeasible in embedded computers such as Raspberry Pi typically used as the base station. We can potentially just use a single policy across all the sensor nodes, but our experiments show that the performance of the nodes drop significantly when using a common policy as the environmental conditions that each sensor is exposed to is vastly different. We propose using a single policy between the sensor nodes that share similar energy availability. The sensor nodes can be clustered based on lighting data collected. While we leave the clustering method to future work, we show in simulation that the nodes perform satisfactorily when they use a single policy for hundreds of sensor nodes that share lighting characteristics.

4.4. Transfer Learning

We study the effect of learning a new RL policy by exploiting a pre-learned Q-Table instead of learning everything from scratch (et al., 2009). This is important to speed up the learning process of the nodes.

5. Simulation Results

We build our simulator using real power consumption data. The base station uses BLE gattool functions to exchange data with the sensor nodes. As soon as a sensor-node node wakes-up, it starts advertising. The base station reads the advertisement, connects to the sensor-node and exchanges data. The energy consumption of this communication process is taken into account in our simulation. The whole process lasts a few seconds and it does not impact the performance of the system. Before running the simulations, we calibrate the simulation’s parameters using real discharging measurements. To simulate charging, we used a linear model considering PV cells information from datasheets (CO, 2007).

5.1. Simulation Setup

5.1.1. Modeling the States for RL

  • Light intensity level: we normalize the light intensity from a range of 0 to 10, where 0 represents no light and 10 represents 2000 lux. We select 2000 lux as a maximum value after checking typical indoor light intensity in buildings, values above 2000 lux are approximated to 10.

  • Energy storage level: we scale the energy storage voltage calculated to a value from 0 (i.e. minimum voltage available of 2.1V) to 10 (i.e. max voltage available of 5.5V).

  • Weekend/Weekday: buildings indoor lights patterns are strongly dependent on the presence of people (Campbell et al., 2016). Hence, we consider a binary state to capture weekdays and weekends.

5.1.2. RL Hyper-parameters

Table 2 reports the hyper-parameters used for the Q-learning algorithm for our simulations.

Hyper Reward Epsilon Epsilon Epsilon Learn
Parameter decay max min decrement rate
() () () () ()
Value 0.99 1 0.1 0.0004 0.1
Table 2. Simulation RL Hyper-parameters

5.2. Day-by-Day Learning - Baseline Results

We run 5 different simulations, one for each node in a different lighting condition. Figure 3 shows the results, each dot represents the reward achieved by the node at the end of a day.

Figure 3. Simulation Results on Different Lighting Conditions using Day-by-Day Learning (i.e. our Baseline)

Several nodes (i.e. Door, Window, Conference Room and Middle of Office) are receiving negative rewards (that we report as -100 to improve graph visibility) on the first few days indicating that the Q-learning algorithm needs time to learn and adapt to changes in the environment. Furthermore, we can notice that the nodes that are receiving negative rewards even after the RL started is learning are the one subject to human-dependent light pattern such as Conference Room (that has no windows and light is only on when people enter the room) and Middle of an Office. On the other hand, nodes that are subject to constant light patterns such as Windows (that is subject to ambient light from sunrise to sunset), Door (that has a window in front of it) limit the days in which the energy storage is depleted just on the first day. Finally, the Stair Access case, where light is always on for security reasons, receives always positive reward due to the stability of its light pattern. Overall, we can notice that after a week, all the light patterns are learned by the system and the rewards follow a constant pattern that depend on the light availability of each node placement. On average the nodes are operational within the first 2.6 days of deployment when training from scratch.

5.3. Dynamic Policy Training Interval Results

Figure 4. Dynamic Policy Training Interval Results

Figure 4 reports the results of our experiment. During the first week of training, the nodes use a short on-policy training interval, confirming that they are still exploring the environment. But after 10 days, they are able to drastically reduce the number of on-policy training performed. Furthermore, all the nodes were able to maintain positive rewards and avoid energy storage depletion.

Num Training Window Middle Door Confer Stair
(Num Nod Died) Office Room Access
Baseline 89 (1) 89 (3) 89 (1) 89 (2) 89 (0)
Dynamic 27(0) 42(0) 41(0) 17(0) 20(0)
Percentage 69 % 53 % 54 % 81 % 77 %
Table 3. Comparison between the dynamic policy training interval experiment w.r.t. day-by-day approach: number of policy trainings performed and number of days the nodes receive a negative reward (between parenthesis)

Table 3 compares the dynamic training approach with the day-by-day training quantitatively. The number of On-Policy training is drastically reduced by using our method: on the Conference Room case the number of On-Policy training can be reduced up to a 81% while in the worst case (i.e. Middle Office) we can reduce it to up 53%. In the same Table, we reported between parenthesis the number of days in which the system achieves negative rewards and we can notice that our dynamic On-Line Policy remove all the negative rewards accumulated on the first days by the Baseline approach and the nodes become operational within the first day.

5.4. Sharing RL Policy Results

We exploit the 5 real data traces from our sensor nodes across 5 different lighting conditions to simulate up to a 1000 sensor nodes. For each data trace, a new trace is generated by randomly adjusting both the light intensity by 30% and shifting the sampling time by 3 hours. For each of the 5 light data traces, we build 200 new data traces. The 5 light data traces that we collected are already covering a variety of indoor lighting conditions (i.e. Window, Door, Middle of Office, Conference Room and Stair Access), so the new nodes are placed on similar conditions but will be subject to a lower or higher light intensity based on the distance from the light source. As light-patterns are human activity dependent, shifting light availability by time captures variations in activities. With the 1000 simulated data traces, we use the first week of light data and calculate the mean of the light intensity for each node. We then group all the nodes in 5 different clusters based on mean light intensity and build a Q-Table (i.e. Cluster Q-Table) for each of the 5 clusters.

Total Window Middle Door Confer Stair
Reward Office Room Access
1000 Nodes 1290 420 369 1527 238
Cluster 1472 988 390 1580 475
Table 4. Total Rewards achieved for each lighting condition on a Q-Table generated for 1000 nodes and a Q-Table generated after clustering all the data-light traces

Table 4 reports the total reward obtained by running the original 5 data traces on (i) a single general Q-Table built by using all the 1000 nodes light traces (i.e. 1000 nodes Q-Table), and (ii) on the Cluster Q-Table. The Q-Tables generated after clustering the nodes (i.e. Cluster Q-Table), are outperforming the Q-Table built by using the 1000 nodes. In the Middle of an Office case, the clustered Q-Table can achieve up to 235% more reward compared to the 1000 nodes Q-Table. Those results indicate that the Cluster Q-Table is able to store individual characteristic of the 200 clustered nodes and to use those information to maximize the performance of the nodes. On the other hand, the 1000 Nodes Q-Table sacrifices maximum performance to allow the management of all the 1000 nodes. We leave automated clustering of nodes based on lighting characteristics to future work.

5.5. Transfer Learning Results

We use the first week of light data for all the 5 different lighting conditions to train a general Q-Table. After convergence, we use this Q-Table to start the execution of the Day-by-Day Learning experiment instead of starting from an empty Q-Table. Results are reported in Figure 5.

Figure 5. Rewards Obtained with Transfer Learning

All the 5 nodes are always collecting positive rewards even after the firsts days. Compared to the Day-by-Day Learning experiment this is a great result, since almost all the nodes were achieving storage energy depletion as reported from Table 3. This confirms that the information extracted from of a pre-calculated Q-Table built using general lightning trends, can be used to speed up the learning process of learning for different lighting conditions.

6. Conclusion and Future Work

In this work, we proposed and apply several solutions to scale the configuration of energy harvesting sensor with reinforcement learning. An adaptive on-policy RL solution that reduces the training phase after deployment has been tested. Results show that nodes can effectively adapt their sensing rate to different lighting conditions without depleting their stored energy and while reducing the number of on-policy training to up 81% compared to a standard policy that runs on-line policy training every day. We also show that transfer RL can reduce the training phase, making the nodes operational within the first day. Finally, prior solutions consider one RL policy for each sensor node and affects the scalability. We show that the use of a single policy for sensors that share similar lighting conditions can still effectively configure the sensor nodes. We focus on simulation as a proof of concept in this work and will perform real experiments in future work.


This work is supported by the National Science Foundation grant BD Spokes 1636879


  • (1)
  • (2018) 2018. (2018).
  • at al. (2009) Adrian Udenze at al. 2009. Direct reinforcement learning for autonomous power configuration and control in wireless networks. In Adaptive Hardware and Systems, 2009. AHS 2009. NASA/ESA Conference on. IEEE, 289–296.
  • Atzori et al. ([n. d.]) Luigi Atzori, Antonio Iera, and Giacomo Morabito. [n. d.]. The Internet of Things: A survey. Computer Networks ([n. d.]).
  • Campbell et al. (2016) Bradford Campbell, Joshua Adkins, and Prabal Dutta. 2016. Cinamin: A Perpetual and Nearly Invisible BLE Beacon. In Proceedings of the 2016 International Conference on Embedded Wireless Systems and Networks (EWSN ’16). Junction Publishing, USA, 331–332.
  • Chi et al. (2014) Qingping Chi, Hairong Yan, Chuan Zhang, Zhibo Pang, and Li Da Xu. 2014. A reconfigurable smart sensor interface for industrial WSN in IoT environment. IEEE transactions on industrial informatics 10, 2 (2014), 1417–1425.
  • Chincoli and Liotta (2018) Michele Chincoli and Antonio Liotta. 2018. Self-learning power control in wireless sensor networks. Sensors 18, 2 (2018), 375.
  • CO (2007) Sanio Semiconductor CO. 2007. (2007).
  • Dalamagkidis et al. (2007) Konstantinos Dalamagkidis, Denia Kolokotsa, Konstantinos Kalaitzakis, and George S Stavrakakis. 2007. Reinforcement learning for energy conservation and comfort in buildings. Building and environment 42, 7 (2007), 2686–2698.
  • Dias et al. (2016) Gabriel Martins Dias, Maddalena Nurchis, and Boris Bellalta. 2016. Adapting sampling interval of sensor networks using on-line reinforcement learning. In Internet of Things (WF-IoT), 2016 IEEE 3rd World Forum on. IEEE, 460–465.
  • et al. (2014) B. Campbell et al. 2014. An Energy-harvesting Sensor Architecture and Toolkit for Building Monitoring and Event Detection. In Proceedings of the 1st ACM Conference on Embedded Systems for Energy-Efficient Buildings (BuildSys ’14).
  • et al. (2009) Matthew Taylor et al. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research 10, Jul (2009), 1633–1685.
  • et al. (2011) S. Sudevalayam et al. 2011. Energy harvesting sensor nodes: Survey and implications. IEEE Communications Surveys & Tutorials 13, 3 (2011), 443–461.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. (2017).
  • Finnigan et al. ([n. d.]) S Mitchell Finnigan, AK Clear, Geremy Farr-Wharton, Kim Ladha, and Rob Comber. [n. d.]. Augmenting Audits: Exploring the Role of Sensor Toolkits in Sustainable Buildings Management. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies ([n. d.]).
  • Fraternali et al. (2018) Francesco Fraternali, Bharathan Balaji, Yuvraj Agarwal, Luca Benini, and Rajesh K. Gupta. 2018. Pible: Battery-Free Mote for Perpetual Indoor BLE Applications. In Proceedings of the 5th ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Building (BuildSys ’18). ACM.
  • Hester and Sorber (2017) Josiah Hester and Jacob Sorber. 2017. Flicker: Rapid Prototyping for the Batteryless Internet-of-Things. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems (SenSys ’17). ACM, New York, NY, USA, Article 19, 13 pages.
  • Hester et al. (2017) Josiah Hester, Kevin Storer, and Jacob Sorber. 2017. Timely Execution on Intermittently Powered Batteryless Sensors. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems (SenSys ’17). ACM, New York, NY, USA, Article 17, 13 pages.
  • Howell (2017) Jenalea Howell. 2017. (2017).
  • Hsu et al. (2006) Jason Hsu, Sadaf Zahedi, Aman Kansal, Mani Srivastava, and Vijay Raghunathan. 2006. Adaptive duty cycling for energy harvesting systems. In Proceedings of the 2006 international symposium on Low power electronics and design. ACM, 180–185.
  • Hsu et al. (2009a) Roy Chaoming Hsu, Cheng-Ting Liu, and Wei-Ming Lee. 2009a. Reinforcement learning-based dynamic power management for energy harvesting wireless sensor network. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, 399–408.
  • Hsu et al. (2014) R. C. Hsu, C. T. Liu, and H. L. Wang. 2014. A Reinforcement Learning-Based ToD Provisioning Dynamic Power Management for Sustainable Operation of Energy Harvesting Wireless Sensor Node. IEEE Transactions on Emerging Topics in Computing 2, 2 (June 2014), 181–191.
  • Hsu et al. (2009b) R. C. Hsu, C. T. Liu, K. C. Wang, and W. M. Lee. 2009b. QoS-Aware Power Management for Energy Harvesting Wireless Sensor Network Utilizing Reinforcement Learning. In 2009 International Conference on Computational Science and Engineering, Vol. 2. 537–542.
  • (18) ’18. (’18).
  • Jayakumar et al. (2014) Hrishikesh Jayakumar, Kangwoo Lee, Woo Suk Lee, Arnab Raha, Younghyun Kim, and Vijay Raghunathan. 2014. Powering the internet of things. In Proceedings of the 2014 international symposium on Low power electronics and design. ACM.
  • Kansal et al. (2007) Aman Kansal, Jason Hsu, Sadaf Zahedi, and Mani B. Srivastava. 2007. Power Management in Energy Harvesting Sensor Networks. ACM Trans. Embed. Comput. Syst. 6, 4, Article 32 (Sept. 2007).
  • Lawson and Ramaswamy (2015) Victor Lawson and Lakshmish Ramaswamy. 2015. Data Quality and Energy Management Tradeoffs in Sensor Service Clouds. In Big Data (BigData Congress), 2015 IEEE International Congress on. IEEE, 749–752.
  • Moser et al. (2010) C. Moser, L. Thiele, D. Brunelli, and L. Benini. 2010. Adaptive Power Management for Environmentally Powered Systems. IEEE Trans. Comput. 59, 4 (April 2010), 478–491.
  • Nurchis et al. (2011) Maddalena Nurchis, Raffaele Bruno, Marco Conti, and Luciano Lenzini. 2011. A Self-adaptive Routing Paradigm for Wireless Mesh Networks Based on Reinforcement Learning. In Proceedings of the 14th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM ’11). ACM, New York, NY, USA, 197–204.
  • Renner (2013) Bernd-Christian Renner. 2013. Sustained Operation of Sensor Nodes with Energy Harvesters and Supercapacitors. BoD–Books on Demand.
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge.
  • Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.
  • Watkins (1989) Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards. Ph.D. Dissertation. King’s College, Cambridge.
  • Yau et al. (2012) Kok-Lim Alvin Yau, Peter Komisarczuk, and Paul D Teal. 2012. Reinforcement learning for context awareness and intelligence in wireless networks: Review, new features and open issues. Journal of Network and Computer Applications 35, 1 (2012), 253–267.