With the proliferation of the Internet of Things devices and the seemingly endless connections among people, the demand for reliable and long-lasting devices is becoming critical. IoT devices are expected to be network connected at all times, even while simply waiting for an event to happen. Nevertheless, conventional solutions have significant costs even in this standby state, consuming on the order of a few milliwatts and reducing the useful device lifetime.
One emerging solution to extend battery life is the incorporation of a Low-Power Wake-up Radio (LP-WUR) . A LP-WUR consumes only micro-watts of power, usually below 100 W, and is continuously on, listening for a trigger either on the same channel used for data communication or on a dedicated, out-of-band channel. Once a trigger is received, the LP-WUR activates the primary device, which can otherwise remain in a deep sleep state, saving energy. The extreme low power constraints of WUR limit the receiver complexity, modulation scheme, and thus the overall receiver sensitivity of WUR designs, reducing the effective communication range.
The potential for LP-WUR is rapidly expanding into domains such as Wi-Fi access points, low-power wide area sensor networks , and wildlife monitoring. As LP-WUR moves towards a widely used practical technology, simulations and experimental deployments must be performed to validate proposed hardware designs, network architectures, and wireless communication protocols.
Problem. Current state-of-the-art WUR prototypes, most of which are custom in-lab designs, display significant diversity in their architectures, processing capabilities, energy consumptions, and receiver sensitivities . Given such diversity in platform characteristics, choosing the right prototype and protocol for a specific application scenario is challenging. Nevertheless, the available communication protocols are evaluated in restricted settings with ad-hoc experiments and without comparison to competing approaches, making it almost impossible to identify the appropriate protocol for a given prototype radio in a specific application. Moreover, results of experiments performed in different settings (e.g., topology, traffic pattern, interference) using different metrics may not hold for others. Currently, there exist no fixed set of accepted testing methods, parameters or metrics to be applied to a WUR based system under evaluation. This lack of standardization significantly increases the difficulty to develop new systems and/or apply existing technologies in novel domains.
Solution. To overcome these disparities, we identify the key parameters of a new evaluation methodology, WURBench, offering a step toward enabling accurate and repeatable profiling of WUR-based systems for IoT applications. Further, we outline the issues that must be addressed before full WUR benchmarking can become a reality.
The concept of benchmarking is not new and has been applied to areas such as wireless networking  and CPUs  to compare performance results. Recently, IoT-Connect , an Industry-Standard Benchmark for Embedded Systems has been introduced to evaluate micro-controllers with various connectivity interfaces such as Bluetooth, LoRa, and WiFi. A benchmark typically outlines a set of specifications to follow when evaluating the performance of a system, making experiments repeatable and results directly comparable. These specifications include the definition of the parameters for the experimental setup and output metrics reflecting the performance of the benchmarked system. Benchmarking WUR is non-trivial as this not only requires evaluating the WUR prototypes and protocols, but also measuring or modeling the wireless environment such as interference sources that have the potential to significantly affect system performance.
Goals. The main goal of WURBench is to outline a benchmarking framework that will:
provide a set of recommended practices for performance evaluation.
offer reliable indicators in terms of key performance metrics, parameters, and tools for researchers to test and fairly compare new solutions against existing ones or baselines when implementations are not publicly available. WUR hardware designers can also utilize this framework to benchmark devices against competitors.
facilitate a repeatable test environment for WUR-based systems.
Ii Hardware Micro-benchmarking
The design phase of WUR-based low-power networking starts with choosing an appropriate WUR prototype and performing a series of hardware specific micro-benchmarks. Micro-benchmarks are small test applications that iterate through the states of the component being tested e.g., radio, MCU, LEDs. As WURs are mostly custom designed, micro-benchmarking enables the identification of possible performance bottlenecks at the architecture level, allowing hardware designers to compare and assess design trade-offs. Most often these micro-benchmarks are conducted in an ad-hoc fashion limiting comparability. Here, we seek to provide a well-defined structure, defining “what to measure” and “recommended practice” for measuring, allowing results to be compared across prototypes and to verify the fidelity of the test platform.
Ii-a What to measure?
The first step is the definition of the metrics; hardware performance measured in terms of a set of quantitative variables of interest. At an abstract level, the metrics defined here are hardware-agnostic making them comparable across various WUR prototypes.
Communication range: the achievable distance between the endpoints to establish a baseline for the performance of a given connection. For instance, evaluating transmit power vs. range is important for WUR deployments.
Successful wake-up rate: is the communication reliability of the WUR module measured in terms of the frame loss, which is the fraction of triggers sent by the sender over those successfully received at the receiver. This metric depends on the effective communication range and the strength of the wake-up signal.
Energy consumption: right now there is no common way to universally benchmark the energy efficiency of the WUR and one must either independently benchmark or rely on the information provided in the literature. Energy consumption, for instance, may refer to the average power consumption of the WUR in the continuous channel monitoring state, which is of greater importance for the IoT applications. Depending on the nature of the micro-benchmarking, this may also refer to the energy consumed by the WUR while executing different tasks for e.g., signal transmission and processing cost.
Ii-B Recommended practice
To correctly measure the defined metrics, it is therefore mandatory to layout the steps one needs to follow while conducting these experiments, specifying the experimental parameters when characterizing these micro-benchmarks. The parameters are the configurations that allow controlling the execution of the micro-benchmark. This is a critical piece of the evaluation as comparing WURs without outlining all the configurations bring into question the soundness of the comparison. The main parameters are identified as:
Physical layer (PHY) settings: Most WURs support various PHY settings that include bit rates, transmission power, and modulation. Often, some of these parameters are not identified, making it difficult to directly compare the prototypes, as seen in . Therefore, the configurations used in each trial must be reported with the results.
Antenna orientation: The radiation patterns of the antenna determine the performance of wireless devices . It thus becomes important to state the antenna orientation and type in combination with frequency and transmit power.
Trial duration: wireless links change over time due to subtle changes in environmental conditions. Therefore, evaluation should be spread throughout a 24 hour period and long duration experiments should be preferred.
Firmware: The firmware version being used in the tests as well as any functions that have been disabled, should be reported together with the results.
Environment: The micro-benchmarking process is incomplete without describing the characteristics of the set-up environment that can be either based on real-life use case or artificial test environments. For a fair comparison, describing and documenting the scenarios for later analysis is critical. The test setups can be divided coarsely into two categories: i) shielded: where cabling or RF shielding techniques are used to attenuate external signals and noise. ii) open-air: environments that mimic the actual use case for the WUR such as indoor or outdoor environment with line-of-sight and non-line-of-sight.
Iii Benchmarking System as a Whole
While it is important to isolate functions for performance testing, benchmarking the software separately may not reveal all the functionalities or vulnerabilities of a system. As software alone may not be able to capture underlying hardware behavior such as interrupts, timings, and design flaws, a complete-as-possible performance test must be conducted. In other words, benchmarking the whole system including the hardware and software interaction provides the most realistic evaluation. For WSN systems, validation of the whole system is carried out using both, simulations and testbeds, following these key steps: defining the application scenario, choosing or implementing a communication protocol, conducting a large set of experiments, measuring the performance in terms of defined metrics, and comparing the results. This seemingly simple task of benchmarking is surprisingly challenging for WUR networks as it requires an evaluation of the entire system to capture real-world operational conditions. This requires a large number of experiments that can be tedious and error-prone. Moreover, the complexity of this evaluation is compounded by the lack of control over experimental conditions and lack of evaluation tools. Furthermore, results obtained from ad-hoc experiments are difficult to compare with the results gathered from different wireless testbeds and simulations, hindering repeatability. In this section, we focus on enabling benchmarking of WUR-based systems as a whole by presenting a set of testing methods, application scenarios, parameters, and metrics to be applied to a protocol under test either using testbeds or simulation tools.
Iii-a Application settings
The first critical element to define for a system as a whole is the expected environment in which it will be exploited, as this defines many of they key environmental parameters that influence a deployment. For example, the size of the area to be covered and the density of measurement points affect the network topology. Further, the application needs often direct the choice of the protocol, for example data collection favors unidirectional focus while control systems emphasize latency. These choices must be outlined clearly, as they help focus the applicable metrics which we address in the following section.
Next, we define key system performance and techno-economic metrics that we consider in the definition of WURBench. These can be divided into four categories:
Power consumption: computed using the amount of time a node keeps its radio on in different states such as Receive (RX), Transmit (TX), Idle, and Sleep. The consumption should also consider duty cycle patterns of both the radios to detect even small deviations that may have a substantial effect on the device lifetime in real deployments.
Reliability: defined as the fraction of application data packets successfully received over those sent. This is an indication of the level of service provided to applications in delivering sensed data, especially relevant when WUR is considered an option for safety-critical systems.
Latency: defined as the end-to-end packet delivery delay from the time of generation to reception.
Cost: one limitation of WUR comes from inherently low-power demand in continuous listening mode, which results in a limitation on the feasible distance between two devices. As such, the WUR needs to be combined with another system, typically with a high power node resulting in a dense network. The relative cost of replacing a standard node with a WUR-based one might incur additional cost. The system cost should, therefore, be calculated not only for the main sensor node but also for the extra WUR hardware.
Iii-C Evaluation mechanisms
Before proceeding with field tests, simulations and testbeds are the main tools for performance analysis of wireless systems allowing researchers to perform repeatable experiments.
Simulations. As WUR technology is still in its relative infancy, many simulators have been extended for evaluating LP-WUR protocols. For benchmarking, simulators offer many advantages over testbeds. Various network topologies such as single- and multi-hop with different traffic patterns can be implemented and optimized with easy data collection for extracting the metrics. Large-scale networks for scalability analysis can be easily modeled, which otherwise would be too expensive to realize using testbeds. Furthermore, repeatability is easily achievable in simulations. On the other hand, simulators are criticized for not being able to capture all details, especially at the PHY layer, such as path loss, fading, and interference, bringing into question the applicability of simulation results. Nevertheless, simulation can provide valuable results.
WUR simulators. Recent interest in WURs demands simulation support to allow systematic exploration of this novel technology. In  OMNET++ extensions provide a modular simulation model for WURs. It employs the MiXiM framework and offers reliable primitives for wireless signal propagation, energy consumption, and a complete networking stack. Similarly, GreenCastalia simulates a power model for wake-up receivers . These simulators, however, do not offer code portability from simulation to real system. For this, COOJA, a network simulator widely used in the WSN community, has been augmented with WUR . COOJA supports node emulation for MSP430 and AVR platforms and uses binary, deployment-ready firmware, providing the ability to move between simulated and real experiments. It offers a full networking stack with various signal propagation models such as Multi-path Ray-tracing and Unit Disk Graph. Moreover, it allows simulation of multiple embedded operating systems and is also the first open-source tool (https://github.com/waco-sim). As an example of the ease of comparing protocols with WaCo, Fig. 1 illustrates the benchmarking of three different MAC protocols; wake-up radio (W-MAC), duty cycling (ContikiMAC), and always-on (NullRDC) MAC for a network of 100 nodes over Collection Tree Protocol. To have a fair benchmarking, the same application was run on top of all the MAC protocols with same settings while varying the network traffic. As expected, WUR solution not only improves the network reliability but also reduces the overall latency over other MAC protocols, motivating further study of such systems using testbeds.
Testbeds. There is an increasing demand for experimentally-supported results to identify issues that cannot be captured through simulation or theory alone. This observation is reflected in the topics of the flagship conferences that increasingly encourage experimentally-driven research for validation.
Various shared and private testbeds exist, including the FIT IoT-LAB , FlockLab , and Indriya . These allow scheduling experiments remotely, executing protocols directly on hardware, as well as collecting and extracting metrics of interest from the logged data. However, none of these testbeds currently support WUR functionality. We discuss next some of the key functionalities that the testbeds need to offer and how they could be implemented for benchmarking WURs.
WUR interface: first and foremost, testbeds must offer hardware with the WUR interface. One cost effective option is to support only a few nodes .
Experiment configurations: the testbeds must provide experiment scheduling capabilities with the ability to configure a number of system parameters such as network topology and size, traffic load and pattern, experiment duration, physical layer settings for the radios including WUR and the main data transceiver. As noted, these configurations must be clearly reported to allow comparison.
Monitoring the environment: test facilities should provide information about the environmental conditions during the experiment. External wireless interference degrades network performance and to investigate this, spectrum analysis is indispensable. The testbed infrastructure should allow recording and replaying of the wireless traces. Tools such as JamLab  are key to producing repeatable interference. Temperature, instead, affects the clock oscillation of the devices. TempLab  offers temperature profiles for sensor nodes.
Data archiving and sharing: to extend the value of measurements beyond a specific case study, open-source data repositories are necessary. This facilitates archiving, publishing, and comparing of system performance data.
Result analysis: testbed infrastructure should be able to extract the key metrics such as power consumption, end-to-end reliability, and data latency. For instance, these metrics can be extracted non-intrusively on testbeds using tools such as IoT-Connect  or D-Cube , avoiding the probing effects of instrumentation.
Ideally, experiments should be performed in multiple testbeds to achieve statistical significance. However, it is critical to do an “apples to apples” comparison in any benchmarking exercise, and failure to consider all variables can produce results that are misleading or even erroneous.
This paper offers the first steps toward WURBench, a benchmarking framework tailored to the unique properties of the emerging wake-up radio technology. Solidifying this framework and encouraging it in the research community will contribute to the solidification and wide adoption of a technology that promises to revolutionize wireless systems.
-  R. Piyare, A. L. Murphy, C. Kiraly, P. Tosato, and D. Brunelli, “Ultra Low Power Wake-Up Radios: A Hardware and Networking Survey,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2117–2157, Fourthquarter 2017.
-  F. A. Aoudia, M. Gautier, M. Magno, M. Le Gentil, O. Berder, and L. Benini, “Long-short range communication network leveraging LoRa™ and wake-up receiver,” Microprocessors and Microsystems, vol. 56, pp. 184–192, 2018.
-  C. A. Boano, S. Duquennoy, A. Förster, O. Gnawali, R. Jacob, H.-S. Kim, O. Landsiedel, R. Marfievici, L. Mottola, G. P. Picco, X. Vilajosana, T. Watteyne, and M. Zimmerling, “IoTBench: Towards a Benchmark for Low-power Wireless Networking,” in Workshop on Benchmarking Cyber-Physical Networks and Systems, 2018.
-  CPU Benchmarks, accessed May 6, 2018, https://www.cpubenchmark.net/cpu_list.php.
-  EMBC, ULP MARK, accessed May 15, 2018, https://www.eembc.org/iot-connect/about.php.
-  M. Wadhwa, M. Song, V. Rali, and S. Shetty, “The Impact of Antenna Orientation on Wireless Sensor Network Performance,” in 2nd IEEE International Conference on Computer Science and Information Technology, Aug 2009, pp. 143–147.
-  J. Oller, I. Demirkol, J. Casademont, J. Paradells, G. U. Gamm, and L. Reindl, “Has Time Come to Switch From Duty-Cycled MAC Protocols to Wake-Up Radio for Wireless Sensor Networks?” IEEE/ACM Transactions on Networking, vol. 24, no. 2, pp. 674–687, April 2016.
-  D. Spenza, M. Magno, S. Basagni, L. Benini, M. Paoli, and C. Petrioli, “Beyond duty cycling: Wake-up radio with selective awakenings for long-lived wireless sensing systems,” in IEEE INFOCOM, April 2015, pp. 522–530.
-  R. Piyare, T. Istomin, and A. L. Murphy, “WaCo: A Wake-Up Radio COOJA Extension for Simulating Ultra Low Power Radios,” in ACM International Conference on Embedded Wireless Systems and Networks, 2017, pp. 48–53.
-  FIT IOT-LAB, accessed May 17, 2018, https://www.iot-lab.info/.
-  R. Lim, F. Ferrari, M. Zimmerling, C. Walser, P. Sommer, and J. Beutel, “FlockLab: A testbed for distributed, synchronized tracing and profiling of wireless embedded systems,” in ACM/IEEE International Conference on Information Processing in Sensor Networks, April 2013, pp. 153–165.
-  M. Doddavenkatappa, M. C. Chan, and A. L. Ananda, “Indriya: A Low-Cost, 3D Wireless Sensor Network Testbed,” in Testbeds and Research Infrastructure. Development of Networks and Communities. Springer Berlin Heidelberg, 2012, pp. 302–316.
-  F. Sutton, B. Buchli, J. Beutel, and L. Thiele, “Zippy: On-Demand Network Flooding,” in 13th ACM Conference on Embedded Networked Sensor Systems, NY, USA, 2015, pp. 45–58.
-  C. A. Boano, T. Voigt, C. Noda, K. Römer, and M. Zúñiga, “JamLab: Augmenting sensornet testbeds with realistic and controlled interference generation,” in 10th ACM/IEEE International Conference on Information Processing in Sensor Networks, April 2011, pp. 175–186.
-  C. A. Boano, M. Zúñiga, J. Brown, U. Roedig, C. Keppitiyagama, and K. Römer, “TempLab: A testbed infrastructure to study the impact of temperature on wireless sensor networks,” in 13th International Symposium on Information Processing in Sensor Networks, April 2014, pp. 95–106.
-  M. Schuß, C. A. Boano, and K. Röemer, “Moving Beyond Competitions: Extending D-Cube to Seamlessly Benchmark Low-Power Wireless Systems,” in Workshop on Benchmarking Cyber-Physical Networks and Systems, Apr. 2018.