Waveform Signal Entropy and Compression Study of Whole-Building Energy Datasets

10/25/2018 ∙ by Thomas Kriechbaumer, et al. ∙ Technische Universität München 0

Electrical energy consumption has been an ongoing research area since the coming of smart homes and Internet of Things devices. Consumption characteristics and usages profiles are directly influenced by building occupants and their interaction with electrical appliances. Extracted information from these data can be used to conserve energy and increase user comfort levels. Data analysis together with machine learning models can be utilized to extract valuable information for the benefit of occupants themselves, power plants, and grid operators. Public energy datasets provide a scientific foundation to develop and benchmark these algorithms and techniques. With datasets exceeding tens of terabytes, we present a novel study of five whole-building energy datasets with high sampling rates, their signal entropy, and how a well-calibrated measurement can have a significant effect on the overall storage requirements. We show that some datasets do not fully utilize the available measurement precision, therefore leaving potential accuracy and space savings untapped. We benchmark a comprehensive list of 365 file formats, transparent data transformations, and lossless compression algorithms. The primary goal is to reduce the overall dataset size while maintaining an easy-to-use file format and access API. We show that with careful selection of file format and encoding scheme, we can reduce the size of some datasets by up to 73



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Home and building automation promise many benefits for the occupants and power utilities. From increased user comfort levels to demand response and lower electricity costs, Smart Homes offer a variety of assistance and informational gains. Internet of Things, a combination of sensors and actuators, can be intelligently controlled based on sensor data or external triggers. Power monitoring and smart metering are a key step to fulfill these promises. The influx of renewable energies and the increased momentum of changes in the power grid and its operations are a main driving factor for further research in this area.

Non-intrusive load monitoring (NILM) can be one solution to identify and disaggregate power consumers (appliances) from a single-point measurement in the building. Utilizing a centralized data acquisition system saves costs for hardware and installation in the electrical circuits under observation. The NILM community heavily relies on long-term measurement data, in the form of public datasets, to craft new algorithms, train models, and evaluate their accuracy on per-appliance energy consumption or appliance identification. In recent years these datasets grew significantly in size and sampling characteristics (temporal and amplitude resolution). Collecting, distributing, and managing large-scale data storage facilities is an ongoing research topic (Yuan et al., 2010; Deelman and Chervenak, 2008) and strongly depends on the environment and systems architecture.

High sampling rates are particularly interesting for NILM to extract waveform information from voltage and current signals (Kahl et al., 2017). Early datasets targeted at load disaggregation and appliance identification started with under (Kolter and Johnson, [n. d.]), whereas recently published datasets reach nearly of raw data (Kriechbaumer and Jacobsen, 2018). Working with such quantities requires specialized storage and processing techniques which can be costly and maintenance-heavy. Optimizing infrastructure costs for storage is part of ongoing research (Liu and Shen, 2017; Puttaswamy et al., 2012).

The data quality requirements typically define a fixed sampling rate and bit-resolution for a static environment. Removing or augmenting measurements might impede further research, therefore no filtering or preprocessing steps are performed before releasing the data.

Data compression techniques can be classified as lossy or lossless

(Bookstein and Storer, 1992). Lossy algorithms allow for some margin of error when encoding the data and typically give a metric for the remaining accuracy or lost precision. For comparison, most audio, image, and video compression algorithms remove information not detectible by a human ear or eye. This allows for a data rate reduction in areas of the signal a user can’t detect or has a reduced resolution due to a typical human physiology. Depending on the targeted use case, certain aspects of the input signal are considered unimportant and might be not reconstructable. Encoding only the amplitude and frequency of the signal can lead to vast space savings, assuming phase alignment, harmonics, or other signal characteristics are not required for future analysis. On the contrary, lossless encoding schemes guarantee a 1:1 representation of all measurement data with a reversible data transformation. If the intended use case or audience for a given dataset is not known or is very diverse in their requirements, only lossless compression can be applied to keep all data accessible for future use. Recent works pointed out an imbalance in the amount of research on steady-state versus waveform-based compression of electricity signals (de Souza et al., 2017).

Further consideration must be given to communication bandwidth (transmission to a remote endpoint) and in-memory processing (SIMD computation). The efficient use of network channels can be a key requirement for real-time monitoring of streaming data. In the case of one-time transfers (or burst transmissions), chunking is used to split large datasets into more manageable (smaller) files. However, choosing a maximum file size depends on the available memory and CPU (instruction set and cache size). Distributing large datasets as a single file creates an unnecessary burden for researchers and required infrastructure.

A suitable file format must be considered for raw data storage, as well as easy access to metadata, such as calibration factors, timestamps, and identifier tags. None of the existing datasets (NILM or related datasets with high sampling rates) share a common file format, chunk size, or signal sampling distribution. This heterogeneity makes it difficult to apply algorithms and evaluation pipelines on more than one dataset. Therefore, researchers working with multiple datasets have to implement custom importer and converter stages, which can be time-consuming and error-prone.

This work provides an in-depth analysis of public whole-building datasets, and gives a comprehensive evaluation of best-practice storage techniques and signal conditioning in the context of energy data collection. The key contributions of this work are:

  1. A numerical analysis of signal entropy and measurement calibration of public whole-building energy datasets by evaluating all signal channels with respect to their available resolution and sample distribution over the entire measurement period. The resulting entropy metrics further motivate our contributions and the need for a well-calibrated measurement system.

  2. An exhaustive benchmark of storage requirements and potential space savings with a comprehensive collection of 365 file formats, lossless compression techniques, and reversible data transformations. We re-encode and normalize data from all datasets to evaluate the effect of compression. We present the best-performing combinations and their overall space savings. The full ranking can be used to select the optimal file format and compression for offline storage of large long-term energy datasets.

  3. A full-scale evaluation of increasingly larger data chunks per file and their final compression ratio. The dependency between input size and achievable compression ratio is evaluated up to per file. The results provide an evidence-based guideline for future selection of chunk sizes and possible environmental factors for consideration.

We give an in-depth evaluation of file formats and signal characteristics that directly affect storage, encoding, and compression of such data. Each of the analyzed datasets was created with a dedicated set of requirements, therefore, a single best option does not exist. However, with this study, we want to help the community to better understand the fundamental causes of compression performance in the field of waveform-based whole-building energy datasets. We provide a definition of measurement calibration and its effects on the storage requirements based on signal entropy. Published datasets are self-contained and final, which allows us to prioritize the compression ratio and achievable space saving over other common compression metrics (CPU load, throughput, or latency). We define the achievable space saving and compression ratio as the only criterion when dealing with large (offline) datasets.

The rest of this paper is structured as follows: We discuss related work in Section 2. We describe the evaluated datasets in Section 3, which are then used in the experiments in Sections 45, and 6. Finally, we present results in Section 7, before concluding in Section 8.

2. Related Work

NILM and related fields distinguish between low and high sampling rates to capture voltage and current measurements. Low sampling rates (or low-frequency) are typically or slower. High sampling rates (or high-frequency) are typically above (or at least the Nyquist–Shannon sampling theorem (Shannon, 1949)). Recording multiple channels with high sampling rates requires oscilloscopes or specialized data acquisition systems as presented in (Kriechbaumer et al., 2017; Haq et al., 2017; Meziane et al., 2016).

Low-frequency energy data can benefit greatly from compression when applied to smart meter data, as multiple recent works have shown (Ringwelski et al., 2012; Unterweger and Engel, 2015; Unterweger et al., 2015; Eichinger et al., 2015). Electricity smart meters can be a source of high data volume with measurement intervals of , , , or higher. Possible transmission and storage savings due to lossless compression have been evaluated in (Unterweger et al., 2015). While the achievable compression ratio increased with smaller sampling intervals, the benefits of compression vanish quickly above intervals. Various encodings (ASCII- and binary-based) have been evaluated for such low-frequency measurements, and in most cases, a binary encoding greatly outperforms an ASCII-based encoding. The need for smart data compression was discussed in (Nabeel et al., 2013), which further motivates in-depth research in this area. The main focus of the authors was smart meter data with low temporal resolution from 10,000 meters or more. Various compression techniques were presented and a fast-streaming differential compression algorithm was evaluated: removing steady-state power measurements () can save on average 62% of required storage space.

High-frequency energy data offers a significantly larger potential for lossless compression, due to the inherent repeating waveform signal. Tariq et al. (2015) utilized general-purpose compressors, such as LZMA and bzip2, and achieved good compression ratios on some datasets. Applying differential compression and omitting timestamps can yield size reductions of up to 98% on smart grid data, however, these results are not comparable as there is no generalized uniform data source. The presented results use a single data channel and an ASCII-based data representation as a baseline for their comparison, which contains an inherent encoding overhead. The SURF file format (Pereira et al., 2014) was designed to store NILM datasets and provide an API to create and modify such files. The internal structure is based on wave-audio and augments it with new types of metadata chunks. To the best of our knowledge, the SURF file format didn’t gain any traction due to its lack of support in common scientific computing frameworks. The recently published EMD-DF file format (Pereira, 2017), by the same authors, relies on the same wave-audio encoding, while extending it with more metadata and annotations. Neither SURF nor EMD-DF provides any built-in support for compression. The power grid community defined the PQDIF (Society, 2018) (for power quality and quantity monitoring) and COMTRADE (Association, 2018) (for transient data in power systems) file formats. Both specifications outline a structured view of numerical data in the context of energy measurements. Raw measurements are augmented with precomputed evaluations (statistical metrics), which can cause a significant overhead in required storage space. While PQDIF supports a simple LZ compression, COMTRADE does not offer such capabilities. To the best of our knowledge, these file formats never gained traction outside the power grid operations community.

Lossy compression can achieve multiple magnitudes higher compression ratios than lossless, with minimal loss of accuracy for certain use cases (Eichinger et al., 2015). Using piecewise polynomial regression, the authors achieved good compression ratios on three existing smart grid scenarios. The compressed parametrical representation was stored in a relational database system. However, this approach only applies if the use case and expected data transformation is known before applying a lossy data reduction. A 2-dimensional representation for power quality data was proposed in (Gerek and Ece, 2004) and (Qing et al., 2011), which then could be used to employ compression approaches from image processing and other related fields. While both approaches can be categorized as lossy compression due to their numerical approximation using wavelets or trigonometric functions, they require a specialized encoder and decoder which is not readily available in scientific computing frameworks.

The NilmDB project (Paris et al., 2014) provides a generalized user interface to access, query, and analyze large time-series datasets in the context of power quality diagnostics and NILM. A distributed architecture and a custom storage format were employed to work efficiently with “big data”. The underlying data persistence is organized hierarchically in the filesystem and utilizes tree-based structures to reduce storage overhead. This internal data representation is capable of handling multiple streams and non-uniform data rates but lacks support for data compression or more efficient coding schemes. NILMTK (Batra et al., 2014), an open-source NILM toolkit, provides an evaluation workbench for power disaggregation and uses the HDF5 (Folk et al., 2011) file format with a custom metadata structure. Most available public datasets require a specialized converter to import them into a NILMTK-usable file format. While the documentation states that a zlib data compression is applied, some converters currently use bzip2 or Blosc (Alted, 2017).

3. Evaluated Datasets

While there is a vast pool of smart meter datasets111http://wiki.nilm.eu/datasets.html, i.e., low sampling rates of measurements every , , or , a majority of the underlying information is already lost (signal waveform). The raw signals are aggregated into single root-mean-squared voltage and current readings, frequency spectrums, or other metrics accumulated over the last measurement interval. This can be already classified as a type of lossy compression. For some use cases, this data source is sufficient to work with, while other fields require high sampling rates to extract more information from the signals.

All following experiments and evaluations were performed on publicly accessible datasets: The Reference Energy Disaggregation Data Set (REDD (Kolter and Johnson, [n. d.])), Building-Level fUlly-labeled dataset for Electricity Disaggregation (BLUED (Anderson et al., 2012)), UK Domestic Appliance-Level Electricity dataset (UK-DALE (Kelly and Knottenbelt, 2015)), and the Building-Level Office eNvironment Dataset (BLOND (Kriechbaumer and Jacobsen, 2018)). We will refer to these datasets by their established acronyms: REDD, BLUED, UK-DALE, and BLOND. Based on the energy dataset survey provided by the NILM-Wiki111http://wiki.nilm.eu/datasets.html, these are all datasets of long-term continuous measurements with voltage and current waveforms from selected buildings or households. The data acquisition systems and data types are comparable to warrant their use in this context. (Table 1).

Measurement systems and their analog-to-digital converters (ADC) always output a unit-less integer number, either between for unipolar ADCs or for bipolar ADCs. During setup and calibration, a common factor is determined to convert raw values into a voltage or current reading. Some datasets publish raw values and the corresponding calibration factors, while others publish directly Volt- and Ampere-based readings as float values. Datasets only available as floating-point values are converted back into their original integer representation without loss of precision by reversing the calibration step from the analog-to-digital converter for each channel:

Each of the mentioned datasets was published in a different (compressed) file format and encoding scheme. To allow for comparisons between these datasets, we decompressed, normalized, and re-encoded all data before analyzing them (raw binary encoding).

From REDD, we used the entire available High Frequency Raw Data: house_3 and house_5, each with 3 channels: current_1, current_2, and voltage. The custom file format encodes a single channel per file. In total, of raw data from 126 files were used.

From BLUED, we used all available waveform data (1 location, 16 sub-datasets) and 3 channels: current_a, current_b, voltage. The CSV-like text files contain voltage and two current channels and a dedicated measurement timestamp. In total, of raw data from 6430 files were used.

From UK-DALE, we selected house_1 from the most recent release (UK-DALE-2017-16kHz, the longest continuous recording). The compressed FLAC files contain 2 channels: current and voltage. In total, of raw data from 19491 files were used.

From BLOND, we selected the aggregated mains data of both sub-datasets: BLOND-50 and BLOND-250. The HDF5 files with gzip compression contain 6 channels: current{1-3} and voltage{1-3}. In total, of raw data from 61125 files of BLOND-50, and of raw data from 35490 files of BLOND-250 were used.

Dataset Current Channels Voltage Channels Sampling Rate Values
REDD 2 1 24-bit
BLUED 2 1 16-bit
UK-DALE 1 1 24-bit
BLOND-50 3 3 16-bit
BLOND-250 3 3 16-bit
Table 1. Overview of evaluated datasets: long-term continuous measurements containing raw voltage and current waveforms.

The data acquisition systems (DAQ) of all datasets produce a linear pulse-code modulated (LPCM) stream. The analog signals are sampled in uniform intervals and converted to digital values (Figure 1). The quantization levels are distributed linearly in a fixed measurement range which requires a signal conditioning step in the DAQ system. ADCs typically cannot directly measure mains voltage and require a step-down converter or measurement probe. Mains current signals need to be converted into a proportional voltage.

4. Entropy Analysis

DAQ units provide a way to collect digital values from analog systems. As such, the quality of the data depends strongly on the correct calibration and selection of measurement equipment. Mains electricity signals are typically not compatible with modern digital systems, requiring an indirect measurement through step-down transformers or other metrics. Mains voltage can vary by up to during normal operation of the grid (European Committee for Electrotechnical Standardization, 1989; American National Standards Institute, 2016), making it necessary to design the measurement range with a safety margin. The expected signal, plus any margin for spikes, should be equally distributed on the available ADC resolution range. Leaving large areas of the available value range unused can be prevented by carefully selecting input characteristics and signal conditioning (step-down calibration). A rule of thumb for range calibration is that the expected signal should occupy 80-90%, leaving enough bandwidth for unexpected measurements. Input signals larger than the measurement range get recorded as the minimum/maximum value. Grossly exceeding the rated input signal level could damage the ADC, unless a dedicated signal conditioning and protection is employed.

Figure 1. Linear pulse-code modulation stream of a sinusoidal waveform sampled with a 16-bit ADC. The waveform corresponds to a 230 V mains voltage signal.

We extracted the probability mass function (PMF) of all evaluated datasets for the full bit-range (16- or 24-bit). The value histogram is a structure mapping each possible measurement value (integer) to the number of times this value was recorded. Ideally, the region between the lowest and highest value contains a continuous value range without gaps. However, the quantization level (step size) could cause a mismatch and results in skipped values. We then normalize this histogram to obtain the PMF and compute the signal entropy per channel, which gives an estimation of the actual information contained in the raw data and provides a lower bound for the achievable compression ratio based on the Kolmogorov complexity.

Each dataset is split into multiple files, making it necessary to merge all histograms into a total result at the end of the computing run. Since all histograms can be combined with a simple summation, the process can be parallelized and computed without any particular order. Computing and merging all histograms is, therefore, best accomplished in a distributed compute cluster with multiple nodes or similar environments.

5. Data Representation

Choosing a suitable file format for potentially large datasets involves multiple tradeoffs and decisions, including supported platforms, scientific computing frameworks, metadata, error correction, compression, and chunking. The available choices for data representation can range from CSV data (ASCII-parsable) to binary file formats and custom encoding schemes. From the energy dataset survey and the evaluated datasets, it can be noted, that every dataset uses a different file format, encoding scheme, and optionally compression.

Publishing and distributing large datasets requires storage systems capable of providing long-term archives of scientific measurement data. Lossless compression helps to minimize storage costs and distribution efforts. At the same time, other researchers accessing the data benefit from smaller files and shorter access times to download the data.

Electricity signals (current and voltage) contain a repetitive waveform with some form of distortion depending on the load. In an ideal power grid, the voltage would follow a perfect sinusoidal waveform without any offset or error. This would allow us to accurately predict the next voltage measurement. However, constant fluctuations in the supply and demand cause the signals to deviate. The fact that each signal is primarily continuous (without sudden jumps) can be beneficial to compression algorithms.

A delta encoding scheme only stores the numerical difference of neighboring elements in a time-series measurement vector. This can be useful for slow-changing signals because the difference of a signal might require less bytes to encode than the absolute value:

We compare the original data representation (format, compression, encoding) of each dataset, reformat them into various file formats, and evaluate their storage saving based on a comprehensive list of lossless compression algorithms. This involves encoding raw data in a more suitable representation to compare their compressed size: , and the resulting space saving: . We define the main goal of reducing the overall required storage space for each dataset, and deliberately do not consider compression or decompression speed. The performance characteristics (throughput and speed) are well known for individual compression techniques (Arnold and Bell, 1997) and are of minor importance in the case of large static datasets which require only a single compression step before distribution. Performance metrics are important when dealing with repeated compression of raw data, which is not the case for static energy datasets. Repeated decompression is however relevant because researchers might want to read and parse the files over and over again while analyzing them (if in-memory processing is not feasible). As noted in (Arnold and Bell, 1997), decompression speed and throughput is typically not a performance bottleneck in data analytics tasks.

Building a novel data compression scheme for energy data is counter-productive, since most scientific computing frameworks lack support and the idea suffers from the ”not invented here” and ”yet another standard” problematic, both common anti-patterns in the field of engineering when developing new solutions, despite existing suitable approaches (Pereira et al., 2014; Kolter and Johnson, [n. d.]; Eichinger et al., 2015). Therefore, a key requirement is that each file format must be supported in common scientific computing systems to read (and possibly write) data files.

We selected four format types: raw binary, HDF5 (data model and file format for storing and managing data), Zarr (chunked, compressed, N-dimensional arrays), and audio-based PCM containers.

Raw binary formats provide a baseline for comparison. All samples are encoded as integer values (16-bit or 24-bit) and are compressed with a general-purpose compressor: zlib/gzip, LZMA, bzip2, and zstd, all with various parameter values. The input for each compressor is either raw-integer or variable-length encoded data (LEB128S (Group, 2018)), which is serialized either row- or column-based from all channels (interweaving). The LEB128S encoding is additionally evaluated with delta encoding of the input.

The Hierarchical Data Format 5 (HDF5) (Folk et al., 2011) provides structured metadata and data storage, data transformations, and libraries for most scientific computing frameworks. All data is organized in natively-typed arrays (multi-dimensional matrices) with various filters for data compression, checksumming, and other reversible transformations before storing the data to a file. The API transparently reverses these transformations and compression filters while reading data. HDF5 is popular in the scientific community and used for various big-data-type applications (Blanas et al., 2014; Gosink et al., 2006; Dougherty et al., 2009; Sehrish et al., 2017). The public registry for HDF5 filters111https://support.hdfgroup.org/services/filters.html currently lists 21 data transformations, most of them compression-related. Each HDF5 file is evaluated with and without the shuffle filter, zlib/gzip, lzf, MAFISC (Hübbe and Kunkel, 2013) with LZMA, szip (Group, 2017), Bitshuffle (Masui et al., 2015) with LZ4, zstd, and the full Blosc (Alted, 2017) compression suite, again all with various parameter values.

Zarr (Miles, 2018) organizes all data in a filesystem-like structure, which can be archived as a single zip-archive file or as tree-structure in the filesystem. Each channel is stored as a separate array (data stream) with optional chunk-based compression via zlib/gzip, LZMA, bzip2, or Blosc (with shuffle, Bitshuffle, or no-shuffle filter), again all with various parameter values. Each Zarr file is additionally evaluated with a delta filter to reduce the value range.

Audio-based formats use LPCM-type data encoding (PCM16 or PCM24) with a fixed precision and sampling rate. All channels are encoded into a single container using lossless compression formats: FLAC (Foundation, 2018), ALAC (Inc., 2018), and WavPack (Bryant, 2018). These formats do not provide tune-able parameters.

Calibration factors, timestamps, and labels can augment the raw data in a single file while providing a unified API for accessing data and metadata. Raw binary formats lack this type of integrated support and require additional tooling and encoding schemes for metadata. Audio-based formats require a container format to store metadata, typically designed for the needs of the music and entertainment industry. Out of these formats, only HDF5 and Zarr provide support for encoding and storing arbitrary metadata objects (complex types or matrices) together with measurement data.

Most audio-based formats support at most 8 signal channels, while general-purpose formats such as HDF5 and Zarr have no restrictions on the total number of channels per file. The sampling rate can also be a limiting factor: FLAC supports at most and ALAC only . ADC resolution (bit depth) is mostly bound by existing technological limitations and will not exceed 32-bit in the foreseeable future. While these constraints are within the requirements for all datasets under evaluation, they need to be considered for future dataset collection and the design of measurement systems.

In total, we encoded the evaluated datasets with 365 different data representation formats: 54 raw, 264 HDF5-based, 44 Zarr-based, and 3 audio-based and gathered their per-file compression size as a benchmark. The complete list, including all parameters and compression options, is available in the online appendix222The online appendix is available through the program chair (double-blind review).. The full analysis was performed in a distributed computing environment and consumed approx. CPU-core-hours (dual Intel Xeon E5-2630v3 machines with RAM and Ethernet interfaces).

6. Chunk Size Impact

Each dataset is provided in equally-sized files, typically based on measurement duration. Working with a single large file can be cumbersome due to main memory restriction or available local storage space. Assuming a typical desktop computer, with of main memory, is used for processing, a single file from a dataset must be fully loaded into memory before any computation can be done. Depending on the analysis and algorithms, multiple copies might be required for intermediary results and temporary copies. This means the main memory size is an upper bound for the maximum feasible chunk size.

Some file formats and data types support internal chunking or streamed data access, in which data can be read into memory sequentially or random-access. In such environments other factors will limit the usable chunk size, such as file system capabilities, network-attached storage, or other operating system limitations.

The evaluated datasets are distributed with the following chunk sizes of raw data: REDD: or , BLUED: or , UK-DALE: or , BLOND-50: or , BLOND-250: or . Measurement duration and file size are not strictly linked, causing a slight variation in file sizes across the entire measurement period of each dataset. Observed real-world time does not affect any of the compression algorithms under test and is therefore omitted. The sampling rate and channel count directly affects the data rate (bytes per time unit) and explains the non-uniform chunk sizes mentioned for each dataset.

We compare the best-performing data representation formats of each dataset from the previous experiment, benchmark them with different chunk sizes, and estimate their effect on the overall compression ratio. For this evaluation, we define the compression ratio as . The chunk sizes range from 1, 2, 4, 8, 16, 32, 64, , and then continue in steps of up to . To reduce the required computational effort, we greedily consume data from the first available dataset file, until the predefined chunk limit is fulfilled. The chunk size is determined using the number of samples (across all channels) and their integer byte count (2 or 3 bytes); only full samples for all channels are included in a chunk.

7. Results

7.1. Entropy Analysis

Entropy is based on the probability for a given measurement (signal value). The histogram of an entire measurement channel shows the number of times a single measurement value was seen in the dataset (Figure 3). The plots show the raw measurement bandwidth in ADC value on the x-axis and a logarithmic y-axis for the number of occurrences of each value. The raw ADC values are bipolar and centered on 0: for BLUED, BLOND-50, and BLOND-250; for REDD and UK-DALE.

The voltage histogram shows a distinctive sinusoidal distribution (peaks at minimum and maximum values). The current histogram would show a similar distribution if the power draw is constant (pure-linear or resistive loads), however, multiple levels of current values can be observed, indicating high activity and fluctuations. REDD and BLUED (Figures 3 and 3) show a center-biased distribution, indicating a sub-optimal calibration performance and unused measurement bandwidth. UK-DALE, BLOND-50, and BLOND-250 (Figures 3, 3, 3) show a wide range of highly used values, with the voltage channels utilizing around 90% of the available bandwidth.

Figure 2. Compression performance for the top-30 data representation formats and their transformation filters. Each data representation format was applied on a per-file basis to every dataset.

REDD and BLUED use only a small percentage of the available range, indicating a low entropy based on the used data type. UK-DALE utilizes a reasonable slice, while BLOND covers almost the entire possible range (Table 2). Assuming a well-calibrated data acquisition system, the expected percentage should reflect the expected measurement values. Low range usage (REDD, BLUED) leads to lost precision which would have been freely available with the given hardware, whereas high usage (UK-DALE, BLOND) means almost all available measurement precision is reflected in the raw data. Some datasets utilize 100% of the available measurement range, while REDD only uses 5%. A high range utilization does not result in a equally high usage, as the histogram can contain gaps (ADC values with 0 occurrences in the datasets).

REDD (24-bit)
BLUED (16-bit)
UK-DALE (24-bit)
BLOND-50 (16-bit)
BLOND-250 (16-bit)
Figure 3. Semi-logarithmic histogram of ADC values for each dataset and channel. Current signals show distinct steps, corresponding to prolonged usage at certain power levels. For visualization reasons, the scatter plot was smoothed and the full histogram is available in the online appendix333The online appendix is available through the program chair (double-blind review)..
Dataset Channel Values Usage Range H(x)
current_1 87713 1% 4% 14.3
current_2 85989 1% 5% 14.9
voltage 2925155 17% 18% 21.1
current_a 5855 9% 10% 7.8
current_b 7684 12% 13% 9.7
voltage 11302 17% 18% 13.2
current 6981612 42% 81% 19.0
voltage 15135594 90% 100% 23.2
current1 51122 78% 100% 12.6
current2 49355 75% 100% 11.2
current3 48658 74% 100% 11.3
voltage1 58396 89% 92% 15.3
voltage2 57975 88% 91% 15.4
voltage3 59596 91% 95% 15.4
current1 52721 80% 100% 12.4
current2 51802 79% 100% 10.8
current3 50989 78% 100% 11.6
voltage1 58488 89% 91% 15.3
voltage2 57912 88% 92% 15.4
voltage3 59742 91% 94% 15.4
Table 2. Entropy analysis of whole-building energy datasets with high sampling rates. The amount of unique measurement values for each channel is extracted, which corresponds to a usage percentage over the available measurement resolution. The lowest and highest observed value is used to give determine the observed range.

7.2. Data Representation

The evaluation compares the compressed size (CS, final file size after compression and file format encapsulation in percent of uncompressed size) of 365 data representation formats. For brevity reasons, only the 30 best-performing formats are shown in Figure 2. Each of the 365 data representation was tested on all datasets and the full evaluation is available in the online appendix333The online appendix is available through the program chair (double-blind review).. The following evaluation and benchmark uses the raw data from each dataset as described in Section 3. In total, raw data with was re-encoded 365 times.

HDF5 and Zarr are general-purpose file formats for numerical data with a broad support in scientific computing frameworks. As such, they only support 16-bit and 32-bit integer values, which causes a 1-byte overhead for REDD and UK-DALE. The baseline used for comparison is a raw concatenated byte string with dataset-native data types (16-bit and 24-bit). This allows us to obtain comparable evaluation results, while other published benchmarks compared ASCII-like encodings against binary representations, skewing the results significantly.

Overall, it can be noted that all three audio-based formats performed well, given their inherent targeted nature of compressing waveforms with high temporal resolution. ALAC and FLAC achieved the highest overall CS across all datasets, followed by HDF5+MAFISC and HDF5+zstd, which can overcome the 1-byte overhead. Although the general-purpose compressors and their individual data representation formats were intended to serve as a baseline for comparison of the more advanced schemes (HDF5, Zarr, and audio-based), one can conclude that even plain bzip2 or LZMA compression can achieve comparable compression results. A tradeoff to consider is the lack of metadata and internal structure, which might cause additional data handling overhead as easy-to-use import and parsing tools are not available. Variable-length encoding using LEB128S is a suitable input for the bzip2 and LZMA compressors when combined with a column-based storage format. Delta encoding resulted in comparably good CS in certain combinations.

Some datasets are inherently more compressible than others. This is a result of the entropy analysis and can be observed in the data representation evaluation as well. Compressing BLUED consistently yields smaller file sizes with most compressors than any other dataset. The benchmark shows that higher entropy correlates strongly with higher CS per dataset.

While the majority of tested data representation formats achieves a data reduction, compared to the baseline, some formats are counter-productive and generate a larger output (CS over 100%). This behavior affects most HDF5- and Zarr-based formats, because of the 1-byte overhead (depending on the used compressor).

Choosing the best-performing data representation for each dataset, the following SS can be achieved when applied to all data files as compared against the raw binary encoding: REDD: 48.3% or , BLUED: 73.0% or , UK-DALE: 40.5% or , BLOND-50: 51.3% or , BLOND-250: 55.4% or . It can be noted that REDD, UK-DALE, and both BLOND datasets perform at around 50-60% of CS, while BLUED shows a significantly smaller CS of below 30% CS, due to it’s very low signal entropy (Table 2). Variable-length encoding (LEB128S) and Delta encoding yield the largest space saving for such types of data (REDD and BLUED).

Two out of the five evaluated datasets (REDD and BLUED) showed the highest space savings with a general-purpose compressor (bzip2) and variable-length encoding. ALAC and HDF5+MAFISC performed best on UK-DALE, BLOND-50, and BLOND-250, given their higher signal entropy and value range utilization.

When comparing the raw space savings against the actually published dataset, which typically is already compressed, we can achieve additional space savings: REDD: 61.2% or , BLUED: 96.4% or , UK-DALE: -1.3% or , BLOND-50: 23.3% or , BLOND-250: 26.0% or . All datasets show space savings, except for UK-DALE, which shows an insignificant increase in the overall dataset size. This means the originally published FLAC files are already compressed to a high extent; this is supported by Figure 2, showing FLAC among the highest ranking formats in this study. While an absolute space saving of for REDD might be insignificant in most use cases (desktop computing and data center), a more compelling reduction in storage space of up to for BLOND-250 can be substantially beneficial.

7.3. Chunk Size Impact

The chunk size evaluation (Figure 4) contains the averaged CR per chunk size for all datasets except REDD, as it only contains of data and was therefore omitted. A detailed per-dataset evaluation is available in the online appendix333The online appendix is available through the program chair (double-blind review)..

The evaluated chunk size range starts with very small chunks, which would not be recommended for large datasets because of the increased handling and container overhead. As such, chunk sizes starting with can be considered as viable storage strategy. The resulting CR ramps up quickly for most formats until it levels off between   to   . Above this mark, no significant improvement in CR can be achieved by increasing the chunk size. Some file formats even show a slight linear decrease in CR with very large chunk sizes (above approx. ). ALAC and FLAC compressors show a slight improvement (2-3%) in CR with larger chunk sizes. In most use cases this size reduction comes at a great cost in RAM requirement to process files above . HDF5 has its own concept of ”chunks”, used for I/O and the filter pipeline, with a default size of . Internal limitations do not allow for HDF5-chunks larger than , however, HDF5, in general, can be used for files larger than this limit. The MAFISC filter with LZMA compression experiences large fluctuations for neighboring chunk size steps and should, therefore, be tuned separately. Overall, increasing the chunk size has a negligible effect on the final compression ratio and only pushes up the RAM requirements for processing.

Figure 4. Chunk size impact of different representations.

7.4. Summary and Recommendations

The entropy analysis shows a lack of measurement range calibration in some datasets. This results in unutilized precision, that would have been available with the given hardware DAQ units. The used range directly affects the contained entropy, and therefore the achievable compression ratio. A well-calibrated measurement system is a key requirement to achieve the best signal range and resolution.

Choosing a file format for long-term whole-building energy datasets is a crucial component, directly affecting the visibility and accessibility of the data by other researchers. Using an unsupported encoding or requiring specialized tools to read the data is cumbersome and error-prone and should be avoided. We recommend using well-known file formats, such as HDF5 or FLAC, which are widely adopted and provide built-in support for metadata, compression, and error-detection. While ALAC and FLAC already provide internal compression, we recommend the MAFISC or zstd filters for HDF5, due to their superior compression ratio. The serialization orientation (row- or column-based) has only a minor effect.

Large datasets should be split into multiple smaller files to facilitate data handling, reduce transfer speeds and loading times for short amounts of data. We have found that compression algorithms (together with the above-described file formats) yield higher space savings with chunk sizes above   to   . Small files show a modest compression ratio, while larger files require more transfer bandwidth and time before the data can be analyzed.

8. Conclusions

We presented a comprehensive entropy analysis of public whole-building energy datasets with waveform signals. Some datasets leave a majority of the available ADC range unused, causing lost precision and accuracy. A well-calibrated measurement system maximizes the achievable precision. Using 365 different data representation formats, we have shown that immense space savings of up to 73% are achievable by choosing a suitable file format and data transformation. Low entropy datasets show higher achievable compression ratios. Audio-based file formats perform considerably well, given the similarities to electricity waveforms. Transparent data transformations are particularly beneficial, such as MAFISC and SHUFFLE-based approaches. The input size shows a mostly stable dependency to the achievable compressed size, with variations of a few percentage points (limited by RAM). Waveform data shows a nearly constant compression ratio, independent of the input chunk size. Splitting large datasets into multiple smaller files is important for data handling, but insignificant in terms of space savings.


  • (1)
  • Alted (2017) Francesc Alted. 2017. Blosc: A high performance compressor optimized for binary data. (November 2017). Retrieved January 20, 2018 from http://blosc.org/
  • American National Standards Institute (2016) American National Standards Institute. 2016. ANSI C84.1-2016: Standard for Electric Power Systems and Equipment—Voltage Ratings (60 Hz). (2016).
  • Anderson et al. (2012) Kyle Anderson, Adrian Ocneanu, Diego Benitez, Derrick Carlson, Anthony Rowe, and Mario Berges. 2012. BLUED: A Fully Labeled Public Dataset for Event-Based Non-Intrusive Load Monitoring Research. In SustKDD ’12. ACM, Beijing, China, 1–5.
  • Arnold and Bell (1997) R. Arnold and T. Bell. 1997. A corpus for the evaluation of lossless compression algorithms. In Data Compression Conference, 1997. DCC ’97. Proceedings. 201–210. https://doi.org/10.1109/DCC.1997.582019
  • Association (2018) IEEE Standards Association. 2018. COMTRADE: Common format for Transient Data Exchange for power systems. (January 2018). Retrieved January 20, 2018 from https://standards.ieee.org/findstds/standard/C37.111-2013.html
  • Batra et al. (2014) Nipun Batra, Jack Kelly, Oliver Parson, Haimonti Dutta, William Knottenbelt, Alex Rogers, Amarjeet Singh, and Mani Srivastava. 2014. NILMTK: An Open Source Toolkit for Non-intrusive Load Monitoring. In ACM e-Energy ’14. ACM, New York, NY, USA, 265–276. https://doi.org/10.1145/2602044.2602051
  • Blanas et al. (2014) Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014. Parallel Data Analysis Directly on Scientific File Formats. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD ’14). ACM, New York, NY, USA, 385–396. https://doi.org/10.1145/2588555.2612185
  • Bookstein and Storer (1992) Abraham Bookstein and James A. Storer. 1992. Data compression. Information Processing & Management 28, 6 (1992), 675 – 680. https://doi.org/10.1016/0306-4573(92)90060-D Special Issue: Data compression for images and texts.
  • Bryant (2018) David Bryant. 2018. WavPack: Hybrid Lossless Audio Compression. (January 2018). Retrieved January 20, 2018 from http://www.wavpack.com/
  • de Souza et al. (2017) J. C. S. de Souza, T. M. L. Assis, and B. C. Pal. 2017.

    Data Compression in Smart Distribution Systems via Singular Value Decomposition.

    IEEE Transactions on Smart Grid 8, 1 (Jan 2017), 275–284. https://doi.org/10.1109/TSG.2015.2456979
  • Deelman and Chervenak (2008) E. Deelman and A. Chervenak. 2008. Data Management Challenges of Data-Intensive Scientific Workflows. In 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID). 687–692. https://doi.org/10.1109/CCGRID.2008.24
  • Dougherty et al. (2009) Matthew T. Dougherty, Michael J. Folk, Erez Zadok, Herbert J. Bernstein, Frances C. Bernstein, Kevin W. Eliceiri, Werner Benger, and Christoph Best. 2009. Unifying Biological Image Formats with HDF5. Commun. ACM 52, 10 (Oct. 2009), 42–47. https://doi.org/10.1145/1562764.1562781
  • Eichinger et al. (2015) Frank Eichinger, Pavel Efros, Stamatis Karnouskos, and Klemens Böhm. 2015. A Time-series Compression Technique and Its Application to the Smart Grid. The VLDB Journal 24, 2 (April 2015), 193–218. https://doi.org/10.1007/s00778-014-0368-8
  • European Committee for Electrotechnical Standardization (1989) European Committee for Electrotechnical Standardization. 1989. CENELEC Harmonisation Document HD 472 S1. (1989).
  • Folk et al. (2011) Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. 2011. An Overview of the HDF5 Technology Suite and Its Applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases (AD ’11). ACM, New York, NY, USA, 36–47. https://doi.org/10.1145/1966895.1966900
  • Foundation (2018) Xiph.Org Foundation. 2018. FLAC: Free Lossless Audio Codec. (January 2018). Retrieved January 20, 2018 from https://xiph.org/flac/
  • Gerek and Ece (2004) O. N. Gerek and D. G. Ece. 2004. 2-D analysis and compression of power-quality event data. IEEE Transactions on Power Delivery 19, 2 (April 2004), 791–798. https://doi.org/10.1109/TPWRD.2003.823197
  • Gosink et al. (2006) L. Gosink, J. Shalf, K. Stockinger, Kesheng Wu, and W. Bethel. 2006. HDF5-FastQuery: Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices. In 18th International Conference on Scientific and Statistical Database Management (SSDBM’06). 149–158. https://doi.org/10.1109/SSDBM.2006.27
  • Group (2018) Free Standards Group. 2018. DWARF Debugging Information Format Specification Version 3.0. (January 2018). Retrieved January 20, 2018 from http://dwarfstd.org/doc/Dwarf3.pdf
  • Group (2017) HDF Group. 2017. Szip Compression in HDF Products. (November 2017). Retrieved January 20, 2018 from https://support.hdfgroup.org/doc_resource/SZIP/
  • Haq et al. (2017) Anwar Ul Haq, Thomas Kriechbaumer, Matthias Kahl, and Hans-Arno Jacobsen. 2017. CLEAR – A Circuit Level Electric Appliance Radar for the Electric Cabinet. In 2017 IEEE International Conference on Industrial Technology (ICIT ’17). 1130–1135. https://doi.org/10.1109/ICIT.2017.7915521
  • Hübbe and Kunkel (2013) Nathanael Hübbe and Julian Kunkel. 2013. Reducing the HPC-datastorage footprint with MAFISC—Multidimensional Adaptive Filtering Improved Scientific data Compression. Computer Science - Research and Development 28, 2 (01 May 2013), 231–239. https://doi.org/10.1007/s00450-012-0222-4
  • Inc. (2018) Apple Inc. 2018. ALAC: Apple Lossless Audio Codec. (January 2018). Retrieved January 20, 2018 from https://macosforge.github.io/alac/
  • Kahl et al. (2017) Matthias Kahl, Anwar Ul Haq, Thomas Kriechbaumer, and Hans-Arno Jacobsen. 2017. A Comprehensive Feature Study for Appliance Recognition on High Frequency Energy Data. In Proceedings of the 2017 ACM Eighth International Conference on Future Energy Systems (e-Energy ’17). ACM, New York, NY, USA. https://doi.org/10.1145/3077839.3077845
  • Kelly and Knottenbelt (2015) Jack Kelly and William Knottenbelt. 2015. The UK-DALE dataset, domestic appliance-level electricity demand and whole-house demand from five UK homes. Scientific Data 2, 150007 (2015). https://doi.org/10.1038/sdata.2015.7
  • Kolter and Johnson ([n. d.]) J. Zico Kolter and Matthew J. Johnson. [n. d.]. REDD: A Public Data Set for Energy Disaggregation Research. In SustKDD ’11 (2011), Vol. 25. 59–62.
  • Kriechbaumer et al. (2017) Thomas Kriechbaumer, Anwar Ul Haq, Matthias Kahl, and Hans-Arno Jacobsen. 2017. MEDAL: A Cost-Effective High-Frequency Energy Data Acquisition System for Electrical Appliances. In Proceedings of the 2017 ACM Eighth International Conference on Future Energy Systems (e-Energy ’17). ACM, New York, NY, USA. https://doi.org/10.1145/3077839.3077844
  • Kriechbaumer and Jacobsen (2018) Thomas Kriechbaumer and Hans-Arno Jacobsen. 2018. BLOND, a building-level office environment dataset of typical electrical appliances. (March 2018). https://doi.org/10.1038/sdata.2018.48
  • Liu and Shen (2017) Guoxin Liu and Haiying Shen. 2017. Minimum-Cost Cloud Storage Service Across Multiple Cloud Providers. IEEE/ACM Trans. Netw. 25, 4 (Aug. 2017), 2498–2513. https://doi.org/10.1109/TNET.2017.2693222
  • Masui et al. (2015) K. Masui, M. Amiri, L. Connor, M. Deng, M. Fandino, C. Höfer, M. Halpern, D. Hanna, A.D. Hincks, G. Hinshaw, J.M. Parra, L.B. Newburgh, J.R. Shaw, and K. Vanderlinde. 2015. A compression scheme for radio data in high performance computing. Astronomy and Computing 12, Supplement C (2015), 181 – 190. https://doi.org/10.1016/j.ascom.2015.07.002
  • Meziane et al. (2016) M. N. Meziane, T. Picon, P. Ravier, G. Lamarque, J. C. Le Bunetel, and Y. Raingeaud. 2016. A Measurement System for Creating Datasets of On/Off-Controlled Electrical Loads. In 2016 IEEE 16th International Conference on Environment and Electrical Engineering (EEEIC). 1–5. https://doi.org/10.1109/EEEIC.2016.7555847
  • Miles (2018) Alistair Miles. 2018. Zarr: A Python package providing an implementation of chunked, compressed, N-dimensional arrays. (January 2018). Retrieved January 20, 2018 from https://zarr.readthedocs.io/en/latest/
  • Nabeel et al. (2013) Muhammad Nabeel, Fahad Javed, and Naveed Arshad. 2013. Towards Smart Data Compression for Future Energy Management System. In Fifth International Conference on Applied Energy.
  • Paris et al. (2014) J. Paris, J. S. Donnal, and S. B. Leeb. 2014. NilmDB: The Non-Intrusive Load Monitor Database. IEEE Transactions on Smart Grid 5, 5 (Sept 2014), 2459–2467. https://doi.org/10.1109/TSG.2014.2321582
  • Pereira (2017) Lucas Pereira. 2017. EMD-DF: A Data Model and File Format for Energy Disaggregation Datasets. In Proceedings of the 4th ACM International Conference on Systems for Energy-Efficient Built Environments (BuildSys ’17). ACM, New York, NY, USA, Article 52, 2 pages. https://doi.org/10.1145/3137133.3141474
  • Pereira et al. (2014) Lucas Pereira, Nuno Nunes, and Mario Bergés. 2014. SURF and SURF-PI: A File Format and API for Non-intrusive Load Monitoring Public Datasets. In Proceedings of the 5th International Conference on Future Energy Systems (e-Energy ’14). ACM, New York, NY, USA, 225–226. https://doi.org/10.1145/2602044.2602078
  • Puttaswamy et al. (2012) Krishna P.N. Puttaswamy, Thyaga Nandagopal, and Murali Kodialam. 2012. Frugal Storage for Cloud File Systems. In Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys ’12). ACM, New York, NY, USA, 71–84. https://doi.org/10.1145/2168836.2168845
  • Qing et al. (2011) A. Qing, Z. Hongtao, H. Zhikun, and C. Zhiwen. 2011. A Compression Approach of Power Quality Monitoring Data Based on Two-dimension DCT. In 2011 Third International Conference on Measuring Technology and Mechatronics Automation, Vol. 1. 20–24. https://doi.org/10.1109/ICMTMA.2011.12
  • Ringwelski et al. (2012) Martin Ringwelski, Christian Renner, Andreas Reinhardt, Andreas Weigel, and Volker Turau. 2012. The Hitchhiker’s Guide to choosing the Compression Algorithm for your Smart Meter Data. (September 2012), 935–940. https://doi.org/10.1109/EnergyCon.2012.6348285
  • Sehrish et al. (2017) S. Sehrish, J. Kowalkowski, M. Paterno, and C. Green. 2017. Python and HPC for High Energy Physics Data Analyses. In Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing (PyHPC’17). ACM, New York, NY, USA, Article 8, 8 pages. https://doi.org/10.1145/3149869.3149877
  • Shannon (1949) C. E. Shannon. 1949. Communication in the Presence of Noise. Proceedings of the IRE 37, 1 (Jan 1949), 10–21. https://doi.org/10.1109/JRPROC.1949.232969
  • Society (2018) IEEE Power & Energy Society. 2018. IEEE 1159 - PQDIF: Power Quality and Quantity Data Interchange Format. (January 2018). Retrieved January 20, 2018 from http://grouper.ieee.org/groups/1159/3/docs.html
  • Tariq et al. (2015) Z. B. Tariq, N. Arshad, and M. Nabeel. 2015. Enhanced LZMA and BZIP2 for improved energy data compression. In 2015 International Conference on Smart Cities and Green ICT Systems (SMARTGREENS). 1–8.
  • Unterweger and Engel (2015) Andreas Unterweger and Dominik Engel. 2015. Resumable load data compression in smart grids. IEEE Transactions on Smart Grid 6, 2 (2015), 919–929. https://doi.org/10.1109/TSG.2014.2364686
  • Unterweger et al. (2015) Andreas Unterweger, Dominik Engel, and Martin Ringwelski. 2015. The Effect of Data Granularity on Load Data Compression. Springer International Publishing, Cham, 69–80. https://doi.org/10.1007/978-3-319-25876-8_7
  • Yuan et al. (2010) D. Yuan, Y. Yang, X. Liu, and J. Chen. 2010. A cost-effective strategy for intermediate data storage in scientific cloud workflow systems. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1–12. https://doi.org/10.1109/IPDPS.2010.5470453