Automating Cluster Analysis to Generate Customer Archetypes for Residential Energy Consumers in South Africa

06/11/2020
by   Wiebke Toussaint, et al.
University of Cape Town
0

Time series clustering is frequently used in the energy domain to generate representative energy consumption patterns of households, which can be used to construct customer archetypes for long term energy planning. Selecting the optimal set of clusters however requires extensive experimentation and domain knowledge, and typically relies on a combination of metrics together with additional expert guidance through visual inspection of the clustering results. This can be time consuming, subjective and difficult to reproduce. In this work we present an approach that uses competency questions to elicit expert knowledge and to specify the requirements for creating residential energy customer archetypes from energy meter data. The approach enabled a structured and formal cluster analysis process, while easing cluster evaluation and reducing the time to select an optimal cluster set that satisfies the application requirements. The usefulness of the selected cluster set is demonstrated in a use case application that reconstructs a customer archetype developed manually by experts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

06/01/2020

Using competency questions to select optimal clustering structures for residential energy consumption patterns

During cluster analysis domain experts and visual analysis are frequentl...
12/16/2021

KnAC: an approach for enhancing cluster analysis with background knowledge and explanations

Pattern discovery in multidimensional data sets has been a subject of re...
12/09/2011

The Expert System Designed to Improve Customer Satisfaction

Customer Relationship Management becomes a leading business strategy in ...
11/01/2019

Research and application of time series algorithms in centralized purchasing data

Based on the online transaction data of COSCO group's centralized procur...
10/25/2021

Adaptive Probabilistic Model for Energy-Efficient Distance-based Clustering in WSNs (Adapt-P): A LEACH-based Analytical Study

Network lifetime and energy consumption of data transmission have been p...
11/03/2019

Geono-Cluster: Interactive Visual Cluster Analysis for Biologists

Biologists often perform clustering analysis to derive meaningful patter...
06/28/2020

Feedback Clustering for Online Travel Agencies Searches: a Case Study

Understanding choices performed by online customers is a growing need in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Long term energy planning requires insights into the energy consumption behaviour of customers, such as residential households, to build demand forecasts. Customer behaviour is frequently approximated with load profiles or load curves, which are time-varying energy consumption patterns. A daily load profile captures the average load drawn from the electrical grid over a metered interval (e.g. 5 minutes). If a daily load profile averages consumer behaviour for a particular loading condition, such as a year, season, month or daytype, it is called a representative daily load profile (RDLP). Cluster analysis is a popular unsupervised machine learning technique with diverse applications. Time series clustering is frequently used in the energy domain to generate RDLPs and typically yields good results for consumers in the industrial and commercial sectors. Granular household energy consumption patterns are however inherently noisy, making it more challenging to produce meaningful clusters in the residential sector

[29].

In practise, selecting the optimal set of clusters requires extensive experimentation and domain knowledge. A combination of metrics together with additional expert guidance and visual inspection of clustering results are often used during the experimental process to identify the best cluster set [20][11]. However, these qualitative approaches can be adhoc and time consuming, subjective and difficult to reproduce, and biased by the expert’s interpretation of the visual representation [15]. This is further compounded in developing countries like South Africa, where there is limited availability of machine learning expertise outside the private sector for solving social problems. For domain experts without a background in machine learning, interpreting traditional clustering metrics is challenging.

The objective of this work is to use cluster analysis techniques to generate customer archetypes that represent the consumption behaviour of residential energy consumers in South Africa. We attempt to structure and automate aspects of the machine learning workflow to make the process of creating customer archetypes transparent, repeatable and to reduce the time commitment of experts. This work builds on and extends our previous work, where we compared and analysed different clustering techniques for generating RDLPs [31]. In this work we show how competency questions from the ontology engineering community can be incorporated in the cluster analysis process to illicit and represent application requirements and guide cluster set selection. We demonstrate the usefulness of the approach by reconstructing a residential customer archetype that we compare against an archetype developed by experts.

The paper starts by reviewing relevant literature in Section 2, and presents the dataset in Section 3. In Section 4 we outline the experimental setup. In Section 5 we present our approach to formalising application requirements. The clustering results are presented in Section 6. Section 7 demonstrates the application of the clusters to creating customer archetypes. Finally, we discuss the results in Section 8 and conclude in Section 9.

2 Literature Review

2.1 Clustering Residential Load Profiles

A daily load profile describes the energy consumption pattern of a household over a 24 hour period. Representative daily load profiles (RDLPs) are indicative of distinct daily energy usage behaviour for different types of households. They have been well explored for generating customer archetypes that represent groupings of energy users consuming energy in a similar manner [13][23]. Cluster analysis is an unsupervised machine learning approach that is useful for finding groups in a dataset when no labelled training observations are available [28]

. It is frequently used to create RDLPs. Traditionally, the most common approaches used for clustering load profiles are centroid-based approaches and variants of kmeans, self-organising maps (SOM) and hierarchical clustering

[20]

. The majority of studies that evaluated different clustering techniques found that the k-means algorithm performed the best

[4][20][25][39]. Other studies showed that the SOM [8][23], k-medoids [20][30] and modified follow-the-leader [6][7] yielded the best results. Several variations of k-means [3][20][26] and hierarchical clustering [20][7][1] were identified as the best or amongst the best clustering algorithms in individual studies.

2.1.1 Data Representation and Processing

Fine-grained daily load profiles are frequently reduced using Piecewise Aggregate Approximation with 15, 30 or 60 minute windows to produce input vectors of 96, 48 or 24 dimensions respectively

[27][38][8]. Xu et al. [39] represents daily load profiles as a normalised vector that sums consumption over time, to capture load shape as well as consumption levels. Granell et al. [16] investigates the impact of temporal resolution on clustering algorithms in the residential energy domain and suggests that cluster quality is best at a resolution of 8 or 15 minutes. For the k-means algorithm performance is robust in a band of temporal resolutions between 4 to 60 minutes. Most studies normalise input data by scaling vectors with a min-max scaler so that patterns retain their shape but are scaled to a zero-one range [8][27][3]

, an approach that is sensitive to outliers. De-minning subtracts the daily minimum demand from each hourly value and then divides it by the de-minned daily total. It has been proposed as a more robust form of normalisation

[21].

2.1.2 Clustering with Pre-binning

For residential consumers the variable nature of individual households makes the interpretation of clustering results ambiguous [29], a challenge that is exacerbated in highly diverse, developing country populations, where economic volatility, income inequality, geographic and social diversity contribute to increased variability of residential energy demand [18]. In other clustering studies of diverse populations pre-binning, or two-stage clustering, was implemented and showed promising results [4][39][35]. Xu et al. [39] used pre-binning to first clusters load profiles by overall consumption and then by load shape, to improve clustering results for highly variable households spread across the United States.

2.1.3 Considerations for Developing Countries

Very few studies have been conducted in developing countries. Certain assumptions around data representation and cleaning must be reconsidered when clustering energy consumers in this context. Very low consuming households are frequently treated as outliers and removed from the data [22][4]. While individual household consumption of these groups is low, they present a significant percentage of households in our dataset. Moreover, the profiles typically belong to consumers living in rural or informal settings, and their inclusion is key if energy access is a concern. Their low consumption base also presents an opportunity for high growth, which has important implications for utilities.

2.2 Clustering Metrics

Common metrics that measure cluster compactness and distinctness, and that are used in the residential energy domain are the Davies-Bouldin Index (DBI) [9], the Cluster Dispersion Index (CDI) and Mean Index Adequacy (MIA) described in Chicco et al. [5] and the Silhouette Index [19]. It is well known that a single metric on its own is insufficient to adequately represent cluster performance [2], and many studies have indicated that these metrics do not discriminate clustering structures sufficiently. Several studies suggest a combination of measures together with additional expert validation to ensure optimal cluster selection [20][11][8]. Drawing on segmentation criteria from the marketing sector, Dent and Hons [12] propose additional metrics that require clusters to be accessible, differentiable, actionable, stable and familiar. Kwac et al. [22] propose the notion of entropy as a metric for capturing the variability of electricity consumption of a household. To evaluate the result of segmenting a large number of daily load profiles into interpretable consumption patterns, Xu et al. [39] use peak overlap, percentage error in overall consumption and entropy as metrics.

2.2.1 Competency Questions

Competency questions have been widely used in the ontology engineering community to formalise context-specific requirements and to compare candidate ontologies [17]. They can be used to represent a set of problems that characterise microtheories in a rigorous manner, enabling more precise evaluation of different conceptualisations of a domain [14]. Brainstorming, expert interviews and consultation of established sources of domain knowledge are processes that can be used to identify competency questions [10]. The techniques for developing competency questions and the questions themselves can be formal or informal. Informal competency questions can be expressed in natural language and connect a proposed ontology to its application scenarios, thus providing an informal justification for the ontology [36]. To our knowledge competency questions have not been used previously to evaluate clustering structures in terms of their fitness for purpose.

2.3 From Clusters to Customer Archetypes

Customer archetypes can be derived from RDLPs by classifying load profiles according to socio-demographic characteristics

[27][23][37]. This work builds on the approach of McLoughlin et al. [23]

. They cluster the daily hourly load profiles 3941 Irish households and derive RDLPs by averaging the consumption of cluster members. The RDLP used by every customer on every day is captured in a Customer Class Index (CCI). The statistical mode of the CCI is assigned to each customer to obtain its most frequently occurring profile. Finally, multinomial logistic regression is used to classify the CCI by socio-demographic and appliance variables.

3 Data

The Domestic Electrical Load Metering Hourly (DELMH) [32] dataset contains 3 295 194 daily load profiles for 14 945 South African households over a period of 20 years from 1994 to 2014. The daily load profile of household on day denoted by is a 24 element vector representing the energy demand in Amperes for each hour in day . For example, the first element, , is the household’s average energy demand for the first hour of the day, i.e. 00:00:00 - 00:59:59. is an array containing all daily load profiles associated with household , and (dim 3 295 194 24) is the array of all daily load profiles for all households.

(1)
(2)
(3)

We can then use clustering to find an optimal clustering structure , given the input dataset . A single cluster is representative of individual daily load profiles that capture similar daily energy consumption behaviour. The centroid of the cluster is the mean daily load profile also referred to as the representative daily load profile (RDLP), denoted as . It represents the mean daily consumption pattern of all load profiles in cluster . The RDLPs of the optimal cluster set can be used to generate customer archetypes for long term energy modelling applications.

3.1 Description of Sample Population

For 58% of the metered households (8656 households) detailed socio-demographic data was captured in an annual survey111

A harmonised version of the survey data used to provide descriptive statistics has been published as the Domestic Electrical Load Survey - Key Variables (DELSKV) dataset

[33]. The majority of households have a low income of less than R5000 (about $340) per month. A fraction of households earns up to 50 times that amount. A similar distribution can be observed for dwelling size, with most households occupying dwellings between and . Less than half the surveyed households have access to piped water in the home and less than a quarter of households live in dwellings with brick walls. More than half the households have a corrugated iron or zinc roof - a construction material that is particularly popular in rural and informal settlements due to its availability and low cost. Furthermore, the dataset covers a large number of newly electrified households. While the affluent households could be seen as outliers, it is important to include them in the analysis as they are disproportionally large energy consumers. Appendix A visualises the distribution of income, dwelling floor area, the number of years electrified and the proportion of wall materials, roof materials and water access points of survey respondents in Figures 5(a), 5(b), 5(c) and 6(a).

4 Load Profile Clustering

After an extensive literature survey on clustering residential load profiles, we selected Euclidean distance and the clustering algorithms that were most popular and successful in the domain. This section describes the pre-processing steps, clustering algorithms, parameters and quantitative metrics.

4.1 Experiment Design

An experiment run takes input array to produce cluster set and predict a cluster for each normalised daily load profile of household observed on day . The output of the cluster evaluation is the selection of the clustering structure that is most suitable for our proposed use case. More specifically, the objective of the load profile clustering experiments is the selection of the experiment that produces the set of clusters that symbolise the best RDLPs for , so that the RDLPs can be used to generate customer archetypes for long term energy planning.

Cluster symbolises the RDLP , calculated from the mean of all de-normalised daily load profiles assigned to :

(4)

is the set of RDLPs for all clusters in

. Given the high variance of the dataset, preprocessing was an important component of the clustering process. Different normalisation and pre-binning algorithms were set up for comparison alongside clustering algorithms.

4.1.1 Clustering Algorithms

Variations of kmeans, self-organising maps (SOM) and a combination of the two algorithms were implemented to cluster . The kmeans algorithm was initialised with a range of clusters. The SOM algorithm was initialised as a square map with dimensions for in range . Combining SOM and kmeans first creates a map, which acts as a form of dimensionality reduction on . For each , kmeans then clusters the map into clusters. The mapping only makes sense if is greater than . and are the algorithm parameters.

4.1.2 Normalisation

Early test runs indicated that normalisation has a considerable influence on clustering results. We compared four techniques from the literature (Table 1) against a baseline with no normalisation.

Normalisation Equation Comments
Unit norm Scales input vectors individually to unit norm
De-minning Subtracts daily minimum demand from each hourly value, then divides each value by deminned daily total; proposed by Jin et al. [20]
Zero-one Scales all values to a range [0, 1]; retains profile shape but is very sensitive to outliers; also known as min-max scaler
SA norm Normalises all input vectors to mean of 1; retains profile shape but very sensitive to outliers; introduced for comparison, as it is frequently used by South African domain experts
Table 1: Data normalisation algorithms and descriptions

4.1.3 Pre-binning

We implemented two different approaches to pre-bin all daily load profiles in . To pre-bin by average monthly consumption (AMC), we selected 8 expert-approved bin ranges based on South African electricity tariff ranges (see Appendix A for ranges). All the daily load profiles, of household were assigned to one of the 8 bins based on the value of the household’s average monthly consumption, . Individual household identifiers were removed from after pre-binning. AMC for household over one year is:

(5)

Pre-binning by integral k-means draws on the work of Xu et al. [39], which resembles our use case. For the simple case where represents hourly values, pre-binning by integral k-means followed these steps:

  1. Construct sequence from the cumulative sum of profile normalised with unit norm

  2. Append to

  3. Gather all features in array and remove individual household identifiers

  4. Use the kmeans algorithm to cluster into bins (same as bins created for AMC)

4.1.4 Summary of Clustering Experiments

Table 2 summarises the algorithms, parameters and pre-processing steps for each experiment. indicates that zero consumption values were retained in the input dataset. Each experiment was executed with all normalisation approaches.

Exp. Algorithm Parameters Pre-bin Zeros
1 kmeans True
SOM True
SOM+kmeans True
2 kmeans False
SOM False
SOM+kmeans False
3 kmeans AMC True
SOM AMC True
SOM+kmeans AMC True
4 kmeans AMC True
SOM+kmeans AMC True
5 kmeans AMC False
6 kmeans integral kmeans True
7 kmeans integral kmeans False
Table 2: Summary of experiments

4.2 Quantitative Metrics and CI Score

The Mean Index Adequacy (MIA), Davies-Bouldin Index (DBI) and the Silhouette Index were combined into a Combined Index (CI) score so that clustering performance can be evaluated across traditional analytical measures (see Appendix B for details on metrics). The CI is used as a relative index to enable simultaneous interpretation of multiple metrics. Distances between cluster centroids and cluster members were computed using Euclidean distance. The CI is calculated as follows:

(6)
(7)

Ix is an interim score that computes the product of the DBI, MIA and inverse Silhouette Index. The CI is the log of the weighted sum of Ix across all experiment bins. DBI and MIA measure cluster compactness. Both metrics increase as cluster compactness deteriorates, thus increasing Ix and CI if this is the case. The Silhouette Index has a range between {-1, 1} and is a measure of cluster distinctness and compactness. The Silhouette Index is close to 1 when clusters are both distinct and compact. The closer the Silhouette Index is to 0, the greater the Ix and CI become. A lower CI is desirable and an indication of a better clustering structure. The logarithmic relationship between Ix and the CI means that the CI is negative when Ix is between 0 and 1, 0 when and greater than 0 otherwise. For experiments with pre-binning, the experiment with the lowest Ix score in each bin is selected, as it represents the best clustering structure for that bin. For experiments without pre-binning, and . Weighting the Ix of each bin is important to account for the cluster membership size in that bin.

5 Formalising Application Requirements

We used a combination of analysing existing standards and engagement with domain experts to formulate informal competency questions expressed in natural language. The Geo-based Load Forecasting Standard (2012) contains manually constructed load profiles and guiding principles for load forecasting in South Africa. The competency questions were developed after analysis of this standard and continuous engagement with a panel of five industry experts. There were initial interviews with all experts to elicit the usage requirements. Preliminary competency questions were presented at a workshop with key stakeholders in the community. The final version of the competency questions incorporated the feedback from the stakeholders. The competency questions were then used to construct associated qualitative evaluation measures and a cluster scoring matrix that weights these measures to provide a qualitative ranking of cluster sets in terms of the application requirements.

5.1 Eliciting Competency Questions

The following five competency questions were identified and expressed in natural language:

  1. Does the load shape deduced from clusters represent expected energy demand?

  2. Do clusters distinguish between low, medium and high demand consumers?

  3. Can clusters represent specific loading conditions for different day types and months?

  4. Can a zero-consumption profile be represented in the cluster set222Deemed important for energy access in low income contexts, where households may go through periods of no consumption when they cannot afford to buy electricity.?

  5. Is the number of households assigned to clusters reasonable, given the sample population?

Based on these questions, we define a good cluster set as having expressive clusters and being usable within the context of the intended application. An expressive cluster must convey specific information related to particular socio-economic and temporal energy consumption behaviour. A usable cluster set must represent energy consumption behaviour that makes sense in relation to the application context, and carry the necessary information to make it pertinent to domain users. Next, qualitative evaluation measures are introduced to formalise the competency questions.

5.1.1 Cluster Expressivity

Current domain knowledge suggests that daily energy consumption behaviour is strongly influenced by daily routines, seasonal climatic variability and the energy demand (e.g. low, medium, high consumption) of a household. Beyond producing load profiles that exhibit specific features typically associated with load profiles (question 1), it is desirable that individual clusters convey specific information about the demand profiles of types of consumers (question 2), on different days of the week and months (question 3). Expressivity thus requires firstly that the RDLP of a cluster is representative of the energy consumption behaviour of the individual daily load profiles that are members of that cluster. Secondly, members of an expressive cluster must share the same context to have the ability to convey specific meaning, e.g. daily load profiles of low demand households on Sundays in June.

The mean demand errors of the total and peak consumption values measure the average deviation between the RDLP (centroid) and the cluster members’ load profiles. The mean peak coincidence ratio measures the deviation of the peak usage time between the RDLP and the daily load profiles in the cluster. Together these measures express the extent to which a RDLP is representative of the shape and demand of the cluster’s member profiles. Cluster entropy can be used to establish the information embedded in a cluster and thus its specificity. The lower the entropy, the more information is embedded in the cluster, the more specific (homogeneous) the cluster, the better the cluster. We calculate day type and monthly entropy to establish temporal specificity, and total and peak daily consumption entropy to establish demand specificity.

5.1.2 Cluster Usability

The attribute of cluster usability was derived from competency questions 4 and 5. Question 4 requires a manual evaluation based on expert judgement and is evaluated as being either true, or false. Question 5 is calculated as the percentage of clusters whose membership exceeds a threshold value. Moreover, while we anticipate a relatively large number of clusters to represent the large variety of consumers, the following two factors should also be considered:

  1. Fewer clusters typically ease interpretation and are thus preferable to larger numbers of clusters

  2. The maximum number of clusters should be limited to 220, based on population diversity and existing expert models which account for 11 socio-demographic groups, 2 seasons, 2 daytypes and 5 climatic zones

5.2 Cluster Scoring Matrix

The qualitative measures translate the clustering attributes into quantifiable scores. Experiments are then ranked by their scores for each measure. The ranks are weighted by the relative importance that experts assigned to that measure. Finally, a cumulative score is calculated for each experiment by summing its weighted ranks. The lower the total score, the better the cluster set. Table 3 summarises the attributes, competency questions, qualitative measures and corresponding weights of the cluster scoring matrix. The total score of a qualitative measure for cluster set is the mean of the individual measures of all clusters with more than 10490 members333The threshold was selected as a value approximately equal to 5% of households using a particular cluster for 14 days.. Clusters with a small member size are excluded when calculating mean measures, as they tend to overestimate the performance of poor clusters. Moreover, cluster scores are weighted by cluster size to account for the overall effect that a particular cluster has on the set. For the mean demand error, experiments are ranked against four error metrics. The mean rank used in the cluster scoring matrix is then calculated across all errors.

Attribute CQ Qualitative measure Weight
usable 4 zero-profile representation 1
5 membership threshold ratio 2
expressive 1 mean demand error total 6
representative 1 peak 6
1 mean peak coincidence 3
expressive 3 temporal entropy day type 4
specific 3 monthly 4
2 demand entropy total daily 5
2 peak daily 5
Table 3: Overview of qualitative evaluation

5.3 Qualitative Evaluation Measures

5.3.1 Mean Demand Error

The total daily demand and peak daily demand for a household and a cluster RDLP are calculated as the sum and maximum values of their respective daily load profiles and as follows:

and (8)
and (9)

Four error metrics are used to calculate the mean deviation between the RDLP’s peak and total daily demand and its members . Mean absolute percentage error (MAPE) and median absolute percentage error (MdAPE) are well known error metrics. The median log accuracy ratio (MdLQ) [24] overcomes some of the drawbacks of the absolute percentage errors. The median symmetric accuracy (MdSymA) can be interpreted as a percentage error similar to MAPE, making it more intuitive than MdLQ. The demand error measures are given below and calculated for , which are all daily load profiles assigned to cluster with RDLP .

Absolute Percentage Error
(10)
(11)
Median Log Accuracy ratio
(12)
(13)
Median Symmetric Accuracy
(14)

5.3.2 Mean Peak Coincidence Ratio

Peaks are all those values that are greater than half the maximum daily load profile value . Peak coincidence is the count that the time of peak demand in a daily load profile coincides with the time of peak demand of the RDLP of its cluster. Mean peak coincidence (MPC) is the average intersection of the actual and cluster peak times for all assigned to . The mean peak coincidence ratio is the ratio of mean peak coincidence to the count of peaks in RDLP of cluster . It has a value between 0 and 1. The magnitude of the peak is not considered in the mean peak coincidence ratio.

5.3.3 Entropy as a Measure of Cluster Specificity

Entropy is used to quantify the specificity of clusters and is calculated as follows:

(15)

is a feature vector with possible values .

is the probability that daily load profiles with value

are assigned to cluster . For day type entropy expresses the specificity of a cluster with regards to day of the week. Thus has possible values . is the likelihood that daily load profiles that are used on a Sunday are assigned to cluster . has possible values and is used to calculate monthly entropy . To calculate peak and total daily demand entropy, we created percentile demand bins. Thus the possible values of feature are . is the likelihood that daily load profiles with peak demand corresponding to that of the 60th peak demand percentile are assigned to cluster .

6 Evaluation of Clustering Results

A total of 2083 experiment runs were conducted using the parameter values outlined in Table 2. Each run was first evaluated with the quantitative CI score. The best runs of the best experiments were then further evaluated with the cluster scoring matrix. We implemented our experiments in python 3.6.5 using k-means algorithms from scikit-learn (0.19.1) and self-organising maps from the SOMOCLU (1.7.5) libraries444The codebase is available online at https://github.com/anon.

6.1 Quantitative Clustering Results

The CI scores for all experiments range from 2.282296 to 9.626502. Lower scores are better. Almost two thirds (65.5%) of experiments have a score below 4. These experiments have been normalised with unit norm, de-minning or zero-one. The remaining experiments have scores above 5 and have not been normalised, or normalised with SA norm. The top 10 ranked experiment runs based on the CI score are shown in Table 4. Closer analysis of the results confirms that normalisation significantly impacts clustering results. Almost all of the top experiments have been normalised with unit norm, with the exception of two experiments normalised with zero-one. The effects of pre-binning are less clear. Both pre-binning approaches and runs without pre-binning are represented in the top results. Kmeans is the uncontested best clustering algorithm. Four runs belong to exp. 1 (kmeans, unit norm), but were initialised with different parameters ().

Rank CI DBI MIA Sil. Exp. Alg. m Norm.
1 2.282 2.125 0.438 0.095 1 kmeans 47 unit
2 2.289 1.616 1.220 0.262 4 kmeans 17 0-1
3 2.296 1.616 1.220 0.260 3 kmeans 17 0-1
4 2.301 2.152 0.485 0.119 5 kmeans 82 unit
5 2.316 2.115 0.447 0.093 1 kmeans 35 unit
6 2.320 2.199 0.486 0.121 4 kmeans 71 unit
7 2.349 2.152 0.481 0.143 6 kmeans 49 unit
8 2.351 2.189 0.434 0.090 1 kmeans 50 unit
9 2.354 2.111 0.476 0.128 7 kmeans 59 unit
10 2.355 2.173 0.453 0.093 1 kmeans 32 unit
Table 4: Top 10 runs ranked by CI score

For both the kmeans and SOM algorithms the batch fit time increases linearly with dimensionality. For SOM+kmeans the SOM is used for dimensionality reduction and the dimensions explored are thus considerably greater. This has a significant impact on increasing experiment run times, as shown in Table 5.

Algorithm Mean CI score Mean run time (s)
k-means 2.59 44.79
SOM 4.11 39.42
SOM + k-means 3.17 1498.77
Table 5: Summary of algorithm CI scores and run times

6.2 Qualitative Clustering Results

The results of the qualitative rescoring with the cluster scoring matrix are presented in Table 6 for the top runs of the top experiments. The CI score is shown in the last column for comparison. Despite being ranked 9th by CI score, exp. 7 (kmeans, unit norm) is now ranked 1st. Table 7 shows a detailed view of the cluster scoring matrix, with rankings for individual qualitative measures. The second best run, exp. 4 (kmeans, unit norm), ranks highly for entropy and demand error measures, but has a poorer peak coincidence ratio. Exp. 5 (kmeans, unit norm) ranks third for most measures. While the top two runs lie only 8 points apart, they comfortably outperform the third best run, which has double the score.

Rank Score Exp. Norm. Pre-binning Zeros CI rank
1 57.0 7 unit int. kmeans False 9
2 65.0 4 unit AMC True 6
3 117.5 5 unit AMC False 4
4 143.5 6 unit int. kmeans True 7
5 150.0 1 unit True 1
6 205.0 4 0-1 AMC True 2
7 208.0 3 0-1 AMC True 3
Table 6: Top runs ranked by qualitative scores
Exp. 1 3 4 4 5 6 7
Norm unit 0-1 unit 0-1 unit unit unit
Qualitative measures W
threshold ratio 2 1 5 3 5 7 4 1
peak coincidence ratio 3 1 7 4 6 2 5 3
peak demand error 6 5.50 5.50 2.00 5.05 4.00 3.00 1.50
total demand error 6 5.00 6.25 2.00 6.00 3.25 3.75 1.00
peak demand entropy 5 5 7 2 6 3 4 1
total demand entropy 5 5 6 1 6 3 4 2
day type entropy 4 4 6 1 6 3 5 2
monthly entropy 4 4 6 1 6 3 5 2
SCORE 150.0 214.5 65.0 205.0 117.5 143.5 57.0
Table 7: Cluster Scoring Matrix

The day type entropy of the best experiment, exp. 7 (kmeans, unit norm), is shown in Figure 1 to gain an intuition of its expressivity and usability. The figure visualises the likelihood () that a cluster is used on a particular day of the week (see Eq. 15). The higher the peak of a line, the more likely that profiles assigned to that cluster are used on that day of the week. The lower the peak, the less likely that this is the case. Cluster 15 is a good example of a cluster that has a very high likelihood of being used on a Sunday, and a lower likelihood of being used on a Saturday or weekday. This cluster is thus specific to the Sunday day type, which is desirable.

Figure 1: Day type entropy for exp. 7 (kmeans, unit norm)

6.3 Contrasting Quantitative & Qualitative Results

Contrasting the RDLPs of exp. 4 (kmeans, zero-one) with those of exp. 7 (kmeans, unit norm), shows the potential of the qualitative evaluation measures. Exp. 4 (kmeans, zero-one) ranked second based on the CI score, but second last in the cluster scoring matrix. Exp. 7 (kmeans, unit norm) on the other hand ranked ninth by CI score, yet ranked first based on qualitative measures. Comparing the RDLPs in Figures 2 and 3 clearly shows that the latter have greater potential for generating customer archetypes. Exp. 4 (kmeans, zero-one) has only 18 clusters. The five smallest clusters combined have fewer than 1500 member profiles and appear invisible in the bar chart at the bottom of Figure 2. The ragged shapes of the RDLPs of Cluster 16, 17 and 18 are an indication that very few profiles were aggregated in these RDLPs. Over half of all load profiles belong to only three clusters: Cluster 5, 6 and 9. As a whole, the individual RDLPs lack distinguishing features, making them neither expressive nor useable, and thus poor candidates for creating customer archetypes.

Figure 2: RDLPs and cluster membership of exp. 4 (kmeans, zero-one)

Exp. 7 (kmeans, unit norm) on the other hand has 59 clusters. With the exception of Cluster 33 which accounts for roughly 15% of all daily load profiles, cluster membership for the remaining clusters varies in a range from 15 000 to 100 000 members. Cluster 33 is one of only two clusters in a bin with large membership, due to the high number of low consumption households captured in our dataset population. Collectively, the individual RDLPs are representative and specific, which promises that they will be useful for constructing customer archetypes.

Figure 3: RDLPs and cluster membership of exp. 7 (kmeans, unit norm)

6.4 Discussion of Clustering Results

We found that while traditional analytical metrics provided a useful tool for identifying the most distinct and compact cluster sets, the CI score was limited for analysis and comparison within the application context. The percentage point difference between the scores of the first and tenth experiment is only 3.2%. Selecting the best set of clusters based on the CI score alone does not provide insights on the expressivity and usability of clusters, and their potential for producing candidate RDLPs that can be used to generate customer archetypes. This confirms the conclusions drawn by previous studies [20], where expert judgement was still required to analyse and select the best cluster set. The qualitative scores span a greater range of values than the CI scores and are grounded in interpretable measures. This makes the results more meaningful and eases the analysis and final selection of the best cluster set.

As expected, normalisation significantly impacts clustering results. There is a distinct difference in performance between experiments normalised with algorithms that transform daily load profiles to values between 0 and 1 (unit norm, de-minning and zero-one normalisation) and those that do not (SA norm and unnormalised experiments). Unit norm was the best normalisation for most experiments. SA norm performed the worst. This was no surprise, as the Euclidean distance measure and the error metrics are severely impacted by the larger values that this normalisation permits. While pre-binning appears promising, more rigorous analysis is warranted to assess its effectiveness. Comparing the clustering algorithms, kmeans outperformed the SOM and SOM+kmeans techniques for almost all experiments. This result corresponds with the suggestions made in the cluster analysis literature and with the results of previous studies [20]. While the type of dataset is well suited to clustering with kmeans, alternative partitional clustering algorithms such as k-medoids, as well as alternative distance measures such as Dynamic Time Warping, should still be explored.

7 Application of Clusters to Construct Customer Archetypes

We now explore how effective the best cluster set, exp. 7 (k-means, unit norm), is for constructing customer archetypes. We describe a customer archetype created by experts, illustrate how our system can be used to create such an archetype and compare our archetype to the one created by the experts.

7.1 Expert Archetype for Lower Middle Class Customers in KZN

A customer archetype represents the expected energy consumption behaviour (RDLPs) of a given type of household, distinguished by its socio-demographic characteristics. We used the archetype of a lower middle class, long term electrified household in KwaZulu-Natal (KZN), South Africa as a use case. Figure 5 depicts a customer archetype developed by experts for such a household. KZN lies in the East of South Africa, and subsequently has an earlier sunrise and sunset than most other parts of the country. Work day morning peaks are expected between 5am and 7am, and evening peaks between 5pm and 7pm. The climate is subtropical, with humid summers and warm winters.

Figure 4: Expert archetype for medium-term electrified lower middle class households in KZN

Table 8 shows the specific characteristics of different household types identified by experts. The lower middle class household described above will be a household that has piped water access (tap in house), a floor area between 80m and 150m with walls constructed from asbestos, blocks or bricks and a monthly income between R7 800 and R11 600.

Archetype Water Wall material Floor area Income
rural river/dam daub/mud/clay 0-50 R0-R1.8k
informal street taps, tap in yard corr.iron/zinc 0-50 R1.8-R3.2k
township tap in house asbestos, blocks, brick 50-80 R3.2k-R7.8k
lower middle tap in house asbestos, blocks, brick 80-150 R7.8k-R11.6k
upper middle tap in house brick 150-250 R19k-R24.5k
Table 8: Attributes of customer archetypes

7.2 System generated archetype for KZN Lower Middle Class Customers

We used the clusters from exp. 7 (k-means, unit norm) to reconstruct the above archetype with a simple multi-class regression model that maps the socio-demographic attributes of cluster members to their clusters. To train the model, we created a feature input vector of the socio-demographic and temporal attributes (i.e. day type and season) for each daily load profile belonging to a cluster. The socio-demographic household data was discretised into the ranges recommended by domain experts as shown in Table 8

. Each input vector was labelled with the cluster to which the daily load profile was assigned. The model outputs odds ratios that indicate the likelihood that a particular feature value (i.e. socio-demographic or temporal attribute) is correlated with a cluster. We selected attributes to characterise clusters if the odds ratio was equal to or greater than 1.05. The model was trained with WEKA’s

555https://www.cs.waikato.ac.nz/ml/weka/ multinomial logistic regression algorithm, but any appropriate classification method can be used. The full implementation details are in [34].

Winter Summer
Cluster Daytype Cluster Daytype
3 weekday 1 Saturday, Sunday
35 weekday 4 weekday, Friday
36 Saturday, Sunday 5 Saturday, Sunday
38 weekday, Friday
Table 9: Temporal attributes of clusters in Fig 5
Figure 5: Clusters and RDLPs for medium-term electrified lower middle class households in KZN

Seven clusters showed a strong correlation with the socio-demographic attributes of this archetype. Table 9 shows the day type and seasonal attributes of the 7 clusters. Each day type in each season is represented by at least one cluster. Full temporal coverage like this is desirable. Work day and weekend clusters, and winter and summer clusters are mutually exclusive. There are 3 winter weekday clusters (Cluster 3, 35, 38), one summer weekday cluster (Cluster 4), 1 winter weekend cluster (Cluster 36) and two summer weekend clusters (Cluster 1, 5). Figure 5 shows the RDLPs for the clusters. The RDLPs of all weekday clusters resemble a typical ‘out of home’ shape, with either a high morning or evening peak and lower consumption throughout the day. This is expected for a lower middle class household, where adults are typically blue collar workers that have a fixed work routine. The RDLPs of Clusters 1, 5 and 36 show a strong correlation with weekends. Cluster 1 and 36 are indicative of a slow starting day when there is no job to rush to. Cluster 5 with its peak at 12pm is typical for families that have a strong tradition of a shared family lunch on weekends. The summer weekday RDLP of Cluster 4 has an earlier morning peak than those of the winter weekday clusters. The weekday RDLPs of Cluster 3, 4, and 35, show an earlier evening peak. With the exception of Cluster 3, the winter RDLPs have a higher energy demand throughout the day than the summer RDLPs.

As a whole, the cluster-constructed RDLPs of this archetype were found to resemble expected customer behaviour. However, some discrepancies exist in relation to the expert archetype. In contrast to the expert RDLPs in Figure 4, the shapes of the cluster-constructed RDLPs have only one distinct peak, either in the morning or evening. While the peak times correspond between the archetypes, the peak demand values of the expert archetype are approximately half the value of those of the cluster-constructed RDLPs. The expert archetype represents the aggregate consumption of a group of households and has only one RDLP for each temporal energy usage context. If we were to aggregate all the cluster-constructed RDLPs for common temporal contexts, for example the three winter weekday RDLPs, the single resultant profile shape and its peak demand would more closely resemble those of the expert archetype.

8 Discussion

In this work we present a method for creating customer archetypes for residential energy consumers from metered energy data. We used time series clustering techniques to generate 2083 candidate cluster sets. Like previous studies, we found traditional quantitative metrics insufficient for evaluating and selecting the most useful cluster set. We thus set out to formalise and automate the subsequent visual analysis of cluster sets, which is typically carried out manually by experts. Our approach used competency questions to elicit expert knowledge and to specify the requirements for a given clustering application to generate customer archetypes. This approach enabled us to reduce cluster analysis and evaluation time and made cluster selection less subjective and adhoc. The usefulness of the selected cluster set is demonstrated in an application that uses the RDLPs generated from clusters to reconstruct a customer archetype previously developed by experts.

We found that even though competency questions were highly effective for engaging with experts and eliciting domain knowledge and requirements, they lack intrinsic support for evaluating and selecting cluster sets. We therefore introduced a collection of qualitative measures and a cluster scoring matrix to translate the competency questions into a ranking system for evaluating and comparing cluster sets. The cluster scoring matrix has been used to rank and guide the selection of a robust cluster set that satisfies the specified application requirements. It eases the scoring and ranking of experiments, while also making validation explicit, transparent and repeatable. Entropy in particular is a promising approach for evaluating the contextual specificity of clusters. While the results produced by the cluster scoring matrix are promising, the overall score does depend on manually selecting weights for the different measures and setting a minimum threshold count to filter out small clusters.

In our use case application, we evaluated a cluster-generated customer archetype against an expert archetype. The evaluation firstly assessed if the RDLPs have sufficient temporal representative strength to characterise the customer archetype. This was done by examining the temporal coverage, seasonal and daytype exclusivity of the RDLPs. Secondly, the shape, peak time and energy demand of the archetype’s RDLPs were compared against those of the expert archetype. We found that the cluster-generated archetype compared favourably against the expert benchmark.

9 Conclusion

This paper presents an end-to-end approach for automatically creating residential energy customer archetypes from energy meter data in a highly diverse, developing country population. Our extensive comparison of clustering and pre-processing techniques has demonstrated that pre-binning significantly improves clustering results. Moreover, the typically time-consuming visual analysis of quantitative results is aided by using competency questions to formalise local domain expertise. The subsequent evaluation of the clusters in a real-world application shows that the approach has promise for automating the generation of customer archetypes for real-world, long-term energy planning. While this approach has only been evaluated in the residential energy sector, similar approaches may be promising in other residential utility domains, such as the water sector.

References

  • [1] F. Batrinu, G. Chicco, R. Napoli, F. Piglione, P. Postolache, M. Scutariu, and C. Toader (2005) Efficient iterative refinement clustering for electricity customer classification. 2005 IEEE Russ. Power Tech, PowerTech, pp. 1–7. External Links: Document, ISBN 9781424418749 Cited by: §2.1.
  • [2] J. C. Bezdek and N. R. Pal (1998) Some New Indexes of Cluster Validity. 28 (3), pp. 301–315. Cited by: §2.2.
  • [3] S. M. Bidoki, N. Mahmoudi-Kohan, M. H. Sadreddini, M. Z. Jahromi, and M. P. Moghaddam (2010) Evaluating different clustering techniques for electricity customer classification. 2010 IEEE PES Transm. Distrib. Conf. Expo. Smart Solut. a Chang. World, pp. 1–5. External Links: Document, ISBN 9781424465477 Cited by: §2.1.1, §2.1.
  • [4] H. A. Cao, C. Beckel, and T. Staake (2013) Are domestic load profiles stable over time? An attempt to identify target households for demand side management campaigns. IECON Proc. (Industrial Electron. Conf., pp. 4733–4738. External Links: Document, ISBN 9781479902248, ISSN 1553-572X Cited by: §2.1.2, §2.1.3, §2.1.
  • [5] G. Chicco, R. Napoli, and F. Piglione A Review of Concepts and Techniques for Emergent Customer Categorisation. External Links: Link Cited by: §2.2.
  • [6] G. Chicco, R. Napoli, and F. Piglione (2003) Application of clustering algorithms and Self Organising Maps to classify electricity customers. 2003 IEEE Bol. PowerTech - Conf. Proc. 1, pp. 373–379. External Links: Document, ISBN 0780379675, ISSN 00448486 Cited by: §2.1.
  • [7] G. Chicco, R. Napoli, and F. Piglione (2006) Comparison Among Clustering Techniques for Electricity Customer Classification. IEEE Trans. POWER Syst. 21 (2), pp. 1–7. External Links: Document Cited by: §2.1.
  • [8] T. Dang-Ha, R. Olsson, and H. Wang (2017) Clustering Methods for Electricity Consumers: An Empirical Study in Hvaler-Norway. NIK-2017. External Links: 1703.02502, Link Cited by: §2.1.1, §2.1, §2.2.
  • [9] D. L. Davies and D. W. Bouldin (1979) A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1 (2), pp. 224–227. External Links: Document, ISBN 0162-8828, ISSN 01628828 Cited by: §B.2, §2.2.
  • [10] A. De Nicola, M. Missikoff, and R. Navigli (2009) A software engineering approach to ontology building. Inf. Syst. 34 (2), pp. 258–275. External Links: Document, ISBN 0306-4379, ISSN 03064379 Cited by: §2.2.1.
  • [11] I. Dent, T. Craig, U. Aickelin, and T. Rodden (2014) Variability of behaviour in electricity load profile clustering; Who does things at the same time each day?. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 8557 LNAI, pp. 70–84. External Links: Document, arXiv:1409.1043v1, ISBN 9783319089751, ISSN 16113349 Cited by: §1, §2.2.
  • [12] I. Dent and B. S. Hons (2015) Deriving knowledge of household behaviour from domestic electricity usage metering. (July). External Links: Link Cited by: §2.2.
  • [13] V. Figueiredo, F. Rodrigues, Z. Vale, and J. B. Gouveia (2005) An electric energy consumer characterization framework based on data mining techniques. IEEE Trans. Power Syst. 20 (2), pp. 596–602. External Links: Document, ISBN 0885-8950, ISSN 08858950 Cited by: §2.1.
  • [14] M. S. Fox and M. Grüninger (1994) Ontologies for enterprise integration.. In CoopIS, pp. 82–89. Cited by: §2.2.1.
  • [15] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos (2019-01) Comparing similarity perception in time series visualizations. IEEE Transactions on Visualization and Computer Graphics 25 (1), pp. 523–533. External Links: Document, ISSN Cited by: §1.
  • [16] R. Granell, C. J. Axon, and D. C. H. Wallom (2015) Impacts of Raw Data Temporal Resolution Using Selected Clustering Methods on Residential Electricity Load Profiles. IEEE Trans. Power Syst. 30 (6), pp. 3217–3224. External Links: Document Cited by: §2.1.1.
  • [17] M. Grüninger and M. S. Fox (1995) The role of competency questions in enterprise engineering. In Benchmarking — Theory and Practice, A. Rolstadås (Ed.), pp. 22–31. External Links: ISBN 978-0-387-34847-6, Document, Link Cited by: §2.2.1.
  • [18] S. Heunis and M. Dekenah (2014) Manual for Eskom Distribution Pre- Electrification Tool (DPET). Eskom Holdings Limited, Johannesburg. Cited by: §2.1.2.
  • [19] Jiawei Han, M. Kamber, and J. Pei (2012) Data Mining Concepts & Techniques. Third edition, Morgan Kaufmann Publishers. External Links: Document, arXiv:1011.1669v3, ISBN 9780123814791, ISSN 1469-994X Cited by: §2.2.
  • [20] L. Jin, D. Lee, A. Sim, S. Borgeson, K. Wu, C. A. Spurlock, and A. Todd (2017) Comparison of Clustering Techniques for Residential Energy Behavior Using Smart Meter Data. AAAI Work. Artif. Intell. Smart Grids Smart Build., pp. 260–266. Cited by: §1, §2.1, §2.2, Table 1, §6.4, §6.4.
  • [21] L. Jin, A. Spurlock, S. Borgeson, D. Fredman, L. Hans, S. Patel, A. Todd, L. Berkeley, and A. Spurlock (2016) Load Shape Clustering Using Residential Smart Meter Data : a Technical Memorandum. (September), pp. 1–15. Cited by: §2.1.1.
  • [22] J. Kwac, J. Flora, and R. Rajagopal (2014) Household energy consumption segmentation using hourly data. IEEE Trans. Smart Grid 5 (1), pp. 420–430. External Links: Document, ISBN 1949-3053, ISSN 19493053 Cited by: §2.1.3, §2.2.
  • [23] F. McLoughlin, A. Duffy, and M. Conlon (2015) A clustering approach to domestic electricity load profile characterisation using smart metering data. Appl. Energy 141, pp. 190–199. External Links: Document, ISBN 0306-2619, ISSN 03062619, Link Cited by: §2.1, §2.3.
  • [24] S. K. Morley (2016) Alternatives to accuracy and bias metrics based on percentage errors for radiation belt modeling applications. 01. External Links: Document, Link Cited by: §5.3.1.
  • [25] S. Ramos, S. Member, J. M. M. Duarte, J. Soares, Z. Vale, S. Member, and F. J. Duarte (2012) Typical Load Profiles in the Smart Grid Context A Clustering Methods Comparison. 2012 IEEE Power Energy Soc. Gen. Meet., pp. 1–8. External Links: Document, ISBN 9781467327299 Cited by: §2.1.
  • [26] T. Räsänen, D. Voukantsis, H. Niska, K. Karatzas, and M. Kolehmainen (2010) Data-based method for creating electricity use load profiles using large amount of customer-specific hourly measured electricity use data. Appl. Energy 87 (11), pp. 3538–3545. External Links: Document, ISBN 03062619, ISSN 03062619 Cited by: §2.1.
  • [27] J. D. Rhodes, W. J. Cole, C. R. Upshaw, T. F. Edgar, and M. E. Webber (2014) Clustering analysis of residential electricity demand profiles. Appl. Energy 135, pp. 461–471. External Links: Document, ISBN 0306-2619, ISSN 03062619, Link Cited by: §2.1.1, §2.3.
  • [28] W. S. Sarle, A. K. Jain, and R. C. Dubes (1990) Algorithms for Clustering Data. Vol. 32. External Links: Document, tesxx, ISBN 013022278X, ISSN 00401706, Link Cited by: §2.1.
  • [29] L. G. Swan and V. I. Ugursal (2009) Modeling of end-use energy consumption in the residential sector: A review of modeling techniques. Renew. Sustain. Energy Rev. 13 (8), pp. 1819–1835. Note: excellent overview of residential sector energy modelling! External Links: Document, ISBN 1364-0321, ISSN 13640321 Cited by: §1, §2.1.2.
  • [30] T. Teeraratkul, D. O’Neill, and S. Lall (2018) Shape-Based Approach to Household Electric Load Curve Clustering and Prediction. IEEE Trans. Smart Grid 9 (5). External Links: Document, 1702.01414, ISBN 0306-7319, ISSN 19493053 Cited by: §2.1.
  • [31] W. Toussaint and D. Moodley (2019) Comparison of clustering techniques for residential load profiles in South Africa. Proceedings of the South African Forum for AI Research. External Links: Link Cited by: §1.
  • [32] W. Toussaint (2019) Domestic electrical load metering, hourly data 1994-2014. version 1. DataFirst. External Links: Document, Link Cited by: §3.
  • [33] W. Toussaint (2019) Domestic electrical load survey - key variables 1994-2014. version 1. DataFirst. External Links: Document, Link Cited by: footnote 1.
  • [34] W. Toussaint (2019) Evaluation of clustering techniques for generating household energy consumption patterns in a developing country. External Links: Link Cited by: §7.2.
  • [35] G. J. Tsekouras, N. D. Hatziargyriou, and E. N. Dialynas (2007)

    Two-stage pattern recognition of load curves for classification of electricity customers

    .
    IEEE Trans. Power Syst. 22 (3), pp. 1120–1128. External Links: Document, ISBN 0885-8950 VO - 22, ISSN 08858950 Cited by: §2.1.2.
  • [36] M. Uschold and M. Gruninger (1996) Ontologies: principles, methods and applications. Knowledge Engineering Review 11, pp. 93–136. Cited by: §2.2.1.
  • [37] J. L. Viegas, S. M. Vieira, R. Melício, V. M.F. Mendes, and J. M.C. Sousa (2016) Classification of new electricity customers based on surveys and smart metering data. Energy 107, pp. 804–817. External Links: Document, ISBN 9783715598857, ISSN 03605442 Cited by: §2.3.
  • [38] J. L. Viegas, S. M. Vieira, J. M.C. Sousa, R. Melício, and V. M.F. Mendes (2015) Electricity demand profile prediction based on household characteristics. Int. Conf. Eur. Energy Mark. EEM 2015-Augus, pp. 0–4. External Links: Document, ISBN 9781467366915, ISSN 21654093 Cited by: §2.1.1.
  • [39] S. Xu, E. Barbour, and M. C. González (2017) Household Segmentation by Load Shape and Daily Consumption. Proc. of. ACM SigKDD 2017 Conf., pp. 1–9. External Links: Document, ISBN 1234567245, Link Cited by: §2.1.1, §2.1.2, §2.1, §2.2, §4.1.3.

Appendix A Visualisations of descriptive statistics for input dataset

(a) Monthly income distribution
(b) Dwelling floor area distribution
(c) Years electrified distribution
Figure 6: Descriptive statistics of DEL survey respondents
(a) Proportioned survey responses for water access, wall and roof materials
(b) Histogram of mean daily household power consumption in 10kWh bins
Figure 7: Descriptive statistics of DEL survey respondents

Appendix B Supplementary Tables for Clustering Experiments

b.1 Bin ranges AMC pre-binning

bin AMC
1 0 - 1 kWh no consumption
2 2 - 50 kWh lifeline tariff - free basic electricity
3 51 - 150 kWh
4 151 - 400 kWh
5 401 - 600 kWh
6 601 - 1200 kWh
7 1201 - 2500 kWh
8 2501 - 4000 kWh
Table 10: AMC bins based on South African electricity tariffs

b.2 Clustering metrics

The Silhouette Index for an individual pattern in the dataset is:

(16)

Compactness is the average distance between and all other patterns in the same cluster. Distinctness is the average distance between and all remaining patterns that are not in the same cluster.

The Davies Bouldin Index (DBI) for two clusters is calculated as the ratio of the sum of cluster dispersions, and the distance between the two cluster centroids.

(17)

Cluster dispersion can be calculated using different measures. A simple method for computing it is as the average distance between the centroid of a cluster and each pattern in the cluster. The DBI for the dataset is obtained by averaging the similarity measure of each cluster and its most similar cluster, , for all clusters. A small DBI value indicates that cluster dispersions are small and distances between clusters are large, which is desirable. When plotting the DBI against the number of clusters, the optimal number of clusters can be visually identified. It is possible for the DBI to have several local minima [9].