1 Background and Previous Work
While cluster analysis is an established unsupervised machine learning technique, identifying the optimal set of clusters for a specific application requires extensive experimentation and domain knowledge. Cluster compactness and distinctness are two important attributes that characterise a good cluster set
(Sarle et al., 1990) and different metrics, such as the Mean Index Adequacy (MIA), DaviesBouldin Index (DBI) and the Silhouette Index have been proposed to measure cluster compactness and distinctness. In practise, a combination of measures together with additional expert guidance and visual inspection of clustering results is often used during the experimental process to identify the best cluster set (Jin et al., 2017), (Dent et al., 2014). However, these qualitative approaches can be adhoc and time consuming, subjective and difficult to reproduce, and biased by the expert’s interpretation of the visual representation (Gogolou et al., 2019). This work shows how competency questions from the ontology engineering community can be used to guide cluster set selection for generating representative daily load profiles that are suitable for developing customer archetypes of residential consumers in South Africa.A daily load profile describes the energy consumption pattern of a household over a 24 hour period. Representative daily load profiles (RDLPs) are indicative of distinct daily energy usage behaviour for different types of households. Customer archetypes are developed to represent groupings of energy users that consume energy in a similar manner. RDLPs have been well explored for generating customer archetypes for applications in long term energy modelling (Figueiredo et al., 2005), (McLoughlin et al., 2015)
. Traditionally, the most common approaches used for clustering load profiles are centroidbased approaches and variants of kmeans, selforganising maps (SOM) and hierarchical clustering
(Jin et al., 2017). For residential consumers the variable nature of individual households makes the interpretation of clustering results ambiguous (Swan and Ugursal, 2009), a challenge that is exacerbated in highly diverse, developing country populations, where economic volatility, income inequality, geographic and social diversity contribute to increased variability of residential energy demand (Heunis and Dekenah, 2014). Xu et al. (2017) have used prebinning, which involves applying a twostage clustering algorithm that first clusters load profiles by overall consumption and then by load shape, to improve clustering results for highly variable households spread across the United States. In addition to the general clustering metrics, Kwac et al. (2014) also propose the notion of entropy as a metric for capturing the variability of electricity consumption of a household. To evaluate the result of segmenting a large number of daily load profiles into interpretable consumption patterns, Xu et al. (2017) use peak overlap, percentage error in overall consumption and entropy as metrics.In ontology engineering, competency questions are an established methodology used to specify the requirements of an ontology and to evaluate the extent to which a particular ontology meets these requirements (Grüninger and Fox, 1995). Brainstorming, expert interviews and consultation of established sources of domain knowledge are processes that can be used to identify competency questions (De Nicola et al., 2009). Informal competency questions can be expressed in natural language and connect a proposed ontology to its application scenarios, thus providing an informal justification for the ontology (Uschold and Gruninger, 1996). To our knowledge competency questions have not been used previously to evaluate clustering structures.
2 Data
The Domestic Electrical Load Metering Hourly (DELMH) (Toussaint, 2019) dataset contains 3 295 194 daily load profiles for 14 945 South African households over a period of 20 years from 1994 to 2014. The daily load profile
is a 24 element vector
representing the hourly consumption (measured in Amperes) of household on day . Each interval is labeled by the start time, such that captures interval 00:00:00  00:59:59. is the array of all daily load profile vectors for household , and (dim 3 295 848 24) is the array of all daily load profiles .(1) 
(2) 
(3) 
We can then use clustering to find an optimal clustering structure , given the dataset .
3 Developing Competency Questions
We used a combination of analysing existing standards and engagement with domain experts to formulate informal competency questions expressed in natural language. The Geobased Load Forecasting Standard (4) contains manually constructed load profiles and guiding principles for load forecasting in South Africa. The competency questions were developed after analysis of this standard and continuous engagement with a panel of five industry experts. There were initial interviews with all experts to elicit the usage requirements. Preliminary competency questions were presented at a workshop with key stakeholders in the community. The final version of the competency questions incorporated the feedback from the stakeholders. The competency questions were then used to construct associated qualitative evaluation measures and a cluster scoring matrix that weights these measures to provide a qualitative ranking of cluster sets in terms of the application requirements.
The following five core competency questions were identified:

Can the load shape and demand be deduced from clusters?

Do clusters distinguish between low, medium and high demand consumers?

Can clusters represent specific loading conditions for different day types and seasons?

Can a zeroconsumption profile be represented in the cluster set^{1}^{1}1This was deemed important for considering energy access in low income contexts, as households may go through periods where they cannot afford to buy electricity and thus have no consumption.?

Is the number of households assigned to clusters reasonable, given knowledge of the sample population?
Based on these questions, we defined a good cluster set as having expressive clusters and being usable. An expressive cluster must convey specific information related to particular socioeconomic and temporal energy consumption behaviour. A usable cluster set must represent energy consumption behaviour that makes sense in relation to the clustering context and that carries the necessary information to make it pertinent to domain users. Next we developed qualitative measures to assess the competency questions. They are explained briefly below and in detail in Appendix A.
Expressivity (from competency questions 2 and 3) requires that the RDLP of a cluster is representative of the energy consumption behaviour of the individual daily load profiles that are members of that cluster, as expressed by the mean consumption error of total and peak demand and the mean peak coincidence ratio. An expressive cluster must also have the ability to convey specific meaning, especially in contexts where populations are highly variable. Cluster entropy can be used as a measure to establish the information embedded in a cluster and thus its specificity. The lower the entropy, the more information is embedded in the cluster, the more specific (homogeneous) the cluster, the better the cluster. In a specific cluster all members share the same context, e.g. daily load profiles of low consumption households on Sundays in summer.
The characteristic of cluster usability was derived from competency questions 4 and 5. Question 4 requires a manual evaluation based on expert judgement and is evaluated as being either true, or false. Question 5 is calculated as the percentage of clusters whose membership exceeds a threshold value of 10490 members^{2}^{2}2The threshold was selected as a value approximately equal to 5% of households using a particular cluster for 14 days.. Additional considerations are that fewer clusters typically ease interpretation and are thus preferable to larger numbers of clusters. The maximum number of clusters should be limited to 220, based on population diversity and existing expert models which account for 11 sociodemographic groups, 2 seasons, 2 daytypes and 5 climatic zones.
3.1 Cluster Scoring Matrix
The cluster scoring matrix in Table 1 presents a summary of the attributes and competency questions, the corresponding evaluation measures and their weights. The weights are based on the relative importance that experts assigned to the measure. Experiments are ranked by performance in each measure, with a score of 1 indicating the best cluster set. A weighted score is then computed for each experiment by multiplying its rank with the corresponding measure’s weight, and summing over all measures. The lower the total score, the better the cluster set.
Attribute  Qu.  Evaluation measure  Weight  
usable  5  sensible count per cluster  2  
4  zeroprofile representation  1  
expressive  1  mean consumption error  total  6 
representative  1  peak  6  
1  mean peak coincidence  3  
expressive  3  temporal entropy  weekday  4 
specific  3  monthly  4  
2  demand entropy  total daily  5  
2  peak daily  5 
4 Clustering Experiments and Results
Various clustering experiments were performed to find a set of clusters that symbolise the best RDLPs for
. The clustering process was set up as a typical data processing pipeline, using hourly daily load profiles from DELMH as input. Depending on the experiment, different preprocessing steps were performed. These include the selection of a prebinning by average monthly consumption (AMC) or integral kmeans, and retaining or dropping zero values. Each of the experiments was run with four different normalisation algorithms, and without normalisation. Algorithms were initialised with different parameter values to generate cluster sets with a range of membership sizes. Details on the algorithms, normalisation and prebinning are provided in Appendix B.
4.1 Evaluation
Based on the experiment details defined in Table 2 in Appendix B, 2083 individual experiment runs were conducted across all parameters. Each run was first evaluated with traditional quantitative clustering metrics. To ease the quantitative evaluation process and allow for comparison across metrics, Mean Index Adequacy (MIA), the DaviesBouldin Index (DBI) and Silhouette Index were combined into a Combined Index (CI) score. The top 10 ranked experiment runs based on the CI score are shown in Table 4 in Appendix C. The highest ranked experiments were then further evaluated with the cluster scoring matrix.
4.2 Qualitative Clustering Results
Table 5 in Appendix C summarises the scores and ranking produced by the cluster scoring matrix. The scores span a greater range of values than the CI scores and are grounded in interpretable measures, which makes the results more meaningful and eases the selection of the best experiment. While the top two runs lie only 8 points apart, they comfortably outperform the third best run, which has double the score. The potential of the qualitative evaluation measures is evident when contrasting the quantitative and qualitative results of exp. 5 (kmeans, zeroone) with those of exp. 8 (kmeans, unit norm). Exp. 5 (kmeans, zeroone) had the second best run based on the CI score but was ranked second last in the cluster scoring matrix. Exp. 8 (kmeans, unit norm) on the other hand only ranked ninth by quantitative score, yet convincingly claimed the top position based on qualitative measures.
Comparing the RDLPs in Figure 1 in Appendix C gives confidence in the reranking. Exp. 5 (kmeans, zeroone) has only 18 clusters; on average 2.125 clusters per bin. The five smallest clusters combined have fewer than 1500 member profiles and appear invisible in the bar chart of cluster size at the bottom of Figure 0(a). The ragged shapes of cluster 16, cluster 17 and cluster 18 are also an indication that very few profiles were aggregated in these RDLPs. Over half of all load profiles belong to only three clusters: cluster 5, cluster 6 and cluster 9. As a whole, the individual RDLPs lack distinguishing features and are neither expressive nor useful, making them poorly suited for creating customer archetypes.
Exp 8 (kmeans, unit norm) on the other hand has 59 clusters, varying between 2 and 15 clusters per bin. With the exception of cluster 33 which accounts for roughly 15% of all daily load profiles, cluster membership for the remaining clusters varies in a range from 15 000 to 100 000 members. Cluster 33 is one of only two clusters in its bin, which has a large bin membership in line with expectations given our sample population. Collectively, the individual RDLPs are expressive, featured and distinct, which promises that they will be useful for constructing customer archetypes.
5 Discussion and Conclusion
This work formalises competency questions, formulated in consultation with domain experts, as quantifiable, qualitative evaluation measures. The qualitative measures are summarised in a cluster scoring matrix which weights, ranks and compares the measures across clustering experiments. By combining traditional clustering metrics and qualitative evaluation measures, clustering structures with good compactness and distinctness are thus ranked by their usability and expressivity, which guides our selection of a clustering structure that is useful for our intended application of creating customer archetypes in the residential energy sector in South Africa.
The cluster scoring matrix eases the scoring and ranking of experiments, while also making the reliance on expert validation explicit and repeatable. It clearly indicates that of the top 10 experiments, unit norm normalisation and prebinning produced the most expressive and usable clusters. While the best experiment was prebinned with integral kmeans, prebinning by average monthly consumption produced comparable scores. The difference in scores between the two prebinning approaches was strongly influenced by the weights assigned to different evaluation measures and the threshold determining the minimum cluster membership. These are subjective constraints determined by our application context. In a different application, they may be set differently. The cluster scoring matrix could be improved by making it less susceptible to weight and threshold changes, as well as the ranking method. A limitation of the work is that we used well established clustering techniques and have not tested more recent clustering algorithms and dynamic time warping.
Our work presents a novel application of machine learning in the energy domain in South Africa, with potential for application in other developing country contexts. The approach shows promise for generating clusters that are useful for application in a realworld, longterm energy planning scenario and demonstrates the use of cluster analysis techniques for building real world systems.
Acknowledgements
This research was funded in part by the South African Centre for Artificial Intelligence Research (CAIR).
References
 A software engineering approach to ontology building. Inf. Syst. 34 (2), pp. 258–275. External Links: Document, ISBN 03064379, ISSN 03064379 Cited by: §1.
 Variability of behaviour in electricity load profile clustering; Who does things at the same time each day?. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 8557 LNAI, pp. 70–84. External Links: Document, arXiv:1409.1043v1, ISBN 9783319089751, ISSN 16113349 Cited by: §1.
 An electric energy consumer characterization framework based on data mining techniques. IEEE Trans. Power Syst. 20 (2), pp. 596–602. External Links: Document, ISBN 08858950, ISSN 08858950 Cited by: §1.
 [4] (2012) Geobased Load Forecast Standard. Technical report Technical Report June 2012, Eskom, Johannesburg. Cited by: §3.
 Comparing similarity perception in time series visualizations. IEEE Transactions on Visualization and Computer Graphics 25 (1), pp. 523–533. External Links: Document, ISSN Cited by: §1.
 The role of competency questions in enterprise engineering. In Benchmarking — Theory and Practice, A. Rolstadås (Ed.), pp. 22–31. External Links: ISBN 9780387348476, Document, Link Cited by: §1.
 Manual for Eskom Distribution Pre Electrification Tool ( DPET ). Eskom Holdings Limited, Johannesburg. Cited by: §1.
 Comparison of Clustering Techniques for Residential Energy Behavior Using Smart Meter Data. AAAI Work. Artif. Intell. Smart Grids Smart Build., pp. 260–266. Cited by: §1, §1, footnote 4.
 Household energy consumption segmentation using hourly data. IEEE Trans. Smart Grid 5 (1), pp. 420–430. External Links: Document, ISBN 19493053, ISSN 19493053 Cited by: §1.
 A clustering approach to domestic electricity load profile characterisation using smart metering data. Appl. Energy 141, pp. 190–199. External Links: Document, ISBN 03062619, ISSN 03062619, Link Cited by: §1.
 Alternatives to accuracy and bias metrics based on percentage errors for radiation belt modeling applications. 01. External Links: Document, Link Cited by: §A.1.
 Algorithms for Clustering Data. Vol. 32. External Links: Document, tesxx, ISBN 013022278X, ISSN 00401706, Link Cited by: §1.
 Modeling of enduse energy consumption in the residential sector: A review of modeling techniques. Renew. Sustain. Energy Rev. 13 (8), pp. 1819–1835. Note: excellent overview of residential sector energy modelling! External Links: Document, ISBN 13640321, ISSN 13640321 Cited by: §1.
 Domestic electrical load metering, hourly data 19942014. version 1. DataFirst. External Links: Document, Link Cited by: §2.
 Ontologies: principles, methods and applications. Knowledge Engineering Review 11, pp. 93–136. Cited by: §1.
 Household Segmentation by Load Shape and Daily Consumption. Proc. of. ACM SigKDD 2017 Conf., pp. 1–9. External Links: Document, ISBN 1234567245, Link Cited by: §B.3.2, §1.
Appendix A Qualitative Evaluation Measures
We use to denote a single cluster in clustering structure . The score of a qualitative measure for cluster set is the mean of the scores of all clusters with more than 10490 members. Clusters with a small member size were excluded when calculating mean measures, as they tend to overestimate the performance of poor clusters. Individual cluster performance is weighted by cluster size to account for the overall effect that a particular cluster has on the set.
a.1 Mean Consumption Error
The total daily demand and peak daily demand for an actual daily load profile and a predicted cluster representative daily load profile are given by the equations below:
and  (4)  
and  (5) 
Four mean error metrics are calculated to characterise the extent of deviation between the total and peak demand of a cluster, and those of its member profiles. Mean absolute percentage error (MAPE) and median absolute percentage error (MdAPE) are well known error metrics. The median log accuracy ratio (MdLQ) overcomes some of the drawbacks of the absolute percentage errors Morley (2016)
as the logtransformation tends to induce symmetry in positively skewed distributions, thus reducing bias. Interpreting MdLQ is not intuitive, a problem overcome by the median symmetric accuracy (MdSymA) which can be interpreted as a percentage error, similar to MAPE. Peak and total consumption errors can be calculated using the same formulae and are equivalent to the corresponding demand errors.
The consumption error measures are calculated for , where are all assigned to .
Absolute Percentage Error
(6)  
(7) 
Median Log Accuracy ratio
(8)  
(9) 
Median Symmetric Accuracy
(10) 
a.2 Mean Peak Coincidence Ratio
For each daily load profile the peaks are identified as all those values that are greater than half the maximum daily load profile value. The python package peakutils was used to extract the peak values and peak times for all daily load profiles and all representative daily load profiles.
(11) 
where and . The mean peak coincidence ratio for a single cluster is a value between 0 and 1 that represents the ratio of mean peak coincidence to the count of peaks in cluster . The magnitude of the peak is not taken into account in calculating the mean peak coincidence ratio. The mean peak coincidence (denoted as MPC) was calculated from the intersection of the actual and cluster peak times for all assigned to :
(12) 
a.3 Entropy as a Measure of Cluster Specificity
Entropy H is used to quantify the specificity of clusters and is calculated as follows:
(13) 
Here are the values of a feature and
is the probability that daily load profiles with value
for feature are assigned to cluster . For example, expresses the specificity of a cluster with regards to day of the week, with and , where is the likelihood that daily load profiles that are used on a Sunday are assigned to cluster .To calculate peak and total daily demand entropy, we created percentile demand bins. Thus the values of feature are and is the likelihood that daily load profiles with peak demand corresponding to that of the 60th peak demand percentile are assigned to cluster .
Appendix B Clustering Experiments
We implemented our experiments in python 3.6.5 using kmeans algorithms from scikitlearn (0.19.1) and selforganising maps from the SOMOCLU (1.7.5) libraries^{3}^{3}3The codebase is available online at https://github.com/wiebket/del_clustering.
Table 2 summarises the algorithms, parameters and preprocessing steps for each experiment, with indicating that zero consumption values were retained in the input dataset.
Exp.  Algorithm  Parameters  Prebin  Zeros 

1  kmeans  True  
2  kmeans  True  
SOM  True  
SOM+kmeans  True  
3  kmeans  False  
SOM  False  
SOM+kmeans  False  
4  kmeans  AMC  True  
SOM  AMC  True  
SOM+kmeans  AMC  True  
5  kmeans  AMC  True  
SOM+kmeans  AMC  True  
6  kmeans  AMC  False  
7  kmeans  integral kmeans  True  
8  kmeans  integral kmeans  False 
b.1 Clustering Algorithms
An experiment run takes input array to produce cluster set and predict a cluster for each normalised daily load profile of household observed on day . Variations of kmeans, selforganising maps (SOM) and a combination of the two algorithms were implemented to cluster . The kmeans algorithm was initialised with a range of clusters. The SOM algorithm was initialised as a square map with dimensions for in range . Combining SOM and kmeans first creates a map, which acts as a form of dimensionality reduction on . For each , kmeans then clusters the map into clusters. The mapping only makes sense if is greater than . and are the algorithm parameters.
b.2 Normalisation
The table below lists the normalisation techniques applied.
Normalisation  Equation  Comments 

Unit norm  Scales input vectors individually to unit norm  
Deminning  Subtracts daily min. demand from each hourly value, then divides each value by deminned daily total ^{4}^{4}4proposed by Jin et al. (2017)  
Zeroone 
Scales all values to a range [0, 1]; retains profile shape but is very sensitive to outliers. ^{5}^{5}5also known as minmax scaler 

SA norm  Normalises all input vectors to mean of 1; retains profile shape but very sensitive to outliers. ^{6}^{6}6introduced as a comparative measure, as it is frequently used by South African domain experts 
b.3 Prebinning
b.3.1 Prebinning by average monthly consumption (AMC)
To prebin by average monthly consumption, we selected 8 expertapproved bin ranges based on South African electricity tariff ranges. The average monthly consumption (AMC) for household over one year is:
(14) 
All the daily load profiles, of household were assigned to one of 8 consumption bins based on the value of . Individual household identifiers were removed from after prebinning.
b.3.2 Prebinning by integral kmeans
Prebinning by integral kmeans is a datadriven approach that draws on the work of Xu et al. (2017). For the simple case where represents hourly values, prebinning by integral kmeans followed these steps:

Construct a new sequence from the cumulative sum of profile normalised with unit norm

Append to – this ensures that both peak demand and relative demand increase are taken into consideration

Gather all features in array and remove individual household identifiers

Use the kmeans algorithm to cluster into bins, corresponding to the number of bins created for AMC prebinning
Appendix C Cluster Evaluation
c.1 CI Score and Quantitative Results
To ease the quantitative evaluation process and allow for comparison across metrics, Mean Index Adequacy (MIA), DaviesBouldin Index (DBI) and the Silhouette Index were combined into a Combined Index (CI) score. is an interim score that computes the product of the DBI, MIA and inverse Silhouette Index. The CI is the log of the weighted sum of across all experiment bins. A lower CI is desirable and an indication of a better clustering structure. The logarithmic relationship between and the CI means that the CI is negative when is between 0 and 1, 0 when and greater than 0 otherwise. For experiments with prebinning, the experiment with the lowest score in each bin is selected, as it represents the best clustering structure for that bin. For experiments without prebinning, and . Table 4 shows the top ten experiments based on CI score.
#  CI  DBI  MIA  Sil.  Exp.  Alg.  m  Norm. 

1  2.282  2.125  0.438  0.095  2  kmeans  47  unit 
2  2.289  1.616  1.220  0.262  5  kmeans  17  zeroone 
3  2.296  1.616  1.220  0.260  4  kmeans  17  zeroone 
4  2.301  2.152  0.485  0.119  6  kmeans  82  unit 
5  2.316  2.115  0.447  0.093  2  kmeans  35  unit 
6  2.320  2.199  0.486  0.121  5  kmeans  71  unit 
7  2.349  2.152  0.481  0.143  7  kmeans  49  unit 
8  2.351  2.189  0.434  0.090  2  kmeans  50  unit 
9  2.354  2.111  0.476  0.128  8  kmeans  59  unit 
10  2.355  2.173  0.453  0.093  2  kmeans  32  unit 
c.2 Experiments Ranked by Qualitative Score
#  Score  Exp.  Norm.  Prebinning  Zeros 

1  57.0  8  unit  integral kmeans  False 
2  65.0  5  unit  AMC  True 
3  117.5  6  unit  AMC  False 
4  143.5  7  unit  integral kmeans  True 
5  150.0  2  unit  True  
6  205.0  5  zeroone  AMC  True 
7  208.0  4  zeroone  AMC  True 
Comments
There are no comments yet.