Using Metamorphic Relations to Verify and Enhance Artcode Classification

08/05/2021 ∙ by Liming Xu, et al. ∙ University of Cambridge University of Wollongong 1

Software testing is often hindered where it is impossible or impractical to determine the correctness of the behaviour or output of the software under test (SUT), a situation known as the oracle problem. An example of an area facing the oracle problem is automatic image classification, using machine learning to classify an input image as one of a set of predefined classes. An approach to software testing that alleviates the oracle problem is metamorphic testing (MT). While traditional software testing examines the correctness of individual test cases, MT instead examines the relations amongst multiple executions of test cases and their outputs. These relations are called metamorphic relations (MRs): if an MR is found to be violated, then a fault must exist in the SUT. This paper examines the problem of classifying images containing visually hidden markers called Artcodes, and applies MT to verify and enhance the trained classifiers. This paper further examines two MRs, Separation and Occlusion, and reports on their capability in verifying the image classification using one-way analysis of variance (ANOVA) in conjunction with three other statistical analysis methods: t-test (for unequal variances), Kruskal-Wallis test, and Dunnett's test. In addition to our previously-studied classifier, that used Random Forests, we introduce a new classifier that uses a support vector machine, and present its MR-augmented version. Experimental evaluations across a number of performance metrics show that the augmented classifiers can achieve better performance than non-augmented classifiers. This paper also analyses how the enhanced performance is obtained.



There are no comments yet.


page 6

page 15

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past two decades, machine learning techniques have been widely adopted by research communities (e.g., computer vision, bioinformatics, computational linguistics, and medical imaging) to solve a range of practical problems. For researchers in the machine learning and software testing communities, the ability to build accurate learning models and verify their quality is essential. Due to the nature of machine learning programs, test oracles (mechanisms to categorically determine if the software behaviour or output is correct) are generally very difficult to define. Hence, conventional software testing techniques may not be effective for detecting defects. The issue of how to ensure the quality of applications based on machine learning has become increasingly important

(Xie et al., 2011).

Metamorphic testing (MT) is a testing technique that can alleviate the oracle problem (Chen et al., 1998, 2003), a major challenge in software testing. While conventional testing methods focus on verifying individual outputs, MT examines relations among the inputs and outputs of multiple executions of the software under test (SUT). These relations are called metamorphic relations (MRs). Since the first MT paper (Chen et al., 1998) was published in 1998, MT has been widely used to test software in various fields, including: scientific computing (Ding et al., 2016), numerical analysis (Chen et al., 2002), classification (Xie et al., 2011, 2009), cybersecurity (Chen et al., 2016), image processing (Mayer and Guderlei, 2006), compilers (Le et al., 2014; Donaldson et al., 2017), search engines (Zhou et al., 2016), web security (Mai et al., 2020), and visualisation (McNutt et al., 2020), among others. A body of literature also describes its integration with other testing techniques to improve their applicability and effectiveness. Comprehensive surveys about MT have also been recently published by Segura et al. (2016) and Chen et al. (2018).

More recently, MT has been increasingly gaining interest in classic AI fields for testing systems powered by machine learning, including: machine translation (Zhou and Sun, 2018; He et al., 2019), autonomous driving (Zhang et al., 2018; Zhou and Sun, 2019)

, and generic NLP (natural language processing) models

(Ma et al., 2020; Ribeiro et al., 2020). MT can have comparable bug-revealing effectiveness to model-based testing, and hence is a useful alternative to test an implementation, especially in situations where a model is expensive to construct (Hughes, 2020).

MT techniques have been used to test machine learning programs (Xie et al., 2011, 2009; Murphy et al., 2008). Machine learning techniques have also been used to automatically identify MRs, although so far only with simple MRs (Kanewala and Bieman, 2013). Xu et al. (2018) expanded the traditional role of MRs from software testing to a kind of post adjustor for a machine learning program, building a more accurate learning model using an example of the Artcode classification problem. Artcodes are visual codes whereby bespoke designs can be scanned to trigger the digital information attached to them. Artcodes may be disguised as normal images in the scene through their freeforms and complex aesthetic patterns — and they may appear as any instances of semantic objects. Therefore, it is not straightforward for people to build scan affordance without the support of an alert system that can recognise the presence of Artcodes in the context. The core part of such an alert system is Artcode classification, which determines whether or not the Artcode-based augmented reality applications can work effectively. We will present Artcode basics and Artcode classification in more detail in Section 2.2. More information about Artcode applications in augmented reality can be found in the literature, such as Meese et al. (2013), Xu et al. (2017), Benford et al. (2018), and Koleva et al. (2020).

Two MRs, Separation and Occlusion, identified based on the category of the inputs, were introduced by Xu et al. (2018), who reported on their ability to improve the performance of the original classifier. Initial experimental evaluations showed that MRs could enhance the performance in this case of supervised Artcode classification.

In this paper, we further explore the Separation and Occlusion MRs, present more detailed experimental analyses, and generalise the ability of MRs in both verification and enhancement. Experiments were conducted to show not only the applicability of MT in verifying the correctness of the classifier, but also the improved performance obtained by the MR-augmented framework regardless of the chosen classification methods. The new contributions of this paper are mainly threefold:

  1. We report on the capability of the two MRs to verify the correctness of the previously introduced classification model (Xu et al., 2017) using a set of complementary statistical test methods.

  2. We analyse and discuss how the improved performance of the MR-augmented classifiers is achieved, explaining how the post adjustor rectifies incorrect predictions.

  3. We introduce the use of a Support Vector Machine (SVM) as the classification algorithm in the original classifier and investigate its impact on the performance of the MR-augmented framework, comparing its performance with the MR-augmented classifier based on Random Forests (RF).

The rest of this paper is organised as follows. Section 2 gives a brief description of metamorphic testing and Artcode classification. Section 3 presents the MR-augmented classification framework. The experimental studies examining the MRs’ verification and enhancement capability are given in Section 4. Section 5 analyses how the improved performance is obtained by MR augmentation. Finally, Section 6 concludes the paper, highlighting some areas for future work.

2 Preliminaries

2.1 Metamorphic testing

In software testing, a mechanism that can determine whether a test has passed or failed is called an oracle. A situation where the oracle is not available, or is too expensive to be used, is known as the oracle problem (Barr et al., 2015). Metamorphic testing alleviates the oracle problem (Chen et al., 1998). It has been widely adopted in both academia and industry (Chen et al., 2003; Liu et al., 2014; Lindvall et al., 2015; Segura et al., 2016; Zhou et al., 2016; Donaldson et al., 2017; Zhang et al., 2018; He et al., 2019; Mai et al., 2020). MT has successfully detected defects in mature software, including in extensively tested systems (Chen et al., 2015). A central part of MT is a set of MRs, which are relations among several related inputs and their corresponding outputs. While conventional testing approaches uncover software problems by examining the outcome of an individual input, MT detects the presence of a fault by cross-checking multiple related inputs and outputs with respect to MRs.

We next use a database management system (DBMS) example to illustrate the idea of MT. Given two DBMS queries, such as the following:

  • selectfrom student where condition_A and condition_B;

  • selectfrom student where condition_B and condition_A;

the DBMS should return the same results — the outcome for a query with search conditions “A” and “B” and the query that swaps their order should be the same (which could represent an MR). Specifically, if the DBMS returns different results for the queries Q1 and Q2, then a fault must exist in the DBMS implementation.

As with all software testing, MT can only be used to check for the presence of bugs, not their absence (Dijkstra, 1972). For example, a faulty DBMS implementation may somehow return the same results for queries Q1 and Q2: thus, although violation of an MR means there must be some fault in the implementation, satisfaction of MRs cannot be taken to mean that the software is fault-free. A key step in MT is the identification of appropriate MRs, which normally requires a good understanding of the problem domain.

2.2 Artcode basics and classification

Artcodes are human-designable topological visual markers that are both machine readable and meaningful to humans (Meese et al., 2013). As illustrated in Figure 1fig:artcodeIllustration, a valid Artcode includes two parts: a recognisable foreground; and some background imagery. The recognisable foreground (the penguins annotated by the red circle) is a closed boundary that is split into several regions (usually five regions, annotated r1 to r5 in Figure 1fig:artcodeIllustration), with each region containing one or more blobs — solid objects disconnected from the region edge. The numbers of these blobs in each region are sorted and joined with a separator to form a string of numbers, which can then be used to represent the Artcode. For example, the code for Figure 1fig:artcodeIllustration is “1-1-2-3-5”, indicating that there are 1, 1, 2, 3, and 5 blobs found in the respective regions. Additionally, background imagery (B1 and B2 in Figure 1fig:artcodeIllustration) can be added, surrounding the recognisable foreground of an Artcode to enhance the aesthetics, but only if the background does not break the Artcode’s topological structure (Costanza and Huang, 2009; Meese et al., 2013). For example, the black solid blobs around the penguins were intentionally added to enhance the beauty, but are unconnected to the actual code.

(a) Artcode components illustration.
(b) Region adjacency tree.
Figure 1: Illustration of the components of an Artcode (code: “1-1-2-3-5”) and the region adjacency tree of its recognisable foreground.

The actual code of an Artcode is represented by a region adjacency tree (RAT) (Costanza and Robinson, 2003); the RATs of the recognisable part of the penguin Artcode and the two background elements are shown in Figure 1fig:regionAdjacencyTree. According to the Artcode system, the components are the sets of pixels that are connected to each other, and are known as connected components. These connected components are referred to as: root boundary, region, blob, and background imagery, depending on their use in the Artcode’s context. The root boundary (R) contains several holes (regions) with each having a number of connected components without holes. The number of components is determined by the containment relationship rather than geometrical shapes, as shown in Figure 1. The components can be any shapes, and this freeform property — with little restriction on shapes — can be an opportunity for designers to create aesthetic, interactive graphics. This property allows Artcode objects to look like an instance of any semantic object classes — animals, flowers, and fish can be recognised as Artcodes if they are designed according to Artcode drawing rules (see Figure 6).

Redundancy is allowed in Artcode design — multiple Artcodes with the same topology but different geometry can appear in an Artcode. Artcodes have been explored in a wide range of contexts (Meese et al., 2013; Benford et al., 2015, 2015; Ng and Shaikh, 2016; Thorn et al., 2016; Benford et al., 2016; Preston et al., 2017; Benford et al., 2018; Koleva et al., 2020) since Costanza and Huang (2009) first proposed D-touch markers, whose drawing rules the Artcode system implements and extends.

(a) Artcodes-decorated dining context
(b) Artcode examples detected
Figure 2: Illustration of Artcode detection.
Artcode classification

Figure 2fig:artcodesResContext-1 shows Artcodes being used to augment a dining context, in which the surfaces of objects (menu, plate, and mat) are decorated with Artcodes. In order to alert people to the presence of Artcodes before triggering their further decoding, the first step is to determine whether or not an input image or an image patch contains Artcodes (see Figure 2fig:artcodesResContext-2). This step involves classification which determines whether an input image is an Artcode or not. This task of Artcode classification222Artcode classification and Artcode detection are used interchangeably in this paper. involves classifying an input image as either containing an Artcode or not, labelled Artcode or non-Artcode. There is, visually, no obvious difference in appearance or geometrical shape between the two classes (see the examples in Figures 5 and 6). The geometrical freeform property differentiates Artcodes from other well-known markers, such as barcodes (Woodland and Bernard, 1952), QR codes (International Organization for Standardization, 2015), ARTags (Fiala, 2005), or RUNE-tags (Bergamasco et al., 2011). Artcodes, as a type of augmented reality technique, have been adopted in many situations (as described in Section 2.2) to augment the meanings of the objects in aesthetic-centred contexts. The triggering of the digital information depends on whether or not the presence of Artcodes in the scene is recognised; therefore, Artcode classification is vital for the correct use of Artcode applications and can provide guidelines for other visual codes-based augmented reality techniques. More information about Artcode basics and classification can be found in work such as Costanza and Huang (2009) and Xu et al. (2017).

3 MR-augmented classification framework

Figure 3: MR-augmented classification framework. The framework includes three stages: Follow-up generation, Prediction, and Rectification.

Conventional classification typically involves two steps: first, create feature vectors that distinctively represent each class; and, second, train classification algorithms to predict the class of individual inputs. Xu et al. (2018)

proposed two MRs, Separation and Occlusion, through examination of the differences in aggregated probability of image blocks being classified as

Artcode between the Artcode and non-Artcode class. The two MRs were then used to enhance the classifier’s performance based on conventional classification methods by adding a step before and after classification by this base classifier (referred to as the original classifier), resulting in an MR-enhanced classifier.

We have refined the MR-augmented classifier previously proposed in Xu et al. (2018) to present a new use of MRs in the verification of its correctness. As shown in Figure 3, the MR-augmented classifier framework includes three stages: Follow-up generation, Prediction, and Rectification. Follow-up generation involves building inputs for the prediction stage using MR-defined image transformations. The second stage makes predictions about these inputs using commonly-used classification models. The third stage may adjust or rectify the results generated in the prediction stage. These three stages are described in detail in Sections 3.1 to 3.3.

(a) Separation masks
(b) Occlusion masks
Figure 4: Separation and occlusion masks.

3.1 Follow-up generation

The core activity of the follow-up generation stage is to identify MRs and to construct the inputs based on the defined MRs. The identification of MRs in image classification is often done by examining the different image transformations, such as translation, rotation, and scaling. Based on the observation that the image blocks of Artcode images are more likely to be classified as Artcode than the blocks of non-Artcodes, two MRs, Separation and Occlusion, were proposed, using straightforward image operations: uniform and non-uniform separation. This stage accepts an entire image as input, and outputs image blocks generated from the operations defined by the two MRs.

3.1.1 Separation MR

Separation involves splitting the input image uniformly into a number of sections, or blocks. For example, Figure 4(fig:separationMasks) shows separation masks to generate four uniform blocks by intersecting them with input images. This MR is based on the observation that the blocks of Artcode images would be predicted to be Artcode with a higher likelihood than the blocks of non-Artcode images. If the number of blocks is appropriately selected, this difference in the aggregated likelihood (probability) of all blocks may provide clues for classification. The Separation MR can be formulated as:


where is the number of image blocks; is the probability for it to be classified as an Artcode by the original classifier; and and denote the th block of the Artcode and non-Artcode images after separation, respectively.

3.1.2 Occlusion MR

Occlusion is similar to separation, except that the image blocks are not split uniformly — overlapping among image blocks is permitted. As shown in Figure 4fig:occlusionMasks, four occlusion masks are provided to intersect with the input image, so that the image blocks outlined by the white regions will be generated. Occluded Artcode images generally preserve the properties of the input Artcode images — half of an Artcode image usually has a higher likelihood of being classified as Artcode by the original classifier than a quarter of the image; this property may not be preserved for non-Artcode images: occluded non-Artcode images may have the equivalent likelihood as the entire non-Artcode images of being predicted as non-Artcode. Based on this observation, MR Occlusion can be formulated as:


where is the number of masks; and outputs the overlapping areas of Artcode and non-Artcode images, and , and the th mask ; and and denote the th block of the Artcode and non-Artcode images generated after occlusion, respectively.

The Separation and Occlusion MRs are both processed by comparing the aggregated likelihood of predicting the generated image blocks with the probability of predicting the entire input image. They are based on the observation that the topological structure of an Artcode image, as a global property, may be preserved, even after splitting. Uniform separation with (separation) and without (occlusion) overlapping enable the generated image blocks to cover the possible distribution of Artcodes in an image, especially considering their freeform geometric shapes. In addition, the masks with varying sizes can adapt to the Artcodes’ scales. Therefore, they complement each other, and are combined together to obtain a better augmentation performance.

3.2 Prediction

In order to predict the class of an input image or block, a classification model that includes feature vector and classification algorithms (using random forests or support vector machines) needs to be built. The Artcode classification model is built using the Shape of Orientation Histograms (SOH) feature vector (Xu et al., 2017), which was specially designed for describing topological visual markers such as Artcodes. An SOH is constructed based on the translational symmetry and smoothness of the orientation histogram, which is a feature vector developed by McConnell (1986) for pattern analysis in both static and dynamic modes, and was adopted by Freeman and Adelson (1991) for recognising hand gestures.

Instead of describing the local geometry or structure, an SOH describes Artcodes by representing their topological structure. As previously reported (Xu et al., 2017), the orientation histogram of an Artcode displays horizontally translational symmetry, and is relatively smoother than that of a non-Artcode. The SOH is then constructed by quantifying these two aspects of the orientation histogram of the input images using similarity measurements, such as Procrustes distance (Moser, 1965) and distance (Greenacre, 2007). When all images are represented by their respective SOH vectors, classification algorithms using random forests or SVM are trained and used to predict the classes of the input images. The output of the prediction stage is a vector of labels of the input image blocks fed by the follow-up generation stage. This vector is referred to as the prediction vector .

3.3 Rectification

Unlike most deterministic software, classification is based on statistics, or is learned from past experience. Given an input, the output of a classifier is a probabilistic classification of belonging to a predefined class. In other words, before execution of the classifier, only the likelihood of the input being classified as a class or not is known beforehand. Therefore, in order to enable incorporation of the MRs described above, an augmented classifier integrating the MRs was designed based on probability, adding an adjustor (or rectifier) to the conventional classification pipeline (Xu et al., 2018).

As defined in Equations 1 to 3, the likelihood of image patches belonging to the two classes, generated in the follow-up generation stage, may be different. Therefore, a weight vector that contains different weight (i.e., likelihood) values is assigned to them. This vector, which has same dimensionality as the prediction vector , is referred to as the weight vector .

Given a prediction vector , and a weight vector — where is the predicted class of the th image patch by the original classifier; is the weight assigned to the th image patch (which is, in fact, the weight of the separation or occlusion mask); and and are the numbers of image patches generated by the two MRs — the inner product of and is the aggregated likelihood of belonging to the Artcode class (-value), which is defined as:


The aggregated likelihood is also known as the total probability (Xu et al., 2018; Xu, 2019). The augmented classifier predicts the label of the input by comparing the -value with the given thresholds and , using the following decision rules: if , then it is a non-Artcode; if , then it is an Artcode; otherwise, the input retains the original classifier’s prediction result.

4 Experimental studies

This section presents the experimental study, including the evaluation dataset and the set-up of the experiment. The experimental results of verifying and enhancing the original classifier, and the performance comparison between the RF-based and SVM-based classifiers, are also described in this section.

Figure 5: Non-Artcode examples selected from the Artcode dataset.
Figure 6: Artcode examples selected from the Artcode dataset. Artcodes are visually “hidden” or even “invisible” markers. Similar to barcodes and QR codes, they can be scanned to trigger the digital information attached within. The code embedded in an Artcode is a string of numbers of blobs in each “hollow” region. For example, the code of the 6th image is “1-1-1-1-2”.

4.1 Dataset

In order to study the Artcode classification problem, a dataset containing 47 Artcode and 116 non-Artcode images was used for experimental study. To the best of our knowledge, this is the first dataset available for studying Artcode classification. The non-Artcode images (including logos, drawings, advertisements, paintings, and graphics) were all created by humans, and were intentionally selected such that they would appear very similar to actual Artcode images (Xu et al., 2018). This means that the dataset is very challenging for Artcode classification. As shown in Figures 5 and 6, Artcode examples look very similar to the non-Artcode images, which can make it very difficult to distinguish between the two classes through visual inspection alone. Because Artcodes are manually created by designers, the number of available Artcodes is currently small and slightly imbalanced, but work is ongoing to extend the dataset333

. However, it is not possible to create hundreds of Artcode samples within a short time frame, much less increase the number to thousands or millions, like other common image classification tasks. Rather than devoting the very large effort necessary to expand the size of the dataset, we accepted this situation (of a small, imbalanced dataset), and adopted measures to address it, and mediate its impact: 1) We used classification methods that are effective on small datasets; 2) we adopted a group of carefully-considered performance evaluation metrics that are capable of evaluating classifiers used on imbalanced datasets; 3) we employed cross-validation techniques for experimental evaluation; and 4) we applied appropriate statistical methods to verify whether or not the improved performance was indeed attributable to the MR augmentation.

4.2 Cross-validation

Cross-validation is a commonly-used model validation technique for assessing how a learning model will generalise to a dataset (Kohavi, 1995; Devijver and Kittler, 1982; Seni and Elder, 2010). A major reason for using cross-validation, rather than using the conventional validation method that partitions the dataset into two sets (70% for training and 30% for testing), is that sufficient data may not be available for training and testing the model without compromising its generalisation and prediction capability.

Considering the limited number of samples in the Artcode dataset, a 5-fold cross-validation was used to ensure sufficient training and testing set sizes for performance evaluation. A -fold cross-validation involves randomly partitioning a dataset into equally-sized subsets, keeping one subset as validation data for testing the trained model, and using the remaining subsets as training data. The process is then repeated times (the folds).

4.3 Study 1 – Verification

MT attempts to verify the software through examination of whether or not the identified MRs are violated: as explained in Section 2.1, violation of the Separation and/or Occlusion MR would indicate that the original classifier has not been correctly implemented.

Due to the uncertainty of a prediction by the original classifier, we explored its correctness by examining the weighted sum of probability of all image blocks of an input image being classified as Artcode — the aggregated likelihood — seeing if Artcodes and non-Artcodes had significant differences in the aggregated likelihood. Given input groups of Artcode and non-Artcode images, after the follow-up generation and prediction stages (Figure 3), the two classes then have two sets of -values calculated based on Equation 4:


where and denote the sets of aggregated likelihood of image samples of Artcode () and non-Artcode () category, respectively.

We then examined the implementation correctness by checking whether or not the relationship that and

are significantly different was violated. Because of the probabilistic nature of the classifier, we used one-way analysis of variance (ANOVA) to assess the possible violation. ANOVA is a form of statistical hypothesis-testing that can be used to analyse whether or not there are statistically significant differences among the means of independent groups. We used ANOVA to examine if there was a statistically significant difference between the two groups

and — overall, the may be significantly “greater” than from a statistical perspective — using separation and occlusion. If not, the classifier may be incorrectly implemented. When employing one-way ANOVA, it is assumed that the variances of different groups are equal and that the

-values are normally distributed. However, although the two groups were independently selected and members in groups

were randomly selected, it was not certain that the normality and equal variance assumptions were satisfied in the experiment. Although one-way ANOVA is not very sensitive to deviations from normality, according to simulation results by McDonald (2009, pp. 157–164), we conducted further studies to consider situations of non-normality and unequal variances. In contrast to examining if the two assumptions were satisfied, we consolidated the experiment by introducing two more statistical test methods: t-test (for unequal variances) — which can be used to determine if the means of two groups and are significantly different when the variances are unequal; and Kruskal-Wallis test (Kruskal and Wallis, 1952) (also called one-way ANOVA on ranks, denoted ANOVA_ranks) — which is suitable for studying the difference between the means of two groups under non-normality situations. Hence, one-way ANOVA in conjunction with t-test and ANOVA_ranks can effectively evaluate the difference between the mean -values of the two groups under the aforementioned situations. As the comparisons between and each using these three methods were conducted separately, rather than simultaneously, we also used Dunnett’s test (Dunnett, 1955) as a post hoc test method. Dunnett’s test is a multiple comparison procedure that enables one-to-many comparisons simultaneously to check if significant differences exist between the Artcode group and each of the non-Artcode groups . The following sections present this verification examination, including detailing the experimental setting and results.

4.3.1 Experimental setting

In order to examine the correctness of the classifier, we checked for violation of the MRs through examination of the variation of -values between the two classes. Considering the different sizes of and () — is considerably larger than elements were randomly selected from each time, with this process run times to generate non-Artcode groups . One-way ANOVA, t-test (for unequal variances), one-way ANOVA on ranks, and Dunnett’s test were conducted to examine if there was a significant difference between and each . To reduce variance, we randomly selected groups, , from the non-Artcode group , in which each had the same size as the group . We used the RF-based original classifier as the SUT for study, and a 5-fold cross-validation to obtain the prediction results of the image blocks generated by the follow-up generation stage. The weights of () and () were all assigned the same values, meaning that all image blocks generated based on separation or occlusion had the same weights — having the same likelihood to contain Artcodes. The weights between the images blocks for separation may be different from those for occlusion.

Figure 7: Boxplot of the aggregated likelihood () of Artcode group () and non-Artcode groups (). The dashed line in each box denotes the mean aggregate likelihood of the group, i.e.,

. The grey arrowed box annotations show the mean, maximum, minimum, median, first quartile (q1) and third quartile (q3) of the Artcode group.

4.3.2 Results

Figure 7 presents a boxplot of the aggregated likelihoods of the group and ; and Table 1 shows the -values for comparisons between and each , according to the four tests. The average aggregated likelihoods (dashed line in Figure 7) of all images in Artcode and non-Artcode categories were calculated using the following formula:


where is the mean aggregated likelihood of group . The mean aggregated likelihood of all randomly generated non-Artcode groups is defined as:


The mean aggregated likelihood of all groups is defined as:


where is the number of groups randomly selected from ; and and are the total number of Artcode and non-Artcode images in the Artcode dataset, respectively. We set to 20, which means that 20 groups were randomly selected for study. Both and , the number of masks used in separation and occlusion, were set to 4.

As shown in Figure 7, the -value of the Artcode group is much less dispersed than that of the non-Artcode groups, showing less distance between the median and mean -value. The mean aggregated likelihood data () (denoted by dashed lines in the boxes) shows (0.136170) to be greater than all the . This shows that, overall, the sum of probabilities of all image blocks of an Artcode image is greater than that of a non-Artcode image — indicating that the MRs have not been violated. Because of the uncertain nature of supervised classification, the aggregated likelihood of an individual Artcode image is not always greater than that of a non-Artcode image — the classifier may not predict inputs with 100% accuracy. However, the statistical analysis of variations between the groups and provides evidence for the difference of the mean -values between Artcode and non-Artcode groups, indicating no violation of the MRs.

One-way ANOVA
t-test (for
equal variances)
(Kruskal-Wallis test)
1 0.020599 0.020935 0.026708 0.137527
2 0.000929 0.001055 0.001055 0.006173
3 0.020941 0.021266 0.027382 0.005814
4 0.151530 0.151567 0.014303 0.022575
5 0.026392 0.026708 0.028257 0.016151
6 0.082988 0.082990 0.011025 0.025203
7 0.563244 0.563270 0.252606 0.011663
8 0.004583 0.004797 0.004104 0.023714
9 0.026213 0.026354 0.007432 0.024604
10 0.120316 0.120447 0.088321 0.004742
11 0.082720 0.083057 0.119343 0.198247
12 0.006003 0.006276 0.008328 0.013617
13 0.060633 0.060637 0.003846 0.040912
14 0.059978 0.060223 0.051375 0.029467
15 0.013469 0.013771 0.015104 0.034990
16 0.007758 0.008053 0.011156 0.089879
17 0.036771 0.037150 0.055180 0.005291
18 0.225829 0.226021 0.310495 0.042914
19 0.062216 0.062425 0.047335 0.018983
20 0.323313 0.323352 0.085770 0.029832
median 0.048375 0.048687 0.027045 0.024159
mean 0.094821 0.095018 0.058456 0.039115
min 0.000929 0.001055 0.001055 0.004742
max 0.563244 0.563270 0.310495 0.198247
std 0.137489 0.137422 0.083444 0.047734
Table 1: Results of verification statistical analyses.

Table 1 presents the significance level (-values) of the difference between and

under ANOVA, t-test (for unequal variances), ANOVA_ranks, and Dunnett’s test. Descriptive statistics — median, mean, minimum, maximum, and standard deviation (std) — for the

-values are also included. For ease of understanding, cells in the table are coloured to reflect the significance level: are shown in dark gray; are in light gray; and are in white. If the null hypothesis is defined as “an MR is violated”, then small

-values (typically below 0.05) indicate strong evidence against the null hypothesis — small

-value indicate that neither of the two MRs have been violated. On the other hand, large -values indicate weak evidence to reject the null hypothesis: there is no significant difference between the mean -values of the and groups, under the chosen significance level, suggesting that one or both of the MRs may have been violated and, thus, the RF-based original classifier may have defects.

As shown in Table 1, the ANOVA -values range from 0.000929 to 0.563244, with a median of 0.048375. Half of the groups show -values that are considerably less than 0.05, indicating that these groups ( and ) are significantly different from , under the significance level of 0.05 (). If we increase the alpha value to 0.1, then two thirds of non-Artcode groups have means that are significantly different from the Artcode group . This result provides evidence that the difference between the two groups is not due to sampling errors or by chance. The -values of the remaining pairs are greater than 0.05, ranging from 0.059978 to 0.563244, indicating that there is no significant difference between the mean -values of the two groups under . This result can be explained by the diversity of the non-Artcode images in the Artcode dataset — some appear very similar to Artcode images, so-called “Artcode-like” images (Xu, 2019). Therefore, the significance level of the difference between and may decrease if includes many Artcode-like images. This will be discussed further in Section 5.

The mean ANOVA

-value is 0.094821, which is considerably larger than the median value of 0.048375. This indicates the skewness of the

-values: most -values approach the minimal -value, evidenced by the relatively higher standard deviation (0.137489). Although the mean -value is relatively high (greater than the commonly-used significance level of 0.05), the low median -value is evidence against the null hypothesis, reflecting the observed differences between and most .

The one-way ANOVA results show that, even without assurance of equal variances and normality, is, to some extent, significantly different from . Moreover, this significant difference was also observed under the assumptions of unequal variances and non-normality. The t-test (for unequal variances) has almost equivalent results to ANOVA (with only a negligible increase in -values), thus supporting the same conclusion as ANOVA.

Table 1 also reports the results of the Kruskal-Wallis tests (ANOVA_ranks), which are suitable for non-normally distributed data. The ANOVA_ranks -values are generally lower than those of ANOVA, ranging from 0.001055 to 0.310495, with a median of 0.027045 (which is less than the commonly-used -value of 0.05). 13 groups (1-6, 8-9, 12-13, 15-16, and 19) have -values below 0.05. Compared with the ANOVA and t-test (for unequal variances) results, ANOVA_ranks has a considerably lower mean -value (0.058456), which is only slightly greater than the -value of 0.05. The dispersion of -values is also lower, with a smaller standard deviation of 0.083444. The ANOVA_ranks results confirm the significant differences between the means of and under the assumption of non-normality. This phenomenon could be explained by the ranked data type of the -values: the -values are not completely continuous, or normally distributed, but somehow show “ranks” in the proposed MR-augmented framework.

The -values for one-way ANOVA, ANOVA_ranks, and t-test (for unequal variances) were calculated in separate comparisons. To alleviate the influence of this setting, and to consolidate the conclusion, we also conducted a multiple comparison test, Dunnett’s test, to compare the Artcode group and the 20 non-Artcode groups . Because the experiment studied the difference between and , only the -values for comparisons between and each are presented in Table 1. As can be seen from the table, Dunnett’s test provides more evidence for significant differences between and , with the -values ranging from 0.004742 to 0.198247, and a median of 0.024159. 16 of the 20 groups were significantly different () from the Artcode group. In terms of mean and standard deviation, Dunnett’s test had the lowest mean (0.039115) and standard deviation (0.047734) among all four tests. The results of the Dunnett’s test thus confirm the significant difference between the Artcode group and non-Artcode groups.

Although none of the four test methods produced 20 -values below 0.05, overall, the results in Table 1 show significant differences between the mean aggregated likelihoods of the Artcode and non-Artcode groups. Considering the uncertain nature of the predictor (the classifier) and the innate variance of random forests, the experimental results indicate no reason to consider the implementation faulty — the results indicate that neither MR has been violated. The next section will present the second study to evaluate the performance of the MR-augmented classifier, showing the enhanced performance of MR-augmented classifiers over non-augmented classifiers.

4.4 Study 2 – Enhancement

4.4.1 Experimental setting

According to the framework in Figure 3, we used Matlab to implement MR-augmented versions of classifiers that use random forests and support vector machines. The RF-based MR-augmented, SVM-based MR-augmented, RF-based non-MR-augmented (original) and SVM-based non-MR-augmented classifiers are denoted Aug-RF, Aug-SVM, Ori-RF and Ori-SVM, respectively. Cross-validation techniques were used to evaluate and compare the performance of these classifiers, with the Artcode dataset being used as the evaluation dataset.

Because random forests and SVM are used for the classification algorithms, the performance naturally has a certain level of variation in each execution — due to RF’s random variable selection from the feature vector, and SVM’s sub-optimisation because of the limited number of computational iterations. Multiple runs of cross-validation were therefore conducted to obtain the average performance. Because the dataset was imbalanced, with more non-Artcode than Artcode samples, we needed an appropriate group of measurements that could effectively deal with evaluation using imbalanced datasets to provide an informative view of the performance of the MR-augmented classifiers: Precision, recall, accuracy, the TNR (true negative rate), the

measure, and the MCC (Matthews Correlation Coefficient) (Matthews, 1975) were all employed as evaluation metrics.

Precision is a measure of the correctness of those classified as Artcodes, whereas recall is a measure of completeness (how many of the true Artcodes were correctly classified). These two measures focus on positive examples and predictions, and their importance varies from one learning task to another. With Artcode classification, recall is more important than precision because recognising the presence of all Artcodes in the scene is a prerequisite to the follow-up decoding that triggers the digital information.

TNR measures how many non-Artcode samples are correctly classified. Accuracy, F, and MCC measure the overall performance of the classifier. Accuracy is the overall proportion of correct predictions, for both the positives (Artcodes) and negatives (non-Artcodes). However, accuracy is sensitive to size differences among classes, and, in our study, may have been influenced by the imbalanced class sizes. The F measure is a special instance of the F measure with , where is a value allocating times as much importance to recall as to precision. F

uses a weighted average of precision and recall to evaluate the classification effectiveness, giving twice (

) as much importance to recall as to precision. In contrast to accuracy, the F measure and MCC provide more insight into the performance of a classifier. However, compared with MCC, F can be sensitive to data distribution. MCC is, in essence, a correlation coefficient between the observed and predicted classifications, incorporating true and false positives and negatives. It remains effective even if the dataset is imbalanced, and is generally regarded as one of the best measures for classification performance evaluation (Powers, 2011).

Two thresholds, and , were studied in the experiment, as was their impact on the augmented classifiers. The given values in the weight vector affect the selection of the values of and . According to Equation 3, the weights of image blocks generated by occlusion are greater than those generated by separation. In this experiment, four masks were used for both separation and occlusion (), resulting in both the prediction vector and the weight vector being 8-dimensional. Based on empirical examinations of assigning different values to , we assigned a value of to both and , and a value of to both and . In order to achieve quantisation and computational convenience of the value of aggregated likelihood , the numbers and were used in the prediction vector to represent the Artcode and non-Artcode classes, respectively.

4.4.2 Results

(a) Precision
(b) Recall
(c) Accuracy
(d) True negative rate (TNR)
(e) F measure
(f) Matthews correlation coefficient (MCC)
Figure 8: Performance comparison between RF and SVM-based classifiers with different values of and .
(a) Precision
(b) Recall
(c) Accuracy
(d) True negative rate (TNR)
(e) F measure
(f) Matthews correlation coefficient (MCC)
Figure 9: Performance comparison between the RF- and SVM-based classifiers with different values and .

All performance metric values reported are the average values calculated from five executions of -fold cross-validation. Two combinations of the two thresholds and

, in conjunction with different numbers of decision trees, were used to study the impact of the classifiers’ tuning parameters. Because

(the number of decision trees used in the RF-based classifiers) is not a tuning parameter of the SVM classifiers, for the sake of comparison, the SVM classifier values for each value are only the average of five runs of -fold cross-validation. Higher values in Figures 8 and 9 indicate better performance. Figures 8 and 9 show a consistent performance across different values of for all six evaluation metrics: This means that the Aug-RF classifier (unbroken red) is not sensitive to changes in the value of , a characteristic inherited from the original RF classifier (dashed red).

MR-augmented versus non-MR-augmented classifiers

We studied the performance difference between the augmented (Aug-) and original (Ori-) classifiers, and also compared the performance of the classifiers based on random forests (-RF) with that of those based on support vector machines (-SVM).

Using various values of and fixed values of the thresholds and , the MR-augmented classifiers (Aug-RF and Aug-SVM) outperformed the original classifiers (Ori-RF and Ori-SVM) in terms of precision, recall, accuracy, F, and MCC. They also outperformed the original classifiers in terms of recall, precision, and F measure for threshold combinations of , and , , showing improved predictive performance in classification of the positive class (Artcodes). This improvement is important because Artcode classification requires higher accuracy when predicting Artcodes.

When predicting the negative class (non-Artcodes), as measured by TNR, the MR-augmented classifiers appear slightly influenced by different values of the thresholds ( and ), which can be seen in the slight difference in TNR values for the original and augmented classifier in Figures 8fig:evaluationMetrics_1_d and 9fig:evaluationMetrics_2_d: for , the augmented classifier TNR values are similar to those for the original; but for , , they are less effective. This is different to the other evaluation metrics, which all show that the augmented classifiers outperform the original ones for both threshold combinations. A reason for this, partly as described in Section 3.3, is that when equals , the augmented classifier does not directly use the prediction result of the original classifier. Another reason is the careful selection of threshold : lower values of mean that the augmented classifier predicts the input image depending on the MRs only when they can adjust prediction with a relatively high confidence — otherwise, the augmented classifier uses the original prediction result. Thus, thresholds and can be used as tuning parameters for the performance of the MR-augmented classifier for both the positive and negative class.

Accuracy and MCC assess the overall performance of the classifier. As shown in Figures 8fig:evaluationMetrics_1_c and 9fig:evaluationMetrics_2_c, for both threshold combinations, the augmented classifiers have slightly better Accuracy than the original classifier, with an average increase of approximately 2-3%. Although the MR-augmented classifiers show improved performance in the Artcode class, the small percentage of Artcodes in the dataset does not contribute strongly to the overall accuracy in evaluation, which is determined by both true positives and true negatives. In contrast, MCC is a more informative measure of overall performance, even when the dataset is imbalanced. As shown in Figures 8fig:evaluationMetrics_1_f and 9fig:evaluationMetrics_2_f, the augmented classifiers obtain about a 10-20% increase over the original classifiers. This improvement is much more noticeable when comparing Aug-SVM with the Ori-SVM classifier, showing an overall improved performance of the MR-augmented classifier. However, the values of F and MCC for all classifiers are relatively low. This is due to the imbalance of the dataset used in the evaluation, with a much greater number of negative examples than positive ones.

Both the original and MR-augmented classifiers achieve high true negatives (TN), approximately 0.82–0.85, as presented in Figures 8fig:evaluationMetrics_1_d and 9fig:evaluationMetrics_2_d. However, they also have very low true positives (TP), approximately 0.3–0.4, which can be observed from the low precision (Figures 8fig:evaluationMetrics_1_a and 9fig:evaluationMetrics_2_a) and recall (Figures 8fig:evaluationMetrics_1_b and 9fig:evaluationMetrics_2_b) results. If and , then and , and MCC can be calculated as:


The MCC is a very low value, 0.1796. This illustrates how MCC is an effective measurement for evaluating the performance of a classifier on an imbalanced dataset.

As can be seen from Figures 8 and 9, the precision and recall values of all classifiers are relatively low, and the TNR values are comparatively high. This is due to the imbalance in the Artcode dataset, which includes many more negatives. On the one hand, more weight is given to the non-Artcode class by feeding more information to the classification model in the training stage, resulting in a classifier with low recall evaluation (good non-Artcode classification, but poorer Artcode classification). On the other hand, the small percentage of Artcodes in the dataset results in the low precision evaluation of both classifiers. Conversely, the large proportion of non-Artcode images in the dataset (and the good non-Artcode prediction of the classifier) lead to relatively high TNR values, as shown in Figures 8fig:evaluationMetrics_1_d and 9fig:evaluationMetrics_2_d.

RF-based versus SVM-based classifiers

The SVM-based classifiers (blue lines) achieve better performance than the RF-based classifiers (red lines), as shown in Figures 8 and 9, with an approximately 5-10% increase in terms of almost all performance evaluation measurements (not for TNR). The tradeoff between the precision and TNR of the SVM-based classifiers can be adjusted by the misclassification matrix (Cortes and Vapnik, 1995) employed in SVM. Considering the greater importance of recall than precision in this application, this experiment assigned higher values to the cost of classifying an Artcode as a non-Artcode, resulting in a classifier that enables better Artcode prediction.

The better performance of the SVM-based classifiers is also evidenced by the higher values of the Aug-SVM classifier than the Aug-RF classifier. However, when the classifiers use the same classification method (SVM or RF), the MR-augmented version outperformed the original (non-augmented) version of the corresponding classifier. This indicates that the introduction of MRs into supervised classification models actually improves the performance of the original classifiers, regardless of whether SVM or RF is used.

Overall, the Aug-SVM classifier obtained the best performance, especially when considering that SVM runs much faster than the random forests classifier. The MR-augmented classifiers outperformed the original classifiers in terms of all the evaluation measures. This improved performance is sensitive to the values of the thresholds and , but not to the value of , or the choice of classification method. As discussed in Section 4.4.2, thresholds and influence the performance of the augmented classifier, with different combinations determining the impact the MRs have on adjusting the original classification. Careful selection of the values of the tuning parameters — the thresholds and — is therefore vital to fine-tune the results of the original classifier and obtain the enhanced performance.

5 Analysis and discussion

Aug-RF Rectifications Aug-SVM Rectifications
Class Amount Correct Incorrect Correct Incorrect
Artcode 47 13.3 (28.3%) 1.9 (4.04%) 13.8 (29.36%) 4 (8.51%)
non-Artcode 116 7.3 (6.3%) 15.6 (13.45%) 6.4 (5.52%) 9.2 (7.93%)
Table 2: Results of rectification analysis.

5.1 Analysis of the rectification stage

Figure 10: Illustration of rectification distribution of the RF-based MR-augmented classifier. The graph is generated from one round of cross-validation of the RF-based MR-augmented classifier with , and . This graph is split into left and right areas separated by a green vertical line, where the left and right area are an illustration of the aggregated likelihood (-value) of Artcode () and non-Artcode () images. The horizontal green line is the predefined thresholds ( and ): It separates the graph into upper and lower zones. The samples in the upper zone ( ) are rectified as Artcodes, whereas the samples in the lower zone () are labelled as non-Artcodes in the rectification stage of the MR-augmented classifier. Therefore, the two tuning parameters, thresholds and , of the MR-augmented classifier control whether or not to rectify more “Artcode-like” samples. Correctly and incorrectly rectified predictions are highlighted in red and blue, respectively.

In order to reveal how the fine-tuning (rectification layer) stage operates, and how the improved performance is achieved, we performed ten rounds of cross-validation runs using both the RF-based and SVM-based MR-augmented classifiers on all samples in the Artcode dataset. Table 2 shows the average correct and incorrect rectifications by the MR-augmented classifiers over these ten executions of 5-fold cross-validation. Figure 10 shows -values of all Artcodes () and non-Artcodes () of one execution of cross-validation, where correct and incorrect rectifications are highlighted in red and blue, respectively. As illustrated in Figure 10 and Table 2, the two MR-augmented classifiers correctly rectified an average of 28.3% and 29.36% of the Artcode predictions, but incorrectly adjusted an average of 4.04% and 8.51% of the Artcodes to non-Artcodes. This higher correct rectification percentage contributed to the higher true positive rate — a key factor in the evaluation of a classifier in terms of recall and precision. However, the classifiers performed slightly worse on the non-Artcode class: the RF-based MR-augmented classifier had an average of 6.63% correct and 13.45% incorrect rectifications, and the SVM-based MR-augmented classifier obtained an average of 5.52% correct and 7.93% incorrect rectifications. This explains why the MR-augmented classifiers have a relatively lower true negative rate (TNR), as shown in Figures 8fig:evaluationMetrics_1_d and 9fig:evaluationMetrics_2_d, but higher precision (Figures 8fig:evaluationMetrics_1_a and 9fig:evaluationMetrics_2_a) and recall (Figures 8fig:evaluationMetrics_1_b and 9fig:evaluationMetrics_2_b). Overall, the average correct rectification percentage is 1.91% () for the Aug-RF classifier and 4.29% () for the Aug-SVM classifier, indicating that 1.91% and 4.29% of incorrect predictions by the RF-based and SVM-based original classifier were corrected by their respective MR-augmented classifiers. This explains how the improved performance of the MR-augmented classifiers was obtained: the rectification stage can rectify misclassifications (mainly false negatives) made by the original classifiers, albeit at the expense of comparatively fewer incorrect rectifications of true negatives.

The superior rectification performance of the MR-augmented classifiers on the Artcode examples shows that Artcode blocks are more likely to preserve the topological structure than non-Artcode blocks. Therefore, although the two MRs may violate the properties of Artcode images, the aggregated predictions of image blocks of Artcodes are more informative than the predictions of the entire image. This property may not be preserved for non-Artcodes, which have no predefined topological characteristics. The MR-augmented classifier adjusts prediction results in the rectification stage only if the new evidence collected is strong enough to accept, which is determined by comparing the aggregated likelihood of the predictions of image blocks with the given thresholds and . Further discussion about how the MR-augmented classifier works is presented in the next section.

5.2 Discussion

Figure 11: Image blocks generated according to separation and occulsion.

As explained in Section 4.4.2, the MR-augmented classifiers obtained better recall and precision results than the original classifiers (with approximately 10-15% improvement). Recall and precision focus on the positive class (Artcode), with higher values indicating more confident and complete predictions of Artcodes, while some Artcode misclassifications by the original classifier were corrected by the MR-augmented classifier in the rectification stage. The decision as to whether or not to rectify was based on the -value, as given in Equation 4, a measure of aggregated likelihood that an input image belongs to the Artcode class (Section 3.3).

As described in Section 3.1, the Separation and Occlusion MRs are based on the assumption that Artcode image blocks are more likely to be classified as Artcode than non-Artcode. The effectiveness of the two MRs was investigated by examining the prediction and rectification of image blocks for those images adjusted by the MR-augmented classifier (the red and blue points in Figure 10). The non-Artcodes that were incorrectly adjusted were the images that were very similar in topology to Artcodes (containing a number of connected regions), and had repeated geometrical structures, such as the 2nd and 4th images in Figure 5. Repeated structures enabled the separate image blocks to inherit more topological structure from the original image, making their internal structures similar to those of Artcodes. Occlusion and separation sometimes strengthened their topological structure, because occlusion and separation may remove auxiliary structures such as background imagery. Accordingly, the MR-augmented classifiers are sensitive to this kind of Artcode-like images (such as the 4th image in Figure 5) — images that are topologically very similar to Artcodes — which may result in incorrect rectifications.

Likewise, if separation and occlusion completely break the topological structure, Artcodes would be incorrectly rectified as non-Artcode by the MR-augmented classifiers. Fortunately, Artcodes have a topological structure that includes a number of connected regions, and often include several repeated structures with the same topology (but different geometry). These two properties enable Artcode image blocks to very likely retain the original topology, even after separation and occlusion. An example is presented in Figure 11 for illustration: The image (the 5th in Figure 6) is split into eight blocks by intersecting with the eight separation and occlusion masks shown in Figure 4 — the left four image blocks in Figure 11 are from separation, and the right four are from occlusion. Almost all of these blocks retain a complete topological structure: they remain relatively complete Artcodes. Therefore, the MR-augmented classifier, based on the aggregated probability (-value) of image blocks belonging to the Artcode class, can accumulate more information about this Artcode image than the original classifier, thereby achieving better overall predictions.

The two MRs are based on fundamental image processing operations, with the underlying rationale being whether or not the image blocks are able to retain the original structure’s properties after transformations. Artcodes, as topological markers enabling redundancy, naturally possess this property. The conventional use of MRs in metamorphic testing draws on intrinsic properties of the SUT. Likewise, the MRs used in Artcode classification also make use of intrinsic characteristics of Artcodes and non-Artcodes. Because domain characteristics may differ from task to task, and the repeated structures used in our two identified MRs may not exist in some contexts, it is likely that these MRs may not be directly applicable in some other image classification tasks. Nevertheless, this study has shown that MRs do have the potential to be used in image classification tasks (or even more general machine learning tasks), especially for those tasks with distinctive structural properties among different categories of learning data.

6 Conclusion

This paper has reported on an examination of two previously identified MRs to enhance image classification, using them not only to improve performance, but also to explore verification of the classifier. Considering the uncertainty of classification algorithms, the verification exploration involved four statistical tests: one-way ANOVA, t-test (for unequal variances), Kruskal-Wallis test, and Dunnett’s test. An effective and efficient MR-augmented classifier that uses SVM as the classification method, Aug-SVM, was introduced, and was compared with the Aug-RF classifier. The paper also examined the MR-augmented classification framework (Xu et al., 2018), and presented a method that could be applied to related image classification problems for verification and enhancement.

Our experimental studies showed the applicability of ANOVA in conjunction with t-test (for unequal variances), ANOVA_ranks, and Dunnett’s test to explore verification of the classifier based on the two MRs. The improved performance was not affected by the chosen classification method, demonstrating the potential to apply MT theories and techniques to general machine learning applications. Among the four classifiers in this paper (Ori-RF, Aug-RF, Ori-SVM, and Aug-SVM), Aug-SVM obtained the best performance in terms of both the evaluation metrics, and the computational efficiency. The experimental results also showed the essential role of the two thresholds, and , for tuning the MR-augmented classifier performance. In addition, a theoretical analysis and discussion about how the enhanced performance was achieved by the MR-augmented classifiers was presented.

Our future work will include further examination of other parameters, including the number of masks ( and ) for the separation and occlusion, the values in the weight vector , and the values of thresholds and . A potential approach for choosing suitable values of and will be to examine the relationship between the thresholds and the centroids of and . Because the work presented here has only examined two straightforward image transformations for the MRs, exploring other possible MRs that draw from other transformations for general image classification tasks, will also form part of our future work.

Although the two MRs employed in this work were straightforward, the results are promising, and clearly demonstrate the feasibility of MRs being used to augment classifiers. In order to fully investigate this new research area, more theoretical and practical work needs to be conducted, including exploration of connections between MRs and data augmentation, and case studies to examine the application of MRs to other well-studied image classification tasks (such as face and object detection) and even more broad machine learning problems. The concept of verifying machine learning software (the classifier in this paper) using MRs is still in an early stage of development, and more effort is also needed in the future. The proposed verification exploration based on ANOVA, t-test (for unequal variances), ANOVA_ranks, and Dunnett’s test, attempts to use statistical analyses to test probabilistic algorithms such as classification models: Further work is necessary to extend this approach to verification, and fully explore its applicability.


This work was supported in part by the National Natural Science Foundation of China, under grant no. 61872167, by the Australian Research Council’s Discovery Projects funding scheme (Project ID: DP210102447), and by a Western River entrepreneurship grant.


  • E. T. Barr, M. Harman, P. McMminn, M. Shahbaz, and S. Yoo (2015) The oracle problem in software testing: a survey. IEEE Transactions on Software Engineering 41 (5), pp. 507–525. Cited by: §2.1.
  • S. Benford, A. Hazzard, A. Chamberlain, and L. Xu (2015) Augmenting a guitar with its digital footprint. In Proceedings of International Conference on New Interfaces for Musical Expression (NIME’15), Louisiana, USA, pp. 303–306. Cited by: §2.2.
  • S. Benford, A. Hazzard, A. Chamberlain, K. Glover, C. Greenhalgh, L. Xu, M. Hoare, and D. Darzentas (2016) Accountable artefacts: the case of the Carolan guitar. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’16), San Jose, CA, USA, pp. 1163–1175. Cited by: §2.2.
  • S. Benford, A. Hazzard, and L. Xu (2015) The Carolan guitar: a thing that tells its own life story. interactions 22 (3), pp. 64–66. Cited by: §2.2.
  • S. Benford, B. Koleva, W. W. Preston, A. Angus, E. Thorn, and K. Glover (2018) Customizing hybrid products. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, pp. 1–12. Cited by: §1, §2.2.
  • F. Bergamasco, A. Albarelli, E. Rodola, and A. Torsello (2011) Rune-tag: a high accuracy fiducial marker with strong occlusion resilience. In

    Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11),

    Providence, RI, USA, pp. 113–120. Cited by: §2.2.
  • T. Y. Chen, S. C. Cheung, and S. M. Yiu (1998) Metamorphic testing: a new approach for generating next test cases. Technical Report HKUST-CS98-01, Department of Computer Science, The Hong Kong University of Science and Technology. Cited by: §1, §2.1.
  • T. Y. Chen, J. Feng, and T. H. Tse (2002)

    Metamorphic testing of programs on partial differential equations: a case study

    In Proceedings of the 26th Annual International Computer Software and Applications Conference (COMPSAC’02)., Oxford, UK, pp. 327–333. Cited by: §1.
  • T. Y. Chen, F. C. Kuo, W. Ma, W. Susilo, D. Towey, J. Voas, and Z. Q. Zhou (2016) Metamorphic Testing for Cybersecurity. Computer 49, pp. 48–55. Cited by: §1.
  • T. Y. Chen, F. Kuo, H. Liu, P. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou (2018) Metamorphic testing: a review of challenges and opportunities. ACM Computing Surveys (CSUR) 51 (1), pp. 1–27. Cited by: §1.
  • T. Y. Chen, F. Kuo, D. Towey, and Z. Q. Zhou (2015) A revisit of three studies related to random testing. Science China Information Sciences 58 (5), pp. 1–9. Cited by: §2.1.
  • T. Y. Chen, T. H. Tse, and Z. Q. Zhou (2003) Fault-based testing without the need of oracles. Information and Software Technology 45 (1), pp. 1–9. Cited by: §1, §2.1.
  • C. Cortes and V. Vapnik (1995) Support vector networks. Machine Learning 20 (3), pp. 273–297. Cited by: §4.4.2.
  • E. Costanza and J. Huang (2009) Designable visual markers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’09), pp. 1879–1888. Cited by: §2.2, §2.2, §2.2.
  • E. Costanza and J. Robinson (2003) A region adjacency tree approach to the detection and design of fiducials. In Proceedings of Video Vision and Graphics Conference, pp. 63–69. Cited by: §2.2.
  • P. A. Devijver and J. Kittler (1982) Pattern recognition: a statistical approach. Prentice Hall, London, UK. Cited by: §4.2.
  • E. W. Dijkstra (1972) Chapter i: notes on structured programming. In Structured Programming, pp. 1–82. Cited by: §2.1.
  • J. Ding, D. Zhang, and X. Hu (2016) An application of metamorphic testing for testing scientific software. In Proceedings of the 1st International Workshop on Metamorphic Testing (MET’16), Austin, TX, USA, pp. 37–43. Cited by: §1.
  • A. F. Donaldson, H. Evrard, A. Lascu, and P. Thomson (2017) Automated testing of graphics shader compilers. Proceedings of the ACM on Programming Languages 1 (OOPSLA), pp. 1–29. Cited by: §1, §2.1.
  • C. W. Dunnett (1955) A multiple comparison procedure for comparing several treatments with a control. Journal of the American Statistical Association 50 (272), pp. 1096–1121. Cited by: §4.3.
  • M. Fiala (2005) ARTag, a fiducial marker system using digital techniques. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, San Diego, CA, USA, pp. 590–596. Cited by: §2.2.
  • W. T. Freeman and E. H. Adelson (1991) The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (9), pp. 891–906. Cited by: §3.2.
  • M. Greenacre (2007) Correspondence analysis in practice. Chapman and Hall/CRC. Cited by: §3.2.
  • P. He, C. Meister, and Z. Su (2019) Structure-invariant testing for machine translation. ArXiv abs/1907.08710. Cited by: §1, §2.1.
  • J. Hughes (2020) How to specify it!. In Trends in Functional Programming, W. J. Bowman and R. Garcia (Eds.), Cham, pp. 58–83. External Links: ISBN 978-3-030-47147-7 Cited by: §1.
  • International Organization for Standardization (2015) Information technology – automatic identification and data capture techniques – QR code bar code symbology specification. Standard Technical Report ISO/IEC 18004:2015, Vol. 18004, ISO. External Links: Link Cited by: §2.2.
  • U. Kanewala and J. M. Bieman (2013) Using machine learning techniques to detect metamorphic relations for programs without test oracles. In Proceedings of the 24th International Symposium on Software Reliability Engineering (ISSRE’13), Pasadena, CA, USA, pp. 1–10. Cited by: §1.
  • R. Kohavi (1995)

    A study of cross-validation and bootstrap for accuracy estimation and model selection


    Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95)

    Vol. 2, Montreal, Quebec, Canada, pp. 1137–1143. Cited by: §4.2.
  • B. Koleva, J. Spence, S. Benford, H. Kwon, H. Schnädelbach, E. Thorn, W. Preston, A. Hazzard, C. Greenhalgh, M. Adams, J. R. Farr, N. Tandavanitj, A. Angus, and G. Lane (2020) Designing hybrid gifts. ACM Transaction on Computer-Humman Interaction 27 (5). Cited by: §1, §2.2.
  • W. H. Kruskal and W. A. Wallis (1952) Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47 (260), pp. 583–621. External Links: ISSN 01621459 Cited by: §4.3.
  • V. Le, M. Afshari, and Z. Su (2014) Compiler validation via equivalence modulo inputs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14), New York, NY, USA, pp. 216–226. Cited by: §1.
  • M. Lindvall, D. Ganesan, R. Árdal, and R. E. Wiegand (2015) Metamorphic model-based testing applied on NASA DAT: an experience report. In Proceedings of the 37th International Conference on Software Engineering (ICSE’15), Vol. 2, Florence, Italy, pp. 129–138. Cited by: §2.1.
  • H. Liu, F. Kuo, D. Towey, and T. Y. Chen (2014) How effectively does metamorphic testing alleviate the oracle problem?. IEEE Transactions on Software Engineering 40 (1), pp. 4–22. Cited by: §2.1.
  • P. Ma, S. Wang, and J. Liu (2020) Metamorphic testing and certified mitigation of fairness violations in NLP models. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pp. 458–465. Cited by: §1.
  • P. X. Mai, F. Pastore, A. Goknil, and L. Briand (2020) Metamorphic security testing for web systems. In 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST), Porto, Portugal, pp. 186–197. Cited by: §1, §2.1.
  • B. Matthews (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405 (2), pp. 442–451. Cited by: §4.4.1.
  • J. Mayer and R. Guderlei (2006) On random testing of image processing applications. In Proceedings of the 6th International Conference on Quality Software (QSIC’06), Beijing, China, pp. 85–92. Cited by: §1.
  • R. K. McConnell (1986) Method of and apparatus for pattern recognition. Google Patents. Note: US Patent 4,567,610 Cited by: §3.2.
  • J. H. McDonald (2009) Handbook of biological statistics. Vol. 2, Sparky House Publishing Baltimore, Maryland. Cited by: §4.3.
  • A. McNutt, G. Kindlmann, and M. Correll (2020) Surfacing visualization mirages. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI’20), New York, NY, USA, pp. 1–16. Cited by: §1.
  • R. Meese, S. Ali, E. Thorne, S. Benford, A. Quinn, R. Mortier, B. Koleva, T. Pridmore, and S. L. Baurley (2013) From codes to patterns: designing interactive decoration for tableware. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’13), Paris, France, pp. 931–940. Cited by: §1, §2.2, §2.2.
  • J. Moser (1965) On the volume elements on a manifold. Transactions of the American Mathematical Society 120 (2), pp. 286–294. Cited by: §3.2.
  • C. Murphy, G. Kaiser, L. Hu, and L. Wu (2008) Properties of machine learning applications for use in metamorphic testing. In

    Proceedings of the 20th International Conference on Software Engineering and Knowledge Engineering (SEKE’08)

    Redwood City, CA, USA, pp. 867–872. Cited by: §1.
  • K. H. Ng and S. P. Shaikh (2016) Design of a mobile garden guide based on artcodes. In Proceedings of 2016 International Conference on User Science and Engineering (i-USEr’16),, Puchong, Malaysia, pp. 23–28. Cited by: §2.2.
  • D. Powers (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies 2 (1), pp. 37–63. Cited by: §4.4.1.
  • W. Preston, S. Benford, E. Thorn, B. Koleva, S. Rennick-Egglestone, R. Mortier, A. Quinn, J. Stell, and M. Worboys (2017) Enabling hand-crafted visual markers at scale. In Proceedings of the 2017 Conference on Designing Interactive Systems (DIS’17), New York, NY, USA, pp. 1227–1237. External Links: ISBN 9781450349222 Cited by: §2.2.
  • M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh (2020) Beyond accuracy: behavioral testing of NLP models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), pp. 4902–4912. Cited by: §1.
  • S. Segura, G. Fraser, A. Sanchez, and A. Ruiz-Cortés (2016) A survey on metamorphic testing. IEEE Transactions on Software Engineering 42, pp. 805–824. Cited by: §1, §2.1.
  • G. Seni and J. Elder (2010) Ensemble methods in data mining: improving accuracy through combining predictions. Synthesis Lectures on Data Mining and Knowledge Discovery 2 (1), pp. 1–126. Cited by: §4.2.
  • E. Thorn, S. Rennick-Egglestone, B. Koleva, W. Preston, S. Benford, A. Quinn, and R. Mortier (2016) Exploring large-scale interactive public illustrations. In Proceedings of the 2016 ACM Conference on Designing Interactive Systems (DIS’16), Brisbane, Australia, pp. 17–27. Cited by: §2.2.
  • N. J. Woodland and S. Bernard (1952) Classifying apparatus and method. Google Patents. Note: US Patent 2,612,994 Cited by: §2.2.
  • X. Xie, J. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen (2009) Application of metamorphic testing to supervised classifiers. In Proceedings of the 9th International Conference on Quality Software (QSIC’09), Jeju, Korea, pp. 135–144. Cited by: §1, §1.
  • X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen (2011) Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84 (4), pp. 544–558. Cited by: §1, §1, §1.
  • L. Xu, A. P. French, D. Towey, and S. Benford (2017) Recognizing the presence of hidden visual markers in digital images. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, Thematic Workshops ’17, Mountain View, CA, USA, pp. 210–218. Cited by: item 1), §1, §2.2, §3.2, §3.2.
  • L. Xu, D. Towey, A. P. French, S. Benford, Z. Q. Zhou, and T. Y. Chen (2018) Enhancing supervised classifications with metamorphic relations. In 2018 IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET’18), Gothenburg, Sweden, pp. 46–53. Cited by: §1, §1, §3.3, §3.3, §3, §3, §4.1, §6, footnote 1.
  • L. Xu (2019) Artcode detection in images. Ph.D. Thesis, School of Computer Science, University of Nottingham. External Links: Link Cited by: §3.3, §4.3.2.
  • M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE’18), New York, NY, USA, pp. 132–142. Cited by: §1, §2.1.
  • Z. Q. Zhou and L. Sun (2018) Metamorphic testing for machine translations: MT4MT. In Proceedings of the 25th Australasian Software Engineering Conference (ASWEC’18), Adelaide, Australia, pp. 96–100. Cited by: §1.
  • Z. Q. Zhou and L. Sun (2019) Metamorphic testing of driverless cars. Communication of the ACM 62 (3), pp. 61–67. Cited by: §1.
  • Z. Q. Zhou, S. Xiang, and T. Y. Chen (2016) Metamorphic Testing for Software Quality Assessment: A Study of Search Engines. IEEE Transactions on Software Engineering 42 (3), pp. 264–284. Cited by: §1, §2.1.