There are a variety of invariance qualities associated with machine learned (ML) models. Testing invariance qualities enable us to evaluate the robustness of a model in its real world application where the model may encounter variations that do not feature sufficiently in the training and testing data. The testing also allows us to observe the possible biases or spurious correlations that may have been learned by a model spurious_aaai and to anticipate if the model can be deployed in other application domains domain_generalisation. This work is concerned with background invariance testing – a relatively challenging type of testing.
Many types of commonly-deployed invariance testing focus on variables that can be ordered easily, such as sizes and rotation angles As illustrated by ml4ml_invariance, ordering the testing results is important for observing the level of robustness in relation to the likelihood of different variations (e.g., a slightly rotated car vs. an upside-down car in an image). However, in background invariance testing, the term “background” is a multivariate variable and commonly expressed qualitatively (e.g., an outdoor scene, in a desert, and so on), making it difficult to order the testing results. Therefore, one cannot judge if a model is robust against certain variations, or how different background scenes influence spurious correlations. Furthermore, without a mechanism for ordering background scenes consistently, the ML4ML approach for automated invariance testing ml4ml_invariance cannot be used.
In the literature, some previous work focused on the quality of background scenes, e.g., introducing black pixels or random noise color_bgtest into the background. While this allows the variations to be ordered, the variations of image quality are indeed very different from the variations of background scenes. Other previous work focused on testing foreground objects against random background images bgchallenge
. While this approach can provide an overall statistical indication of the invariance quality, the background images are randomly selected, and it thus does not support more detailed analysis such as whether the level of robustness or biases is acceptable in an application by taking into account the probabilities of different background scenes.
In this paper, we present a technical solution to the need for ordering background scenes by utilizing semantic information. Our technical solution is built on the existing techniques of scene understanding in computer vision and those of ontological networks that are used in many text analysis applications. With this technique, we are able to:
Search for different background scenes in a meaningful and efficient way based on the semantics encoded in each original image with a foreground object x;
Control the distribution and sparsity of the background scenes according to their semantic distance to the original image containing x;
Construct testing images from background scenes for each x and test an ML model with the testing images;
Apply steps (a-c) to a large number of target images, , and generate testing images.
Test an ML model for object classification with the testing images, collect results or intermediate results at positions of the model, and transform the results to visual representations (referred to as variance matrices) in a consistently-ordered manner.
Apply step (e) to a model repository of different ML models, and use the resultant variance matrices to train an ML4ML assessor for evaluating the invariance quality of ML models. With a trained ML4ML assessor, the process of background invariance testing can be automated since steps (a-e) can easily be automated.
2. Related Work
The invariance qualities of ML models have been studied for a few decades. In recent years, invariance testing becomes a common procedure in invariant learning invariant_risk; water_bird; invariant_learning. Among different invariance qualities, background invariance is attracting more attention.
In the literature, several types of variations were introduced in background invariance testing, e.g., by replacing the original background with random noise, color patterns, and randomly selected background images.
noise_bgtest tested object detection models by transforming the original background to random noise or black pixels. They reported that all tested models failed to perform correctly at least in one of their testing cases. Similarly, random_erase; aaai_occlusion1; aaai_occlusion2 replaced parts of the images with black or grey pixels for foreground invariance testing.
2004scenetest noticed that the association between a foreground object and its background scene affected object recognition and described such association as “consistency”. color_bgtest tested different models with consistent and inconsistent backgrounds, while using the term “semantically-related” to describe consistent association. In particular, they used color texture to replace the original background of the target image, and controlled the inconsistency using a parameterized texture model color_texture.
Several researchers experimented with swapping background scenes in studying background invariance, e.g., 2004scenetest. bgchallenge
provided the Background Challenge database by overlaying a foreground object to all extracted backgrounds from other images. To prepare models (to be tested), they also provided a smaller version of ImageNet with nine classes (IN9). In this work, we train a small repository of models on IN9.
Like many invariance qualities (e.g., rotation, size, and intensity), it is relatively easy to control the variation of noise level, the size of the replacement patch, and the inconsistency level of color textures. However, it is not so easy to control the level of consistency or semantic association when one replaces one background scene with another. This work aims to address this research challenge.
Measuring semantic association between background scenes can benefit from existing scene understanding models, e.g., scene_aaai; scene_aaai2. We refer interested readers to a few comprehensive surveys on scene understanding, including scene_survey_indoor; scene_survey_crowd
. In this work, we use two models, which were pre-trained on the ADE20k databaseADE20k and the Place365 database place365 respectively, to extract semantic information from images.
Our solution also utilizes the techniques developed in other branches of AI and ML, including ontology ontology_aaai; doctorXAI and association analysis Apriori.
3. Definition, Overview, and Motivation
Let be the image in a dataset and be the foreground object in .
be an ML model trained to recognise or classifyfrom . In general, the invariance quality of characterizes the ability of to perform consistently when a type of transformation is applied to . For example, one may apply a sequence of rotation transformations , and test with the set of testing images of . As reported by ml4ml_invariance, when the testing results are organised into a variance matrix (Figure 1a), the visual patterns in the variance matrix can be analyzed using an ML4ML assessor to evaluate the invariance quality of .
The background invariance quality characterizes the ability of in recognizing when it is with different backgrounds. Hence the transformations of involve the replacement of the original background in with different background scenes . The transformations:
generate testing images. Similar to other invariance testing, the testing results can be summarized as a variance matrix. However, as these background scenes may be ordered according to their locations in a database, different orderings may yield different variance matrices (Figure 1b). The visual patterns in such a variance matrix are not as meaningful as those resulting from rotation transformations (Figure 1a).
If we can find a way to consistently produce variance matrices for background transformations, we can adopt the ML4ML invariance testing framework proposed by ml4ml_invariance for background invariance testing. This motivates us to address the following challenges:
We need to introduce a metric for measuring the semantic distance between each background scene and the corresponding target image .
We need to produce a variance matrix as a uniform data representation from non-uniform sampling of background scenes, as sampling background scenes will not be as uniform as sampling sizes, rotation angles, and many other types of variables in invariance testing.
We need to have an effective way to search for background scenes that will be distributed appropriately for producing variance matrices.
As illustrated in Figure 2, the upper part of the figure shows the previous framework ml4ml_invariance for invariance testing involving uniform sampling of transformations. In this work, we introduce a number of new technical components (the lower part of the figure) to address the aforementioned challenges. Once these challenges are addressed, background invariance testing can be integrated into the ML4ML invariance testing framework.
In this section, we follow the pathways in the lower part of Figure 2 to describe a series of technical solutions for enabling background invariance testing with non-uniform sampling of the transformations of the original images.
Consider a large collection of original images to be tested and a large number of background scenes in an image repository . The first part of the sub-workflow is to identify a set of background scenes suitable for transforming each original image (also called target image) with a specific foreground object . The ML models to be tested for background invariance quality are expected to recognize when is combined with different background scenes, or to classify such combined images with the label of .
Naturally, one may consider to use conventional image similarity metrics (e.g., cosine/l2 similarity used for metric learningmetric_learning_survey) to find background scenes similar to . However, image similarity does not necessarily imply plausibility. Furthermore, as a suitable background scene may not (often is desirable not to) have the foreground object , the similarity metrics cannot deal with the conflicting requirements, similar background and different foreground, easily. We therefore focus on the semantic distance between images (Challenge (1) in the previous section).
Image Semantics from Scene Understanding.
Research on scene understanding aims to extract different semantic information from images. In this work, we represent the semantics of each image with the keywords extracted by employing existing scene understanding techniques to process the original images to be tested and background scenes to be used for transformation. From each image, ( or ), a scene understanding model identifies a set of objects that are recorded as a set of keywords. We detail the scene understanding model used in this work in Appendix A. When is a target image, i.e., , we assume that contains a keyword for the foreground object .
Figure 3 shows two examples of foreground images and two examples of background scenes. For some images, scene understanding may result in many keywords, but in other cases, only 1-3 keywords (e.g., the 1 and 3 images in Figure 3). Therefore it is desirable to consider not only the keywords extracted from each image, but also the keywords related to the extracted keywords.
Ontology from Association Analysis.
Ontology is a graph-based knowledge representation, which is used to store the relationships among different keywords in this work. As illustrated in Figure 4, nodes represent keywords, and an edge between two nodes indicates that two keywords have been extracted from the same image. The weight on the edge indicates how strong is the association between the two keywords. The ontology is typically constructed in a pre-processing step by training association rules using the extracted keywords for all images in the repository .
The Apriori algorithm Apriori is widely used for association analysis. When the size of the dataset is great, the Frequent Pattern Growth algorithm fptree can run more efficiently. Considering a set of all possible keywords that can be extracted from all images in the repository , the level of association between two keywords and can be described by two concepts. Consider three itemsets: , , and , The first concept, support for the itemset :
indicates the frequency of the co-occurrence of and .
An association rule from one itemset to another, denoted as , is defined as the second concept confidence:
which indicates the confidence level about the inference that if the object of keyword appears in a scene, the object of keyword could also appear in such a scene. Similarly, we can compute confidence.
For a large image repository, the value of support is usually tiny, and is more easily changed by the increase of images in the repository, the introduction of more keywords, and the improvement of scene understanding techniques. Hence, it is difficult to use the support values consistently. We therefore use the confidence values for weights on the directed edges in the ontology.
In the ontology, the shortest path between two keywords indicates the level of association between two keywords, typically facilitating two measures, (i) the number of edges along the path (i.e., hops) and (ii) an aggregated weight, e.g., or .
Background Scenes from Semantic Search.
Given a target image , to test if an ML model is background-invariant, we would like to find a set of background scenes that can be used to replace the background in while maintaining the foreground object . The set of keywords extracted by the scene understanding model can be used to search for background scenes with at least one of the matching keywords . When there are many keywords in , semantic search can work very well. However, as exemplified in Figure 4(top), when an image has only two keywords, the search will likely yield a small number of background scenes, undermining the statistical significance of the test.
To address this issue (i.e., challenge (3) in Section 3), we expand the keyword set by using the ontology that has acquired knowledge about keyword relationships in the preprocessing stage. As illustrated in Figure 4, the initial set has keywords [sky, tree]. The ontology shows that Sky, Tree are connected to Earth, Field Road, Botanic Garden, Vegetable Garden, Water, which form the level 1 expansion set . Similarly, from , the ontology helps us to find the level 2 expansion set , and so on. The set of all keywords after -th expansion is:
Figure 5 shows three sets of example background scenes discovered for a targeting image (i.e., the fish image on the top-left corner of each set). The background scenes in the first set (left) are discovered by searching the image repository randomly. Those in the last set (right) are discovered using the initial set of keywords . Those in the second set (middle) are discovered using an expanded set of keywords, , after the 4th expansion. While it is not necessary for every testing image in an invariance testing to be realistic, the plausibility of a testing image reflects its probability of being captured in the real world. As discussed in ml4ml_invariance, it is unavoidable that invariance testing involves testing images of different plausibility, and therefore it is important to convey and evaluate the testing results with the information of the plausibility. An ideal set of background scenes should have a balanced distribution of scenes of different plausibility. Qualitatively, we can observe that in Figure 5, the random set has too many highly implausible images and the closest set has images biased towards keywords painting, water, tree, many are not quite plausible, while the expanded set has a better balance between more plausible to less plausible background scenes. In Appendix B, we measure plausibility quantitatively using semantic distance. And we show more details on the keyword expansion using the ontology and candidate selection process in Appendix A.
Testing Images from Background Replacement.
The transformation process for sampling the background variations is more complex than other commonly-examined invariance qualities, requiring the use of a segmentation tool to separate the foreground object from each target image , and then superimpose into individual background scenes discovered in the previous step. As defined in Eq. 1, for different background scenes , the transformation process produces testing images
When a background scene also contains one or more objects of the same class label as the target object , it creates two problems. (i) If is superimposed onto without removing those similar objects, this undermines the validity of the ML testing because when an ML model returns a label of for , it is unknown that the model has recognized the superimposed or the similar objects in . (ii) If those similar objects are removed, the resulting image would have holes that may not be covered by the superimposed . For these two reasons, we filter out any background scenes with the same keyword as .
Point Clouds from ML Testing.
Unlike the previous framework where the transformation process samples variations uniformly (e.g., for rotation angles), the testing images are a set of non-uniform sampling points. When we test an ML model against the testing images, we can measure the results and intermediate results of in many different ways. In ml4ml_invariance, the different measurements are controlled by the notions of position and modality. The position indicates where in
the signal may be captured, e.g., Max@CONF (the confidence vector of the final predictions) and Max@CONV-1 (the map of the last convolutional layer). The modality indicates what mathematical function is used to abstract the signal vector or map at a position to a numerical measure, e.g., Mean or Max. Hence, within the context of an ML model, a fixed position () and a fixed modality () for each testing image , ML testing results in a numerical measure .
In addition, we can measure the semantic distance between each testing image and the target image . As shown in Eq. 2, the confidence concept in association analysis is non-commutative. We therefore always use the semantic distance starting from . This assigns the value with a position away from .
Consider two different testing images and and their corresponding semantic distances to as and . The difference between their numerical measures
indicates the variation between the two testing results. As the variation corresponds to positions and , this gives us a 2D data point at coordinates with data value . When we consider all the testing results for all as well as , there is point cloud with data points in the context of . In Appendix B, we list details on the semantic distance and obtained using the ontology.
When we combine the testing results for all targeting images, we have a point cloud with data points, which can be visualized as scatter plots. The first column in Figure 6 shows five examples of such point clouds. Because the number and the distribution of these points depend on the set of background scenes, we can observe that when the level of expansion OL (see Eq. 3) increases, the sampling has more data points and better distribution.
Variance Matrices from RBF-based Resampling.
Because the sampling of background transformation is not uniform (i.e., challenge (2) in Section 3), unlike the previous framework in Figure 2, we have to consider the options of training an ML4ML assessor with point clouds or converting point clouds to variance matrices. We select the latter option primarily because, in the previous framework, ML4ML assessors are trained with ML experts’ annotation of invariance quality based on their observation of variance matrices. Replacing variance matrices with scatter plots in the annotation process would introduce an inconsistency in the framework in general and annotation in particular. In the short term, this would not be desirable, but in the longer term, one should not rule out the possibility of training ML4ML assessors with scatter plots.
We use the common approach of radial basis functions (RBFs) to transform a point cloud into a variance matrix. For each elementin a variance matrix, an RBF defines a circular area in 2D, facilitating the identification of all data points in the circle. Let these data points be and their corresponding values are . As discussed earlier, the coordinates of each data point are determined by the semantic distances from the target image to two testing images. A Gaussian kernel
is then applied to these data points, and produces an interpolated value for elementas
However, when the RBF has a large radius, the computation can be costly. When the radius is small, there can be cases of no point in a circle. In order to apply the same radius consistently, we define a new data point at each element and use nearest neighbors algorithm to obtain its value . The above interpolation function thus becomes:
In Figure 6, we show the application of three different RBFs. The mixed green and yellow patterns in row OL gradually become more coherent towards OL. We can clearly see a green square at the centre and yellow areas towards the top and right edges.
ML4ML Assessor Training and Deployment.
With the variance matrices, we can use the same processes to train ML4ML assessors and deploy them to evaluate the background invariance quality of ML models in the same way as the previous framework. The same technical approaches in ml4ml_invariance can be adopted, including: (i) collecting variance matrices and ML models, (ii) splitting the model repository into a training and testing set and provide expert annotations of invariance quality based on variance matrices, (iii) engineering of imagery features for variance matrices, (iv) training ML4ML assessors using different ML techniques, and (v) testing and comparing ML4ML assessors.
Testing Image Generation.
We use the BG-20k bg20k
database with 20,000 background images as all the candidates. We train a small repository of 250 models on the IN9 (smaller ImageNet) databasebgchallenge to align with the previous attempts of background invariance testing (with randomly selected backgrounds).
For each model, we measure the signals at two positions, including the final predictions and the last convolutional layers (or the last layer before the final MLP head for Vision Transformers), as these two positions are considered important and interesting by our annotators. We use the as the modality, and subtraction as the operator in Eq. 4. As a result, for each model, we will have two original scatter plots (point clouds).
For RBF-based interpolation, we use the following parameters: (r=32, =10, =32). For each model, we therefore obtain two variance matrices. For other settings of the interpolation, we refer interested readers to the Appendix B.
Training ML4ML Assessors.
We build a small model repository of 250 models for object classification. The models were trained under different settings:
Architectures: VGG13bn, VGG13, VGG11bn, VGG11 vggnet, ResNet18 resnet, and Vision Transformer vit
Hyper-parameters: learning rate, batch size, epochs
Augmentation: rotation, brightness, scaling, using images with only foreground (black pixels as background)
Optimizers: SGD, Adam, RMSprop
Loss: cross-entropy loss, triplet loss, adversarial loss
We use our sub-workflow in Figure 2 to generate testing images using the target images in the IN9 dataset, and the background scenes in the BG-20K database. We apply these models to the generated testing images. With the two measuring positions per model, we generate two variance matrices using the results from the ML Testing process.
For each model, we then provide professional annotations based on the original scatter plots and interpolated variance matrices with three quality levels, namely, 1) not invariant, 2) borderline, and 3) invariant. In Appendix C, we show the statistics of the model repository, as well as a small questionnaire and more detail on the annotation process.
Finally, to automate the testing process, we extract the same set of statistical features as ml4ml_invariance
, e.g., mean / standard deviation, from the interpolated variance matrices. After that, we use the statistical features and the professional annotations of the training set of the model repository to train an ML4ML assessor, e.g., a random forest / adaboost in our case.
To evaluate the feasibility of the automation process using ML4ML assessors, we split the model repository into a training set (2/3 of the models) and a testing set (1/3 of the models). We do not tune any hyper-parameters for the ML4ML assessors, therefore we do not further split the training set into training and validation set. To make the results more statistically significant, we randomly split the data, repeat the experiments ten times and report the averaged results and the standard deviation.
|Automation Accuracy||IRR Score|
|Random Forest||79.7 + 7.5%||0.649+0.091|
|AdaBoost||74.8 + 9.1%||0.599+0.102|
Results and Analysis
In Table 2 we show that the majority votes of the professional annotations have around 0.4 IRR score with worst-case accuracy, which shows that the annotations are still aligned with the traditional accuracy metric. However, the professional labels are provided by considering the variance matrices and they are the decisions from many different factors instead of relying only on one single metric. Furthermore, Table 1 shows that the inter-rater reliability scores among the three professionals are around 0.65 which indicates that the annotations are consistent compared with many NLP tasks reported by irr. Therefore, we believe that such annotations are both desirable and reliable.
In Table 2, we also show that by using ML4ML assessors ml4ml_invariance, we could achieve around 70-80% automation accuracy which shows that the automation method (using ML4ML assessors) can achieve a satisfactory accuracy (80%). Furthermore, the IRR scores between the predictions from ML4ML assessors and the majority votes are similar to those of the three coders (0.6). Therefore it shows the proposed framework can work as a fully automated background testing mechanism with sufficient accuracy. For more studies and details on the ML4ML assessor, we refer interested readers to Appendix C.
In this work, we propose a technical solution to address a major limitation of an invariance testing framework recently reported by ml4ml_invariance. The limitation is that background invariance testing cannot be incorporated into the framework as many other invariance problems. Our technical solution brings several non-trivial techniques together to overcome the three challenges in Section 3. The introduction of ontology is both novel and vital in making background invariance tests as meaningful as other commonly-seen invariance tests, e.g., rotation invariance. In this way, the previous framework has been expanded and improved significantly, paving the critical path for introducing other invariance tests with transformations that are not uniformly sampled, e.g., variations of clothing or hairstyles,