Log In Sign Up

Sill-Net: Feature Augmentation with Separated Illumination Representation

by   Haipeng Zhang, et al.

For visual object recognition tasks, the illumination variations can cause distinct changes in object appearance and thus confuse the deep neural network based recognition models. Especially for some rare illumination conditions, collecting sufficient training samples could be time-consuming and expensive. To solve this problem, in this paper we propose a novel neural network architecture called Separating-Illumination Network (Sill-Net). Sill-Net learns to separate illumination features from images, and then during training we augment training samples with these separated illumination features in the feature space. Experimental results demonstrate that our approach outperforms current state-of-the-art methods in several object classification benchmarks.


Feature Space Transfer for Data Augmentation

The problem of data augmentation in feature space is considered. A new a...

IF-Net: An Illumination-invariant Feature Network

Feature descriptor matching is a critical step is many computer vision a...

On Finding Gray Pixels

We propose a novel grayness index for finding gray pixels and demonstrat...

Appearance-Invariant 6-DoF Visual Localization using Generative Adversarial Networks

We propose a novel visual localization network when outside environment ...

Multiplexed Illumination for Classifying Visually Similar Objects

Distinguishing visually similar objects like forged/authentic bills and ...

Rethinking and Designing a High-performing Automatic License Plate Recognition Approach

In this paper, we propose a real-time and accurate automatic license pla...

1 Introduction

Although deep neural network based models have achieved remarkable successes in various computer vision tasks

(Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; Russakovsky et al., 2015; He et al., 2016), vast amounts of annotated training data are usually required for a superior performance in many visual tasks. For the object classification task, the requirement for a large training set could be partially explained by the fact that many latent variables (e.g., positions/postures of the objects, the brightness/contrast of the image, and the illumination conditions) can cause significant changes in the appearance of objects. Although collecting a large training set to cover all possible values of these latent variables could improve the recognition performance, for rare latent values such as extreme illumination conditions it could be prohibitively time-consuming and expensive to collect enough training images.

Figure 1: Illustration of the key idea of our approach. The semantic and illumination representation are separated from the training image (mandatory straight). The illumination representation is used to augment the support sample (deer crossing).

In this paper we restrict our attention to illumination conditions. For many real-world computer vision applications (e.g., autonomous driving and video surveillance) it is essential to recognize the objects under extreme illumination conditions such as backlighting, overexposure and other complicated cast shadows. Thus, we reckon it is desirable to improve recognition models’ generalization ability under different illumination conditions in order to deploy robust models in real-world applications.

We propose a novel neural network architecture named Separating-Illumination Network (Sill-Net) to deal with such problems. The key idea of our approach is to separate the illumination features from the semantic features in images, and then augment the separated illumination features onto other training samples (hereinafter we name these samples as “support samples”) to construct a more extensive feature set for subsequent training (see Figure 1). Specifically, our approach consists of three steps. In the first step, we separate the illumination and semantic features for all images in the existing dataset via a disentanglement method, and use the separated illumination features to build an illumination repository. Then, we transplant the illumination repository to the support samples to construct an augmented training set and use it to train a recognition model. Finally, test images are fed into the trained model for classification. Our proposed approach could improve the robustness to illumination conditions since the support samples used for training are blended with many different illumination features. Thus after training, the obtained model would naturally generalize better under various illumination conditions.

Figure 2: Illustration of the architecture of Sill-Net. Sill-Net consists of three main modules: the separation module, the matching and reconstruction module, and the augmentation module. The semantic and illumination features are separated by the exchange mechanism in the first module. The semantic features are constraint to be informative by the matching and reconstruction module. The illumination features are stored into a repository. In the augmentation module, we use the illumination features in the repository to augment the support samples (e.g., template images) for training a generalizable model.

Our contributions are summarized as follows:

1) We develop an algorithm to separate the illumination features from semantic features for natural images. The separated illumination features can be used to construct an illumination feature repository.

2) We propose an augmentation method to blend support samples with the illumination feature repository, which could effortlessly enhance the illumination variety of the training set and thus improve the robustness to illumination conditions of the trained deep model.

3) We evaluate Sill-Net on several object classification benchmarks, i.e., two traffic datasets (GTSRB and TT100K) and three logo datasets (Belgalogos, FlickrLogos32, and TopLogo-10). Sill-Net outperforms the state-of-the-art (SOTA) methods by a large margin.

2 Proposed Method

In this section, we introduce our Separating-Illumination Network (Sill-Net). Sill-Net first learns to separate the semantic and illumination features of training images. Then the illumination features are blended with the semantic feature of each support sample to construct an augmented feature set. Finally, we train again on the illumination-augmented feature set for classification. The architecture of our method is illustrated in Figure 2.

Sill-Net mainly consists of the following modules: the separation module, the matching and reconstruction module, and the augmentation module. In detail, we implement the method in three steps:

1) The separation module is trained to separate the features into semantic parts and illumination parts for all training images. The matching and reconstruction module promotes the learning of better semantic feature representation. The learned illumination features are stored into an illumination repository. The details are illustrated in Section 2.1.

2) The semantic feature of each support image is combined with all illumination features in the repository to build an augmented feature set to train the classifier. The augmentation module is illustrated in Section


3) Test images are input into the well-trained model to be predicted in an end-to-end manner, referring to Section 2.3.

This approach assumes that the illumination distribution learned from training data is similar to that of test data. Thus the illumination features can be used as feature augmentation for sufficient training.

We can choose different support samples in different visual tasks. For instance, in conventional classification tasks, we use the real training images as support samples; in one-shot classification tasks, we construct the support set with template images (i.e., graphic symbols visually and abstractly representing semantic information).

2.1 Separate semantic and illumination features

Let represent the labeled dataset of training classes with images, where denotes the -th training image, is the one-hot label, and denotes the corresponding template image (or any image of the object without much deformation).

A feature extractor denoted by learns the separated features from images , where can be split along channels: . Here, is called the semantic feature, which represents the consistent information of the same category, while is called the illumination feature.

Here we specify what illumination represents in our paper. The narrow meaning of illumination is one of the environmental impacts which cause the appearance changes but no label changes. We call the features related to all the environmental impacts but are not category-specific as illumination features. Technically, we divide the object feature into different channels, one half determining the category label defined as the semantic feature, and the other half unrelated to the category label defined as the illumination feature. Thus, the following three conditions should be satisfied:

1) The semantic feature is informative to reconstruct the corresponding template image.

2) The semantic feature can predict the label while the illumination feature can not.

3) The illumination feature should not contain the semantic information.

To satisfy the above conditions, we build the following modules.

Matching and reconstruction module. We first construct a matching module (as shown in Figure 2) to make the semantic feature informative as required by the first condition. Since we design the extractor without downsampling operations to maintain the spatial information of the object, the semantic feature of one real image should be similar to that of its corresponding template image. However, the real image is usually deformed compared to the regular template image. Therefore, we use a spatial transformer (Jaderberg et al., 2015) to correct the deformation. We constrain the transformed semantic feature to be consistent with the template feature by the mean square error (MSE), that is:


Besides, we design a reconstructor (as shown in Figure 2) to retrieve the template image from the semantic feature to ensure that it is informative enough. We constrain the reconstructed template image by binary cross-entropy (BCE) loss:


where represents the -th pixel of the -th template image . Since the template images are composed of primary colors within the range of , binary cross-entropy (BCE) loss is sufficiently efficient for the retrieval (Kim et al., 2019). So far, the semantic feature is constrained to be consistent with the template feature and informative enough to be reconstructed to its template image.

Exchange mechanism. To ensure that the semantic feature can predict the label while the illumination feature can not, we utilize a feature exchange mechanism enlightened by (Xiao et al., 2018) to separate the feature. As shown in Figure 3, the semantic feature of one image is blended with the illumination feature of another image to form a new one through feature mixup (Zhang et al., 2017):


where the proportion . As required by the condition, the blended feature retains the same label as the semantic feature. Hence through training, the semantic feature would learn information to predict the label while the illumination feature would not.

Figure 3: Illustration of the exchange mechanism. The semantic and illumination features are exchanged between random paired images with labels and . Then we obtain cross combined features labeled the same as the images corresponding to the semantic features. These features are then classified as the specified labels.

We implement the exchange process for random pairs of images, building a new exchanged feature set:


The mixed features are then input into a classifier for label prediction. We denote the distribution of the predicted label given the mixed feature by . Then we minimize the cross-entropy loss:


where denotes the number of recombined features in the augmented feature set, represents the class number of all images for training and test, and is the -th element of the one-hot label .

The semantic feature retains the information to predict the label after training on the exchanged feature set. Besides, the semantic information in the illumination feature would be reduced, because otherwise, it would impair the predicion when it is blended with the other semantic features.

Constraints on illumination features. As required by the third condition, it is essential to impose additional constraints on illumination features to reduce the semantic information. However, it is difficult to find suitable restrictions since the generally used datasets have no illumination labels.

Enlightened by the disentanglement metric proposed in (Suter et al., 2019), we design a constraint on illumination features by negative Post Interventional Disagreement (PIDA). Given a subset including images of the same label , we write the loss as follows:


here, is a proper distance function (e.g., -norm), is the illumination feature of image with the same label , is the expectation, and is the number of images in class .

According to Eq. (6), PIDA quantifies the distances between the illumination feature of each same-labeled image and their expectation when the illumination conditions are changed. In the subset , the semantic information of each image is similar while the illumination information is different. Suppose an undesirable situation that the illumination features capture much semantic information rather than illumination information. The expectation would strengthen the common semantic component and weaken the distinct illumination components, and thus PIDA would be small. It means that the smaller PIDA is, the more semantic information the illumination feature captures compared to illumination information. By maximizing PIDA (i.e., minimizing ), we can effectively reduce the common semantic information remaining in the illumination features.

In summary, the overall loss function in the training phase can be written as:


Through the above training, the model learns to split the features into semantic and illumination features.

2.2 Augment samples by illumination repository

After the first training step, the illumination feature of each image can be separated. These features are collected to construct an illumination repository, expressed as follows:


We then use the illumination features to augment the support samples by a multiple of the repository size . Consider with images of label , here we assume that the template images constitute the support set. We combine all illumination features in the repository with the semantic feature of each template by feature mixup, building an augmented feature set as follows:


where .

We train the model again on the feature set . So, if a few support samples are provided, the model can be trained on the augmented feature set blended with real illumination features, making it generalizable to test data. The classification loss of augmented training is expressed as follows:


where denotes the number of all recombined features in the augmented feature set.

Now, the model has been trained to be generalizable for test.

2.3 Inference

The feature extractor and classifier have been fully trained after the first two phases. Given the -th test image, the feature extractor firstly splits the semantic and illumination feature. Subsequently, the features are blended, and then the classifier outputs the category label , formulated as:


The inference is achieved in an end-to-end manner.

3 Experiments

3.1 Datasets and experimental settings

Datasets. We validate the effectiveness of our method on two traffic sign datasets, GTSRB (Stallkamp et al., 2012) and Tsinghua-Tencent 100K (TT100K) (Zhu et al., 2016), and three logo datasets, BelgaLogos (Joly & Buisson, 2009; Letessier et al., 2012), FlirckrLogos-32 (Romberg et al., 2011) and TopLogo-10 (Su et al., 2017), because these datasets contain different illumination. Table 1 shows the size and number of classes of each dataset (we use the dataset provided in (Kim et al., 2019)). More details about the datasets are described in the supplementary material.

Dataset GTSRB TT100K BelgaLogos FlickrLogos-32 TopLogo-10
Size 51839 11988 9585 3404 848
Classes 43 36 37 32 11
Table 1: Dataset specifications.

Evaluation tasks. Generally, we evaluate our model by the following steps. 1) Utilize the training dataset (or subset) to separate out the illumination features. 2) The support samples are augmented with the illumination features to form an augmented feature set. 3) Train a classifier on the augmented feature set. 4) Prediction on the test dataset.

We validate our model on the following classification tasks.

1) One-shot classification

In this type of task, the training phase requires no real images of the test classes but one template image for each category. This task is similar to the one-shot classification.

We set up two scenarios for traffic sign classification. In the first scenario, we split GTSRB into a training subset with 22 classes and a test subset with the other 21 classes, where the template images constitute the support set. In the second scenario, we train on GTSRB and test on TT100K for cross-dataset evaluation. We exclude four common classes shared by GTSRB and TT100K in the test set. For convenience, we denote the first scenario by GTSRBGTSRB and the second by GTSRBTT100K, where the training set is on the left side of the arrow while the test set is on the right.

For logo classification, we use the largest BelgaLogos as a training set and the remaining two as test sets respectively, denoted by BelgaFlickr32 and BelgaToplogos. Same as above, we remove four common classes in FlickrLogos-32 and five in Toplogo-10 shared by BelgaLogos.

2) Cross-domain one-shot classification

To further validate the generalization of our method, we perform a cross-domain one-shot evaluation by another two experiments, where the model is trained on traffic sign datasets and tested on logo datasets. Specifically, we train the model on GTSRB and test on FlickrLogos-32 and Toplogo-10. We denote these two scenarios as GTSRBFlickr32 and GTSRBToplogos. The setup is more challenging compared to the previous scenarios, since we train the model in the domain of traffic sign datasets while we test the model in an entirely different domain of logo datasets.

Architecture and parameter settings. We construct the extractor by six convolution layers to separate the semantic and illumination features (see the separation module in Figure 2). The reconstructor is built with layers of the extractor in an inversed order. The classifiers (see Figure 2 and Figure 3) are built with six convolution layers and three pooling layers. Due to space limitations, more details of the architecture are described in the supplementary material.

The networks are trained using the ADAM optimizer with learning rate , and . The mixup proportion is set to throughout the experiments. Limited by graphics card memory, we choose the mini-batch size of 16, which can be larger if conditions permit.

The matching and reconstruction loss functions are weighted by proportionality coefficients for optimal results. The weighted overall loss function is expressed as follows:


We can choose and in the range of . When they are too large, the model tends to learn false features with values close to zero. While when they are too small, the model is not able to learn informative semantic features. In our method, is set as and is set as .

Template image processing. Previous studies (Tabelini et al., 2020) have shown that basic image processing on template images (as support samples) helps the network’s generalization. In our experiment, we diversify the template images themselves using the following methods: geometric transformations, image enhancement (including brightness, color, contrast and sharpness adjustment), and blur. The template images are diversified and thus allow the model to learn more generalizable features. We observe that basic processing on template images improves model performance.

3.2 One-shot classification

We compare our method with Quadruplet networks (Kim et al., 2017) (QuadNet) and Variational Prototyping-Encoder (Kim et al., 2019) (VPE) for one-shot classification, reported in Table 2 and 3. We quote accuracies of the compared methods under their optimal settings, that is, VPE is implemented with augmentation and spatial transformer (VPE+aug+stn version) and QuadNet is implemented without augmentation. As shown in the tables, our method outperforms comparative methods in all scenarios.

No. support set (22+21)-way 36-way
QuadNet 45.2 N/A
VPE 83.79 71.80
Sill-Net 97.60 95.59
Sill-Net w/o aug 46.25 45.94
Table 2: One-shot classification accuracy () on traffic sign datasets. The results of other methods are cited from (Kim et al., 2017, 2019). The best results are marked in blue.

In traffic sign classification, Sill-Net outperforms the second best method VPE by a large margin of 13.81 (accuracy improved from 83.79 to 97.60) and 23.79 (accuracy improved from 71.80 to 95.59) respectively in two scenarios (see Table 2). It indicates that training on the features augmented by illumination information does help the real-world classification, even though only one template image is provided. It is notable that in the cross-dataset scenario GTSRBTT100K, Sill-Net achieves a comparable performance to the intra-dataset scenario GTSRBGTSRB, while VPE performs much worse in the cross-dataset scenario. We surmise it is because VPE learns latent embeddings generalizable to test classes in the same domain (GTSRB), but the generalization might be discounted when the target domain is slightly shifted (from GTSRB to TT100K). It is observed that the illumination conditions in GTSRB are quite similar to that in TT100K, therefore Sill-Net shows better generalization performance by making full use of the illumination information in GTSRB.

BelgaFlickr32 BelgaToplogos
No. support set 32-way 11-way
QuadNet 37.72 36.62
VPE 53.53 57.75
Sill-Net 65.21 84.43
Sill-Net w/o aug 52.38 47.95
Table 3: One-shot classification accuracy () on logo datasets. The results of other methods are cited from (Kim et al., 2017, 2019). The best results are marked in blue.

In logo classification, Sill-Net improves the performance by 11.68 (from 53.53 to 65.21) and 26.68 (from 57.75 to 84.43) compared to VPE, respectively in two scenarios (see Table 3). The improvement of accuracies in logo classification is not comparable to that in traffic classification, which might be due to the undesirable quality of the training logo dataset. The GTSRB is the largest dataset with various illumination conditions. And the traffic signs are always complete and well localized in the images so that illumination features can be separated more easily. In contrast, the separation is harder for logo dataset due to incomplete logos, color variations, and non-rigid deformation (e.g., logos on the bottles).

We compare our method to the ordinary convolutional model consisting of a feature extractor and a classifier without feature augmentation, denoted as Sill-Net w/o aug (see the last row of Tables 2 and 3). For a fair comparison, the feature extractor and classifier share the same number of convolutional layers with Sill-Net. We train it on a synthetic dataset composed of the template images after basic processing (i.e., geometric transformations, image enhancement, and blur). The number of training samples is set to be the same as other methods. The results are reported as a reference to show how the Sill-Net performs without illumination feature augmentation. The unsatisfactory results show that the illumination feature augmentation does enhance the recognition ability of the model in one-shot classification.

3.3 Cross-domain one-shot classification

Sill-Net achieves the best results among all methods in cross-domain one-shot classification tasks, as shown in Table 4. It outperforms VPE by a large margin of 23.63 (69.75 compared to 46.12) in GTSRBFlickr32 and 39.86 (69.46 compared to 29.60) in GTSRBToplogos.

GTSRBFlickr32 GTSRBToplogos
No. support set 32-way 11-way
QuadNet 28.41 25.38
VPE 46.12 29.60
Sill-Net 69.75 69.46
Sill-Net w/o aug 53.94 47.54
Table 4: Cross-domain one-shot classification accuracy (). The models are trained on the traffic sign dataset (GTSRB) and tested on the logo datasets. The best results are marked in blue.

The results illustrate that our method is still generalizable when the domain is transferred from traffic signs to logos. The unsatisfactory results of VPE are predictable. VPE learns a generalizable similarity embedding space of the semantic information among the same or similar domain (i.e., from traffic signs to traffic signs or from logos to logos). However, the embeddings learned from traffic signs are difficult to generalize to logos. In contrast, our method learns well-separated semantic and illumination representations and augments the illumination features to the template images from novel domains to generalize the model.

3.4 Ablation study

In this section, we delve into the contribution of each component of our method. The components under evaluation include the exchange mechanism, the matching and reconstruction module, the illumination constraint, and template image processing, as shown in Table 5. We disable one component at a time and then record the performance to assess its importance. The experiments are implemented in the one-shot classification scenario GTSRBGTSRB.

The results demonstrate that the exchange mechanism and matching module are the core components of our method. The accuracy of the model drops to 48.10 without exchange mechanism. It is because that the semantic and illumination features cannot be well separated without the exchange mechanism. The remaining semantic information in the illumination features is useless, or even would interfere with the recognition when they are combined with the semantic features of other objects during feature augmentation, hurting the performance of the model.

Meanwhile, the matching module cooperating with the separation module can further separate the semantic and illumination features. The matching module corrects the deformation of the object features. It retains the concrete semantic information (e.g., the outline of the object and semantic details of the object contents) with the supervision of template images. Without the matching module, the semantic features would not be informative enough, so that the separation module would have difficulty to separate the illumination features from the semantic features. Therefore, the accuracy of the model drops to 54.27 when the matching module is removed.

Factor Accuracy (decrement)
w/o exchange mechanism     48.10 (-49.50)
w/o matching module     54.27 (-43.33)
w/o reconstruction module     80.74 (-16.86)
w/o illumination constraint     90.73 (-6.87)
w/o template processing     80.19 (-17.41)
full method     97.60
Table 5: Ablation study results () in the one-shot classification scenario GTSRBGTSRB. We disable one component at a time and record the performance of Sill-Net.

The accuracy of the model decreases by 16.86 without the reconstruction module. The reconstruction module also strives to make semantic features more informative. The matching module helps the model capture some level of the concrete semantic information, while the reconstruction module prompts to retain more delicate details of the object.

The illumination constraint increases the model performance by 6.87. Intuitively, the constraint reduces the semantic information in illumination features and thus enhances their quality. Higher-quality illumination representation can improve our feature augmentation method’s effectiveness, which is consistent with the results.

Furthermore, template image processing improves model performance as expected. The processing methods (i.e., geometric transformations, image enhancement, and blur as introduced before) diversify the template images so that the trained model is more generalizable. Under the combined effect of the proposed illumination augmentation in the feature space and the variation of template images, the full model achieves the best results among the existing methods.

3.5 Visualization of features

Figure 4: Visualization of the separated features and the reconstructed template images from training and test classes. The first two rows show the input images and their corresponding template images. The third and fourth rows show the semantic and illumination features of the input images separated by our model. The last row shows the template images reconstructed from the semantic features. More visualization results are shown in the supplementary material.

Figure 4 shows the separated semantic and illumination features of the images from training and test classes in GTSRB, visualized in the third and fourth lines. Note that the training and test datasets share no common classes. As shown in the figure, the semantic features delicately retain information consistent with the template images for both training and test classes. It is due to three aspects. First, the extractor maintains the size and spatial information of the features. Second, although objects in the input images vary in size and position, the features are corrected to the normal situation corresponding to the template images via the spatial transformer in the matching module. Third, the reconstruction module promotes the semantic feature to retain the details of the objects.

In contrast, the semantic information is effectively reduced in illumination features. These features reflect the illumination conditions in the original images to a certain extent. Intuitively, the pink parts in the features represent the bright illumination while the green parts represent the dark illumination. Such well-separated representations lay the foundation for the good performance of our model.

3.6 Template image reconstruction

While the reconstructor serves to obtain informative semantic features during training, it can also retrieve the template images in the inference phase. As shown in the last row of Figure 4

, the reconstructor robustly generates the template images of both the training and test samples, regardless of illumination variance, object deformation, blur, and low resolution of the images. Not only outlines of the symbol contents but also fine-details are well restored in the generated template images, which improves the reconstruction results by VPE. Our results further demonstrate that the proposed model have learned good represents of semantic information for both classification and reconstruction.

4 Discussions

So far, our studies have validated the feasibility and effectiveness of illumination-based feature augmentation. The idea of learning good semantic and illumination features before training a classifier is consistent with the thinking of decoupling representation and classifier (Zhang & Yao, 2020). Compared to the existing approaches (Kim et al., 2017, 2019), our method not only achieves the best results on the traffic sign and logo classifications, but also learns intuitively interpretable semantic and illumination representation and performs better reconstructions.

Our method can be widely applied to a series of training scenarios. In the case that the training samples with certain illumination conditions are limited in the dataset, we can augment these samples with that type of illumination features separated from other images (or simply use the illumination features in our repository). Besides, we can utilize the method to expand a few support samples or even only one (e.g., template images) to form a large training dataset, solving the problem of lacking annotated real data. Overall, the imbalance both in size and illumination conditions of the dataset could be alleviated since we can transplant illumination information to specific training samples with a limited number and illumination diversity.

Here is one question that why we do not classify a test sample by its semantic feature after the separation. Actually, we have to train the classifier with the augmented samples because generally there are not many support samples in few-shot or one-shot scenarios. If we trained the classifier with a few support samples, it would be poor in generalization due to the memorization of deep networks (Arpit et al., 2017). While when we extend the volume and diversity of the feature set by illumination augmentation, the trained model can be more generalizable.

Our work can be improved from the following aspects.

First, it should be noted that the illumination features learned by our model seem to reflect relative illumination intensity rather than fine details, limited by the lack of illumination supervision. The constraint used in our method improves the quality of illumination features to some extent and thus enhances the model performance. However, alternative disentanglement methods with more stringent constraints or pretraining on illumination supervised data, can be applied to obtain refined illumination representation.

Second, the spatial transformation network (STN) can be substituted by other networks in the matching module. In traffic sign classifications, STN can well correct the semantic features to be consistent with that of the templates. However, it is sometimes difficult to deal with the non-rigid deformation in logo datasets. Furthermore, general objects might be distinct with the templates in many aspects, such as color variation and changes in visual angles. Two ways can be considered. First, we can choose several different templates for different types of variation. Second, we can develop general networks to deal with such transformations. For instance, we can translate the objects along directions (e.g., color and visual angles) in the feature space to the templates via semantic transformations

(Wang et al., 2020).

5 Realated works

Data augmentation is an effective data-space solution to the problem of limited data (Shorten & Khoshgoftaar, 2019). Augmentations based on data warping transform existing images by some methods while preserving the original labels (LeCun et al., 1998; Zheng et al., 2019). Oversampling augmentations enhance the datasets by generating synthetic training samples (Inoue, 2018; Bowles et al., 2018).

In this work, we propose a method of feature space augmentation. This kind of augmentations implement the transformation in a learned feature space rather than the input space (DeVries & Taylor, 2017). Recently, augmentation methods on semantic feature space are proposed to regularize deep networks (Wang et al., 2020; Bai et al., 2020). Unlike these methods, we augment the samples with interpretable illumination representation in an easier way.

Few-shot learning. Early efforts for few-shot learning were based on generative models that sought to build the Bayesian probabilistic framework (Fei-Fei et al., 2006). Recently, more attention was paid on meta-learning, which can be generally summarized into five sub-categories: learn-to-measure (e.g., MatchNets (Vinyals et al., 2016), ProtoNets (Snell et al., 2017)), learn-to-finetune (e.g., MAML (Finn et al., 2017)), learn-to-remember (e.g., SNAIL (Mishra et al., 2018)), learn-to-adjust (e.g., MetaNets (Munkhdalai & Yu, 2017)) and learn-to-parameterize (e.g., DynamicNets (Gidaris & Komodakis, 2018)). In this work, we used tasks similar to one-shot learning to evaluate our method.

6 Conclusion

In this paper, we develop a novel neural network architecture named Separating-Illumination Network (Sill-Net). The illumination features can be well separated from training images by Sill-Net. These features can be used to augment the support samples. Our method outperforms the state-of-the-art (SOTA) methods by a large margin in several benchmarks. In addition to these improvements in visual applications, the results demonstrate the feasibility of the illumination-based augmentation method in the feature space in object recognition, which is a potential research direction about data augmentation.