Mask-invariant Face Recognition through Template-level Knowledge Distillation
The emergence of the global COVID-19 pandemic poses new challenges for biometrics. Not only are contactless biometric identification options becoming more important, but face recognition has also recently been confronted with the frequent wearing of masks. These masks affect the performance of previous face recognition systems, as they hide important identity information. In this paper, we propose a mask-invariant face recognition solution (MaskInv) that utilizes template-level knowledge distillation within a training paradigm that aims at producing embeddings of masked faces that are similar to those of non-masked faces of the same identities. In addition to the distilled knowledge, the student network benefits from additional guidance by margin-based identity classification loss, ElasticFace, using masked and non-masked faces. In a step-wise ablation study on two real masked face databases and five mainstream databases with synthetic masks, we prove the rationalization of our MaskInv approach. Our proposed solution outperforms previous state-of-the-art (SOTA) academic solutions in the recent MFRC-21 challenge in both scenarios, masked vs masked and masked vs non-masked, and also outperforms the previous solution on the MFR2 dataset. Furthermore, we demonstrate that the proposed model can still perform well on unmasked faces with only a minor loss in verification performance. The code, the trained models, as well as the evaluation protocol on the synthetically masked data are publicly available: https://github.com/fdbtrs/Masked-Face-Recognition-KD.READ FULL TEXT VIEW PDF
Mask-invariant Face Recognition through Template-level Knowledge Distillation
The current COVID-19 pandemic presented new challenges to biometric technologies . To reduce the risk of spreading the virus, the use of contactless biometrics has become increasingly important. Face recognition (FR) in particular had already established itself as a contactless biometric modality before the pandemic due to its performance  and its passive, universal and non-intrusive nature . However, FR systems have also been confronted with new realities, most notably the wearing of masks by the general public to prevent the spread of the contagious virus. The presence of a mask that hides facial features can weaken the FR system [35, 9, 10] and thus reduce confidence in the system’s decisions . A study by the National Institute of Standards and Technology (NIST)  as well as a study from the Department of Homeland Security  concluded, that the wearing of masks has a significant negative effect on the accuracy of FR systems. Further studies from the scientific community confirmed this [9, 10]. It is important to note that the negative effect of masks affects not only the recognition performance of automated FR systems but also the performance of human operators , presentation attack detection 
, and quality estimation. Therefore, it is important to find technical solutions to make FR systems robust to wearing masks in response to the given circumstances. The importance of this topic has recently led to the organization of several competitions addressing masked face recognition [48, 11, 5].
The challenge of masked faces for automatic FR came into scientific focus with the onset of the COVID-2019 pandemic. Previously, some work generally dealt with the problem of occluded faces, which also included sunglasses, other facial accessories, and other face occlusions [26, 43, 39]. Also in the same direction, the detection (not the recognition) of occluded faces or masked faces [16, 2, 29] was also discussed in the literature. Such works addressed the problem of detecting faces that are largely occluded or faces that are partially covered by masks. There were only a few works that address enhancing the recognition performance on masked faces. Li et al.  proposed a cropping and attention-based approach to train a face recognition model on the periocular area of masked faces. Neto et al. 
also focus on the upper part of the face, for this, they use a constraint triplet loss to get optimized embeddings for masked face recognition. Another approach that has been followed is to synthetically replace the area hidden by the mask using Generative Adversarial Networks (GANs)[41, 27, 18]. Li et al.  proposed a solution to in-paint the masked area while trying to maintain the identity information. A different approach was taken by Boutros et al. , where the authors proposed a self-restrained triplet loss to train an on-the-top solution that learns to transfer the templates of masked faces into templates that possess the properties of non-masked face templates. Several approaches were proposed as part of the MFRC-21 Challenge . Most of the submissions used a ResNet  architecture and ArcFace  loss as a foundation. Anwar and Raychowdhury  trained a model on synthetically masked images using the Inception-ResNet v1  with the triplet-loss FaceNet . Moreover, Zheng et al. 
used large-scale web-collected database and corresponding tags without manual annotations and used frequency domain information to train an FR that is more robust to masked faces.
Knowledge Distillation (KD) is a technique that is commonly used to improve the performance and generalizability of lightweight models. This is achieved by transferring knowledge learned by a teacher model to a (usually) smaller student model. The student model is guided by the teacher model to learn additional relationships discovered by the teacher model that goes beyond the information stored in ground truth labels . For Face Recognition, KD has already been used to reduce the complexity of face recognition models to produce well-performing lightweight models [6, 33], or to counter the problem of low-resolution FR [42, 17]. Li et al.  used KD between a teacher and a student model to distill the feature distribution of unmasked faces to recovery identity information on inpainted masked faces.
In this work, we propose a mask-invariant face recognition solution, namely the MaskInv. MaskInv utilizes knowledge from a pre-trained face recognition model using KD on template-level while also being guided by a margin-based identity classification loss (ElasticFace ). The student model is trained with masked and non-masked face images so that it can deal with both cases, while the KD process ensures that the model would produce embeddings of masked faces that are similar to those of non-masked faces of the same identities. We investigated single and two-stage training paradigm, where the latter puts more emphasis on the KD at later training stages proving to enhance the masked FR performance. We additionally baseline our solution, to the same training paradigm that includes using face image augmented with masks, however, without optimizing the embeddings using KD.
Our experiments demonstrate, in a stepwise ablation study, the accuracy gains of our MaskInv solution on different scenarios of masked face recognition. This study utilizes seven different benchmarks, two with real and five with simulated masked face images. We further prove the superiority of our MaskInv solution by comparing the achieved results to the top academic performers of the MFR 2021 masked face recognition challenge , where the MaskInv outperformed all the academic solutions in both, masked vs masked and masked vs non-masked scenarios.
The paper is structured as follows: in Section II we detail and rationalize our proposed approach. The experimental setup, the databases used for training and evaluation, and the evaluation criteria are described in Section III. Subsequently, we present and discuss our results in Section IV, both in terms of a detailed ablation study and comparison with SOTA. In Section V we conclude our work.
In this section, we present and rationalize our proposed methodology to create mask-invariant face embeddings through our MaskInv solution. Our approach achieves that by jointly learning the correct identity classification of masked and non-masked face images, and ensuring that the embeddings of masked faces are similar to those of non-masked images of the same identities through embedding-level KD. The KD teaches a student network to process a masked face in a manner that produces an embedding similar to the non-masked face embedding produced by the teacher, and thus try to neglect the non-identity related information introduced by the mask.
A well-performing face recognition model trained on non-masked faces acts as the teacher model in our knowledge-distillation architecture. A second FR system acts as the student network and is trained with interaction to the teacher model to be mask-invariant, and thus produce embeddings from masked faces that are similar to those produced from non-masked faces. A schematic overview of the proposed learning scenario is presented in Figure 1. During the training, the same images are simultaneously fed to both, the teacher and the student model. On the images forwarded to the student model, we apply a synthetic mask with a probability , while the images forwarded to the teacher network remain unaltered. The synthetic mask is created by placing a mask template on the face, depending on the landmarks used for the face alignment during the image pre-processing. The synthetic mask is applied to a proportion of the images fed to the student model (probability ) to ensure that it still deals optimally with non-masked faces and to enable a more stable training process. For the teacher network, we use a pre-trained auxiliary network that guides the newly trained student network. To achieve our goal, the student network is not only trained to produce a correct classification decision but additionally optimize the produced embedding to be similar to that of the teacher network (on non-masked faces). Thus, the student network is trained using a combined loss , consisting of two different losses. Formally, we define the total loss as:
refers to the recently published SOTA FR loss function ElasticFace-Arc and to the mean squared error loss in the template-level KD process. The ElasticFace-Arc loss function relaxes the fixed margin constraint of similar high-performing FR loss functions and therefore provides space for flexible identification separability. It outperforms several other SOTA FR loss functions especially at hard cross-pose benchmarks . Formally, it can be defined as :
where denotes the batch size, the number of identities, the scale parameter, the margin parameter, and
the standard deviation. The function
returns a random value from a Gaussian distribution with the meanand the standard deviation . All the parameters are set as defined in . The is calculated as part of the KD between the feature embeddings of the teacher network and the feature embeddings of the student network to optimize the embedding itself rather than the network classification behavior. This ensures that the embedding distortions caused by masks are kept to a minimum and therefore guides the KD process to produce a mask-invariant student network. The used loss, mean squared error, can be formalized as:
where and are the feature representations obtained from the embedding layer of the student and teacher model, respectively, and is the size of the embeddings.
Since the learned feature embeddings are normalized, the range of is rather small and thus, as proposed by , we weight them with a weight during the first training step. This enables the knowledge transfer by allowing the to contribute to the overall loss while keeping the emphasis on learning identity classification by the .
We propose two different paradigms based on the methodology described above. In the MaskInv-HG (Mask-invariant High Guidance) approach, we increased the weighting to when the loss stabilized to further emphasize the adaption of the network to the masked data. In the MaskInv-LG (Mask-invariant Low Guidance) approach the weighting remains unchanged throughout the training process. For a detailed ablation study, we additionally present the results of the third solution, where the is set to zero, and thus no KD is applied, rather the student network is trained independently with faces augmented with a synthetic mask with the probability , this solution will be referred to as ElasticFace-Arc-Aug. All the training parameters will be introduced in the next section.
As the teacher network, we use a pre-trained FR model based on the ResNet-100  architecture, trained on the MS1MV2 dataset  with ElasticFace-Arc loss , which has been made publicly available by the authors111https://github.com/fdbtrs/ElasticFace. For the training details of the teacher model, we refer to the original ElasticFace paper . We chose this model because it advanced the SOTA (such as ArcFace  and MagFace ) on six difficult mainstream benchmarks and is publicly available.
For the student model, we also use the ResNet-100  architecture as the ResNet-100 architecture is widely used in SOTA FR approaches [31, 12, 3]. For the ElasticFace-Arc loss we set the scale parameter to , similar to [12, 3, 24] and the margin parameter to and the standard deviation to , following 
. The mini-batch size is set to 512. The model is trained with Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of. The momentum is set to and the weight decay to . The learning rate is divided by 10 at 80k, 140k, 210k training iterations. The total amount of training iterations is set to 295k iterations. With these parameters and training process, we follow the training process defined in . For the MaskInv-HG variant, we increase the weighting from 100 to 3000 after 227k iterations. We do this to further sensitize the model to the masked data. For the MaskInv-LG variant, the weighting remains unchanged throughout the training. In the experiments, we investigate the performance of both, the student model with low guidance (MaskInv-LG) and the student model with high guidance (MaskInv-HG), in a detailed ablation study. For a detailed ablation study, we also evaluate our ElasticFace-Arc-Aug solution where the is set to zero (no KD). Here, the student network is trained independently (with the same training procedure as the teacher) with faces augmented with a synthetic mask with the probability .
Following recent trends [12, 3, 31, 24], the model is trained on the MS1MV2 dataset , which is the same data used to train the teacher model. The MS1MV2 is a refined version of MS-Celeb-1M  and contains 5.8M images of 85k identities. For the teacher network, the images are used unmodified, while for the student network, with a probability of 0.5, synthetic masks with random colors and random small deviations in shape are added. The synthetic masked images were created by mapping a template mask image on the extracted landmarks used for pre-processing with small variations in the mapped key points. All the images are aligned and cropped to 112x112x3 using MTCNN  and then normalized to have pixel values between -1 and 1. The simulated mask approach will be publicly provided to ensure comparability and reproducibility.
To demonstrate the effect of our proposed masked face recognition approach, we investigate the behavior on both, real and simulated masked face datasets. Furthermore, we observe the performance on non-masked benchmarks to show that our MaskInv solution can still operate well on non-masked data.
Regarding the performance on real masked data, we use the database MFRC-21 of the MFR competition  as well as MFR2 . The MFRC-21 dataset consists of images of 47 individuals taken in a collaborative but varying scenario. The authors of MFRC-21 provided two different evaluation scenarios: non-masked vs masked and masked vs masked. The evaluation on this database (MFRC-21) is essential as it enables a wide range comparison to SOTA solutions presented in the competition . Additionally to the MFRC-21, the MFR2 dataset  is also used in this work. The MFR2 consists of masked images collected in the wild of 53 identities with a total of 269 images. The authors provide a list of 848 pairs of images to be compared.
We additionally opted to investigate the performance of our MaskInv solution on large-scale databases. To do that, we rely on simulated masks to adapt established large-scale face recognition benchmarks into evaluating masked face recognition performance. To generate the simulated mask, we follow the same synthetic mask-creating process detailed in subsection III-A. However, we do not use the random shift in the mask position to maintain a realistic appearance, but we retain the random color of the mask. Following the proposed scenarios of the MFRC-21 competition, we evaluate both, masked vs masked and non-masked vs masked comparison pairs, where we either apply the simulated mask on both images or only on one image in the comparison pair. As a basis for our simulated large-scale masked dataset, we use five mainstream datasets that also cover various additional challenges. The five used benchmarks are: LFW , CFP-FP , AgeDB-30 , CALFW , and CPLFW .
LFW  is an unconstrained face verification benchmark and contains 13,233 images of 5749 different identities. We follow the standard protocol and applied the masks on the defined 6000 comparison pairs. The CFP-FP dataset  addresses the comparison between frontal and profile faces and the evaluation protocol contains 3500 genuine and 3500 imposter pairs. AgeDB  is an in-the-wild dataset for age-invariant face verification. We use the most reported and most challenging scenario AgeDB-30 with an age gap of 30 years between the images of the individuals. Both, the CALFW dataset  and the CPLFW  are based on the LFW dataset. While the CALFW dataset focuses on cross-age evaluation, the CPLFW dataset focus on cross-pose evaluation. Both dataset protocols provide 3000 genuine and 3000 imposter comparison pairs. The imposter pairs of CALFW and CPLFW are selected from the same gender and ethnicity to reduce the effect of these attributes on the recognition performance.
To sum it up, we use seven databases to evaluate our MaskInv solution, two with real masks and five common face recognition benchmarks with simulated masks. Sample images from the used datasets are shown in Figure 2.
For the evaluation, we consider several metrics. For the sake of comparability and reproducibility, we follow the evaluation metrics used and proposed in the utilized benchmarks and datasets. We nevertheless acknowledge the evaluation metrics defined in the ISO/IEC 19795-1 standard.
The verification performance on the MFRC-21 dataset  is evaluated and reported as the false non-match rate (FNMR) at two operation points. The operation points are denoted as FMR1000 and FMR100 and refer to the points which provide the lowest FNMR for a false match rate (FMR) and , respectively. As the FMR1000 has been proposed as the best practice evaluation operation point for high-security scenarios, e.g. automatic border control systems by the European Border and Coast Guard Agency (Frontex) , we rate it higher than FMR100. To get an indication of generalizability, we also report the Fisher Discriminant Ratio  between the genuine and imposter distributions as a separability measure, similar to .
On the MFR2 dataset  we follow the evaluation metrics reported in the original work. This includes the true positive rate (TPR) at a false acceptance rate (FAR) of (FAR2000), the achieved accuracy at this operation point (ACC2000) and the maximum accuracy of the network (ACC) . For the benchmarks LFW , CFP-FP , AgeDB-30 , CALFW , and CPLFW , we follow the original metric in the respective benchmarks where they all report the verification performance as the verification accuracy in percentage points as defined in the corresponding benchmark references.
|Mask vs No-Mask||FMR1000||FMR100||FDR|
|ElasticFace-Arc (teacher) ||0.08112||0.06004||8.2282|
|Mask vs Mask||FMR1000||FMR100||FDR|
|ElasticFace-Arc (teacher) ||0.07109||0.05795||9.5872|
|ElasticFace-Arc (teacher) ||83.25%||91.63%||95.05%|
In this section, we present the evaluation results achieved by our MaskInv solution. We start with an extended ablation study to investigate the influence of our proposed solution in its two training paradigms MaskInv-LG and MaskInv-HG in comparison to the teacher baseline (ElasticFace-Arc) and the ElasticFace-Arc-Aug (trained without KD). Later on, we take a closer look at the performance of our approach in comparison to results published in the literature.
In the following ablation study, we investigate step by step the impact of our two-stage training paradigm by looking into the (1) teacher baseline, (2) the model trained with mask-augmented images but no KD (), (3) the single step MaskInv-LG, and (4) the 2 stage MaskInv-HG solutions. We additionally, present our ablation result study in perspective of two additional (to ElaticFace-Arc) top-performing face recognition solutions (ArcFace , MagFace , both also based on ResNet-100 and trained on MS1MV2 ) to motivate mask-specific solutions, while we maintain the comparison to SOTA solutions that specifically targeted masked face recognition in the next section. The results on the MFRC-21 dataset of the different models are shown in Table I and Table II for the masked vs non-masked and the masked vs masked scenarios, respectively. We focus our discussion on the verification performance FMR1000 as this has been proposed by Frontex as a best practice operation point for processes just as automatic border control . Table I and Table II show that the KD is beneficial to the verification performance in comparison to the teacher model and that the adjusted models outperform the traditional FR models on masked faces. While the ElasticFace-Arc-Aug shows in few cases a better performance in comparison to MaskInv-HG, it lacks the same level of separability, measured in FDR, and thus the expected generalizability of the performance. Table III presents a similar ablation study on the MFR2 database, where again the proposed MaskInv-HG presented superior performance to the teacher and the MaskInv-LG model by scoring an FAR2000 of 92.21%, in comparison to 91.98%, and 83.25% for the MaskInv-LG, and the teacher network, respectively. A similar trend can be seen for ACC2000 metric. The maximum accuracy ACC shows mixed results, however such a metric is not measured at a comparable operation point between different solutions and thus can not lead to a fair comparison, which is also the reason why such metrics are not included in the verification performance metrics in ISO/IEC 19795-1 . The poorer performance of the two well-known MagFace and Arcface-based models on the real masked datasets (MFRC-21 and MFR2) shows that a tailored solution, such as the one we are presenting here) for MFR is necessary.
In Table IV we present the results on five face recognition benchmarks. The experiments were performed on three different versions of the datasets: 1) on the unaltered data (No Mask), 2) on the dataset with synthetics masks applied on one of the images of each pair (Masked vs Non- Masked), 3) on the dataset with synthetic masks on both images of each pair (masked vs masked). The results from Table IV show that all three models are better at handling unmasked data than masked data and outperform conventional FR models in the latter case. The proposed MaskInv solutions guided by the KD outperform the teacher model as well as the ElasticFace-Arc-Aug model on most synthetic mask benchmarks. However, the MaskInv-LG and MaskInv-HG do iterate on the top performance spot. For example, on the masked vs non-masked settings of the CFP-FP benchmark, the MaskInv-HG comes first with an accuracy of 96.94%, followed by 96.83% and 95.91% by the MaskInv-LG and ElasticFace-Arc-Aug, respectively. However, all far ahead of the teacher and other baseline methods. Similar scenarios can be seen in the masked vs masked setting, with the ElasticFace-Arc-Aug edging closer to the MaskInv solutions. This is the case as the masked vs masked setting benefits less from the main goal of the MaskInv solution, which is to create face representations similar between masked and unmasked faces, while the masked vs masked setting require only the similarity between masked faces. The lower masked face recognition performances of the solutions based on MagFace, ArcFace, and ElasticFace-Arc (teacher) again indicate the need for specifically designed solution as our MaskInv.
To examine the influence of the proposed training paradigm in detail, we present the plot of the two losses in Figure 3 plotted along the training iterations. When comparing to the MaskInv-LG, the (see Equation 1) is increased for the MaskInv-HG from 100 to 3000 after 227k iterations resulting in a minimum change in the the ElasticFace-Arc loss, but a huge drop in the embedding optimization loss (KD loss). This corresponds to the targeted effect of not effecting the identity classification performance of the model, while enhancing the similarity of the embedding (whether masked or not) to that of the teacher model.
In summary, the ablation study showed that our proposed MaskInv solution benefits from learning to produce similar embeddings for masked and non-masked faces through knowledge distillation. This is specifically beneficial when comparing masked to non-masked faces as demonstrated and rationalized earlier, which is the more practical comparison scenario (e.g. non-masked reference in passport). The experiments performed on benchmarks addressing special challenges such as cross-pose and cross-age, show that the proposed approach also can be transferred to these cases. Furthermore, the performance of traditional non-MFR models show that a separate solution for masked faces is needed to achieve competitive results on masked images face recognition.
|ElasticFace-Arc (teacher) ||99.80||98.67||98.35||96.17||93.27|
|Masked vs Non-Masked|
|ElasticFace-Arc (teacher) ||99.40||95.29||95.38||94.42||90.40|
|Masked vs Masked|
|ElasticFace-Arc (teacher) ||98.90||92.01||93.47||93.73||87.55|
|Neto et al. ||-||0.28252||-|
|Neto et al. ||-||0.23507||-|
In this subsection, we compare our proposed approach with previously published results on both the MFRC-21  and MFR2  datasets. Since that the ablation study proved the benefit of our proposed MaskInv-HG solution, especially in terms of generalization, we compare the results of this solution to SOTA results. The comparison to SOTA is limited to these two benchmarks (out of 7 used) as the rest of the benchmarks do not have comparable results in the literature. For the MFRC-21 dataset, we limit ourselves for the evaluation to verification performance and, unlike the challenge organizers, do not consider model compactness because we do not propose a novel architecture in this work. In addition, we consider the top-ranked academic submissions to the challenge, as they provide a detailed description of their proposed approaches. We compare our solution to the ten top-ranked academic approaches in the MFRC-21 challenge  on the masked vs non-masked (Table V) and the masked vs masked scenario (Table VI). A more detailed description of the different solutions in the MFRC-21 competition can be found in the competition paper .
In both scenarios, our proposed MaskInv-HG outperformed the academic solutions submitted to the MFRC-21 challenges. In the masked vs non-masked (Table V) scenario, our model achieved a verification performance FMR1000 of 0.05849 and beats the best-performing academic solution MTArcFace that achieved a verification performance FMR1000 of 0.05860. As mention earlier, we focus our evaluation on the FMR1000 as it has been proposed as a best practice operation point for automatic border control by Frontex . On the FMR100, only the SMT-MFR-2 model achieved a higher verification performance than our proposed MaskInv-HG model. In addition, our model achieves the highest FDR, which indicates a better separability of imposter and genuine distribution than the other approaches, and thus indicates higher generalizability. In the masked vs masked scenario (Table VI), our MaskInv-HG model outperforms the solutions proposed to the challenge on all evaluation metrics. In contrast to the other top-3 solutions on the masked vs non-masked scenario, the performance of our model on masked vs masked was only slightly worse (FMR1000 = 0.05886) than on the mask vs unmasked dataset (FMR1000 = 0.05849), regarding the FMR1000. This shows the flexibility of our proposed MaskInv-HG model to handle both scenarios well by creating similar face representations for masked and non-masked faces.
|MTF Retrained ||82.78%||91.18%||95.99%|
In Table VII, we compare our results on the MFR2 dataset with the proposed solution ”MTF Retrained” of the authors of the dataset . The ”MTF Retrained” is based on the Inception-ResNet v1  and is trained on a subset from the VGGFace2 dataset  with the triplet-loss FaceNet . On the images of this subset, different synthetic masks were added. In all three evaluation metrics, our MaskInv-HG model outperformed the MTF-Retrained model. Our model increased the performance by percentage points of the FAR2000, percentage points of the corresponding accuracy at FAR2000 (ACC2000), and the maximum accuracy from to .
In summary, our proposed approach outperformed the previous SOTA in both the MFRC-21 challenge and the MFR2 evaluation protocol. This proves the ability of the proposed MaskInv solution to adapt to masked faces and provide a step forward towards accurate masked face recognition in both scenarios, masked vs masked and masked vs non-masked faces.
In this paper, we proposed a mask-invariant face recognition solution that does not only aim at building discriminant face embeddings, but rather extend this target to building embeddings that maintain their intra-identity similarity whether wearing or not wearing a mask. This novel approach, namely the MaskInv, is based on jointly learning to separate between identities despite wearing masks through identity classification learning, and learning to produce similar embeddings for masked and non-masked faces of the same identity through embedded level KD from a teacher network. In a detailed ablation study, on two real masked datasets as well as on five mainstream face verification benchmarks, different stages of our proposed approach have shown to enhance the performance of masked face recognition. The proposed solution outperformed current SOTA approaches when compared to the academic solutions submitted to the recent masked face recognition competition MFRC-21. Moreover, our proposed solution maintains high-level of accuracy when verifying non-masked faces as we demonstrated show on five widely used benchmarks addressing different variations including cross-pose and cross-age verification. As this and future pandemic foreseen effect on the world is still unknown, masked face recognition, which has been brought into focus by the global COVID-19 pandemic, will be of further interest for our society and require novel solutions. Additionally, the potential of the presented mask-invariant face representation learning may not be limited to its use for masked face recognition, but could also prove useful for the problem of occluded face recognition in general, which is still to be studied.
Distilling the knowledge in a neural network. CoRR abs/1503.02531. Cited by: §I.
Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, pp. 4278–4284. Cited by: §I, §IV-B.