The inspection of passenger’s baggage, packages, and containers with X-ray scanners is nowadays a part of the standard checking measures in airports and any other public place where safety and security are of significant concern. This screening process is cumbersome, requiring the relentless attention of a human expert. Furthermore, it’s vulnerable to human errors caused due to exhausting work schedules, lack of experience, and the concealed nature of the contraband items. Although object detection in color images has been a rigorously researched topic, its applicability to X-ray-based threat detection is somewhat limited. The primary reason is the remarkably different X-ray imagery characteristics, where texture and appearance details are scarce compared to regular color images. An adequate system for such a critical application is expected to detect objects under high occlusions, in cluttered scenes, with large view-point variations and limited amounts of contraband data.
Many researchers have developed supervised and unsupervised screening systems for detecting contraband items in X-ray images in response to these challenges. The most recent wave of these efforts employed deep learning models, particularly one-staged and two-staged object detectors such as RetinaNet[retinanet], YOLO [yolo], and Faster R-CNN [fasterrcnn]. While these systems showed remarkable capacity for detecting isolated objects, their performance degrades in recognizing extremely cluttered, occluded, and overlapping items [akcay2018using, gaus2019evaluation]. Semantic segmentation models, due to their pixel-level recognition ability, can extract the extremely occluded contraband items from X-ray baggage scans [Hassan2020ACCV]. With the integration of object context in the pixel classification, they have more potential to improve the threat detection accuracy [an2019]. By leveraging this capacity, some of the initial attempts employed the encoder-decoder-encoder topology for detecting suspicious items as anomalies [akcay2018ganomaly]. However, semantic segmentation networks have an inherent limitation of detecting the individual instances of the overlapping items. For example, in Figure 1 (B), we can see that how a semantic segmentation network cannot recognize the overlapping kitchen knife and chopper individually. In such scenarios, these networks output only a single blob in which the information about individual item instances is lost. Detecting individual instances of the same threat category is, in fact, desirable in cases where we need to identify and locate each instance precisely (see the example in Figure 1-C, where the kitchen knife and the chopper instances have been extracted separately). Also, identifying individual items’ instances is vital in aviation baggage screening as some instances of the items are legal to carry within the baggage, whereas some instances are prohibited. For example, passengers can carry certain drugs and bottles in their luggage, but addictive drugs and alcoholic drinks are banned at airports [EU]. Towards this end, Gaus et al. [gaus2019evaluation, gaus2019evaluating] introduced an instance segmentation approach in their baggage threat detection system using Mask R-CNN [maskrcnn]. However, the authors realized that conventional instance segmentation network requires extensive ground truth labeling and exhaustive training efforts, especially for the large-scale datasets, and there is a need to develop a framework that can effectively perform instance-aware segmentation to recognize the cluttered contraband items from the baggage X-ray imagery via incremental few-shot training.
Ii Related Work
Existing solutions for contraband item detection based on X-ray imagery can be classified as traditional machine learning and deep learning methods. In this section, we shed light on the main approaches, and we refer the reader to the work of[Mery2017TMSC], and [ackay2020] for a detailed survey. In addition to this, this section also explores the recent advances in incremental learning to perform classification and segmentation tasks.
A. Conventional Machine Learning Methods: The initial methods developed for screening contraband items employ conventional machine learning. These solutions are either based on classification [bastan2013BMVC], detection [bastan2015] or the segmentation approaches [heitz2010]. Bastan et al. [bastan2011] used SURF features with Bag of Words (BoW) to identify suspicious objects. Instead of SURF, Kundegorski et al. [kundegorski2016] utilized FAST-SURF with BoW to classify prohibited baggage items. Other works involve Adaptive Sparse Representation [mery2016], and Adapted Implicit Shape Model [riffo2015automated] to detect contraband data. Apart from this, Mery et al. [mery2016] developed a framework that computes 3D feature points through the structure from motion and uses these features to classify contraband items from the X-ray imagery.
B. Deep Learning Methods: The most recent deep learning methods can be categorized either as supervised detection and segmentation approaches or as unsupervised adversarial learning schemes.
1. Supervised Detection Strategies: The majority of deep contraband item detection frameworks utilizes one-staged or two-staged object detectors such as YOLOv2 [yolov2], RetinaNet [retinanet] and Faster R-CNN [fasterrcnn]. Moreover, researchers have also utilized pre-trained models for the object classification within baggage X-ray scans [jaccard2017, gaus2019evaluation]. Zou et al. [_41] utilized YOLOv2 [yolov2] to detect scissors, knives and bottles from their local 1,104 synthetic X-ray images. Miao et al. [miao2019sixray] released the largest security inspection X-ray dataset (SIXray) that contains highly occluded and overlapping instances of contraband items such as guns, knives, wrenches, pliers, scissors and hammers. Furthermore, they presented a framework dubbed class-balanced hierarchical refinement (CHR) to recognize contraband items from the SIXray [miao2019sixray] dataset. More recently, Hassan et al. [hassan2019]
presented Cascaded Structure Tensor (CST) framework that generates contours-driven bounding boxes of potentially prohibited items which are then classified using ResNet50 [he2016deep].
2. Supervised Segmentation Approaches:
Apart from solving the baggage threat recognition problem via deep object detection methods, many researchers utilized semantic and instance segmentation as a tool to effectively recognize suspicious baggage content [an2019, gaus2019evaluation]. It is essential to note here that although we can fine-tune standard encoder-decoder networks for a large variety of semantic segmentation tasks, specific applications would be best be approached with customized models [liang_eccv_2018].
For example, to cope with object size variation and camera view changes in traffic and surveillance applications, Akilan et al. [akilan2020] proposed integrating residual feature fusions at early, middle and late stages in the encoder-decoder architecture (dubbed MvRF-CNN [akilan2020]). Similarly, driven by achieving the optimal trade-off between the segmentation accuracy and the computational model complexity, Wang et al. [wang_cvpr2020] coupled an encoder-decoder model and super-resolution construction scheme. Similarly, a multi-task attention network is proposed in
coupled an encoder-decoder model and super-resolution construction scheme. Similarly, a multi-task attention network is proposed in[wang2020MAN] that coupled handcrafted features pipeline and an attention network to segment the object of interest [wang2020MAN]. Also, an adversarial domain adaptation scheme is proposed in [wang2019ADA] that employs a detection and segmentation (DS) model along with domain classifiers to learn target domain labels from the source domain synthetic data in a weakly supervised manner. In addition to this, Hassan et al. [Hassan2020ACCV] proposed a contour instance segmentation strategy that segments the suspicious baggage content by analyzing the strength of the variation within their contours [Hassan2020ACCV].
3. Unsupervised Adversarial Learning:
Apart from supervised learning frameworks for detecting contraband items, Akcay et al. proposed GANomaly[akcay2018ganomaly], and Skip-GANomaly [akccay2019skip] to derive the latent space representation of the contraband items in an adversarial manner to recognize them as anomalies within the baggage X-ray scans.
C. Incremental Learning Strategies: Incremental learning schemes have gained immense popularity in the context of deep learning for overcoming the need for excessive computational burden in re-training models with large-scale data, which might be difficult to obtain and prepare. However, developing an incremental learning scheme that overcomes catastrophic forgetting (the tendency of a deep learning model to drastically forget the prior knowledge while learning about new information) is also challenging. To address this, many researchers have proposed schemes involving knowledge distillation [kd], gating [gate], and indefinitely long term learning (iCaRL) [icarl]. Furthermore, Tian et al. [crd] exploited the fact that knowledge representations exhibit complex relationships that cannot be learned through objective functions that assume independence of events. Cho et al. [cho] advocated that good performing teachers do not necessarily produce good students due to the student network’s limited capacity to cope with the teacher’s growing knowledge. Lopez-Paz et al. [lopez] proposed the Gradient Episodic Memory (GEM) scheme, which uses episodic memories to hold a small set of examples from the prior learned tasks to avoid catastrophic forgetting. Apart from this, researchers have also proposed distillation-driven incremental learning strategies for performing the semantic segmentation tasks [kdils].
D. Limitations of Existing Work: The main limitations of the existing approaches are their inadequate validation on single datasets or their application to simplistic scenarios within a very constrained environment. For instance, the problem of robustly detecting cluttered, occluded, and overlapping contraband items from the highly imbalanced datasets is still an open question to be addressed. The approaches proposed in [miao2019sixray], [hassan2019] and [Hassan2020ACCV] handles such cases. However, they produce either low detection performance [miao2019sixray] or are subject to parameter tuning [hassan2019]. Apart from this, researchers have also utilized semantic segmentation networks to recognize suspicious baggage content via X-ray imagery [an2019]. Such models have improved the performance of the threat detection frameworks. However, they cannot distinguish between cluttered and overlapping instances of the same items (e.g., a knife overlaid on another knife as shown in Figure 1-B), which is often desirable in aviation screening, and for such cases, the semantic segmentation networks output a single blob of pixels representing only a single class label. To cater this, Gaus et al. [gaus2019evaluation] introduced the usage of Mask R-CNN [maskrcnn] for baggage threat detection. However, the Mask R-CNN-based threat detection system presents limitations in extracting the cluttered contraband items because it relies on the region-based proposals that fail to detect cluttered objects correctly [gaus2019evaluation]. This limitation of Mask R-CNN [maskrcnn] and other instance-aware segmentation networks will be further evidenced when employed in complex datasets such as SIXray [miao2019sixray], as described in Section V. Moreover, other approaches utilized encoder-decoder architectures and fully convolutional networks coupled with classification sub-networks or region of interest (ROI) voting to recognize multiple objects instances individually [an2019, multipathnet]. However, these frameworks also produce a poor trade-off between detection accuracy and efficiency. On the other hand, instance segmentation frameworks require extensive bounding box and mask-level annotations [Hassan2020ACCV], which are reasonably hectic, and resource-demanding to procure, especially for large-scale datasets, such as SIXray [miao2019sixray]. Also, training such networks requires an excessive amount of memory and computational resources. To alleviate these problems, we propose an incremental learning-driven instance-aware segmentation approach, as discussed below.
E. Contributions: This paper proposes a novel scheme that utilizes incremental learning to make conventional semantic segmentation models instance-aware. The proposed method is simple and exhibits modest training efforts by requiring only a small batch of training samples to add more instances of a given suspicious item class. This strategy bypasses hectic annotation workflows as are necessary for training traditional instance segmentation frameworks while overcoming the excessive memory and computational requirements. The proposed framework also avoids catastrophic forgetting through an instance segmentation objective function that minimizes the network loss to retain knowledge about the previously learned classes while understanding new class representations and resolving their complex inter-dependencies. The unique characteristics of the proposed system are:
A novel approach that extends conventional encoder-decoder networks to recognize individual instances of the contraband items from the X-ray scans.
No requirement for an additional object detector, classification sub-network, or ROI voting to perform instance-aware segmentation.
An incremental learning-driven instance segmentation framework that discriminates the overlapping and isolated suspicious item instances with only a few training examples.
Robust to catastrophic forgetting due to its ability to resolve complex inter-dependencies between already learned and newly added suspicious items categories.
The rest of the paper is organized as follows: Section III discusses the proposed system. Section IV enlists the experimental plan. Section V presents the experimental results. Section VI contains a detailed discussion on the performance of the proposed system and Section VII presents concluding remarks.
Iii Proposed Framework
Figure 2 depicts the block diagram of the proposed framework. This framework trains an encoder-decoder model to recognize up to isolated and overlapped instances of a given class incrementally in iterations. The first iteration reflects the ordinary semantic segmentation to extract different contraband items from the baggage X-ray scans. For this, we train the first instance of encoder-decoder dubbed on a relatively large set of training images. Afterward, we make the encoder-decoder model instance-aware in each iteration by exposing it to the small training batches. For example, in the iteration, we make the encoder-decoder to recognize up to instances of the same item by providing a different set of corresponding images. The final instance-aware segmentation model is obtained at the iteration
. In this process, the model is immunized to catastrophic forgetting by analyzing the complex relationships between previously learned and newly added suspicious item categories through the proposed loss function (see Eq.2). Before exposing the details of our approach, we provide a brief overview of the incremental learning paradigm in the next section for completeness.
A. Incremental Learning: In a conventional incremental learning paradigm, the model is trained iteratively. At each iteration , it performs -class segmentation (or classification) task where denotes the number of classes in the current iteration . To learn this task, the model is given a set of training samples such that , where denotes the samples of old classes , learned from iteration 1 to (), and denotes the samples of newly added classes () to be learned in the current iteration . The cumulative list of all the classes (both and ) is represented by , i.e., . The network is also fed with the ground truth of these training samples where and denote the ground truth for the samples of old classes and the new classes, respectively.oneHotEncoding]
. These training samples are passed as an input to the network for which it generates the output logitsin the last layer such that , where represents the feature vector, represents the layer weights, and denotes the biasing factor. The logits are the concatenation of the old logits and the new logits
, generated from training the old classes and the newly added classes, respectively. These logits are then passed through the activation function (usually softmax) in the final layer of the CNN model to generate the final class probabilities, i.e.,, where denotes the probability of the training sample being part of the class. in the above definition is known as a hard class probability of the logit . Hard probabilities are generally recommended in traditional classification or segmentation task because they clearly discriminate the most expected class out of the rest. But in incremental learning, logits are scaled using the temperature constant () to generate the soft target probabilities, i.e., , where . Here, is used to increase the degree of relaxation of the soft label by reducing the disparities between classes probabilities. Practically, it is a hyper-parameter which is tuned for the sake of obtaining a better performing model [ILSurvey].
B. Semantic Segmentation:
The first iteration of the proposed framework relates to semantic segmentation, where we train the proposed contraband items extraction network (CIE-Net)
to extract different contraband items from the baggage X-ray images.
The prime objective of designing the proposed CIE-Net is to accurately extract the contraband items and their instances, even in overly cluttered scenarios. We utilize convolutional blocks (with ReLU activations and batch normalizations) to preserve coarser feature representations of the contraband items while simultaneously retaining their geometrical shapes through finer edge information. The blocks follow a hierarchical design to yield multi-scale representations of threat objects for superior mask-level extraction. Furthermore, we implant novel identity blocks within the encoder topology of the CIE-Net that further aids in preserving the object’s geometrical characteristics regardless of the amount of clutter. The optimal values for the number of filters and kernel sizes are determined empirically after analyzing the similarly designed frameworks like PSPNet
The prime objective of designing the proposed CIE-Net is to accurately extract the contraband items and their instances, even in overly cluttered scenarios. We utilize convolutional blocks (with ReLU activations and batch normalizations) to preserve coarser feature representations of the contraband items while simultaneously retaining their geometrical shapes through finer edge information. The blocks follow a hierarchical design to yield multi-scale representations of threat objects for superior mask-level extraction. Furthermore, we implant novel identity blocks within the encoder topology of the CIE-Net that further aids in preserving the object’s geometrical characteristics regardless of the amount of clutter. The optimal values for the number of filters and kernel sizes are determined empirically after analyzing the similarly designed frameworks like PSPNet[zhao2017pyramid], and ResNet [he2016deep] to craft out the optimal design schematics for the CIE-Net.
The detailed architecture of CIE-Net is illustrated in Figure 3. Here, we can observe that the CIE-Net consists of an asymmetric encoder-decoder topology. The desired objects’ contextual and geometrical features are preserved through the contextual preservation blocks (CPB), composed of cascaded convolution and batch normalization operations. CPB ensures that the network learns to discriminate the similar textured contraband items (even the cluttered ones) by tuning the network weights based upon categorical cross-entropy loss function () in the first iteration, and the proposed instance segmentation loss function () in the rest of the iterations. Moreover, to ensure that the network retains the finer shape representations of the contraband items, dedicated identity blocks (inspired by ResNet [he2016deep] scheme) have been added in the encoder part, where the finer representations (of the suspicious items) are fused with the decoder end via residual triggered skip-connections. Inspired by PSPNet [zhao2017pyramid], we also employ a custom hierarchical block (HB) to improve the performance of the CIE-Net further. HB uses variable pooling factors (determined empirically) to generate the multi-scale feature representations from the latent vector space to recognize the cluttered contraband items and their instances. The hierarchical decomposition and pooling factors are determined empirically to obtain the optimal contraband item extraction performance on grayscale and colored baggage X-ray scans.
Like the proposed framework, the MvRF-CNN [akilan2020] also preserves the desired objects’ geometrical information by fusing feature representations obtained across various network depths in a residual manner [akilan2020]. Similarly, to achieve better geometrical characteristics of the desired objects, the framework proposed in [wang_cvpr2020] couples a segmentation encoder-decoder model with the super-resolution construction scheme where the fine-grained structural features are derived through the affinity maps [wang_cvpr2020]. To have a precise idea of how the above model works to detect baggage threats from security X-ray scans, we evaluated them both on the GDXray [mery2015gdxray], SIXray [miao2019sixray], and the combined datasets. We also compared these scheme’s performance with the proposed incremental instance segmentation framework (please see Table 4 for more details).
In the first incremental training iteration, CIE-Net optimizes the function to discriminate between normal and suspicious items (in a semantic segmentation fashion):
where denotes the total number of classes for the current iteration, represents the total number of samples in the training batch, for the current iteration, is a binary value telling whether the sample represents the class or not, and is the probability of the logit () of sample for the class.
Here, we also want to highlight that the semantic segmentation network extracts isolated and merged suspicious items from the baggage X-ray scans in the first iteration. However, the network cannot differentiate between multiple instances of the same item (e.g., two or more knives or guns in a single scan, whether they are isolated or merged).
C. Incremental Instance Segmentation: We propose a novel instance segmentation framework that utilizes incremental learning to make conventional semantic segmentation networks instance-aware. Most of the instance-aware segmentation models employ object detectors, ROI voting, or separate classification sub-networks. However, such implications require additional overheads for preparing large-scale training data and excessive memory requirements. Contrary to this, our proposed scheme makes conventional encoder-decoder models instance-aware without needing any additional resources. Thanks to the incremental adaptation strategy, only a small-scale training batch is required in each iteration to learn about multiple item instances in each scan, which drastically reduces the memory and computational requirements compared to the fine-tuning approaches. Furthermore, our framework has an in-built capacity to resist catastrophic forgetting through the proposed incorporation of the mutual information loss function, which analyzes the complex inter-dependencies between prior knowledge and newly learned information through Bayesian inference.
1. Instance Segmentation Loss Function: For instance-aware segmentation, we propose the following loss function.
where denote the loss weights (determined empirically to be 0.2, 0.3, and 0.5). minimizes the network loss for learning new instance categories, and minimizes the distillation loss for retaining the prior learned knowledge (about segmenting the suspicious baggage items). Both and are widely used in continual learning frameworks to avoid catastrophic forgetting [ILSurvey]. In the proposed framework, is calculated through categorical cross-entropy loss, while is calculated through KL divergence loss, as shown below:
where and denote, respectively, the number of old training samples and the number of old classes (added in 1 to iterations). and denote, respectively, the number of new training samples and the number of newly added categories (in the current iteration). and represent, respectively, the ground truth for the training samples of the old and the new classes. is the predicted distribution of the scaled logits generated through the training samples of the old classes. is the actual distribution generated from the true labels of the newly added classes, represents the predicted distribution of the scaled logits generated through the training samples of the new classes.
in Eq. (2) is the new proposed loss term, which we introduce to account for the inter-dependencies between old knowledge and newly learned information in our problem. More about the rationale and the description of this loss term is given in the next sub-section.
2. Mutual Information Loss Function The mutual information loss function () is based on the Bayesian inference that exploits the complex inter-dependencies between previously learned class representations (in iteration 1 to ) through their respective training examples and the examples related to the newly stacked classes (in the current iteration ). is expressed as follows:
where denotes the total number of training examples, denotes the total number of old classes (), and the ground truth for the training samples representing the previously added classes (in iterations 1 to ). The posterior probability
). The posterior probabilityis defined as:
The rationale of encompassing stems from the fact that older class representations (learned across the iterations) and the newly learned categories (in the iteration) are non-mutually exclusive. For example, a network trained to extract knives (particularly kitchen knives) in the first iteration should be aware of the contextual similarity between kitchen knives and choppers (which it learns in the second iteration) since both of them are different type of knives.
To the best of our knowledge, all the knowledge distillation and incremental learning solutions handle catastrophic forgetting by separately minimizing the network loss involved in learning the new tasks and maintaining the prior learned knowledge inferred from the previous model (or teacher) instance. But the frameworks, trained using these loss functions assume that both older and newly added class representations are independent of each other, leading towards compromised performance, especially in those scenarios when the incrementally learned information highly correlates with one another. In our approach, the additional loss function () integrates the relationship between prior learned and recently stacked classes through their training examples and exploits it via Bayesian inference to maximize the capacity of the incremental learning process of differentiating contraband item instances.
D. Bounding Box Generation: After extracting the suspicious items from the candidate scan, the bounding box for each extracted item () is generated through a simple yet very effective scheme. We iterate over the mask of each extracted contraband item () within the candidate scan, where for each mask, we find its minimum and maximum row value. The minimum row value represents the minimum row index within the candidate scan, where the mask value is one. Similarly, the maximum row value represents the maximum row index (within the candidate scan), where the mask is 1. Afterward, we take the image transpose and repeat the same process to get the minimum and maximum column index required to generate (and fit) the bounding box. The mathematical expression of the whole scheme is as follows:
where , and denotes the width and height of , respectively, and denotes the bounding box of the candidate contraband item (generated via its extracted mask).
Iv Experimental Setup
This section reports the datasets, the training details, and the evaluation metrics (used in the evaluation and also in the comparative study).
A. Datasets: We evaluated the proposed framework on publicly available GDXray [mery2015gdxray], SIXray [miao2019sixray], and the combined dataset (containing the scans from both GDXray [mery2015gdxray] and SIXray [miao2019sixray] datasets). We report the detailed description of these datasets in the supplementary material (and in the source code repository111 The complete source code and its documentation is available at: https://github.com/taimurhassan/inc-inst-seg.) due to space constraints.
B. Incremental Training Details: To incrementally train the proposed framework on the GDXray [mery2015gdxray] dataset, we used a total of 788 scans (400 scans for extracting originally identified suspicious items and 388 scans for the locally identified items). However, for the SIXray [miao2019sixray] dataset, we used 80% of the scans for training and 20% for evaluation as per the dataset standard [miao2019sixray]. Note that the number of incremental training iterations depends on the number of cluttered item instances within each dataset. In the combined dataset, we have a total of 1,067,381 scans in which 27,750 scans (13,663 positives and 14,087 negatives) were used for training purposes, and the rest of 1,039,631 scans were used in the evaluations. Such a training split also ensures assessing the resistance of the proposed framework against class imbalance.
Moreover, in the first training iteration, we constrain the network with the loss function to recognize different contraband items. Here, the proposed model performs conventional semantic segmentation to extract, for example, a gun and a knife contained within the candidate scan. However, it should be noted that the semantic segmentation model cannot recognize the overlapping instances of the same item, i.e., a gun overlaid on another gun. In such scenarios, the semantic segmentation models will output a single blob of gun-labeled pixels.
To accurately recognize the individual overlapped instances of contraband items (e.g., two overlapping guns), we further train our model iteratively. In each incremental iteration, we stack new classes, representing individual instances of the contraband items. Through their respective training examples, we re-tune the proposed model to make it instance-aware. For example, in the second iteration, we train the proposed model to recognize at most two overlapped instances of any suspicious item (e.g., two instances of guns, two instances of knives etc.) by stacking two additional classes representing gun and knife instance. We, therefore, feed the network with a small batch of training examples (containing at most two overlapping instances), where the two overlapping suspicious items (e.g., two overlapping guns) are marked with two different class labels in the ground truth. The same process is repeated across all the iterations until we obtain K-instance aware segmentation model where denotes the maximum overlapping instances of the same item within the dataset. In addition to passing training examples representing the newly stacked classes, we also pass a few examples representing the previous classes (added in the iterations 1 to ). The set of samples used to train the proposed model at each iteration is significantly lesser than the amount of data that is required by its competitors [Hassan2020ACCV, hassan2019, akcay2018using, miao2019sixray], i.e., it only uses around 20% of the total training data (defined as per the dataset standard), wherein each increment, about 10% examples are added to retain the knowledge of the previously learned categories.
The training is conducted on a machine with an Intel Core i7-9750H@2.6 GHz processor and 32 GB RAM with a single NVIDIA RTX 2080 Max-Q GPU, cuDNN v7.5, and a CUDA Toolkit v11.0.221. The CIE-Net is implemented using TensorFlow 2.1.0 with Keras 2.3.0 on the Anaconda platform using Python 3.7.9. In the first iteration, the training consisted of 20 epochs, whereas the subsequent iterations took ten epochs with ADADELTA
The training is conducted on a machine with an Intel Core i7-9750H@2.6 GHz processor and 32 GB RAM with a single NVIDIA RTX 2080 Max-Q GPU, cuDNN v7.5, and a CUDA Toolkit v11.0.221. The CIE-Net is implemented using TensorFlow 2.1.0 with Keras 2.3.0 on the Anaconda platform using Python 3.7.9. In the first iteration, the training consisted of 20 epochs, whereas the subsequent iterations took ten epochs with ADADELTA[Zeiler2012ADADELTA] optimizer. Moreover, the exact number of learnable and non-learnable parameters in CIE-Net varies in each iteration. Still, on average, they are roughly around 31.4M and 61.3K, respectively. The detailed model architecture is available in the codebase repository1.
We also tested the proposed framework’s applicability on the RGB data by evaluating it on the Microsoft COCO dataset
We also tested the proposed framework’s applicability on the RGB data by evaluating it on the Microsoft COCO dataset[coco]. Since the experiments on COCO dataset [coco] do not relate to our proposed study, we report them in the supplementary material of this paper.
C. Evaluation Metrics: The proposed framework has been evaluated using the pixel-level recall, precision, intersection-over-union (IoU), dice coefficient (DC), ROC curves, box-level and mask-level mean average precision () computed using IoU 0.5 ( and ), IoU 0.75 ( and ), and IoU = ( and ), respectively.
This section reports a thorough evaluation of the proposed framework for extracting and recognizing the contraband items. The purpose of these experiments is two-fold: 1) comparing the performance of our instance segmentation model (CIE-Net) with other state-of-the-art models [msrcnn, maskrcnn, htc, yolact], and 2) comparing the overall performance of our framework for baggage threat detection with other competitive systems [miao2019sixray, hassan2019, Hassan2020ACCV, hassan2020Sensors]. At first, we conducted an ablative analysis to assess the performance of different state-of-the-art encoder-decoder and fully convolutional models in our framework. We also conduct empirical experimentation to study the effect of the temperature constant () and the effect of utilizing different knowledge distillation loss functions for incremental instance segmentation. Then, we present the detailed evaluation results of the proposed framework on both GDXray and SIXray datasets in Section 6-B and Section 5-C, respectively. Afterward, we report, in Section 7-D, the experimentation conducted on the combined datasets.
A. Ablation Study: We conducted an ablation study to investigate: 1) The optimal choice of the segmentation model; 2) The effect of the temperature constant (); 3) The effects of employing different knowledge distillation loss functions in the incremental instance segmentation. Apart from this, we also conducted rigorous ablation experiments to evaluate the parametric effects of the CIE-Net and its custom CPB, IB, and HB blocks. Due to space constraints, these parametric evaluations are reported within the supplementary material of the article.
1. Choice of Segmentation Model: In this study, we compared the performance of several state-of-the-art semantic segmentation models, including PSPNet [zhao2017pyramid], SegNet [segnet], U-Net [ronneberger2015unet], FCN-8 and FCN-32 [fcn8] with our proposed CIE-Net model for the extraction of isolated and overlapping contraband items and their instances depicted within the grayscale and colored baggage X-ray scans. We further want to notify that to fairly compare all the models, we have trained them incrementally using the proposed loss function where each model, including the CIE-Net model, was implemented using ResNet101 [he2016deep]. We dubbed this CIE-Net variant as CIE-R-Net to differentiate it from the CIE-Net build with our custom backbone.
The comparison results are reported in Table 1, where we can see that the proposed CIE-Net produced the best performance in terms of both IoU and DC metrics for the SIXray [miao2019sixray], GDXray [mery2015gdxray], and the combined datasets.
Moreover, Figure 4 depicts a qualitative comparison showing segmentation results on samples from the SIXray and GDXray dataset. We can observe here that the CIE-Net produces better extraction results, especially for the examples in Figure 4 (A), (AJ), (AQ) and (AX). This better performance emanates from integrating the CPB, IB, and HB blocks in our model as showcased through rigorous parametric evaluations discussed in the supplementary material. Also, such synergy allows better extraction of contraband items by retaining global contextual information about the contraband items, even at the sparsest level of decomposition, while integrating finer features from the consecutive encoder part through the skip-connections.
2. Effects of the Temperature Parameter: In this experiment, we varied from 0.1 to 3 and measured its effects on the segmentation performance for GDXray, SIXray, and the combined datasets. The results, depicted in Table 2, indicate and as the best values for the GDXray and the SIXray datasets, respectively. also yields the highest performance on the combined dataset. These results suggested framing the optimal values of within the range .
3. Knowledge Distillation Loss Function: This objective of this ablation study is to compare function with other state-of-the-art knowledge distillation loss functions, such as Output Distillation Loss () [kdil], Modified Deep Model Consolidation [dmc] Loss () (proposed in [kdils]), Similarity-Preserving Knowledge Distillation Loss () [spkd], and Joint Classification and Distillation Loss () [icarl], in our framework. The comparison was made by switching the term in Eq. 2 with these distillation loss functions.
In what comes next, we denote by and , the models trained in the previous iteration (from 1 to ), and in the current iteration , respectively, denotes the total number of training examples belonging to the previously learned classes, , , denotes an old training sample, denotes the total number of old classes, and represents the Frobenius norm. Moreover, minimizes the cross-entropy loss between the prediction of and and is expressed below:
minimizes the disparities between the latent space feature representation of and and defined as:
where and are the latent space vectors related to and , respectively.
minimizes the disparities between the activation similarity matrices () [spkd], and expressed as:
The joint classification and distillation loss , proposed in iCaRL [icarl], is expressed as follows:
where is the standard cross-entropy loss function and the other terms are as previously defined in Eq. (3) and (4). Note that unlike the previous knowledge distillation loss functions, which are plugged as a replacement to , is used as a replacement of in Eq. 2. This is because minimizes both the loss for learning new class representations and the distillation loss for retaining the previously learned classes.
The comparison of the loss functions is reported in Table 3 in term of IoU score where we can see that the proposed framework achieves 2.83%, 2.16%, and 6.50% performance improvements over the second-best [spkd] on GDXray [mery2015gdxray], SIXray [miao2019sixray], and the combined dataset, respectively. These improvements emanate because of the synergy between , , and that not only retains the prior knowledge while learning new classes but also enables the network to analyze the mutual relationships between the knowledge representations of the old and the new instances via Bayesian inference, unlike its competitors, that mostly rely on the spatial [kdil] and contextual [spkd] differences between knowledge representations.
|Loss Functions||GDXray [mery2015gdxray]||SIXray [miao2019sixray]||Combined|
|MS RCNN [msrcnn]||0.7201||0.6484||0.5482|
|Mask RCNN [maskrcnn]||0.7098||0.6381||0.5243|
|MS RCNN [msrcnn]||0.8372||0.7867||0.7081|
|Mask RCNN [maskrcnn]||0.8302||0.7790||0.6879|
|MS RCNN [msrcnn]||0.8238||0.7613||0.6846|
|Mask RCNN [maskrcnn]||0.8183||0.7542||0.6653|
|MS RCNN [msrcnn]||0.8564||0.8153||0.7269|
|Mask RCNN [maskrcnn]||0.8439||0.8072||0.7154|
B. Evaluations on GDXray Dataset: The CIE-Net was trained for two iterations on GDXray as this dataset contains at most two overlapping instances of the same contraband item. Table 5 shows the performance comparison against the state-of-the-art schemes. We can observe that our framework achieves 4.08% and 28.39% better performance than the second-best HTC [htc] and the YOLACT [yolact], respectively, in terms of . Furthermore, it outperforms the second-best performing HTC [htc] by 2.13% in terms of . However, for , the best performance is achieved by the original TST [Hassan2020ACCV] (dubbed TST) from which the proposed framework lags by 11.53%. However, this is an unfair comparison since TST [Hassan2020ACCV] is trained conventionally using the large-scale well-annotated training data. In contrast, the proposed framework is trained incrementally on small-scale training batches. Moreover, under fair comparison with the incremental TST [Hassan2020ACCV] scheme, dubbed TST-, the proposed CIE-Net is leading by 3.63%.
Abbreviations: D: Dataset, G: GDXray [mery2015gdxray], S: SIXray [miao2019sixray], C: Combined Dataset, M: Methods, PF: Proposed Framework, MSR: Mask Scoring R-CNN [msrcnn], MR: Mask R-CNN [maskrcnn], and YT: YOLACT [yolact]. Moreover, ’*’ indicates unfair comparison.
Apart from this, the CIE-Net performance is further evaluated through the ROC curves, as shown in Figure 6 (a). These curves are generated considering the pixel-level recognition, i.e., the pixel for each item (along with their instances) are treated as one and the rest of the pixels as zero (a typical binary classification). We can observe that the instance-aware CIE-Net achieved the minimum AUC score of 0.9818 for extracting razors. Due to space constraints, we report the detailed AUC score for each item (for all the datasets) within the source code repository1.
Moreover, we also compared the performance of the proposed CIE-Net against the state-of-the-art semantic and instance segmentation frameworks. The results are reported in Table 4, where we can see that on GDXray, in terms of IoU, CIE-Net achieves 3.91% improvements over the DSRL [wang_cvpr2020] framework. Similarly, it outperforms HTC [htc] by 4.64%.
In addition to this, we fairly compared the proposed framework with TST [Hassan2020ACCV] by incrementally training it using the same experimental protocols and the proposed function, where the proposed framework achieves 11.29% superior results, in terms of IoU, as evident from Table 4. The degradation in the TST’s performance stems from the fact that during incremental training, it is more susceptible to forgetting the prior learned categories while adapting to new class representations since it employs a contour-driven strategy towards recognizing contraband items [Hassan2020ACCV].
Moreover, the performance of CIE-Net on the GDXray dataset is further analyzed through visual examples, as shown in Figure 5. The GDXray contains at most two overlapping instances of the same items, e.g., see Figure 5 (N, L, P, R, V, and X). Here, we can appreciate the extraction performance of CIE-Net by observing two extracted occluded knives in (L) and occluded shuriken in (N, P). We can also observe how accurately the low-intensity razors have been segmented by the CIE-Net in Figure 5 (N, P).
C. Evaluations on SIXray Dataset: For the SIXray dataset, the training was conducted for six iterations since there are at most six instances of the same item in this dataset. Table 5 shows the model’s comparison against the state-of-the-art instance segmentation algorithms. CIE-Net achieves 5.63% improvements in terms of against the second-best HTC [htc] and 30.03% higher than the least good performing YOLACT [yolact]. It also achieves 5.31% superior results than the existing solutions in terms of . For , the CIE-Net comes third after the original CST [hassan2019] (dubbed CST) and the original TST [Hassan2020ACCV] (dubbed TST) scheme. However, this comparison is unfair, and the increased performance of CST [hassan2019] and TST [Hassan2020ACCV] here emanates from the conventional fine-tuning strategy, which utilizes the whole training dataset. Under fair comparison with incremental TST [Hassan2020ACCV] (dubbed TST) and CST (dubbed CST), the CIE-Net is leading by 5.29% and 3.94%, respectively. Apart from this, the CIE-Net performance on SIXray is further evaluated through the ROC curves shown in Figure 6 (b). Here, we can observe that the proposed framework achieves the best AUC score for extracting the handguns. In addition to this, the segmentation performance of our framework can be analyzed through the mean IoU score in Table 4, showing the best score of 0.6883, leading the second-best HTC [htc] by 4.70%.
In Figure 7, we report the qualitative evaluation showcasing examples of successfully extracted overlapping items, e.g., two items (B, D, F, H) and three items (N, P, R) and up to six items (V, X). In these examples, we can appreciate the potential of the instance-aware CIE-Net in accurately recognizing the extremely merged items, e.g., an instance of guns in Figure 7 (J, V, and X).
D. Evaluations on Combined Dataset: We also evaluated the proposed framework on the combined dataset. The results on the combined dataset are reported in Table 4 and 5. From Table 5, we can observe that CIE-Net achieved the best performance of 0.7249, outperforming the second-best framework by 0.6345%. Furthermore, we can also notice the performance gain of 23.67% over YOLACT [yolact] in terms of . Moreover, in terms of recall and precision, the CIE-Net is outperforming the second-best framework by 1.44%, and 1.12%, respectively (see Table 4).
In addition to this, Figure 6 (c) further depicts the ROC performance of instance-aware CIE-Net for extracting contraband items. Here, we can see that the minimum score is obtained for knives and handguns (i.e., AUC of 0.9133 and 0.9212, respectively).
Figure 8 showcases some qualitative examples derived from the combined dataset, which illustrates the capacity of CIE-Net for extracting instances of overlapped items despite the large differences of the scan properties in GDXray and SIXray datasets. In Figure 8 (F), we can observe how effectively the razor is extracted in such a cluttered scenario. Figure 8 (N, R) depicts examples whereby our framework robustly differentiated between merged gun and chip instances. Figure 8 (T) depicts a reasonable extraction of the occluded knife. The performance of the CIE-Net can also be appreciated on more highly challenging scans such as (V), where a gun has been extracted from an extremely cluttered environment, (AB) in which two overlapping wrenches, two overlapping knives and a barely visible gun have been recognized, (AF) and (AJ) from which six extremely overlapping guns are accurately extracted. In Figure 8 (AF, AJ), in particular, we can appreciate the capacity of CIE-Net in accurately recognizing six instances of guns under extreme occlusion.
E. Comparison of Run-time Performance: Apart from evaluating the proposed scheme’s detection performance, we also analyzed its run-time performance and compared it with state-of-the-art methods. The comparison is reported in Table 6. Here, we can see that the proposed CIE-Net lags behind the state-of-the-art frameworks in terms of efficiency. This is due to the design choice of CIE-Net to focus more on accurately extracting the contraband items rather than achieving efficiency.
Due to this, the CIE-Net is slower than the other lightweight models like YOLOv3 [yolov3], and CST [hassan2019]. However, we also want to highlight that the proposed framework is an instance segmentation scheme (unlike region-based YOLOv3 [yolov3] and contour-based CST [hassan2019] detectors), and it gives the best trade-off between contraband items extraction (see Table 5) and run-time performance (see Table 6).
|Method||Time Performance (sec)|
|Mask R-CNN [maskrcnn]||0.141|
|MS R-CNN [msrcnn]||0.156|
F. Failure Cases: Although the proposed framework achieves remarkable performance towards extracting overlapping contraband items (and their instances), as evident from Table 4, and 5, there are some cases where the CIE-Net turns out to be limited, especially on the negative SIXray scans (see pairs (A, B), (C, D), (E, F), (K, L) and (M, N) in Figure 9), producing pixel-level false positives and false negatives due to spatial and contextual similarity between the normal and suspicious baggage content within the X-ray scans. False positives are produced when the background regions (within the candidate scan) are misclassified as threatening items by the proposed framework as shown in Figure 9-B, D, F, L, and N. Moreover, false negatives are generated when the region of the contraband item is misclassified as background. For example, see the missed portion of shuriken in Figure 9 (X). Apart from this, in some cluttered cases, the proposed CIE-Net produced over-segmentation results by confusing between different instances of the suspicious items (as shown in Figure 9-H, P, R, T, V, and Z). Although all these types of failures were seen rarely during the experimentation, they can be remedied through postprocessing schemes such as blob filtering, region-opening, and region-filling schemes.
An overview of the results in Tables 4 and 5 convey that the proposed CIE-Net, employed within the incremental instance segmentation framework, shows neat performance improvement over standard models such as Mask Scoring R-CNN [msrcnn], Mask R-CNN [maskrcnn], Hybrid Task Cascade [htc] and YOLACT [yolact]). It also exhibits a competitive performance with models specifically designed for extracting threatening items from X-ray scans (such as CST [hassan2019], TST [Hassan2020ACCV], and TSD[hassan2020Sensors]).
The CIE-Net lags from the fine-tuning-based contour instance segmentation framework TST [Hassan2020ACCV] in terms of . However, over the incremental TST- [Hassan2020ACCV] version, it achieves 11.29% on the GDXray dataset, 14.65% improvements on the SIXray dataset, and 26.88% on the combined dataset in terms of IoU (see Table 4). The TST [Hassan2020ACCV] possesses the capacity to eliminate unwanted baggage contours due to extensive fine-tuning on the large-scale training datasets, resulting in the better extraction of the threatening items. In return, the TST [Hassan2020ACCV] requires large-scale well-annotated training data to achieve optimal performance. Indeed, when we trained TST [Hassan2020ACCV] framework incrementally on small-scale training batches using the proposed loss function to compare it with the CIE-Net fairly, it produces degraded performance, as evidenced from the results mentioned above.
Compared to the meta-transfer learning-based baggage threat detector (TSD)
Compared to the meta-transfer learning-based baggage threat detector (TSD)[hassan2020Sensors], our framework achieves 15.62% higher performance in terms of on the SIXray dataset (Table 5). However, on GDXray [mery2015gdxray], it lags from the TSD by 6.61%. The superiority of [hassan2020Sensors] here stems from its capacity to generate the dual-energy tensors [hassan2020Sensors] that can effectively highlight the transitions of the contraband items from the grayscale X-ray scans. However, TSD is still sensitive to extremely cluttered baggage threats, as evident through its performance on the SIXray [miao2019sixray] dataset.
The performance of CIE-Net, in terms of the , is although lagging from the original CST framework [hassan2019] in Table 5. But this comparison is unfair as the original CST [hassan2019] framework is non-incremental and uses more training data and computational resources to produce these results. Nevertheless, under fair comparison, the CIE-Net outperforms CST [hassan2019] by 4.52% and 3.94% on GDXray [mery2015gdxray] and SIXray [miao2019sixray], respectively, in (see Table 5). Also, the CST framework is extremely parametric dependent (i.e., it has to be tuned for each dataset independently). Therefore, it does not generalize well for scans and datasets having drastically varying properties. Furthermore, it also lacks the inherent ability to generate items mask and falls under conventional object detectors.
With regard to run-time performance, the CIE-Net is about two-time faster than several instance segmentation models like MS R-CNN [msrcnn], Mask R-CNN [maskrcnn], and HTC [htc]. It also showed a modest performance compared to YOLOv3 [yolov3], CST [hassan2019], RetinaNet [retinanet] and YOLCAT [yolact]. Nonetheless, looking at both accuracy and efficiency figures in, respectively, Table 4, 5, and 6, we can assert that the CIE-Net realizes the best trade-off between time and performance. It is also important to point out that the CIE-Net model’s current conception is mainly driven by accurately recognizing the cluttered and overlapping contraband items rather than achieving efficiency. However, we envisage different measures to enhance this aspect in the future. A first remedy can be replacing the conventional convolutional blocks with residual driven atrous convolutions (with variable dilation factors) [ragnetv2, Wang2018WACV], resulting in a significant reduction of the trainable parameters, thus increasing the overall run-time performance by many folds. Furthermore, we can generate a lightweight version of the CIE-Net by employing a switching mechanism [chen2019Access] to process only positive regions showcasing contraband items and their instances while ignoring the negative regions. In addition to this, we also envisage employing multi-task attention networks [wang2020MAN] and adversarial domain adaptation [wang2019ADA] schemes as future work to further improve the threat detection performance of the proposed framework.
This paper presents a novel instance segmentation framework that utilizes incremental learning and a conventional encoder-decoder architecture to extract and recognize heavily cluttered, occluded, and overlapping contraband items from multi-vendor baggage X-ray scans. Since the proposed framework is powered through incremental learning, it reaps the benefit of using small-scale training data and bypasses hectic ground-truth generation mechanisms to make semantic segmentation networks instance-aware. The proposed framework has an in-built capacity to resist catastrophic forgetting through a proposed instance segmentation loss function, introducing a novel feature of incorporating mutual information loss embedding the complex inter-dependencies between old knowledge and newly learned information through Bayesian inference. The proposed scheme is unique as it modifies the conventional semantic segmentation networks to perform instance-aware segmentation via incremental learning. By being trained on two different datasets and their combination, the proposed framework produces the best results compared to existing state-of-the-art solutions in multiple metrics, evidencing the ability to effectively recognize the cluttered and overlapping objects through instance segmentation rather than through object detectors. To the best of our knowledge, it is the only framework to date, which can accurately extract overlapping baggage items from the multi-vendor grayscale and colored X-ray images (in an incremental fashion) despite the significant variations in the scan features of both datasets. In addition to the envisaged task mentioned in the Discussion section to optimize the model design, future work will consider investigating the challenging problem of detecting 3D-printed items (e.g., guns). These items, made from organic matter, have low visibility in the X-ray scans. Devising proper models for this category of objects is our potential future work.