Log In Sign Up

Unsupervised Anomaly Instance Segmentation for Baggage Threat Recognition

Identifying potential threats concealed within the baggage is of prime concern for the security staff. Many researchers have developed frameworks that can detect baggage threats from X-ray scans. However, to the best of our knowledge, all of these frameworks require extensive training on large-scale and well-annotated datasets, which are hard to procure in the real world. This paper presents a novel unsupervised anomaly instance segmentation framework that recognizes baggage threats, in X-ray scans, as anomalies without requiring any ground truth labels. Furthermore, thanks to its stylization capacity, the framework is trained only once, and at the inference stage, it detects and extracts contraband items regardless of their scanner specifications. Our one-staged approach initially learns to reconstruct normal baggage content via an encoder-decoder network utilizing a proposed stylization loss function. The model subsequently identifies the abnormal regions by analyzing the disparities within the original and the reconstructed scans. The anomalous regions are then clustered and post-processed to fit a bounding box for their localization. In addition, an optional classifier can also be appended with the proposed framework to recognize the categories of these extracted anomalies. A thorough evaluation of the proposed system on four public baggage X-ray datasets, without any re-training, demonstrates that it achieves competitive performance as compared to the conventional fully supervised methods (i.e., the mean average precision score of 0.7941 on SIXray, 0.8591 on GDXray, 0.7483 on OPIXray, and 0.5439 on COMPASS-XP dataset) while outperforming state-of-the-art semi-supervised and unsupervised baggage threat detection frameworks by 67.37 32.32 and COMPASS-XP datasets, respectively.


page 5

page 6

page 12


Trainable Structure Tensors for Autonomous Baggage Threat Detection Under Extreme Occlusion

Detecting baggage threats is one of the most difficult tasks, even for e...

Tensor Pooling Driven Instance Segmentation Framework for Baggage Threat Recognition

Automated systems designed for screening contraband items from the X-ray...

Detecting zones and threat on 3D body in security airports using deep learning machine

In this research, it was used a segmentation and classification method t...

Detecting Zones and Threat on 3D Body for Security in Airports using Deep Machine Learning

In this research, it was used a segmentation and classification method t...

Temporal Fusion Based Mutli-scale Semantic Segmentation for Detecting Concealed Baggage Threats

Detection of illegal and threatening items in baggage is one of the utmo...

1 Introduction

Recognizing contraband items concealed within baggage is a prime security concern as it endangers public safety. According to a recent report, approximately 1.5 million passengers in the United States are searched every day for weapons and other dangerous items (Council, 1996). Manual detection of these items is a tiring task and also subject to human errors caused due to fatigued work schedules, amount of baggage clutter, aviation traffic load, or simply because of less experience towards screening the contraband data. To overcome this, many researchers have developed automated frameworks (Gaus et al., 2019a; Akçay et al., 2018a; Turcsany et al., 2013) to screen baggage at airports, malls, and cargoes. However, the majority of these frameworks are developed using conventional RGB detectors, which have limited performance towards localizing the occluded suspicious objects due to their region based proposal generation mechanisms (Hassan et al., 2020a), and because of the inherent differences between the RGB and the X-ray imagery (Akçay et al., 2018b). To handle this, researchers have recently proposed clutter-aware solutions possessing the capacity to recognize concealed and occluded baggage threats regardless of the scanner specifications (Hassan and Werghi, 2020; Hassan et al., 2020a, b), acquisition noise (Tao et al., 2021), and the imbalanced nature of the contraband data (Miao et al., 2019). In addition to this, the majority of these methods have been quantitatively evaluated to detect threatening items against different levels of occlusion on the publicly available datasets (Wei et al., 2020; Hassan et al., 2020b; Hassan and Werghi, 2020). Also, the researchers have utilized 3D detectors to get rid of occlusion while screening the baggage threats from the volumetric computed tomography (CT) imagery (Wang and Breckon, 2020; Wang et al., 2020a)

. Despite these recent advancements, many state-of-the-art baggage threat detection frameworks are still based on conventional supervised learning schemes which require extensive ground truth labels to ensure robust detection performance. Some researchers have also presented semi-supervised

(Akçay et al., 2018a)

and unsupervised anomaly detection

(Akçay et al., 2019) via adversarial learning. However, such schemes require an explicit re-training process for recognizing baggage threats from different datasets. Also, they are driven by scan-level analysis for recognizing the anomalous baggage threats and do not possess the capacity to extract and localize the threatening items within the baggage X-ray scans (Akçay et al., 2018a, 2019).

To address the above limitations, we present in this work a novel unsupervised anomaly instance segmentation. The proposed approach requires only one-time training, without any ground truth labels, to recognize the baggage threats.

2 Related Work

Early baggage threat detection frameworks were based on conventional machine learning employing hand-engineered features

(Bastan et al., 2011; Riffo and Mery, 2015)

. Then deep learning methods took over, proposing supervised and unsupervised strategies for recognizing the suspicious baggage content. Here, the recent

approaches have also addressed the imbalanced (Miao et al., 2019) and cluttered (Hassan and Werghi, 2020; Wei et al., 2020) nature of the threatening items in X-ray scans, which are often observed in the real world at airports, malls, and transmission cargoes. This section first gives a brief overview of some of the conventional baggage threat detection schemes, and then it sheds light on the state-of-the-art deep learning-based approaches. For an exhaustive survey, we refer the reader to the work of (Akçay and Breckon, 2020; Mery et al., 2017, 2020).

2.1 Conventional Machine Learning Methods

Earlier methods for screening baggage threats are based on handcrafted features (Megherbi et al., 2012; Wang et al., 2020b) and descriptors such as SIFT (Mery et al., 2016; Zhang et al., 2014), SURF (Bastan et al., 2011), and FAST-SURF (Kundegorski et al., 2016)

, employed with Support Vector Machines (SVM)

(Turcsany et al., 2013; Kundegorski et al., 2016), Bag of Words (BoW), K-Nearest Neighbors (Riffo and Mery, 2015)

and Random Forest

(Jaccard et al., 2014) classifiers. Many researchers have also proposed supervised segmentation (Heitz and Chechik, 2010) and detection (Bastan, 2015) schemes for recognizing prohibited items via high, low and multi-view X-ray imagery (Bastan, 2015). Similarly, Riffo et al. (Riffo and Mery, 2015) proposed an Adapted Implicit Shape Model (AISM) for recognizing different contraband items from the publicly available GDXray (Mery et al., 2015) dataset. In another approach, they developed structure-from-motion-based 3D feature descriptors for recognizing the threatening items (Mery et al., 2016).

Although, traditional machine learning methods can mass-screen the baggage content using security X-ray scans. However, they are only applicable to limited experimental settings and cannot be well-generalized to multiple scanner specifications.

2.2 Deep Learning Methods

Deep learning has greatly enhanced the recognition capabilities of the threat screening frameworks such that they can now identify suspicious objects within grayscale or colored baggage X-ray scans regardless of their scanner properties. Here, we categorized all the deep learning-based threat detection frameworks as supervised and unsupervised approaches.

2.2.1 Supervised Approaches

Supervised approaches for recognizing baggage threats employed classification (Akçay et al., 2016; Jaccard et al., 2017; Zhao et al., 2018; Miao et al., 2019), detection (Liu et al., 2018; Xu et al., 2018; Hassan et al., 2020a, b) and segmentation (Hassan and Werghi, 2020; An et al., 2019; Gaus et al., 2019b) strategies. Akçay et al. (Akçay et al., 2016) introduced GoogleNet (Szegedy et al., 2014)

(in a transfer learning mode) to detect threatening objects within baggage X-ray imagery.

Jaccard et al. (Jaccard et al., 2017) used VGG-19 (Simonyan and Zisserman, 2015) on log-transformed scans to detect suspicious objects. Zhao et al. (Zhao et al., 2018) initiated the use of GANs to enhance the classification performance of the customized networks towards baggage threat detection. Apart from this, researchers have also used two-staged (Liu et al., 2018) and one-staged (Gaus et al., 2019b) detectors along with attention mechanisms (Xu et al., 2018) to recognize and localize threatening objects. Moreover, Gaus et al. (Gaus et al., 2019a) measured the transferability of Faster R-CNN (Ren et al., 2016) Mask R-CNN (He et al., 2017) and RetinaNet (Lin et al., 2017) between various X-ray scanners to detect the contraband data. Motivated by the class imbalance between normal and suspicious objects, Miao et al. (Miao et al., 2019)

presented the class-balanced hierarchical refinement (CHR) model, proposing architecture-oriented mitigation of the class imbalance problem. Other approaches proposed contour-driven object detectors such as Cascaded Structure Tensors (CST)

(Hassan et al., 2020a), and Dual-Tensor Shot Detector (DTSD) (Hassan et al., 2020b). Similarly, Wei et al. (Wei et al., 2020) developed De-occlusion Attention Module (DOAM), a plug-and-play module that can be integrated with conventional object detectors to increase their capacity in screening occluded baggage threats. For the segmentation approaches, An et al. (An et al., 2019) employed encoder-decoder models leveraging dual attention mechanisms, while (Hassan and Werghi, 2020) proposed a first-ever contour instance segmentation framework exclusively designed to extract cluttered contraband data from the security X-ray scans.

2.2.2 Unsupervised Approaches

Researchers have also developed semi-supervised and unsupervised methods for recognizing suspicious items. Akçay et al. pioneered this by developing GANomaly (Akçay et al., 2018a), an encoder-decoder-encoder-driven adversarial framework trained on the normal security X-ray scans. After training, GANomaly (Akçay et al., 2018a) recognizes the baggage threats, as anomalies, from the abnormal test scans through its in-built discriminator. In another approach, Skip-GANomaly is proposed (Akçay et al., 2019) as an improved version of GANomaly utilizing encoder-decoders with skip-connections and adversarial learning to detect anomalous baggage threats with a significantly lesser amount of computational resources.

To the best of our knowledge, the majority of the existing frameworks are based on supervised learning, requiring an extensive amount of well-annotated training data to perform well at the inference stage (Hassan and Werghi, 2020; Miao et al., 2019; Wei et al., 2020; Gaus et al., 2019b). However, procuring a large-scale and well-annotated dataset is often impractical and infeasible, especially for recognizing those items which are rarely observed during the aviation screening. Furthermore, re-training or even fine-tuning the deployed framework to identify a new type of threat is an inefficient process and could lead to compromised performance (Gaus et al., 2019a; Hassan et al., 2020b). Despite recent efforts leveraging meta-transfer-learning (Sun et al., 2019) to alleviate the scanner differences and increase the generalizability of baggage threat detectors (Hassan et al., 2020b), these frameworks still require fine-tuning on different datasets for achieving good performance. Although researchers have utilized semi-supervised and unsupervised adversarial learning to recognize suspicious baggage items (Akçay et al., 2018a, 2019). These frameworks still require explicit re-training on the normal data of each dataset to identify baggage threats. Also, these methods can recognize suspicious items but are unable to localize them through bounding boxes or masks. Hence, this paper presents the first unsupervised anomaly instance segmentation framework exclusively designed to recognize and localize illegal baggage items from the security X-ray scans to the best of our knowledge.

3 Contributions

This paper presents a novel unsupervised anomaly instance segmentation framework to detect and extract baggage threats as anomalies. To the best of our knowledge, this is the first approach towards unsupervised anomaly instance segmentation, in a baggage threat detection territory, exhibiting the following distinctive features:

  • The proposed framework is the first of its kind that is trained only once on the normal baggage X-ray scans. Afterward, it does not require re-training to eliminate the scanner differences to extract anomalous regions (across different datasets).

  • The proposed framework is built upon a novel Gaussian-Weighted Fourier Stylization (GW-FS) scheme that drastically removes the scanner variations to achieve high generalizability towards extracting the suspicious baggage items as anomalies from the baggage X-ray scans.

  • A thorough validation on four public X-ray datasets showcases that the proposed framework outperforms its unsupervised and semi-supervised competitors while achieving a competitive performance with the other fully supervised frameworks.

The rest of the paper is organized as follows: Section 4 presents the proposed framework, Section 5 showcases the experimental setup, Section 6 presents the detailed evaluation results, and Section 7 discusses the prospects of the proposed framework and concludes the paper.

Figure 1: Block diagram of the proposed framework. In the one-time training stage, the X-ray scans (containing the normal baggage data) are decomposed into fixed-size non-overlapping patches, which are passed to the proposed encoder-decoder network that learns to reconstruct them. Afterward, during the inference stage, the trained model is fed with the abnormal scans. After reconstructing them, the proposed framework exploits the original and reconstructed scans’ disparities to recognize anomalous regions. Furthermore, the proposed GW-FS scheme removes the scanner-specific appearance enabling the proposed framework to identify baggage threats irrespective of the scanner specifications or the dataset.

4 Proposed Approach

The block diagram of the proposed framework is shown in Figure 1. We can see here that the encoder-decoder network is trained only once on the X-ray scans (containing the normal baggage data). During this one-time training, the network is constrained via custom stylization loss function () to reconstruct the baggage X-ray scans accurately. The reconstruction is performed patch-wise, and to ensure that the network maintains the spatial characteristics of the original input scan, we perturb it, in each patch, with randomized zero-mean Gaussian noise. We empirically found that the addition of Gaussian noise within each patch puts more constraint to in chastising the encoder-decoder network towards producing accurate reconstruction.

Moreover, minimizes not only the loss but also the differences in the feature representations obtained from the fixed backbone model. These two mechanisms enhance the encoder-decoder model’s capacity for reconstructing the normal scans, leading to the generation of distinct disparity maps (for the abnormal scans) at the inference stage. The disparity maps are then clustered together and are post-processed to detect and locate the anomalous regions. Moreover, an optional lightweight classifier can be mounted at the back of the proposed framework to recognize the localized anomalous items’ categories. It should be noted here that before feeding the test scan into the proposed model, we stylize it first based upon the proposed Gaussian-Weighted Fourier Stylization (GW-FS). This stylization removes the scanner-specific appearances (even the drastic ones), enabling the encoder-decoder network to reconstruct the test scan patches accurately regardless of the scanner model. The detailed description of each module is presented below:

4.1 Gaussian-Weighted Fourier Stylization

To perform stylization, we propose a Gaussian-Weighted Fourier Stylization (GW-FS) scheme. The GW-FS is inspired from Fourier Domain Adaptation (FDA) (Yang and Soatto, 2020) that computes the Fast Fourier Transform (FFT) (Cooley and Tukey, 1965) of the reference and target scans and copy the frequency samples within the magnitude spectrum of the reference scans to the target scan spectrum (defined by the rectangular window ) without altering the phase spectrum (Yang and Soatto, 2020)

. However, in our approach, we perform stylization by mixing the low-frequency components within the candidate scan’s magnitude response with the reference scan’s magnitude spectrum by fitting a Gaussian window (parameterized by the variance

). Let be the input scan (such that denotes its height and denotes its width), and be the reference image. Taking FFT yields:


where represents the complex frequency spectrum of , denotes the FFT operator, and the factor shifts the image spectrum by and to center-align it. We apply the same transformation to reference scan , yielding . Afterward, the magnitude spectra of , i.e., is multiplied by Gaussian window to extract the low ranging spectral components for stylization:


denotes the obtained stylization mask that is added to to produce . Afterward, we apply the inverse FFT () of to obtain the stylized scan , as expressed below:


We can observe here that using a Gaussian window instead of the rectangular window (employed in FDA) results in fewer ripples within the stop-band, ensuring thus a much better stylization as evidenced in Figure 2. Also, (in FDA) does not blend the frequency spectrum between and . It just replaces samples within with that of (constraint by ), where the length of is administered by the factor (Yang and Soatto, 2020). Therefore, the stylization through FDA (Yang and Soatto, 2020) produces additional noisy artifacts (as shown in Figure 2), and optimizing them for each training-testing combination is a haggling job. GW-FS scheme addresses this by first weighting the frequency samples of by (through Eq. 2) before merging them with the target spectra .

Figure 2: (A) Original grayscale scan, (B) stylization through GW-FS scheme with , (C) stylization through FDA (Yang and Soatto, 2020) with . For fairness, the value of scaling factor ( and ) in both schemes are chosen the same.

4.2 Scans Reconstruction

After stylizing the test scan, it is passed to the asymmetric one-time trained encoder-decoder model, which generates the reconstructed images () patch-wise. Afterward, is utilized in developing the disparity maps for anomaly instance segmentation. The proposed encoder-decoder network

is a lightweight model containing one input layer, seven convolution layers with ReLU activations, three max-pooling, and three up-sampling layers. Furthermore, it has around 4,923 trainable parameters. For more architectural details about the proposed encoder-decoder architecture, we refer the reader to the source code repository

111 The source code of the proposed framework, along with the complete documentation is available at

Moreover, to train the proposed encoder-decoder network, we used the proposed stylization loss function (

) to constrain it, at the training time, to recognize shape, context, and edge feature appearances from the latent space vectors. The

loss function is further discussed in the subsequent section below.

4.2.1 The Loss Function

To train the proposed encoder-decoder model, we propose a novel stylization loss () which is

is a linear combination of feature reconstruction loss function (

) (Johnson et al., 2016) and the conventional loss function.


where represent the loss weights. is generated from the feature representations obtained from the frozen pre-trained backbones, and is obtained using the pixel-wise difference between the training scan () and the reconstructed version (), as expressed in Eq. 5 and 6:


where is the batch size, and denotes the feature representations obtained from the frozen backbone model. The loss weights and are empirically determined to be 0.7 and 0.3, respectively.

4.2.2 Unsupervised Anomaly Instance Segmentation

The proposed unsupervised anomaly instance segmentation scheme is shown in Figure 3. At the inference stage, after stylizing the input scan, it is patch-wise reconstructed through the trained encoder-decoder model. Then, the reconstructed scan () is subtracted from the stylized scan () to produce the disparity map. The disparity map is utilized in extracting the suspicious items’ instances by clustering the color distribution between anomalous and normal baggage content. The detailed description of disparity maps and the color distribution-based clustering scheme is presented in the subsequent sections.

Figure 3: Unsupervised anomaly instance segmentation framework. Here, we train an encoder-decoder model only once to reconstruct normal baggage X-ray scans. During the inference stage, the trained model recognizes the anomalous regions by exploiting the actual and reconstructed scans’ disparities. To eliminate the scanner variations, we propose a GW-FS scheme that mixes the frequency representations within the reference scan and the input scans.

Disparity Maps: The disparity maps reveal the deviation of the anomalous items w.r.t the normal baggage content by subtracting from . It should be noted here that the meaningful interpretation from these disparity maps is subject to how accurately the encoder-decoder model reconstructs the normal areas within the abnormal scans. For example, in Figure 3, we can see how effectively the encoder-decoder has reconstructed the abnormal scan (containing knives). However, there are still some noticeable intensity variations between and , which results in the blue, pink, and cyan color noisy regions within the disparity maps. Having a three-channeled representation here allows better discrimination between anomalous regions (corresponding to suspicious items) and the rest of the baggage content as compared to the single-channeled representations. This aspect is further evidenced in Figure 4. Here, for each dataset, red points indicate anomalies, whereas blue color showcases normal pixels. We can observe that the distributions of anomalous and normal baggage content are well-separated. Therefore using an adequate clustering scheme, the anomalous region representing the suspicious items can be extracted.

(A) (B) (C) (D)

Figure 4: Color distributions between anomalous and non-anomalous regions in (A) SIXray (Miao et al., 2019) dataset, (B) GDXray (Mery et al., 2015) dataset, (C) OPIXray (Wei et al., 2020) dataset, and (D) COMPASS-XP (Griffin et al., 2019) dataset.

Color Clustering: In order to extract suspicious (anomalous) items’ instances

, the disparity maps are clustered through K-means Clustering (parameterized by the number of clusters

). Here, for each dataset varies depending upon normal and anomalous items contained within the respective scans. Through empirical analysis, we determined the optimal choice of for SIXray (Miao et al., 2019) and OPIXray (Miao et al., 2019) dataset is four (i.e., ). Similarly, for GDXray (Mery et al., 2015) and COMPASS-XP (Griffin et al., 2019) dataset, . Moreover, the noisy clusters (obtained after K-means clustering) are automatically removed through morphological post-processing. Afterward, each isolated instance of the anomalous region is identified through the connected-component analysis, and then each item instance is localized by fitting the bounding box generated using the minimum and maximum of their masks in both image dimensions (see Figure 3).

Figure 5: Recognition of anomalous items’ categories using the proposed classification model. The model is trained only once on the suspicious items patches obtained from all four datasets.

4.3 Recognition of Anomalous Items

After extracting the anomalous items we identify their categories (such as gun, knife, razor, shuriken, wrench, pliers, scissor, hammer, and axe etc.) using a proposed lightweight classification model (see Figure 5). We want to highlight here that the recognition of anomalies items is just an optional module. It neither relates to our actual unsupervised anomaly instance segmentation approach nor it is mandatory in the proposed framework.

The patches of the anomalous items are cropped from using their bounding boxes, and the classifier is trained only once to recognize their categories. The data used to train this model is based on patches (representing each suspicious item category), and it is taken from all four datasets.

Architecturally, the classification model contains three convolutions, three ReLUs, three max-pooling, one fully connected, and one softmax layer as depicted in Figure

5. The total number of parameters within the proposed model is 3.2M. Note that any pre-trained model can be used here for classifying the suspicious object categories. However, our proposed model exhibits two advantages 1) it is lightweight compared to the popular pre-trained models, and 2) it achieves a good trade-off between accuracy and the number of hyper-parameters (as evidenced from Table 3). The classification model is optimized via the cross-entropy loss function (), as expressed below:


where denotes the number of suspicious item categories, denotes the training example for the class, and

denotes the softmax probability for the

training sample belonging to class. Moreover, the training details of this optional classification model is presented in Section 5.2.

5 Experimental Setup

This section presents the details about the datasets, the training protocol, as well as the evaluation metrics:

5.1 Datasets

The proposed framework is thoroughly evaluated on all the four public X-ray datasets used in baggage threat recognition benchmarking. The detailed description of these datasets is as follows:

5.1.1 SIXray

SIXray (Miao et al., 2019) is the largest and most challenging baggage X-ray dataset to date. It contains 1,050,302 negative X-ray scans containing only the normal baggage content, and 8,929 positive scans having one or more suspicious items in it such as guns, knives, wrenches, pliers, scissors, and hammers. Furthermore, the dataset also contains detailed box-level annotations to train and evaluate the baggage threat detection frameworks. Moreover, SIXray (Miao et al., 2019) is primarily designed to test the capacity of autonomous frameworks to screen contraband items in a highly imbalanced scenario.

5.1.2 GDXray

The GDXray (Mery et al., 2015) dataset contains 19,407 high resolution grayscale X-ray scans divided into five groups, namely, welds, casting, baggage, nature, and settings. The only relevant category for the proposed study is the baggage category that contains 8,150 X-ray scans along with detailed markings. Moreover, the scans within the GDXray (Mery et al., 2015) contains suspicious items such as razors, handguns, knives, and shuriken (Mery et al., 2015).

5.1.3 OPIXray

The OPIXray (Wei et al., 2020) is the most recent publicly released baggage X-ray dataset. It contains a total of 8,885 colored X-ray scans containing suspicious items such as folding knives, straight knives, utility knives, multi-tool knives, and scissors (Wei et al., 2020). Furthermore, OPIXray (Wei et al., 2020) also contain the detailed box-level annotations for these items which can be used in order to evaluate the autonomous baggage threat detection frameworks.

5.1.4 Compass-Xp

Different from the previous datasets, COMPASS-XP (Griffin et al., 2019) is mainly designed to assess classification (rather than detection) frameworks, i.e., it contains scan-level markings to recognize baggage threats without ground truths masks for the localization. However, the novel aspect of the COMPASS-XP dataset (Griffin et al., 2019) is that it contains different scanner images for each case. For example, for a single scene in which a baggage contains a gun, the COMPASS-XP gives six different scanner representations (i.e., the high-energy X-ray, low-energy X-ray, high density, colored X-ray, grayscale X-ray, and the normal RGB scan). So, in total, the complete dataset contains 11,568 = 6 x 1928 X-ray scans (Griffin et al., 2019).

5.2 Training Details

The proposed framework has been implemented using Python 3.7.8 with TensorFlow 2.3.0 and the MATLAB R2020a on a machine with Intel Core i9-10940@3.3 GHz processor and 132 GB RAM with a single NVIDIA Quadro RTX 6600, cuDNN v7.5, and a CUDA Toolkit v11.0.221. The training was conducted for 200 epochs using 80% of the normal baggage X-ray scans

(i.e., 840,241 normal scans) from the SIXray dataset. The choice of the SIXray dataset for one-time training is driven from an extensive ablation study (presented in Section 6.1.4). Moreover, the total number of test scans from all four datasets which we used for evaluating the proposed framework are 238,664 (8,150 scans are taken from GDXray, 210,061 scans are taken from SIXray, 8,885 scans are taken from OPIXray, and 11,568 scans are taken from COMPASS-XP dataset). Apart from this, we used 8,929 scans from the SIXray dataset for validation purposes. Furthermore, the optimizer used during the training was ADAM (Kingma and Ba, 2015) with default learning and decay rates. For recognition of anomalous items, we trained the modular classification model for 100 epochs with ADAM (Kingma and Ba, 2015) having an initial learning rate of 0.0001. The total training patches we used to train this classifier are around 5,000, obtained from all four X-ray datasets for each suspicious item category. The source code of the proposed framework is also released publicly for the research community1.

5.3 Evaluation Metrics

We used standard object detection, instance segmentation, and classification metrics such as mAP, MSE, accuracy, recall, precision, F1 to test the proposed framework’s performance and compare it with the state-of-the-art solutions.

6 Results

6.1 Ablation Study

The ablation study for the proposed framework includes 1) The choice of for GW-FS stylization; 2) The optimal backbone network for computing ; 3) The optimal classification backbone model, and 4) the choice of training dataset which is used for performing one-time training.

6.1.1 Choice of :

The is a hyper-parameter within the GW-FS scheme to determine the Gaussian window’s cut-off frequency. Increasing the value of increases the pass-band region and allows more frequencies to pass through, whereas decreasing the value of only allows the lowest frequencies (among all) to remain while the rest are clipped. Table 1 reports the performance of GW-FS for reconstructing three-channeled scans in terms of MSE scores. Here, we can observe that for , the proposed framework achieves the minimum reconstruction error for all datasets. However, when increases, the reconstruction performance starts to deteriorate because higher frequencies are being allowed to pass through the window (defined by ), which generates more noisy edges.

5 72.92 16.29 21.12 21.94
10 89.16 24.83 41.96 43.62
25 159.53 73.92 119.65 136.42
50 235.98 134.76 216.94 218.16
Table 1: Effects of varying on scan reconstruction. Bold indicates the best MSE scores.

6.1.2 Backbone Network for Computing :

This ablation study reports the backbone’s choice for computing the feature reconstruction loss function (). Here, we compared the performance of pre-trained VGG-16 (Simonyan and Zisserman, 2015) and ResNet-50 (He et al., 2016) that produces to penalize the encoder-decoder model towards reconstructing the abnormal scans robustly. It should be noted here that these pre-trained models were fixed, i.e., the weights of these networks were not trained explicitly for minimizing . The results for this experiment are reported in Table 2. We can see here that with VGG-16 (Simonyan and Zisserman, 2015) achieves 9.03% better performance on SIXray (Miao et al., 2019) dataset. Similarly, on GDXray (Mery et al., 2015), OPIXray (Wei et al., 2020), and COMPASS-XP (Griffin et al., 2019) dataset, VGG-16 (Simonyan and Zisserman, 2015) driven achieved 20.69%, 24.08%, and 31.67% better reconstruction performance, respectively. However, if we increase the training epochs for ResNet-50 (He et al., 2016), we can achieve similar performance. But since VGG-16 (Simonyan and Zisserman, 2015) produced better results with lesser training, we opted for it in the proposed framework to compute during one-time training.

Model SIXray GDXray OPIXray COMP
VGG-16 72.92 16.29 21.12 21.94
ResNet-50 80.16 20.54 27.82 32.11
Table 2: Performance of different pre-trained models for computing during one-time training. Bold indicates the best MSE scores. Moreover, the abbreviation ‘COMP’ represents the COMPASS-XP dataset (Griffin et al., 2019).

6.1.3 The Optimal Classification Backbone:

Recognition of anomalous item (after unsupervised anomaly instance segmentation) is an optional step required to identify the type of anomaly contained within the localized anomalous region. To perform this, we exploited different pre-trained models (fine-tuned on suspicious items patches). We also compared the performance of a proposed classification model with these pre-trained models to see how well it recognizes the suspicious items (contained within the patches). The comparison is reported in Table 3. Here, we can observe that with few training examples (i.e., the suspicious items patches) from all the datasets, the proposed model achieves competitive classification performance compared to other pre-trained networks. Furthermore, considering that it has 54.48% fewer parameters than best performing DenseNet-121 (Huang et al., 2017) model. We believe that it provides a good trade-off between performance and computational requirements, especially compared to the MobileNetv2 (Howard et al., 2017).

PB 0.9669 0.8683 0.5083 0.6412 3.2M
V-16 0.9630 0.8538 0.4759 0.6111 14.7M
R-50 0.9740 0.9172 0.5745 0.7064 23.5M
R-101 0.9786 0.9341 0.6242 0.7483 42.6M
R-152 0.9877 0.9446 0.7568 0.8403 58.3M
D-121 0.9754 0.9053 0.5909 0.7150 7.03M
MNv2 0.9527 0.8247 0.4049 0.5431 2.2M
Table 3: Comparison of classification performance for recognizing anomalous items patches. The good trade-off between classification performance and the computational requirements is highlighted in bold. Moreover, the abbreviations are ACC: Accuracy, TPR: True Positive Rate, PPV: Positive Predicted Value, F1: F1 Score, NP: Number of Parameters, PB: Proposed Backbone, V-16: VGG-16 (Simonyan and Zisserman, 2015), R-50: ResNet-50 (He et al., 2016), R-101: ResNet-101 (He et al., 2016), R-152: ResNet-152 (He et al., 2016), D-121: DenseNet-121 (Huang et al., 2017), MNv2: MobileNetv2 (Howard et al., 2017).

6.1.4 Choice of One-Time Training Dataset

The encoder-decoder model within the proposed framework is trained only once, and in this one-time training, it learns to reconstruct the normal baggage content at run-time robustly. As the model does not learn to recognize suspicious objects (during training), it faces a hard time reconstructing them at the inference stage, and this is highlighted within the disparity maps.

In this experiment, we determine the optimal choice of the dataset for training the encoder-decoder model. The reconstruction performance of the proposed model (in terms of MSE scores) is reported in Table 4. Here, we can see that using SIXray (Miao et al., 2019) dataset for training; the proposed model produces the best reconstruction performance across all four datasets at the inference stage. This is because SIXray (Miao et al., 2019), to the best of our knowledge, contains the maximum amount of scans depicting diverse ranging normal baggage content, allowing the encoder-decoder model to learn the unique feature representations within the baggage X-ray scans robustly. Training on GDXray (Mery et al., 2015) dataset does not produce a very good performance for two reasons: 1) It contains significantly lesser normal baggage scans (around 1,130 of them) for training purposes. 2) GDXray is a grayscale dataset, thus styling the colored baggage X-ray scans as grayscale would create more ambiguities between normal and abnormal baggage content, resulting in the noisier disparity maps. We did not use OPIXray (Wei et al., 2020) dataset for training purposes because it does not contain any normal baggage X-ray scans (Wei et al., 2020). Also, COMPASS-XP (Wei et al., 2020) dataset does not have complex X-ray scans (like other datasets), i.e., it contains single-item scans, and the model trained on these scans does not get much exposure towards learning diversified feature representations contained within other datasets.

Training SIXray GDXray OPIXray COMP
SIXray 72.928 16.291 21.125 21.941
GDXray 96.532 13.639 45.923 51.649
COMP 81.692 21.638 30.113 10.582
Table 4: Reconstruction performance of the proposed encoder-decoder model in terms of MSE scores when trained on different datasets. The abbreviation ‘COMP’ represents the COMPASS-XP dataset (Griffin et al., 2019).

6.2 Comparison with Supervised Frameworks

In this series of experiments, we compared the proposed framework’s detection and recognition performance with the state-of-the-art fully supervised baggage threat detection frameworks. Here, our approach is semi-supervised (since we mounted the optional classification module with the proposed framework for recognizing the suspicious items’ categories). The comparison is reported in Table 5 where we can see that although the proposed framework lags from some state-of-the-art methods, its performance is still quite appreciable, especially considering the fact that it is a semi-supervised approach, unlike its conventionally trained, fully supervised competitors. Also, it achieves the good detection performance (i.e., it only lags from the best performing framework by 17.23%, 11.17%, 0.65%, and 6.89% on SIXray, GDXray, OPIXray, and COMPASS-XP dataset, respectively, in terms of mAP scores). Furthermore, compared to recently introduced GBAD (Dumagpi et al., 2020), DTSD (Hassan et al., 2020b), and DOAM-O (Tao et al., 2021) approaches, the proposed framework, on SIXray and OPIXray datasets, achieve 5.76%, 18.68%, and 0.347% improvements, respectively which is quite significant.

Model SIXray GDXray OPIXray COMP
Proposed 0.7941 0.8591 0.7483 0.5439
TST 0.9516 0.9672 0.7532 0.5842
CST 0.9595 0.9343 - -
DTSD 0.6457 0.9162 - -
DOAM - - 0.7401 -
GBAD 0.7483 - - -
DOAM-O - - 0.7457 -
Table 5: Comparison of proposed approach (semi-supervised version) with existing fully supervised baggage threat detection frameworks in terms of mAP. Bold indicates the best score while the second-best scores are underlined. ’-’ indicates that the metric is not computed. Moreover, the abbreviations are COMP: COMPASS-XP (Griffin et al., 2019), TST: Trainable Structure Tensors (Hassan and Werghi, 2020), CST: Cascaded Structure Tensors (Hassan et al., 2020a), DTSD: Dual-Tensor Shot Detector (Hassan et al., 2020b), DOAM: De-occlusion Attention Module (Wei et al., 2020) with Single-Shot Detector (Liu et al., 2016), GBAD: GAN Based Anomaly Detection (Dumagpi et al., 2020) with ResNet-101 (He et al., 2016), and DOAM-O: Oversampling De-occlusion Attention Module (Tao et al., 2021) with Single-Shot Detector (Liu et al., 2016).

6.3 Comparison with Unsupervised Frameworks

We also compared the performance of the proposed unsupervised framework with state-of-the-art methods such as GANomaly (Akçay et al., 2018a), and Skip-GANomaly (Akçay et al., 2018a). Here, the experimental protocol is to classify the abnormal vs. normal baggage X-ray scans (except for the OPIXray), where abnormal scans are those scans that contain one or more anomalous regions, and the normal scans only have the normal baggage content. For the OPIXray dataset, we followed the strategy of classifying the scans as having folding knives, utility knives, straight knives, multi-tool knives, and scissors, since this dataset does not contain any normal baggage X-ray scans (Wei et al., 2020). Moreover, for fairness, all the frameworks were trained on a single SIXray dataset as per the training protocol defined in Section 5.2. Similarly, they are also applied to the other datasets in a zero-shot manner (without any re-training or fine-tuning). The comparison is reported in Table 6 in terms of F1 scores where we can see that the proposed framework has 67.37% lead on SIXray dataset, 32.32% lead on GDXray dataset, and 45.81% lead on COMPASS-XP dataset. Furthermore, on the OPIXray dataset, the proposed framework is leading the second-best Skip-GANomaly (Akçay et al., 2019) by 47.19%.

Model SIXray GDXray COMP OPIXray
PF 0.4831 0.7839 0.4119 0.6560
GA 0.1227 0.4994 0.2405 0.3074
SG 0.1576 0.5305 0.2232 0.3464
Table 6: Comparison of proposed framework with state-of-the-art unsupervised baggage threat detection frameworks in terms of F1 score. Bold indicates the best scores while the second-best are underlined. Moreover, the abbreviations are: PF: Proposed Framework, GA: GANomaly (Akçay et al., 2018a), SG: Skip-GANomaly (Akçay et al., 2019), and COMP: COMPASS-XP (Griffin et al., 2019).

6.4 Qualitative Evaluations

Figure 6 reports the proposed framework’s qualitative evaluations on all four X-ray datasets. We can see here that the proposed framework effectively recognizes the contraband items irrespective of the scanner specifications. However, for cluttered cases, the quality of extracted masks is somewhat limited. For example, see the mask of gun in (H). This is because the framework recognizes the anomalous regions from the disparity maps in an unsupervised manner (by clustering their color distributions w.r.t the pixels of the normal baggage content). Therefore, it cannot differentiate the anomalous items’ pixel well if they have very high-intensity correlations with the normal pixels.

Figure 6: Qualitative evaluation of proposed framework on four public X-ray datasets. (A-H) show scans from the GDXray (Mery et al., 2015) dataset, (I-P) show scans from the OPIXray (Wei et al., 2020) dataset, (Q-X) show scans from the COMPASS-XP (Griffin et al., 2019) dataset, and (Y-AF) show scans from the SIXray (Miao et al., 2019) dataset.

7 Discussion and Conclusion

This paper presents a novel unsupervised anomaly instance segmentation framework to detect baggage threats from the X-ray scans without any supervision and extensive training procedures. The proposed framework is ideal for screening baggage threats in the real world, easing the security officers’ load by avoiding the tedious re-training and ground truth marking process as required in the conventional baggage threat detection frameworks. The proposed framework recognizes the baggage threats as anomalies by exploiting the original and the reconstructed scans’ disparities. For cluttered cases, the disparity maps are a bit limited in generating the anomalous items’ masks accurately due to their lesser intensity differences with the normal objects. In the future, we plan to improve this aspect by deploying a more adaptive attention mechanism that will highlight only the desired anomalous region within the candidate scan while suppressing the rest of the content. Also, we plan to test the proposed framework to detect 3D-printed and organic baggage threats from the security X-ray scans.


This work is supported with a research fund from Khalifa University: Ref: CIRA-2019-047, and from ADEK Award for Research Excellence: AARE19-156.

Authors Contribution Statement

T.H. devised the idea, wrote the manuscript, and performed the experiments. S.A. also contributed to manuscript writing. M.B. co-supervised the research and reviewed the experiments. S.K. also reviewed the manuscript. N.W. supervised the complete research, contributed to manuscript writing, and reviewed the experiments.

Data Availability

The proposed framework has been thoroughly evaluated on four baggage X-ray datasets, and all of these four datasets are publicly available.

Competing Interests Statement

All the authors declare that there are no competing interests related to this article.


  • Akçay et al. (2019) Akçay S, Atapour-Abarghouei A, Breckon TP (2019) Skip-GANomaly: Skip Connected and Adversarially Trained Encoder-Decoder Anomaly Detection. arXiv:190108954
  • Akçay et al. (2016)

    Akçay S, et al. (2016) Transfer learning using convolutional neural networks for object classification within X-ray baggage security imagery. In: IEEE ICIP, pp 1057–1061

  • Akçay and Breckon (2020) Akçay S, Breckon T (2020) Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging. arXiv:200101293
  • Akçay et al. (2018a)

    Akçay S, Atapour-Abarghouei A, Breckon TP (2018a) GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training. In: Asian Conference on Computer Vision, Springer, pp 622–637

  • Akçay et al. (2018b)

    Akçay S, Kundegorski ME, Willcocks CG, Breckon TP (2018b) Using deep convolutional neural network architectures for object classification and detection within x-ray baggage security imagery. IEEE Transactions on Information Forensics and Security 13(9):2203–2215

  • An et al. (2019) An J, et al. (2019) Semantic segmentation for prohibited items in baggage inspection. In: Int. Conf. Intelligence Science and Big Data Engineering. Visual Data Engineering, p 495–505
  • Bastan (2015) Bastan M (2015) Multi-view Object Detection In Dual-energy X-ray Images. Machine Vision and Applications p 1045–1060
  • Bastan et al. (2011) Bastan M, et al. (2011) Visual words on baggage x-ray images. In: Int. Conference on Computer Analysis of Images and Patterns, p 360–368
  • Cooley and Tukey (1965) Cooley JW, Tukey JW (1965) An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation
  • Council (1996) Council NR (1996) Airline Passenger Security Screening: New Technologies and Implementation Issues. The National Academics Press
  • Dumagpi et al. (2020) Dumagpi JK, Jung WY, Jeong YJ (2020) A New GAN-Based Anomaly Detection (GBAD) Approach for Multi-Threat Object Classification on Large-Scale X-Ray Security Images. IEICE Transactions on Information and Systems
  • Gaus et al. (2019a) Gaus YFA, Bhowmik N, Akçay S, Breckon T (2019a) Evaluating the transferability and adversarial discrimination of convolutional neural networks for threat object detection and classification within X-ray security imagery. 18th IEEE International Conference On Machine Learning And Applications (ICMLA)
  • Gaus et al. (2019b) Gaus YFA, Bhowmik N, Akçay S, Guillen-Garcia PM, Barker JW, Breckon TP (2019b) Evaluation of a Dual Convolutional Neural Network Architecture for Object-wise Anomaly Detection in Cluttered X-ray Security Imagery. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp 1–8
  • Griffin et al. (2019) Griffin LD, Caldwell M, Andrews JTA (2019) COMPASS-XP Dataset. Computational Security Science Group, UCL
  • Hassan and Werghi (2020) Hassan T, Werghi N (2020) Trainable Structure Tensors for Autonomous Baggage Threat Detection Under Extreme Occlusion. Asian Conference on Computer Vision (ACCV)
  • Hassan et al. (2020a) Hassan T, Bettayeb M, Akçay S, Khan S, Bennamoun M, Werghi N (2020a) Detecting Prohibited Items in X-ray Images: A Contour Proposal Learning Approach. 27th IEEE International Conference on Image Processing (ICIP)
  • Hassan et al. (2020b) Hassan T, Shafay M, Akçay S, Khan S, Bennamoun M, Damiani E, Werghi N (2020b) Meta-Transfer Learning Driven Tensor-Shot Detector for the Autonomous Localization and Recognition of Concealed Baggage Threats. MDPI Sensors
  • He et al. (2016)

    He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  • He et al. (2017) He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 2961–2969
  • Heitz and Chechik (2010) Heitz G, Chechik G (2010) Object Separation in X-ray Image Sets. In: Int. Conf Computer Vision and Pattern Recognition, p 2093–2100
  • Howard et al. (2017) Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
  • Huang et al. (2017) Huang G, Liu Z, Laurens VDM, Weinberger KQ (2017) Densely Connected Convolutional Networks. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)
  • Jaccard et al. (2014) Jaccard N, Rogers TW, Griffin LD (2014) Automated detection of cars in transmission X-ray images of freight containers. In: AVSS, pp 387–392
  • Jaccard et al. (2017) Jaccard N, et al. (2017) Detection Of Concealed Cars In Complex Cargo X-ray Imagery Using Deep Learning. Journal of X-Ray Science and Technology p 323–339
  • Johnson et al. (2016)

    Johnson J, Alahi A, Fei-Fei L (2016) Perceptual Losses for Real-Time Style Transfer and Super-Resolution. European Conference on Computer Vision (ECCV)

  • Kingma and Ba (2015) Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR)
  • Kundegorski et al. (2016) Kundegorski M, et al. (2016) On using feature descriptors as visual words for object detection within x-ray baggage security screening. In: Int. Conf. on Imaging for Crime Detection and Prevention (ICDP)
  • Lin et al. (2017) Lin TY, et al. (2017) Focal Loss for Dense Object Detection. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)
  • Liu et al. (2016) Liu W, et al. (2016) SSD: Single Shot MultiBox Detector. In: European Conference on Computer Vision (ECCV)
  • Liu et al. (2018) Liu Z, Li J, Shu Y, Zhang D (2018) Detection and Recognition of Security Detection Object Based on Yolo9000. In: 2018 5th International Conference on Systems and Informatics (ICSAI), IEEE, pp 278–282
  • Megherbi et al. (2012) Megherbi N, Breckon TP, Flitton GT, Mouton A (2012) Fully Automatic 3D Threat Image Projection: Application to Densely Cluttered 3D Computed Tomography Baggage Images. International Conference on Image Processing Theory, Tools and Applications
  • Mery et al. (2017) Mery D, Svec E, Arias M, Riffo V, Saavedra JM, Banerjee S (2017) Modern Computer Vision Techniques for X-Ray Testing in Baggage Inspection
  • Mery et al. (2020) Mery D, Saavedra D, Prasad M (2020) X-Ray Baggage Inspection With Computer Vision: A Survey. IEEE Access, pp 145620-145633
  • Mery et al. (2015) Mery D, et al. (2015) GDXray: The database of X-ray images for nondestructive testing. Journal of Nondestructive Evaluation 34(4):42
  • Mery et al. (2016) Mery D, et al. (2016) Object Recognition in Baggage Inspection Using Adaptive Sparse Representations of X-ray Images. In: Pacific-Rim Symposium on Image and Video Technology, p 709–720
  • Miao et al. (2019) Miao C, et al. (2019) SIXray: A Large-scale Security Inspection X-ray Benchmark for Prohibited Item Discovery in Overlapping Images. In: IEEE CVPR, pp 2119–2128
  • Ren et al. (2016) Ren S, et al. (2016) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:150601497v3
  • Riffo and Mery (2015) Riffo V, Mery D (2015) Automated Detection of Threat Objects Using Adapted Implicit Shape Model. IEEE Transactions on Systems, Man, and Cybernetics: Systems 46(4):472–482
  • Simonyan and Zisserman (2015) Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition
  • Sun et al. (2019) Sun Q, Liu Y, Chua TS, Schiele B (2019) Meta-Transfer Learning for Few-Shot Learning. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)
  • Szegedy et al. (2014) Szegedy C, et al. (2014) Going Deeper with Convolutions. arXiv:14094842v1
  • Tao et al. (2021) Tao R, Wei Y, Li H, Liu A, Ding Y, Qin H, Liu X (2021) Over-sampling De-occlusion Attention Network for Prohibited Items Detection in Noisy X-ray Images. Under Review in IEEE Transactions on Multimedia
  • Turcsany et al. (2013) Turcsany D, Mouton A, Breckon TP (2013) Improving feature-based object recognition for X-ray baggage security screening using primed visual words. In: 2013 IEEE International Conference on Industrial Technology (ICIT), IEEE, pp 1140–1145
  • Wang and Breckon (2020) Wang Q, Breckon TP (2020) Contraband Materials Detection Within Volumetric 3D Computed Tomography Baggage Security Screening Imagery. arXiv:201211753
  • Wang et al. (2020a) Wang Q, Megherb N, Breckon TP (2020a) Multi-Class 3D Object Detection Within Volumetric 3D Computed Tomography Baggage Security Screening Imagery. arXiv:200801218
  • Wang et al. (2020b) Wang Q, Megherbi N, Breckon TP (2020b) A Reference Architecture for Plausible Threat Image Projection (TIP) Within 3D X-ray Computed Tomography Volumes. Journal of X-Ray Science and Technology, vol 28, no 3, pp 507-526
  • Wei et al. (2020) Wei Y, Tao R, Wu Z, Ma Y, Zhang L, Liu X (2020) Occluded Prohibited Items Detection: An X-ray Security Inspection Benchmark and De-occlusion Attention Module. arXiv:200408656
  • Xu et al. (2018) Xu M, et al. (2018) Prohibited Item Detection in Airport X-Ray Security Images via Attention Mechanism Based CNN. In: Chinese Conference on Pattern Recognition and Computer Vision
  • Yang and Soatto (2020) Yang Y, Soatto S (2020) FDA: Fourier Domain Adaptation for Semantic Segmentation. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)
  • Zhang et al. (2014) Zhang J, et al. (2014) Joint Shape and Texture Based X-Ray Cargo Image Classification. International Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
  • Zhao et al. (2018) Zhao Z, et al. (2018) A GAN-Based Image Generation Method for X-Ray Security Prohibited Items. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)