Log In Sign Up

Tensor Pooling Driven Instance Segmentation Framework for Baggage Threat Recognition

Automated systems designed for screening contraband items from the X-ray imagery are still facing difficulties with high clutter, concealment, and extreme occlusion. In this paper, we addressed this challenge using a novel multi-scale contour instance segmentation framework that effectively identifies the cluttered contraband data within the baggage X-ray scans. Unlike standard models that employ region-based or keypoint-based techniques to generate multiple boxes around objects, we propose to derive proposals according to the hierarchy of the regions defined by the contours. The proposed framework is rigorously validated on three public datasets, dubbed GDXray, SIXray, and OPIXray, where it outperforms the state-of-the-art methods by achieving the mean average precision score of 0.9779, 0.9614, and 0.8396, respectively. Furthermore, to the best of our knowledge, this is the first contour instance segmentation framework that leverages multi-scale information to recognize cluttered and concealed contraband data from the colored and grayscale security X-ray imagery.


page 4

page 5

page 6

page 10

page 11


Trainable Structure Tensors for Autonomous Baggage Threat Detection Under Extreme Occlusion

Detecting baggage threats is one of the most difficult tasks, even for e...

Unsupervised Anomaly Instance Segmentation for Baggage Threat Recognition

Identifying potential threats concealed within the baggage is of prime c...

Automated Visual Fin Identification of Individual Great White Sharks

This paper discusses the automated visual identification of individual g...

Cascaded Structure Tensor Framework for Robust Identification of Heavily Occluded Baggage Items from X-ray Scans

In the last two decades, baggage scanning has globally become one of the...

Temporal Fusion Based Mutli-scale Semantic Segmentation for Detecting Concealed Baggage Threats

Detection of illegal and threatening items in baggage is one of the utmo...

1 Introduction

X-ray imagery is a widely used modality for non-destructive testing Tang2020TII; hassan2021tim

, especially for screening illegal and smuggled items at airports, cargoes, and malls. Manual baggage inspection is a tiring task and susceptible to errors caused due to exhausting work routines and less experienced personnel. Initial systems proposed to address these problems employed conventional machine learning

bastan2013BMVC. Driven by hand-engineered features, these methods are only applicable to limited data and confined environmental settings turcsany2013improving

. Recently, attention has turned to deep learning methods, which gave a neat boost in accuracy and generalization capacity towards screening prohibited baggage items

Hu2020ACCV; akccay2016transfer; Hassan2021SMC; Hassan2021JAIHC. However, deep learning methods are also prone to clutter, and occlusion akcay2018using. This limitation emanates from the proposal generation strategies which have been designed for the color images gaus2019evaluation. Unlike RGB scans, X-ray imagery lack texture and exhibit low-intensity variations between cluttered objects. This intrinsic difference makes the region-based or anchor-based proposal generation methods such as Mask R-CNN maskrcnn, Faster R-CNN fasterrcnn, RetinaNet retinanet, and YOLO yolov3 less robust for detecting the cluttered contraband data akcay2018using. Moreover, the problem is further accentuated by the class imbalance nature of the contraband items in the real-world gaus2019evaluation. Despite the considerate strategies proposed to alleviate the occlusion and the imbalance nature opixray; miao2019sixray; hassan2015Review; hassan2016AEIA; hassan2016AO, recognizing threatening objects in highly cluttered and concealed scenarios is still an open problem ackay2020.

1.1 Contributions

In this paper, we propose a novel multi-scale contour instance segmentation framework for identifying suspicious items using X-ray scans. Unlike standard models that employ region-based or keypoint-based techniques to generate multiple boxes around objects akccay2016transfer; hassan2016CMPB; hassan2016EAGE; hassan2016JOSAA; hassan2017AVRDB; akcay2018using; gaus2019evaluating, we propose to derive proposals according to the hierarchy of the regions defined by the contours. The insight driving this approach is that contours are the most reliable cue in the X-ray scans due to the lack of surface texture. For example, the occluded items exhibit different transitional patterns based upon their orientation, contrast, and intensity. We try to amplify and exploit this information through the multi-scale scan decomposition, which boosts the proposed framework’s capacity for detecting the underlying contraband data in the presence of clutter. Furthermore, we are also motivated by the fact that organic material’s suspicious items show only their outlines in the X-ray scans Hassan2020ACCV; hassan2017BioMed; hassan2017BOE; hassan2017Cancer; hassan2017CSCI; hassan2017ICGIP; hassan2017SigSys; hassan2019. To summarize, the main features of this paper are:

  • [leftmargin=*]

  • Detection of overlapping suspicious items by analyzing their predominant orientations across multiple scales within the candidate scan. Unlike hassan2019; Hassan2020ACCV; hassan2020Sensors, we propose a novel tensor pooling strategy to decompose the scan across various scales and fuses them via a single multi-scale tensor. This scheme results in more salient contour maps (see Figure 1), boosting our framework’s capacity for handling dulled, concealed, and overlapping items.

  • A thorough validation on three publicly available large-scale baggage X-ray datasets, including the OPIXray opixray, which is the only dataset allowing a quantitative measure of the level of occlusion.

  • Unlike state-of-the-art methods such as CST hassan2019, TST Hassan2020ACCV, and DTS hassan2020Sensors, the performance of the proposed framework to detect occluded items has been quantitatively evaluated on OPIXray opixray dataset. Please see Table 4 for more details.

2 Related Work

Many researchers have developed computer-aided screening systems to identify potential baggage threats mery2016; hassan2020SoCPaR. While a majority of these frameworks are based on conventional machine learning bastan2013object; hassan2018Access; hassan2018Future5V; hassan2018Access2, the recent works also employ supervised akccay2016transfer; hassan2018Healthcom; hassan2018ICIAR; hassan2018ICIAR2, and unsupervised akccay2019skip deep learning, and these methods outperform conventional approaches both in terms of performance, and efficiency akcay2018using; hassan2018JDI; hassan2018JOMS. In this section, we discuss some of the major baggage threat detection works. We refer the readers to Mery2017TMSC; ackay2020 for an exhaustive survey.

2.1 Traditional Methods

The early baggage screening systems were driven via classification turcsany2013improving; hassan2019CBM; hassan2019CCODE; hassan2019ICCAR, segmentation heitz2010; hassan2020JBHI; hassan2019DIB; hassan2019Sensors and detection bastan2015; hassan2019Person; hassan2020BIBE; hassan2020DIR; hassan2020Survey approaches to identify potential threats and smuggled items. Here, the work of Bastan et al. bastan2013BMVC is appreciable, which identifies the suspicious and illegal items within the multi-view X-ray imagery through fused SIFT and SPIN driven SVM model. Similarly, SURF heitz2010, and FAST-SURF kundegorski2016 have also been used with the Bag of Words bastan2011 to identify threatening items from the security X-ray imagery. Moreover, approaches like adapted implicit shape model riffo2015automated and adaptive sparse representation mery2016 were also commendable for screening suspicious objects from the X-ray scans.

2.2 Deep Learning Frameworks

The deep learning-based baggage screening frameworks have been broadly categorized into supervised and unsupervised learning schemes.

2.2.1 Supervised Methods

The initial deep learning approaches involved scan-level classification to identify the suspicious baggage content akccay2016transfer. However, with the recent advancements in object detection, researchers also employed sophisticated detectors like RetinaNet retinanet, YOLO yolo; yolov2, and Faster R-CNN fasterrcnn to not only recognize the contraband items from the baggage X-ray scans but also to localize them via bounding boxes akcay2018using. Moreover, researchers also proposed semantic segmentation an2019; hassan2020TBME; hassan2021BSPC; Hassan2021SAS; hassan2021cbm and instance segmentation Hassan2020ACCV models to recognize threatening and smuggled items from the grayscale and colored X-ray imagery. Apart from this, Xiao et al. _45 presented an efficient implementation of Faster R-CNN fasterrcnn to detect suspicious data from the TeraHertz imagery. Dhiraj et al. _42 used Faster R-CNN fasterrcnn, YOLOv2 yolov2, and Tiny YOLO yolov2 to screen baggage threats contained within the scans of a publicly available GDXray dataset mery2015gdxray. Gaus et al. gaus2019evaluation utilized RetinaNet retinanet, Faster R-CNN fasterrcnn, Mask R-CNN maskrcnn (driven through ResNets he2016deep, VGG-16 vgg16, and SqueezeNet i2016squeezenet) to detect prohibited baggage items. In another approach gaus2019evaluating, they analyzed the transferability of these models on a similarly styled X-ray imagery contained within their local dataset as well as the SIXray10 subset of the publicly available SIXray dataset miao2019sixray. Similarly, Akçay et al. akcay2018using compared Faster R-CNN fasterrcnn, YOLOv2 yolov2, R-FCN rfcn, and sliding-window CNN with the AlexNet alexnet driven SVM model to recognize occluded contraband items from the X-ray imagery. Miao et al. miao2019sixray explored the imbalanced nature of the contraband items in the real-world by developing a class-balanced hierarchical refinement (CHR) framework. Furthermore, they extensively tested their framework (backboned through different classification models) on their publicly released SIXray miao2019sixray dataset. Wei et al. opixray presented a plug-and-play module dubbed De-occlusion Attention Module (DOAM) that can be coupled with any object detector to enhance its capacity towards screening occluded contraband items. DOAM was validated on the publicly available OPIXray opixray dataset, which is the first of its kind in providing quantitative assessments of baggage screening frameworks under low, partial, and full occlusion opixray. Apart from this, Hassan et al. hassan2019 also addressed the imbalanced nature of the contraband data by developing the cascaded structure tensors (CST) based baggage threat detector. CST hassan2019 generates a balanced set of contour-based proposals, which are then utilized in training the backbone model to screen the normal and abnormal baggage items within the candidate scan hassan2019. Similarly, to overcome the need to train the threat detection systems on large-scale and well-annotated data, Hassan et al. hassan2020Sensors

introduced meta-transfer learning-based dual tensor-shot (DTS) detector. DTS

hassan2020Sensors analyzes the scan’s saliency to produce low and high-density contour maps from which the suspicious contraband items are identified effectively with few-shot training hassan2020Sensors. In another approach, Hassan et al. Hassan2020ACCV developed an instance segmentation-based threat detection framework that filters the contours of the suspicious items from the regular content via trainable structure tensors (TST) Hassan2020ACCV to identify them accurately within the security X-ray imagery.

2.2.2 Unsupervised Methods

While most baggage screening frameworks involved supervised learning, researchers have also explored adversarial learning to screen contraband data as anomalies. Akçay et al.

akcay2018ganomaly, among others, laid the foundation of unsupervised baggage threat detection by proposing GANomaly akcay2018ganomaly, an encoder-decoder-encoder network trained in an adversarial manner to recognize prohibited items within baggage X-ray scans. In another work, they proposed Skip-GANomaly akccay2019skip which employs skip-connections in an encoder-decoder topology that not only gives better latent representations for detecting baggage threats but also reduces the overall computational complexity of GANomaly akcay2018ganomaly.

Figure 1: (A) An exemplar X-ray scan from the GDXray dataset mery2015gdxray, (B) contour map obtained through the modified structure tensors in hassan2019 and Hassan2020ACCV, (C) contour map obtained through proposed tensor pooling strategy.

The rest of the paper is organized as follows: Section 3 presents the proposed framework. Section 4 describes the experimental setup. Section 5 discusses the results obtained with three public baggage X-ray datasets. Section 6 concludes the paper and enlists future directions.

Figure 2:

Block diagram of the proposed framework. The input scan is passed to the tensor pooling module to extract the tensor representations encoding the baggage items’ contours at different orientations. These representations are fused into a single multi-scale tensor and passed afterward to an asymmetric encoder-decoder backbone that segments and recognizes the contraband item’s contours while suppressing the rest of the baggage content. For each detected contour, the corresponding bounding box and mask is generated to localize the detected contraband items. The abbreviations are CV: Convolution, BN: Batch Normalization, SPB: Shape Preserving Block, IB: Identity Block, MP: Max Pooling, AP: Average Pooling, ZP: Zero Padding, SM: Softmax.

3 Proposed Approach

The block diagram of the proposed framework is depicted in Figure 2. The input scan is fed to the tensor pooling module (block A) to generate a multi-scale tensor representation, revealing the baggage content’s transitional patterns at multiple predominant orientations and across various scales. Afterward, the multi-scale tensor is passed to the encoder-decoder backbone (block B), implementing the newly proposed contour maps-based instance segmentation. This block extracts the contours of the prohibited data while eliminating the irrelevant scan content. In the third stage (block C), each extracted contour, reflecting the contraband item instance, is utilized in generating the respective mask and the bounding box for localization. In the subsequent sections, we present a detailed description of each module within the proposed framework.

3.1 Tensor Pooling Module

The tensor pooling module decomposes the input scan into levels of a pyramid. From each level of the pyramid, the baggage content’s transitional patterns are generated by analyzing their distribution of orientations within the associated image gradients. In the proposed tensor pooling scheme, we highlight the transitional patterns in image gradients (corresponding to directions) by computing the following block-structured symmetric matrix hassan2019; Hassan2020ACCV:


Each tensor () in the above block-matrix is an outer product of two image gradients and a smoothing filter . Moreover, the orientation (), of the image gradient , is computed through: , where ranges from to . Since the block-structured matrix in Eq. 1 is symmetric, we obtain unique tensors. From this group, we derive the coherent tensor, reflecting the baggage items’ predominant orientations. The coherent tensor is a single tensor representation generated by adding the most useful tensors out of the unique tensor set. Here, it should be noted that these useful tensors are selected by ranking all the unique tensors according to their norm.

Moreover, the coherent tensor also reveals the variations in the intensity of the cluttered baggage items, aiding in generating individual contours for each item. However, this scheme analyzes only the intensity variations of the baggage items at a single scale, limiting the extraction of the objects having lower transitions with the background hassan2019; Hassan2020ACCV. To address this limitation, we propose a multi-scale tensor fusing the X-ray scan transitions from coarsest to finest levels so that each item, even having a low-intensity difference with the background, can be adequately highlighted. For example, see the boundaries of: the razor in a multi-scale tensor representation in Figure 1 (C), the straight knife in Figure 3 (G), the two knives and a gun in Figure 3 (H), and the two guns and a knife in Figure 3 (I).

1 Input: X-ray scan (), Scaling Factor (), Number of Orientations () Output: Multi-scale Tensor () [, ] = size() Initialize (of size ) with zeros Set // pyramid pooling factor for  to  do
2       if  is 0 then
3             = ComputeTensors(, ) // : Tensors = GetCoherentTensor() = +
4      else
5             [, ] = size() if (min(, ) % or min(, )  then
6                   break
7             end if
8             = Pool(, ) = Pool(, ) = = ComputeTensors(, ) = GetCoherentTensor() = + Unpool(, )
9       end if
11 end for
Algorithm 1 Tensor Pooling Module
Figure 3: Difference between conventional structure tensors (used in hassan2019; Hassan2020ACCV), and proposed multi-scale tensor approach. First row shows the original scans from OPIXray opixray, GDXray mery2015gdxray, and SIXray miao2019sixray dataset. The second row shows the output for the conventional structure tensors hassan2019; Hassan2020ACCV. The third row shows the output for the proposed tensor pooling module.

As mentioned earlier, the multi-scale tensors are computed through pyramid pooling (up to level). At any level, (such that ), we multiply, pixel-wise, the decomposed image with the transitions obtained at the previous () levels. In so doing, we ensure that the edges of the contraband items (procured earlier) are retained across each scale. The full procedure of the proposed tensor pooling module is depicted in Algorithm 1 and also shown in Figure 2.

Figure 4: Contour instance segmentation from multi-scale tensors. The first column shows the original scans, the second column shows the multi-scale tensor representations, the third column shows the ground truths, and the fourth column shows the extracted contours of the contraband items.

The multi-scale tensor is then passed to the proposed encoder-decoder model to extract the contours of the individual suspicious items. A detailed discussion about contour instance segmentation is presented in the subsequent section.

3.2 Contour Instance Segmentation

The contour instance segmentation is performed through the proposed asymmetric encoder-decoder network, which assigns the pixels in the multi-scale tensors to one of the following categories where denotes the number of prohibited items’ instances to which we add the class background which include background and irrelevant pixels (i.e., pixels belonging to a non-suspicious baggage content).

Furthermore, to differentiate between the contours of the normal and suspicious items, the custom shape-preserving (SPB) and identity blocks (IB) have been added within the encoder topology. The SPB, as depicted in Figures 2 and 5 (A), integrates the multi-scale tensor map (after scaling) in the feature map extraction to enforce further the attention on prohibited items’ outlines. The IB (Figure 5-A), inspired by ResNet architecture he2016deep, acts as a residual block to emphasize the feature maps of the previous layer.

Apart from this, the whole network encompasses one input, one zero-padding, 22 convolution, 20 batch normalization, 12 activation, four pooling, two multiply, six addition, three lambda (that implements the custom functions), and one reshape layer. Moreover, we use skip-connections (via addition) within the encoder-decoder to refine the extracted items’ boundaries. The number of parameters within the network is 1,308,160, from which around 6,912 parameters are non-trainable. The detailed summary of the proposed model (including the architectural details of the SPB and IB blocks) is available in the source code repository111The source code of the proposed framework along with its complete documentation is available at

Figure 5: (A) Shape Preserving Block (SPB), (B) Identity Block (IB).

3.3 Bounding Box and Mask Generation

After segmenting the contours, we perform morphological post-processing to remove tiny and isolated fragments. The obtained outlines contain both open and closed contours of the underlying suspicious items. The closed contours can directly lead towards generating the corresponding item’s mask. For open contours, we join their endpoints and then derive their masks through morphological reconstruction. Afterward, we generate the items’ bounding boxes from the masks as shown in Figure 2 (C).

4 Experimental Setup

This section presents the details about the experimental protocols, datasets, and evaluation metrics which were in order to assess the proposed system’s performance and compare it with state-of-the-art methods.

4.1 Datasets

We validated the proposed framework on three different publicly available baggage X-ray datasets, namely, GDXray mery2015gdxray, SIXray miao2019sixray, and OPIXray opixray. The detailed description of these datasets are presented below.

4.1.1 GDXray

GDXray mery2015gdxray was first introduced in 2015 and it contains 19,407 high-resolution grayscale X-ray scans. The dataset is primarily designed for the non-destructive testing purposes and the scans within GDXray mery2015gdxray are arranged into five categories, i.e., welds, baggage, casting, settings and nature. But baggage is the only relevant group for this study and it contains 8,150 grayscale X-ray scans. Moreover, the dataset also presents the detailed annotations for the prohibited items such as shuriken, knives, guns, and razors. As per the dataset standard, 400 scans from GDXray mery2015gdxray were used for training purposes, while the remaining scans were used for testing purposes.

4.1.2 SIXray

SIXray miao2019sixray is a recently introduced large-scale security inspection X-ray dataset. It contains a total of 1,059,231 colored X-ray scans from which 8,929 scans are positive (containing prohibited items such as knives, wrenches, guns, pliers, hammer and scissors along with their ground truths), and 1,050,302 are negative (containing only the normal items). To validate the performance against class imbalance, the authors of the dataset presented three subset schemes of the dataset, namely, SIXray10, SIXray100, and SIXray1000 miao2019sixray. Moreover, SIXray miao2019sixray is also the largest and most challenging dataset (to date) designed to assess threat detection frameworks’ performance towards screening extremely cluttered and highly imbalanced contraband data miao2019sixray; hassan2020Sensors. As per the SIXray miao2019sixray dataset standard, we used 80% scans for the training and the rest of 20% for testing.

4.1.3 OPIXray

OPIXray opixray is the most recent baggage X-ray dataset (released publicly for the research community in 2020). It contains 8,885 colored X-ray scans. As per the dataset standard, out of these 8,885 scans, 7,109 are to be utilized for the training purposes, while the remaining 1,776 are to be used for testing purposes, to detect scissor, straight knife, multi-tool knife, folding knife, and utility knife. Moreover, the dataset authors also quantified occlusion within the test scans into three levels, i.e., OP1, OP2, and OP3. OP1 indicates that the contraband items within the candidate scan contain no or slight occlusion, OP2 depicts a partial occlusion, while OP3 represents severe or full occlusion cases.

We also want to highlight here that the resolution of the scans within each dataset varies significantly (except for OPIXray opixray). For example, on GDXray mery2015gdxray, the scan resolution varies as , , and , etc. Similarly, on SIXray miao2019sixray, the scan resolution varies as , , , , and , etc. But on OPIXray opixray, the resolution of all the scans is 1225x954x3. In order to process all the scans with the proposed framework, we have re-sized them to the common resolution of , which is extensively used in the recently published frameworks hassan2019; Hassan2020ACCV; hassan2020Sensors.

4.2 Training and Implementation Details

The proposed framework was

developed using Python 3.7.4 with TensorFlow 2.2.0 and Keras APIs on a machine havingIntel Core i9-10940X@3.30 GHz CPU, 128 GB RAM and an NVIDIA Quadro RTX 6000 with cuDNN v7.5, and a CUDA Toolkit 10.1.243. Some utility functions are also implemented using MATLAB R2021a. Apart from this, the training on each dataset was conducted for a maximum of 50 epochs using ADADELTA


as an optimizer (with the default learning and decay rate configurations) and a batch size of 4. Moreover, 10% of the training samples from each dataset were used for the validation (after each epoch). For the loss function, we used the focal loss

retinanet expressed below:


where represents the total number of classes, and denotes the batch size.

denotes the predicted probability of the logit

generated from training sample for the class, tells if the training sample actually belongs to the class or not, the term represents the scaling factor that gives more weight to the imbalanced classes (in other words, it penalizes the network to give emphasize to the classes for which the network obtain low prediction scores). Through rigorous experiments, we empirically selected the optimal value of and as 0.25 and 2, respectively, as they result in faster learning for each dataset while simultaneously showing good resistance to the imbalanced data. Apart from this, architecturally, the kernel sizes within the proposed encoder-decoder backbone vary as 3x3 and 7x7, whereas the number of kernels varies as 64, 128, 256, 512, 1024, and 2048. Moreover, the pooling size within the network remained 2x2 across various network depths to perform the feature decomposition (at each depth) by the factor of 2. For more architectural and implementation details of the proposed framework, we refer the reader to the source code, which we have released publicly for the research community on GitHub1.

4.3 Evaluation Metrics

In order to assess the proposed approach and compare it with the existing works, we used the following evaluation metrics:

4.3.1 Intersection-over-Union

Intersection-over-Union (IoU) tells how accurately the suspicious items have been extracted, and it is measured by checking the pixel-level overlap between the predictions and the ground truths. Mathematically, IoU is defined as:


where are true positives (indicating that the pixels of the contraband items are correctly predicted w.r.t the ground truth),

represents false positives (indicating that the background pixels are incorrectly classified as positives), and

represents false negatives (meaning that the pixels of the contraband items are misclassified as background). Furthermore, we also calculated the mean IoU (IoU) by taking an average of the IoU score for each contraband item class.

4.3.2 Dice Coefficient

Apart from IoU scores, we also computed the dice coefficient (DC) scores to assess the proposed system’s performance for extracting the contraband items. DC is calculated through:


Compared to IoU, DC gives more weightage to the true positives (as evident from Eq. 4). Moreover, the mean DC (DC) is calculated by averaging DC scores for each category.

4.3.3 Mean Average Precision

The mean average precision (mAP) (in the proposed study) is computed by taking the mean of average precision (AP) score calculated for each contraband item class for the IoU threshold 0.5. Mathematically, mAP is expressed below:


where denotes the number of contraband items in each dataset. Here, we want to highlight that to achieve fair comparison with the state-of-the-art, we have used the original bounding box ground truths of each dataset for measuring the proposed framework’s performance towards extracting the suspicious and illegal items.

5 Results

In this section, we present the detailed results obtained with GDXray mery2015gdxray, SIXray miao2019sixray, and OPIXray opixray datasets. Before going into the experimental results, we present detailed ablation studies to determine the proposed framework’s hyper-parameters. We also report a detailed comparison of the proposed encoder-decoder network with the popular segmentation models.

5.1 Ablation Studies

The ablation studies in this paper aim to determine the optimal values for 1) the number of orientations and scaling levels within the tensor pooling module and 2) the choice of the backbone model for performing the contour instance segmentation.

(A) (B)

(C) (D) (E) (F)

Figure 6: Detection performance of the proposed system in terms of mAP (A, B, C), and computational time in terms of seconds (D, E, F) obtained for GDXray mery2015gdxray, SIXray miao2019sixray, and OPIXray opixray datasets, respectively.

5.1.1 Number of Orientations and the Scaling Levels

The tensor pooling module highlights the baggage content transitions in the image gradients oriented in directions and up to scaling levels. Increasing these parameters helps generate the best contour representation leading towards a more robust detection, but also incurs additional computational cost. As depicted in Figure 6 (A), we can see that for GDXray dataset mery2015gdxray with , , we obtain an mAP score of 0.82. With the combination , , we get 16.54% improvements in the detection performance but at the expense of a 97.71% increase in computational time (see Figure 6-D). Similarly, on the SIXray dataset miao2019sixray, we obtain 18.36% improvements in the detection performance (by increasing and ) at the expense of 95.88% in the computational time (see Figure 6 (B, E)). The same behavior is also noticed for OPIXray dataset opixray in Figure 6 (C, F). Considering all the combinations depicted in Figure 6, we found that and provide the best trade-off between the detection and run-time performance across all three datasets.

5.1.2 Choice of a Backbone Model

The proposed backbone model has been specifically designed to segment the suspicious items’ contours while discarding the normal baggage content. In this series of experiments, we compared the proposed asymmetric encoder-decoder model’s performance with popular encoder-decoder, scene parsing, and fully convolutional networks. In terms of DC and IoU, we report the performance results in Table 1. We can observe that the proposed framework achieves the best extraction performance on OPIXray opixray and SIXray miao2019sixray dataset, leading the second-best UNet ronneberger2015unet by 2.34% and 3.72%. On the GDXray mery2015gdxray, however, it lags from the FCN-8 fcn8 and PSPNet zhao2017pyramid by 6.54% and 5.91%, respectively. But as our model outperforms all the other architectures on the large-scale SIXray miao2019sixray and OPIXray opixray datasets, we chose it as a backbone for the rest of the experimentation.

Met Data Proposed PSPNet UNet FCN-8
IoU GDX 0.4994 0.5585 0.4921 0.5648
SIX 0.7072 0.5659 0.6700 0.6613
OPI 0.7393 0.5645 0.7159 0.5543
DC GDX 0.6661 0.7167 0.6596 0.7219
SIX 0.8285 0.7227 0.8024 0.7961
OPI 0.8501 0.7217 0.8344 0.7132
Table 1: Performance comparison of the proposed backbone network with PSPNet zhao2017pyramid, UNet ronneberger2015unet and FCN-8 fcn8 for recognizing the boundaries of the contraband items. The best and second-best performances are in bold and underline, respectively. Moreover, the abbreviations are: Met: Metric, Data: Dataset, GDX: GDXray mery2015gdxray, SIX: SIXray miao2019sixray, OPI: OPIXray opixray.
Data Items PF CST TST FD
GDX Gun 0.9872 0.9101 0.9761 -
Razor 0.9691 0.8826 0.9453 -
Shuriken 0.9735 0.9917 0.9847 -
Knife 0.9820 0.9945 0.9632 -
mAP 0.9779 0.9281 0.9672 -
SIX Gun 0.9863 0.9911 0.9734 -
Knife 0.9811 0.9347 0.9681 -
Wrench 0.9882 0.9915 0.9421 -
Scissor 0.9341 0.9938 0.9348 -
Pliers 0.9619 0.9267 0.9573 -
Hammer 0.9172 0.9189 0.9342 -
mAP 0.9614 0.9595 0.9516 -
OPI Folding 0.8528 - 0.8024 0.8671
Straight 0.7649 - 0.5613 0.6858
Scissor 0.8803 - 0.8934 0.9023
Multi 0.8941 - 0.7802 0.8767
Utility 0.8062 - 0.7289 0.7884
mAP 0.8396 - 0.7532 0.8241
Table 2: Performance comparison between state-of-the-art baggage threat detection frameworks on GDXray (GDX), SIXray (SIX), and OPIXray (OPI) dataset in terms of mAP scores. ’-’ indicates that the respective score is not computed. Moreover, the abbreviations are: Data: Dataset, GDX: GDXray mery2015gdxray, SIX: SIXray miao2019sixray, OPI: OPIXray opixray, PF: Proposed Framework, and FD: FCOS fcos + DOAM opixray.

5.2 Evaluation on GDXray Dataset

The performance of the proposed framework and of the state-of-the-art methods on the GDXray mery2015gdxray dataset are reported in Table 2. We can observe here that the proposed framework outperforms the CST hassan2019 and the TST framework Hassan2020ACCV by 4.98% and 1.07%, respectively. Furthermore, we wanted to highlight the fact that CST hassan2019 is only an object detection scheme, i.e., it can only localize the detected items but cannot generate their masks. Masks are very important for the human observers in cross-verifying the baggage screening results (and identifying the false positives), especially from the cluttered and challenging grayscale scans. In Figure 7, we report some of the cluttered and challenging cases showcasing the effectiveness of the proposed framework in extracting the overlapping contraband items. For example, see the extraction of merged knife instances in (H), and the cluttered shuriken in (J, L). We can also appreciate how accurately the razors have been extracted in (J, L). Extracting such low contrast objects in the competitive CST framework requires suppressing first all the sharp transitions in an iterative fashion hassan2019.

Figure 7: Qualitative evaluations of the proposed framework on GDXray mery2015gdxray dataset. Please zoom-in for best visualization.

5.3 Evaluations on SIXray Dataset

The proposed framework has been evaluated on the whole SIXray dataset miao2019sixray (containing 1,050,302 negative scans and 8,929 positive scans) and also on each of its subsets miao2019sixray. In Table 2, we can observe that the proposed framework achieves an overall performance gain of 0.190% and 0.980% over CST hassan2019 and TST Hassan2020ACCV framework, respectively. In Table 3, we report the results obtained with each subset of the SIXray dataset miao2019sixray, reflecting different imbalanced normal and prohibited item categories. The results further confirm the superiority of the proposed framework against other state-of-the-art solutions, especially w.r.t the CHR miao2019sixray, and gaus2019evaluating. In addition to this, in an extremely challenging SIXray1000 subset, we notice that the proposed framework leads the second-best TST framework Hassan2020ACCV by 3.22%, and CHR miao2019sixray by 44.36%.

Apart from this, Figure 8 depicts the qualitative evaluations of the proposed framework on the SIXray miao2019sixray dataset. In this figure, the first row shows examples containing one instance of the suspicious item, whereas the second and third rows show scans containing two or more instances of the suspicious items. Here, we can appreciate how accurately the proposed scheme has picked the cluttered knife in (B). Moreover, we can also observe the extracted chopper (knife) in (D) despite having similar contrast with the background. More examples such as (F, H, and J) demonstrate the proposed framework’s capacity in picking the cluttered items from the SIXray dataset miao2019sixray.

Figure 8: Qualitative evaluations of the proposed framework on SIXray miao2019sixray dataset. Please zoom-in for a best visualization.
Subset PF DTS CHR gaus2019evaluating TST
SIX-10 0.9793 0.8053 0.7794 0.8600 0.9601
SIX-100 0.8951 0.6791 0.5787 - 0.8749
SIX-1k 0.8136 0.4527 0.3700 - 0.7814
Table 3: Performance comparison of proposed framework with state-of-the-art solutions on SIXray subsets. For fair comparison, all models are evaluated using ResNet-50 he2016deep as a backbone. Moreover, the abbreviations are: SIX-10: SIXray10 miao2019sixray, SIX-100: SIXray100 miao2019sixray, SIX-1k: SIXray1000 miao2019sixray, and PF: Proposed Framework

5.4 Evaluations on OPIXray Dataset

The performance evaluation of the proposed framework on OPIXray dataset opixray is reported in Table 2. We can observe here that the proposed system achieves an overall mAP score of 0.8396, outperforming the second-best DOAM framework opixray (driven via FCOS fcos) by 1.55%. Here, although the performance of both frameworks is identical, we still achieve a significant lead of 7.91% over the DOAM opixray for extracting the straight knives.

Concerning the level of occlusion (as aforementioned, OPIXray opixray splits the test data into three subsets, OP1, OP2, OP3, according to the level of occlusion), we can see in Table 4 that the proposed framework achieves the best performance at each occlusion level as compared to the second-best DOAM opixray framework driven by the single-shot detector (SSD) Liu2016SSD.

Figure 9 reports some qualitative evaluation, where we can appreciate the recognition of the cluttered scissor (e.g. see B and F), and overlapping straight knife (in H). We can also notice the detection of the partially occluded folding and straight knife in (D) and (J).

Figure 9: Qualitative evaluations of the proposed framework on OPIXray opixray dataset. Please zoom-in for a best visualization.

5.5 Failure Cases

In Figure 10, we report examples of failure cases encountered during the testing. In cases (B, H, N, and P), we can see that the proposed framework could not pick-up the whole regions of the contraband items, even though the items were detected correctly. However, such cases are observed in highly occluded scans such as (A and G), where it is difficult, even for a human observer, to distinguish the items’ regions properly. The second type of failure corresponds to the pixels misclassification as shown in (D) where some of the gun’s pixels have been misclassified as knife. We can address these scenarios through post-processing steps like blob removal and region filling. The third failure case relates to the proposed framework’s inability to generate a single bounding box for the same item. Such a case is depicted in (F), where two bounding boxes were generated for the single orange knife item. One possible remedy here is to generate the bounding boxes based upon the minimum and maximum mask value in both image dimensions for each label. Another type of failure is shown in (J) and (L). Here, the scans contain only normal baggage content, but some pixels occupying tiny regions have been misclassified as false positive (i.e., knife). We can also address this kind of failure through blob removal scheme.

Examining the failure cases’ statistical distributions, we found a majority of 86.09% cases belonging to the curable categories (i.e., second, third, and fourth), meaning that the proposed framework’s performance can be further improved using the post-processing techniques mentioned above.

Method OP1 OP2 OP3
Proposed 0.7946 0.7382 0.7291
DOAM + SSD opixray 0.7787 0.7245 0.7078
SSD Liu2016SSD 0.7545 0.6954 0.6630
Table 4: Performance comparison of proposed framework with DOAM opixray (backboned through SSD Liu2016SSD) on different occlusion levels of OPIXray opixray dataset.

6 Conclusion

In this work, we proposed a novel contour-driven approach for detecting cluttered and occluded contraband items (and their instances) within the baggage X-ray scans, hypothesizing that contours are the most robust cues given the lack of texture in the X-ray imagery. We concretized this original approach through a tensor pooling module, producing multi-scale tensor maps highlighting the items’ outlines within the X-ray scans and an instance segmentation model acting on this representation. We validated our approach on three publicly available datasets encompassing gray-level and colored scans and showcased its overall superiority over competitive frameworks in various aspects. For instance, the proposed framework outperforms the state-of-the-art methods opixray; hassan2019; Hassan2020ACCV; hassan2020Sensors by 1.07%, 0.190%, and 1.55% on GDXray mery2015gdxray, SIXray miao2019sixray, and OPIXray opixray dataset, respectively. Furthermore, on each SIXray subsets (i.e., SIXray10, SIXray100, SIXray1000) miao2019sixray, the proposed framework leads the state-of-the-art by 1.92%, 2.02%, and 3.22%, respectively.

In future, we aim to apply the proposed framework to recognize 3D printed contraband items from the X-ray scans. Such items exhibit poor visibility in the X-ray scans because of their organic material, making them an enticing and challenging case to investigate and address.

Figure 10: Failure cases from GDXray mery2015gdxray, SIXray miao2019sixray, and OPIXray opixray dataset.

Conflicts of Interest Declarations

Funding: This work is supported by a research fund from ADEK (Grant Number: AARE19-156) and Khalifa University (Grant Number: CIRA-2019-047).

Conflict of Interest: The authors have no conflicts of interest to declare that are relevant to this article.

Financial and Non-Financial interests: All the authors declare that they have no financial or non-financial interests to disclose for this article.

Employment: The authors conducted this research during their employment in the following institutes:

  • T. Hassan (Khalifa University, UAE),

  • S. Akçay (Durham University, UK),

  • M. Bennamoun (The University of Western Australia, Australia),

  • S. Khan (Mohamed bin Zayed University of Artificial Intelligence, UAE), and

  • N. Werghi (Khalifa University, UAE).

Ethics Approval: All the authors declare that no prior ethical approval was required from their institutes to conduct this research.

Consent for Participate and Publication: All the authors declare that no prior consent was needed to disseminate this article as there were no human (or animal) participants involved in this research.

Availability of Data and Material: All the datasets that have been used in this article are publicly available.

Code Availability: The source code of the proposed framework is released publicly on GitHub1.

Authors’ Contributions: T. Hassan formulated the idea, wrote the manuscript, and performed the experiments. S. Akçay improved the initial design of the framework and contributed to manuscript writing. M. Bennamoun co-supervised the whole research, reviewed the manuscript and experiments. S. Khan reviewed the manuscript, experiments and improved the manuscript writing. N. Werghi supervised the whole research, contributed to manuscript writing, and reviewed the experimentation.