X-ray imaging is a widely adopted tool for the baggage inspection at airports, malls and cargo transmission trucks . Baggage threats have become a prime concern all over the world. According to a recent report, approximately 1.5 million passengers are searched every day in the United States against weapons and other dangerous items . Manual screening process is resource-intensive and requires constant attention of human experts during monitoring. This introduces an additional risk of human-error caused by fatigued work-schedule, difficulty in catching contraband items, requirement of quick decision making, or simply due to a less experienced operator.
Therefore, aviation authorities, all over the world, are actively looking for automated and reliable baggage screening systems to increase the inspection speed and support operator alertness. Automated inspection is also desirable for privacy concerns. A number of large-scale natural image datasets for object detection are publicly available, enabling the development of popular object detectors like R-CNN , SPP-Net , YOLO  and RetinaNet . In contrast, only a few datasets for X-ray images are currently available for researchers to develop robust computer-aided screening systems. Also, the nature of radiographs is quite different than natural photographs. Although they can reveal the information invisible in the normal photographs (due to radiations), they lack texture (especially the grayscale X-ray scans), due to which conventional detection methods do not work well on them . In general, screening objects and anomalies from baggage X-ray (grayscale or colored) scans is a challenging task, especially when the objects are closely packed to each other, leading to heavy occlusion. In addition to this, baggage screening systems face a severe class imbalance problem due to the low suspicious to normal items ratio. Therefore, it is highly challenging to develop an unbiased decision support system which can effectively screen baggage items because of the high contribution of the normal items within the training images. Figure 1 shows some of the X-ray baggage scans where the suspicious items such as guns and knives are highlighted in a heavily occluded and cluttered environment.
Ii Related Work
Several methods for detecting suspicious objects in X-ray imagery have been proposed in the literature. These can be categorized as conventional machine learning methods and deep learning methods. We provide a representative list of the main approaches, and we refer the reader to the work of for an exhaustive survey.
Ii-a Traditional Approaches
Many researchers have used traditional machine learning (ML) methods to recognize baggage items from the X-ray scans. Bastan et al.  proposed a structured learning framework that encompasses dual-energy levels for the computation of low textured key-points to detect laptops, handguns and glass bottles from multi-view X-ray imaging. They also presented a framework that utilizes Harris, SIRF, SURF and FAST descriptors to extract key-points which are then passed to a bag of words (BoW) model for the classification and retrieval of X-ray images . They concluded that although BoW produces promising results on regular images, it does not perform well on low textured X-ray images . Jaccard et al.  proposed an automated method for detecting cars14] proposed a framework that uses the SURF descriptor to extract distinct features and passes them to a BoW for object recognition in baggage X-ray images. Riffo et al.  proposed an adapted implicit shape model (AISM) that first generates a category-specific codebook for different parts of targeted objects along with their spatial occurrences (from training images) using SIFT descriptor and agglomerative clustering. Then, based upon voting space and matched codebook entries, the framework detects the target object from the candidate test image.
Ii-B Deep Learning Based Suspicious Items Detection
Recently, many researchers have presented work employing deep learning architectures for the detection and classification of suspicious baggage items from X-ray imagery. These studies are either focused on the usage of supervised classification models or unsupervised adversarial learning:
Ii-B1 Unsupervised Anomaly Detection
architectures to detect different anomalies from X-ray scans. These approaches employ an encoder-decoder for deriving a latent space representation used by discriminator network to classify anomalies. Both architectures are trained on normal distributions while they are tested on normal and abnormal distributions from CIFAR-10, Full Firearm vs Operational Benign (FFOB) and the local in-house datasets (GANomaly is also verified on MNIST dataset).
Ii-B2 Supervised Approaches
Akçay et al.  proposed using a pre-trained GoogleNet model  for object classification from X-ray baggage scans. They prepared their in-house dataset and tested their proposed framework to detect cameras, laptops, guns, gun components and knives (mainly, ceramic knives). A subsequent work  compared different frameworks for the object classification from X-ray imagery. They concluded that AlexNet 
as a feature extractor with support vector machines (SVM) performs better than other ML methods. For occluded data, they compared the performance of sliding window-based CNN (SW-CNN), Faster R-CNN, region-based fully convolutional networks (R-FCN) , and YOLOv2  for object recognition. They used their local datasets as well as non-publicly available FFOB and Full Parts Operation Benign (FPOB) datasets in their experimentations. Likewise, Dhiraj et al.  used YOLOv2 , Tiny YOLO  and Faster R-CNN  to extract guns, shuriken, razor blades and knives from baggage X-ray scans of the GRIMA X-ray database (GDXray ) where they achieved an accuracy of up to 98.4%. Furthermore, they reported the computational time of 0.16 seconds to process a single image. Gaus et al.  proposed a dual CNN based framework in which the first CNN model detects the object of interest while the second classifies it as benign or malignant. For object detection, the authors evaluated Faster R-CNN , Mask R-CNN  and RetinaNet  using ResNet18 , ResNet50 , ResNet101 , SqueezeNet  and VGG-16  as a backbone. Recently, Miao et al.  provided one of the most extensive and challenging X-ray imagery dataset (named as SIXray) for detecting suspicious items. This dataset contains 1,059,321 scans with heavily occluded and cluttered objects out of which 8,929 scans contain suspicious items. Furthermore, the dataset anticipates the class imbalance problem in real-world scenarios by providing different subsets in which positive and negative samples ratios are varied 
. The authors have also developed a deep class-balanced hierarchical refinement (CHR) framework that iteratively infers the image content through reverse connections and uses a custom class-balanced loss function to accurately recognize the contraband items in the highly imbalanced SIXray dataset. They also evaluated the CHR framework on the ILSVRC2012 large scale image classification dataset. After the release of the SIXray dataset, Gaus et al. evaluated Faster R-CNN , Mask R-CNN , and RetinaNet  on it, as well as on other non-publicly available datasets.
To the best of our knowledge, all the methods proposed so far were either tested on a single dataset, or on datasets containing a similar type of X-ray imagery. Furthermore, there are limited frameworks which are applied to the complex radiographs for the detection of heavily occluded and cluttered baggage items. Many latest frameworks that can detect multiple objects and potential anomalies use CNN models (and object detectors) as a black-box, where the raw images are passed to the network for object detection. Considering the real-world scenarios where most of the baggage items are heavily occluded that even human experts can miss them, it will not be straightforward for these frameworks to produce optimal results. As reported in 
, for a deep network to estimate the correct class, the feature kernels should be distinct. This condition is hard to fulfil for occluded objects obtained through raw images (without any initial processing), making thus the prediction of the correct class quite a challenge. In , different CNN based object detectors were evaluated on the SIXray10 subset to only detect guns and knives. Despite the progress accomplished by the above works, the challenge of correctly recognizing heavily occluded and cluttered items in SIXray dataset scans is still to be addressed.
In this paper, we present a cascaded structure tensor (CST) framework for the detection and classification of suspicious items. The proposed framework is unique as it only uses a single feed-forward CNN model for object recognition, and instead of passing raw images or removing unwanted regions, it intelligently extracts each object proposal by iteratively picking the contour-based transitional information from different orientations within the input scan. The proposed framework is robust to occlusion and can easily detect heavily cluttered objects as it will be evidenced in the results section. The main contributions of this paper and the features of the proposed framework are summarized below:
We present a novel object recognition framework that can extract suspicious baggage items from X-ray scans and recognizes them using just a single feed-forward CNN model.
The proposed framework is immunized to the class imbalance problem since it is trained directly on the balanced set of normal and suspicious items proposals rather than on the set of scans containing an imbalanced ratio of normal and suspicious items.
The extraction of items proposals is performed through a novel CST framework, which analyzes the object transitions and coherency within a series of tensors generated from the candidate X-ray scan.
The proposed CST framework exhibits high robustness to occlusion, scan type, noisy artefacts and to highly cluttered scenarios.
The proposed framework achieves mean intersection-over-union () score of 0.9644 and 0.9689, area under the curve () score of 0.9878 and 0.9950, and a mean average precision () score of 0.9343 and 0.9595 for detecting normal and suspicious items from GDXray and SIXray dataset, respectively (see Section VI).
Iv Proposed Method
The block diagram of the proposed framework is depicted in Figure 2. In the first step, we enhance the contrast of the input image via an adaptive histogram equalization . Afterwards, we generate a series of tensors where each tensor embodies information about the targeted objects from different orientations. We employ these tensors for the automatic extraction of baggage items proposals from the scans. We subsequently pass these proposals to a pre-trained network for object recognition. Each module within the proposed framework is described next:
The primary objective of the preprocessing stage is to enhance the low contrasted input. We perform contrast stretching through adaptive histogram equalization . Let the X-ray scan where and denotes the number of rows and columns, respectively. Let be an arbitrary patch of where is obtained by dividing into grid of rectangular patches. The histogram of is computed and is locally normalized using the following relation:
is the cumulative distribution function of, is the minimum value of , represents the maximum grayscale level of , is the rounding function and is enhanced histogram for . This process is repeated for all the patches of to obtain the contrast stretched version as shown in Figure 3. We can observe that the occluded gun is visible in the enhanced scan.
Iv-B Cascaded Structure Tensor (CST) Framework
CST is a framework for object extraction from a candidate scan which is based on a novel contoured structure tensor approach. We will first report the conventional structure tensor; then describe our more generalized approach proposed in this context of suspicious baggage items detection.
Iv-B1 Conventional 2D Discrete Structure Tensor
A 2D discrete structure tensor is defined as a second moment matrix derived from the image gradients. It reflects the predominant orientations of the changes (contained in the gradients) within a specified neighbourhood of any pixel in the input scan[34, 35]. Furthermore, structure tensor can tell the degree to which those orientations are coherent. For a pixel in , the structure tensor is defined as:
where is a smoothing filter typically chosen as a Gaussian function, and denotes the image gradients w.r.t and direction, respectively. From Eq. (2), we can observe that for each pixel in the input scan, we obtain a
matrix representing the distribution of orientations within the image gradients w.r.t its neighbourhood. In order to measure how strongly this matrix is biased towards a particular direction, the degree of coherency or anisotropy is computed through the two eigenvalues fromas shown below:
where represents the coherence measure such that , and are the two eigenvalues. In the proposed framework, we are primarily not interested in finding out the strong orientations within the specified neighborhood of any pixel. Here, our purpose is to extract the coherence maps such that they represent the maximum transitions of baggage items from different orientations. Furthermore, in the conventional structure tensor, the gradients are computed in two orthogonal directions only. However, the baggage items within X-ray scans can have other directions as well.
Iv-B2 Modified Structure Tensor
The modified version of the structure tensor can reveal the objects oriented in any direction within the candidate image. It is defined over image gradients associated to number of different orientations yielding the structure tensor matrix of order where as shown below:
Each coherence map or tensor () in Eq. (4) is an outer product of smoothing filter and the image gradients i.e. having dimension, and are the image gradients in the direction and , respectively. Also, rather than using a Gaussian filter for smoothing, we employ anisotropic diffusion filter 
as it is extremely robust in removing noisy outliers while preserving the edge information. Moreover, it can be noted thatis a symmetric matrix which means that or in other words there are unique tensors in . The gradient orientations () in the modified structure tensor is computed through where is varied from to . For example, for , we have only one tensor for one image gradient oriented at rad. For , we have four tensors for two image gradients oriented at rad and rad. Figure 4 shows the orientation of image gradients for different values of . Figure 5 shows the six unique tensors obtained for randomly selected scans when .
Iv-B3 Coherent Tensors
As mentioned above, with orientations, we obtain unique tensors. Each tensor highlights objects in the candidate image w.r.t the predominant orientations of its respective image gradients. However, considering all these tensors for the generation of object contours (and proposals) is redundant, time-consuming, and might make the proposed framework vulnerable to noise. We instead select a reduced number of predominant tensors, which we dubbed the Coherent Tensors. The optimal values of and will be determined empirically in the ablation analysis (see Section VI-A).
Iv-B4 Extraction of Object Proposals
After obtaining the coherent tensors, we add them together to obtain a single coherent representation () of the candidate image. Then,
is binarized and morphologically enhanced to remove the unwanted blobs and noisy artefacts. Moreover, after generating the baggage items contours, they are labelled through connected component analysis and for each labelled contour, a bounding box is generated based upon minimum bounding rectangle technique (see Figure 6). This bounding box is then used in the extraction of the respective object proposal from the candidate image. Proposals generated by the CST framework contain either a single object or overlapping objects (see examples in Figure 7).
Object borders in an X-ray scan exhibit different level of intensity. This disparity make objects with high transition-level borders (e.g. shuriken in Figure 8A) have more detection likelihood compared to those showing weaker transitions (razor blades in Figure 8A). In order to accommodate these different levels of boundary intensity, we employ the CST in a multi-pass scheme in which regions corresponding to the proposals detected in the previous pass (most likely having a high boundary transition level) are discarded from the image, by computing the discrete Laplacian and then solving the Dirichlet boundary value problem , in the subsequent pass. In this way, the pixels in the proposals bounding boxes of the previous pass are replaced with the intensity values derived from the proposal boundaries (see Figure 8 E and I). This iterative process is repeated until there are no more object transitions left to be picked from the candidate scan as shown in Figure 8 (K). The detailed pseudo-code of the proposed CST framework is presented below:
Iv-B5 Object Recognition
After extracting the object proposals, these are passed to the pre-trained ResNet50 model  for recognition. The ResNet50
exhibits good performance in catering the vanishing gradient problem through the residual blocks. We employ ResNet50 in fine-tuning mode, whereby we replace the final classification layer with our custom layer for the recognition of proposals within this application. We do not freeze the rest of the layers so that they also get updated during the training phase to recognize the object proposals effectively. However, we use the original weights in the initialization phase for faster convergence. The training set is composed of object proposals obtained with the CST framework, which generates around 167 proposals on average per scan. This amplification in the number of training samples allows deriving balanced sets for normal and suspicious items, as will be described further in the experimental setup in Section V. In the training set, proposals are labelled as follows: 1) Proposals containing a single suspicious item or a portion of a single item are labelled by that item category 2) Proposals containing overlapped suspicious items are labelled with the item category occupying the largest area in the bounding box. 3) Proposals that do not contain suspicious items are labelled as normal.
V Experimental Setup
The proposed framework is evaluated against state-of-the-art methods on different publicly available datasets using a variety of evaluation metrics. In this section, we have listed the detailed description of all the datasets and the evaluation metrics. This section also describes the training details of the pre-trained CNN model.
V-A1 GRIMA X-ray Database
The GRIMA X-ray Database (GDXray)  was acquired with an X-ray detector (Canon CXDI-50G) and an X-ray emitter tube (Poksom PXM-20BT). It contains 19,407 X-ray scans arranged in welds, casting, baggage, nature and settings categories. In this paper, we only use the baggage scans to test the proposed framework, as this is the only relevant category for suspicious items detection. The baggage group has 8,150 X-ray scans containing both occluded and non-occluded items. Apart from this, it contains the marked ground truths for handguns, razor blades, and shuriken. For a more in-depth evaluation, we have refined this categorization by splitting the original handgun category into two classes; namely, pistol and revolver. We have also identified and annotated three new classes i.e. the knife, mobile phone class and the chip class, which represents all the electronic gadgets, including laptops. We adopted a training set in accordance with standard defined in , i.e. 400 scans from B0049, B0050 and B0051 series containing proposals for revolver (handgun), shuriken and razor blades, to which we added 388 more scans for the new categories (chip, pistol, mobile and knives). From the 788 training scans, we have obtained a total number of 140,264 proposals which were divided into suspicious and normal categories. The last, is an auxiliary class that includes proposals of miscellaneous and unimportant items like keys and bag zippers, which are generated by CST framework. However, we discarded 84,162 normal proposals to keep the number of suspicious and unsuspicious items balanced. The detailed summary of the GDXray dataset is depicted in Table I.
|Total Scans||Dataset Split||Training Proposals||Items|
|8,150||Training: 788#||Total: 140,264||Pistol**|
|Testing: 7,362||Normal: 28,049||Revolver**|
|Average: 178 / scan||Razor Blades|
* These items have been identified locally for more in-depth validation of the proposed framework. Chip class represents all the electronic gadgets including laptops (except mobiles).
these are the normal proposals that have been discarded for balanced training of the CNN model for proposals recognition.
** The original handgun category is further broken down into pistol and revolver because both items are found to be in abundance within this dataset.
# 400 scans from B0049, B0050 and B0051 series are used for extracting revolver, shuriken and razor blades as per the criteria defined in  and using this same percentage of training and testing split, we used 388 more scans to train the model for the extraction of other items.
V-A2 Security Inspection X-ray Dataset
Security Inspection X-ray (SIXray)  is one of the largest datasets for the detection of heavily occluded and cluttered suspicious items. It contains 1,059,231 color X-ray scans having 8,929 suspicious items which are classified into six groups i.e. gun, knife, wrench, pliers, scissor and hammer. All the images are stored in JPEG format, and the detailed description of the dataset is presented in Table II. To validate the performance of the proposed framework against the class imbalance problem, the same subsets have been utilized, as described in , in which the ratio of suspicious items and normal objects have been matched with real-world scenarios. Also, the ratio of 4 to 1 for training and testing has been maintained in accordance with . For the SIXray dataset, we also have added a separate normal class to filter the proposals of miscellaneous and unimportant items that are generated by the CST framework. Moreover, it should also be noted from Table I and II that for each dataset we have trained the classification model on the balanced set of normal and suspicious items proposals where the excessive normal items proposals are discarded to avoid the classifier bias.
|Subsets*||Scans||Dataset Split||Training Proposals|
|SIXray10||98,219||Training: 78,575||Total: 12,179,125|
|Testing: 19,644||Normal: 2,435,819|
|SIXray100||901,829||Training: 721,463||Total: 111,826,765|
|Testing: 180,366||Normal: 22,365,348|
|SIXray1k||1,051,302||Training: 841,042||Total: 130,361,510|
|Testing: 210,260||Normal: 26,072,296|
|Average: 155 / scan|
* The suspicious items within the SIXray dataset are: gun, knife, wrench, pliers, scissor and hammer.
These are the normal proposals that have been discarded for balanced training of the CNN model for proposals recognition.
total scans within the SIXray dataset are 1,059,231 in which 8,929 scans are positive (containing one or more suspicious items), and the rest of 1,050,302 scans are negative.
V-B Training Details
The classification of the baggage items is performed through pre-trained ResNet50 model after fine-tuning it on the object proposals extracted from the scans of GDXray and SIXray datasets.
The training process was conducted for 13 epochs having a mini-batch size of 64 using MATLAB R2019a with deep learning toolbox, on a machine with an Intel Core email@example.comGHz processor, 16 GB RAM and NVIDIA RTX 2080 GPU with cuDNN 7.5. The optimization during the training phase was performed through ADAM with a base learning rate of 0.001, whereas 20% of the training data is used for validation which is performed after 100 consecutive iterations. Furthermore, the categorical cross-entropy loss function () is employed during training which is computed through Eq. (5):
where denotes the total number of samples, denotes total number of classes, is a binary indicator stating whether sample belongs to class and
is the predicted probability of thesample for class. Figure 9 shows the training performance of the proposed framework.
V-C Evaluation Criteria
The performance is evaluated based on the following metrics:
V-C1 Intersection over Union
describes the overlapping area between the extracted object bounding box and the corresponding ground truth. It is also known as the Jaccard’s similarity index and it is computed through Eq. (6):
where is the extracted bounding box for the baggage item, is the ground truth and computes the area of the passed region. Although measures the ability of the proposed CST framework to extract the suspicious items, it does not measure the detection capabilities of the proposed framework.
Accuracy describes the performance of correctly identify an object and no-object regions as described
where denotes the true positive samples, denotes the true negative samples, denotes the false positive samples and denotes the false negative samples.
Recall, or sensitivity is the true positive rate () which indicates the completeness of the proposed framework to correctly classifying the object regions.
Precision () describes the purity of the proposed framework in correctly identifying the object regions against the ground truth.
V-C5 Average Precision
Average precision () represents the area under the precision-recall () curve. It is a measure that indicates the ability of the proposed framework to correctly identifying positive samples (object proposals in our case) of each class or group. (for each class) is computed by sorting the images based on confidence scores and marking the bounding box predictions positive or negative. The prediction is marked positive if and negative otherwise. Afterwards, the and , computed using Eq. (8) and (9), are used to generate as follows:
where denotes the estimated precision, represents the change (difference) in the consecutive recall values, is the starting point which is equal to 0,
is the endpoint of the interpolation interval which is 1 andrepresents the actual precision curve. After computing the for each class, the score is computed using Eq. (12):
where denotes the number of classes in each dataset.
score measures the capacity of correctly classifying samples in a highly imbalanced scenario. It is computed by considering both and scores.
V-C7 Receiver Operator Chracteristics (ROC) Curve
ROC curves indicate the degree of how much the proposed framework confuses between different classes. It is computed with the true positive rate and the false positive rate ().
where specificity is the true negative rate. After computing the ROC curves for each class within GDXray and SIXray datasets, is computed by numerically integrating the ROC curve.
V-C8 Qualitative Evaluations
Apart from the quantitative evaluations, the proposed framework is thoroughly validated through qualitative evaluations as will be shown in the results section.
In this section, we report the results obtained through a comprehensive series of experiments conducted with GDXray and SIXray datasets. In Section VI-A, we report the ablative analysis we conducted to study the effects of the number of orientations, the number of coherent tensors, and the choice of classification network on the overall system performance. In Section VI-B, we present the performance of the proposed system evaluated based on the metrics explained in Section V-C. Then, in Section VI-C, we report a comparative study with state-of-the-art methods. In Section VI-D, we discuss the limitations of the proposed framework.
Vi-a Ablation Study
The number of orientation and the number of selected coherent tensors are the main hyper-parameters of the CST framework. The number is related to the capacity of the CST for accommodating objects oriented in different directions, while is related to the most relevant tensors.
In the first experiment, we varied the number of orientations from 2 to 4, and we computed for a number of selected tensors varying from 1 to . In the results, reported in Table III, we notice an overall enhancement of the performance as increases. We also notice that for , the performance reaches a peak of 0.934 and 0.959, at , for both GDXray and SIXray, then starts to decrease. This decay can be explained by the fact that including more tensors adds spikes and noisy transitions leading to the generation of noisy edges and negative miscellaneous proposals (see some instances in Figure 10).
To have more insight on the impact of the number of orientations in terms of performance and efficiency, we computed, in the second experiment, the and the average processing time per image, for different values of ranging from 2 to 8 while keeping the number of selected tensors M fixed to 2. In the results, reported in Table IV, we can notice that the detection performance slightly increases with the number of orientations in both datasets. However, this little enhancement comes at the detriment of the computational time, as reflected in the observed exponential increase rate in both datasets. For instance, the computation time increases by a factor of two and four, when the number of orientations passes from to , for the GDXray and the SIXray, respectively. These observations indicate that the volume of generated proposals become excessively redundant after the number of orientation goes beyond a certain threshold. Considering the figures in Table III and Table IV, we choose and as the optimal values for the number of orientations and the number of tensors, as they present the best trade-off between the detection performance and the computational time.
|Time (sec)||Time (sec)|
To analyze the performance of the CST framework across different pre-trained models, we conducted another series of experiments, in which we computed the for several standard classification networks. The results, depicted in Table V, show that the best detection performance is achieved by DenseNet201, whereas ResNet101 stood second and ResNet50 came third. However the DenseNet201 is outperforming ResNet50 only by 0.722% on GDXray and 0.08% on SIXray dataset. Also, the ResNet models have better trade-offs between memory consumptions and time performance than DenseNets , due to which, we prefer the ResNets (particularly ResNet50 because of its less memory consumption) in the proposed study.
|VGG16 + CST||0.8936||0.9126|
|ResNet50 + CST||0.9343||0.9595|
|ResNet101 + CST||0.9401||0.9564|
|GoogleNet + CST||0.9296||0.9419|
|DenseNet201 + CST||0.9411||0.9603|
|Mean STD||0.9644 0.0165||0.9689 0.0249|
* the score of ’Gun’ category in GDXray is an average of pistols and revolvers.
Vi-B System Validation
In Figure 11, we report some qualitative results for the two datasets, showcasing how our framework can effectively extract and localize suspicious items from the grayscale and color X-ray scans w.r.t their ground truths.
In Figure 12, we report other qualitative results of items exhibiting high occlusion and clutter. Through these examples, we can appreciate the capacity of the framework for accurately extracting and recognizing items in such challenging conditions. For example, in (A, D) and (B, E), we can observe that the chip and revolver have been detected despite being severely occluded by the bundle of wires. Similar examples can be also noticed for the occluded pistol, mobile, shuriken and razor blades in (C, F), (G, J), (H, K), (I, L), (M, P), (N, Q) and (O, R). In Q and R, in particular, we can observe how effectively the proposed framework has identified the razor blade despite having and intensity close to the background.
For the SIXray dataset, we can also observe how our framework effectively recognizes partially and heavily occluded suspicious items (see the pairs (S, V), (T, W), (U, X), (Y, AB), (Z, AC), and (AA, AD)). Notice, in particular, the detected overlapped knife in (S, V) and (T, W) and the overlapped wrench and knife in (Y, AB) and (Z, AC). Also, these examples reflect the effectiveness of the coherent tensor framework in discriminating overlapped items even when there are a lot of background edges and sharp transitions within the scan as it can be seen in (S, V), (T, W), (U, X), (Z, AC), and (AA, AD). Also, the ability of the proposed framework to discriminate between revolver and pistol can be in noticed in (B, E) for the revolver and ((C, F), (I, L)) for the pistol. Note that all the existing state-of-the-art solutions have considered these two items as part of the single handgun category in their methods.
Figure 13 reports the validation of our system through the ROC curves. Figure 13 (left) depicts the recognition rate related to the classification normal versus suspicious items, for the GDXray and the SIXray datasets. We achieved here the score of 0.9863 and 0.9901, respectively. Figure 13 (middle) and (right) show the ROC curves related to the item-wise recognition for the GDXray and SIXray, respectively. Here the true positives and true negatives represent pixels of the items and pixels of the background which are correctly identified, respectively. We note here that the reported scores are computed based on both: the correct extraction and the correct recognition of the suspicious items. For example, if the item has been correctly extracted by the CST framework, but it has been misclassified by the ResNet50, then we counted it as a false negative in the scoring. In Figure 13 (middle), we notice the razor blades score relatively lower than the other items (This is also reflected in the related score of 0.9582 in Table VIII). This is explained by the fact that the intensity differences between the razor blades and the background is very minimum within the scans of the GDXray dataset causing this item to be missed by the CST at some instances.
Figure 14 shows the curves computed on the GDXray and SIXray dataset, respectively. The confirms further the robustness of our framework.
In Table VII, we report the performance for each item in the two datasets. On average, our system achieves a score of 0.9343 and 0.9595 on GDXray and SIXray datasets, respectively. Note that the score of 0.9101 for the handgun category in the GDXray dataset is the average of pistol and revolver scores.
Table VIII depicts the scores for the different items, with mean scores of 0.9878 and 0.9950 obtained for the GDXray and the SIXray dataset, respectively. As in Table VII, the score of the Gun class is the average of the pistol and the revolver scores.
|Mean STD||0.9343 0.0442||0.9595 0.0362|
* the score of ’Gun’ category in GDXray is an average of pistols and revolvers.
|Mean STD||0.9878 0.0120||0.9950 0.0031|
* the score of ’Gun’ category in GDXray is an average of pistols and revolvers.
Vi-C Comparative Study
For the GDXray dataset, we compared our framework with the methods , ,  and  as shown in Table IX. The performance comparison is nevertheless indirect as the experiment protocol in each study differs. We, (as well as authors in ) followed the standards laid in  by considering 400 images for training (100 for razor blades, 100 for shuriken and 200 for handguns. However,  used 600 images for testing purposes (200 for each item) and considered only 3 items whereas we considered 7 items and used 7,362 scans for testing. The authors in  considered a total of 3,669 selective images in their work having 1,329 razor blades, 822 guns, 540 knives, and 978 shuriken. To train Faster R-CNN, YOLOv2 and Tiny YOLO models, they picked 1,223 images from the dataset and augmented them to generate 2,446 more images. The work reported in  involved 18 images only while  reports a study that is based on non-ML methods where the authors conducted 130 experiments to detect razor blades within the X-ray scans.
Contrary to the aforementioned methods, we assessed our framework for all the performance criteria described in Section V (C). Also, it should be noted here that the proposed framework has been evaluated in the most restrictive conditions as compared to its competitors where the true positive samples (of the extracted items) were only counted towards the scoring when they were correctly classified by the ResNet50 model as well. So, if the item has been correctly extracted by the CST framework but was not correctly recognized by the ResNet50 model, we counted it as a misclassification in the evaluation. Despite these strict requirements, we were able to achieve 2.37% improvements in terms of precision as evidenced in Table IX.
For the SIXray dataset, we compared our system with the methods proposed in  and  (the only two frameworks which have been applied on the SIXray dataset to date). Results are reported in Table X. The SIXray dataset is divided into three subsets to address the problem of class imbalance. These subsets are named as SIXray10, SIXray100 and SIXray1000. SIXray10 contains all 8,929 positive scans (having suspicious items) and 10 times the negative scans (which do not contain any suspicious item). Similarly, SIXray100 has all the positive scans and 100 times the negative scans. SIXray1000 contains only 1000 positive scans and all the negative scans (1,050,302 in total). So, the most challenging subset for the class imbalance problem is SIXray1000. Note that the works  and  employed different pre-trained models, which we also reported in Table X for completeness. Moreover, for a direct and fair comparison with  and , we have trained the proposed framework on each subset of the SIXray dataset individually and evaluated it using the same metrics as described in . Furthermore, we have excluded the hammer class in these experiments as it was not considered in  and . The scores depicted in Table X evidence the superiority of our framework over its competitors in terms of object classification and localization. In the last experiment, we compared the computational performance of our system with standard one-staged (such as YOLOv3 , YOLOv2  and RetinaNet ), and the two-staged detectors (such as Faster R-CNN ), as they have been widely used for suspicious items detection in the literature. The results, depicted in Table XI, show that our system scores the best average time performance in both training and testing outperforming the existing object detectors. Note also that although YOLOv2  produces significantly improved computational performance over other two-staged architectures, it has limited capacity for detecting smaller objects like diapers, sunglasses, and rubber erasers . Though YOLOv3  seems improving in this regard.
|Metric||Proposed||Faster R-CNN ||YOLOv2 ||Tiny YOLO ||AISM1 *||AISM2 *||SURF ||SIFT ||ISM |||||
Proposed: Classes: 7, Split: 5% for training and 95% for testing, Training Images: 400 (and 388 more for extra items), Testing Images: 7,362.
: Classes: 3, Split: 40% for training, 60% for testing, Training Images: 400, Testing Images: 600 (200 for each category).
: Classes: 4, Split: 80% for training and 20% for validation, Total Images: 3,669.
: Classes: 1, (non-ML approach).
: Classes: 3, Total Images: 18.
* the ratings of AISM1 and AISM2 are obtained from the ROC curve of AISM for different values.
|Criteria||Subset||ResNet50 + CST||ResNet50 ||ResNet50 + CHR ||DenseNet ||DenseNet + CHR ||Inceptionv3 ||Inceptionv3 + CHR |||
|Mean Average Precision||SIXray10||0.9634||0.7685||0.7794||0.7736||0.7956||0.7956||0.7949||0.86|
Vi-D Failure Cases
We observed two types of scenarios where our framework is a bit limited in extracting the suspicious items correctly. The first scenario is when the CST framework cannot highlight the transitional differences in the extremely occluded objects. One of such cases is shown in Figure 15 (A, E), where we can see that our framework could not identify the pistol due to the complex grayscale nature of the scan. However, identifying a pistol in such a challenging scan is an extremely difficult task for even a human expert. Also, a gun item was not detected in (B, F) because of the extreme clutter in that scan. In (C, G), two instances of a knife could not be detected. While these limitations have been rarely observed in the experiments they can be catered by considering more orientations to further reveal the transitional patterns. However, doing so would increase the overall computational time of the proposed framework.
The second scenario is related to cases of a correct proposal generation with misclassification of their content. An instance of these cases is depicted in Figure 15 (H) (last column), in which the pistol object has been misclassified as knife. However, we did notice that such instances are detected with a relatively low score (e.g. 0.537 for that knife item). These cases can, therefore, be catered in the second screening stage based on their confidence score. Note that detecting suspicious items even misclassified is a safer approach in this context.
In some instances, we also observed that our framework sometimes does not generate tight bounding boxes for the recognized items such as the chip in Figure 12 (Y, AB), (Z, AC); and the razor blade in (N, Q). This limitation emanates from our contour-based proposal identification in the CST framework in which the bounding boxes are not necessarily tight to the object. While bounding box regression can address this limitation, such a mitigation approach will incur a significant additional computational burden in return of a marginal impact on the accuracy.
This paper presents a deep learning system for the identification of heavily cluttered and occluded suspicious items from X-ray images. In this system, we proposed an original contour-based proposal generation using a Cascaded Structure Tensor (CST) framework. The proposed framework is highly sensitive in picking merged, cluttered and overlapping items at a different level of intensity through an original iterative proposal generation scheme. The system recognizes the proposals using single feed-forward CNN architecture and does not require any exhaustive searches and regression networks. This property makes our system more time-efficient as compared to the popular CNN object detectors. The proposed system is rigorously tested on different publicly available datasets and is thoroughly compared with existing state-of-the-art solutions using different metrics. We achieve the mean score of 0.9644 and 0.9689, score of 0.9878 and 0.9950, and a score of 0.9343 and 0.9595 on GDXray and SIXray dataset, respectively. The proposed framework outperforms the state-of-the-art in terms of quantitative and computational performance. The proposed system can be extended to normal photographs and popular large scale publicly available datasets for the automatic object detection. This will be the object of our future research.
This work is supported with a research fund from Khalifa University: Ref: CIRA-2019-047.
-  “Cargo Screening: technological options,” Aviation Security International, Retrieved: December 4th, 2019.
-  N. R. Council, “Airline Passenger Security Screening: New Technologies and Implementation Issues,” The National Academics Press, 1996.
-  C. Miao, L. Xie, F. Wan, C. Su, H. Liu, J. Jiao, and Q. Ye, “SIXray: A Large-scale Security Inspection X-ray Benchmark for Prohibited Item Discovery in Overlapping Images,”
-  D. Mery, V. Riffo, U. Zscherpel, G. Mondragon, I. Lillo, I. Zuccar, H. Lobel, and M. Carrasco, “GDXray: The Database of X-ray Images for Nondestructive Testing,” Journal of Nondestructive Evaluation, November 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” IEEE Conference on Computer Vision and Pattern Recognition, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” European Conference on Computer Vision, 2014.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  M. Bastan, W. Byeon, and T. Breuel, “Object Recognition in Multi-View Dual Energy X-ray Images,” British Machine Vision Conference, 2013.
-  S. Akçay and T. Breckon, “Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging,” preprint arXiv:2001.01293, 2020.
-  M. Bastan, “Multi-view object detection in dual-energy X-ray images,” Machine Vision and Application, November 2015.
-  M. Bastan, M. R. Yousefi, and T. M. Breuel, “Visual Words on Baggage X-ray Images,” 14th International Conference on Computer Analysis of Images and Patterns, August 2011.
-  N. Jaccard, T. W. Rogers, and L. D. Griffin, “Automated detection of cars in transmission X-ray images of freight containers,” 11th IEEE International Conference on Advanced Video and Signal Based Surveillance, August 26th-29th, 2014.
-  D. Turcsany, A. Mouton, and T. P. Breckon, “Improving feature-based object recognition for X-ray baggage security screening using primed visual words,” IEEE International Conference on Industrial Technology, February 25th-28th, 2013.
-  V. Riffo and D. Mery, “Automated Detection of Threat Objects Using Adapted Implicit Shape Model,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, June 2015.
S. Akçay, A. A. Abarghouei, and T. P. Breckon, “GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training,”Asian Conference on Computer Vision, 2018.
-  S. Akçay, A. A. Abarghouei, and T. P. Breckon, “Skip-GANomaly: Skip Connected and Adversarially Trained Encoder-Decoder Anomaly Detection,” International Joint Conference on Neural Networks, July 14th-19th, 2019.
S. Akçay, M. E. Kundegorski, M. Devereux, and T. P. Breckon, “Transfer learning using convolutional neural networks for object classification within X-ray baggage security imagery,”International Conference on Image Processing, September 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, A. D. Reed, Scott, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,” IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  S. Akçay, M. E. Kundegorski, C. G. Willcocks, and T. P. Breckon, “Using Deep Convolutional Neural Network Architectures for Object Classification and Detection within X-ray Baggage Security Imagery,” IEEE Transactions on Information Forensics and Security, March 2018.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,”Advances in Neural Information Processing Systems, 2012.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” Neural Information Processing Systems, 2015.
-  J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object Detection via Region-based Fully Convolutional Networks,” arXiv preprint arXiv:1605.06409, 2016.
-  J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Dhiraj and D. KumarJain, “An evaluation of deep learning based object detection strategies for threat object detection in baggage security imagery,” Pattern Recognition Letters, January 2019.
-  Y. F. A. Gaus, N. Bhowmik, S. Akçay, P. M. G. Garcia, J. W. Barker, and T. P. Breckon, “Evaluation of a Dual Convolutional Neural Network Architecture for Object-wise Anomaly Detection in Cluttered X-ray Security Imagery,” The International Joint Conference on Neural Networks, July 14th-19th, 2019.
-  K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” IEEE International Conference on Computer Vision (ICCV), 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size,” arXiv preprint arXiv:1602.07360, 2016.
-  K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  Y. F. A. Yona, N. Bhowmik, S. Akçay, and T. P. Breckon, “Evaluating the Transferability and Adversarial Discrimination of Convolutional Neural Networks for Threat Object Detection and Classification within X-ray Security Imagery,” arXiv preprint arXiv: 1911.08966v1, November 20th, 2019.
-  T. Morris, T. Chien, and E. Goodman, “Convolutional Neural Networks for Automatic Threat Detection in Security X-ray Images,” IEEE International Conference on Machine Learning and Applications, December 2018.
-  K. Zuiderveld, “Contrast Limited Adaptive Histograph Equalization,” Graphic Gems IV, San Diego: Academic Press Professional, pp. 474–485, 1994.
-  J. Bigun and G. Granlund, “Optimal Orientation Detection of Linear Symmetry,” First International Conference on Computer Vision (ICCV), 1987.
-  J. Bigun, G. Granlund, and J. Wiklund, “Multidimensional Orientation Estimation with Applications to Texture Analysis and Optical Flow,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991.
-  P. Perona and J. Malik, “Scale-space and edge detection using anisotropic diffusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, July 1990.
-  S. Hanan and M. Tamminen, “Efficient component labeling of images of arbitrary dimension represented by linear bintrees,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1988.
-  M. Kozdron, “The Discrete Dirichlet Problem,” April 2000.
-  S. Khan, H. Rahmani, S. A. A. Shah, M. Bennamoun, G. Medioni, and S. Dickinson, “A Guide to Convolutional Neural Networks for Computer Vision,” Morgan & Claypool, 2018.
-  D. P. Kingma and J. Ba, “ADAM: A Method for Stochastic Optimization,” International Conference for Learning Representations, 2015.
-  G. Pleiss, D. Chen, G. Huang, T. Li, L. V. D. Maaten, and K. Q. Weinberger, “Memory-Efficient Implementation of DenseNets,” arXiv preprint arXiv:1707.06990, 2017.
-  V. Riffo and D. Mery, “Active X-ray Testing of Complex Objects,” 15th World Conference on NonDestructive Testing, 38(5), 335–343, 2005.
-  D. Mery, “Automated Detection in Complex Objects using a Tracking Algorithm in Multiple X-ray Views,” IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2011.
-  J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv, 2018.
-  G. Huang, Z. Liu, V. D. M. Laurens, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” IEEE Conference on Computer Vision and Pattern Recognition, 2016.