Towards Automatic Threat Detection: A Survey of Advances of Deep Learning within X-ray Security Imaging

01/05/2020 ∙ by Samet Akcay, et al. ∙ 81

X-ray security screening is widely used to maintain aviation/transport security, and its significance poses a particular interest in automated screening systems. This paper aims to review computerised X-ray security imaging algorithms by taxonomising the field into conventional machine learning and contemporary deep learning applications. The first part briefly discusses the classical machine learning approaches utilised within X-ray security imaging, while the latter part thoroughly investigates the use of modern deep learning algorithms. The proposed taxonomy sub-categorises the use of deep learning approaches into supervised, semi-supervised and unsupervised learning, with a particular focus on object classification, detection, segmentation and anomaly detection tasks. The paper further explores well-established X-ray datasets and provides a performance benchmark. Based on the current and future trends in deep learning, the paper finally presents a discussion and future directions for X-ray security imagery.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

X-ray security screening is one of the most widely used security measures for maintaining airport and transport security, whereby manual screening by human operators plays the vital role. Despite the fact that experience and knowledge are the key factors for confident detection, external variables such as emotional exhaustion and job satisfaction adversely impact the manual screening [Michel2007, Halbherr2013, Swann2014, Baeriswyl2016, Chavaillaz2019].

Cluttered nature of X-ray bags is another issue negatively affecting the decision time and detection performance of the human operators [Schwaninger2008, Bolfing2008, Wales2009]. For instance, the threat detection performance of human screeners significantly reduces when laptops are left inside the bags. This is due to the compact structure of laptops which conceals the potential threats, limiting detection capability of the screeners [Mendes2012, Mendes2013]. All these issues necessitate the use of automated object detection algorithms within X-ray security imaging, which would maintain the alertness and improve detection and response time of human operators, yielding higher operator trust [Chavaillaz2018].

Figure 1: Statistics for the recent papers published in X-ray security imaging. Conventional Machine Learning (CML) approaches were dominant in the field before 2016, while deep learning approaches have recently become the standard approach.

Despite the surge of interest in X-ray screening[Murray1995, Zentai2008, Wells2012, Caygill2012, Singh2003], automated computer-aided screening is understudied, particularly due to the lack of data, and the need for advanced learning algorithms. State-of-the-art studies within the literature have focused on image enhancement [Abidi2005, Abidi2006, Lu2006], classification [Caldwell2017, Akcay2018a, Miao2019, Yang2019], detection [Liang2018, Akcay2018a, Steitz2018, Liu2018], segmentation [Singh2004, Heitz2010, Gaus2019b], and unsupervised anomaly detection [Andrews2016, Akcay2018b, Akcay2019, Griffin2019] for automated security screening. Notable surveys within the field [Mouton2015, Rogers2016a] categorize the existing literature within two main categories: (i) image processing [Abidi2005] and (ii) image understanding [Heitz2010, Mery2016, Mery2017c]. Early work within the field focuses more on image processing approaches such as image enhancement [Abidi2005], threat image projection (TIP) [Mitckes2003], material discrimination and segmentation [Lu2006]. Recent work, on the other hand, has a particular interest in image understanding focusing more on automated threat detection and automated content verification via machine/deep learning algorithms [Miao2019, Steitz2018, Gaus2019b, Akcay2019, Griffin2019].

In a traditional setting, a machine learning algorithm pipeline contains pre-processing, enhancement, segmentation, feature extraction, and classification stages

[Mouton2015, Rogers2016a]

. Pre-processing and enhancement stages reduce the noise from the input data and improve the overall quality. The segmentation step crops the regions of interests from the full cluttered image. Feature extraction stage extracts the hand-crafted features of the object, such as edges and shapes. The final stage classifies the objects based on the features derived from the preceding step.

The main drawback of these machine learning approaches is their dependency on hand-crafted features requiring manual engineering. Deep convolutional neural networks overcome this issue by learning the task-specific features, which overall yields a significant improvement. A convolutional neural network contains a single or multiple layers, each of which comprises a set of neuron activations and non-linear transformation. The earlier layers learn high-level features such as edges and shapes, while higher layers learn lower level features that are more specific to the image fed into the network. Despite being initially proposed more than decades ago


, the use of convolutional neural networks within the field of computer vision has become prevalent especially after achieving state-of-the-art performance


on ImageNet object classification challenge

[Russakovsky2014] by a large margin.

Within the X-ray security imaging, on the other hand, the transition from the classical machine learning to modern deep learning approaches was not instant. This is due to data-hunger nature of deep learning approaches, which initially limited its use within the field, where the availability of such large datasets is somewhat limited. With the utilisation of transfer learning paradigm

[Yosinski2014] and synthetic data generation [Mitckes2003], the use of deep learning approaches has become the golden standard within the field [Akcay2016, Mery2017c, Jaccard2016b].

Abidi2004, Abidi2005, Singh2005, Abidi2005a, Rogers2014, Rogers2017bMitckes2003, Rogers2016, Mery2017bParanjape1998, Sluser1999, Singh2004, Ding2006, Lu2006, Heitz2010, Kechagias-Stamatis2017Oertel2006, Gesick2009, Fu2009, Bastan2011, Turcsany2013, Bastan2013, Zheng2013, Zhang2014, Jaccard2014, Kolkoori2014, Zhang2014, Jaccard2014, Bastan2015, Rogers2015, Zhang2015, Zhang2015a, Rogers2015, Kundegorski2016, Mery2016Abusaeeda2011, Zheng2013, Mery2011, Mery2012a, Mery2013b, Mery2013c, Mery2013a, Mery2015a, Mery2016a, Riffo2016, Mery2017c, Canizares2018Schmidt-Hackenberg2012, Bastan2015Franzel2012, Bastan2015Akcay2016, SvecP.2016, Jaccard2016b, Jaccard2016, Jaccard2017, Jaccard2016a, Rogers2017a, Caldwell2017, Yuan2018, Zhao2018, Xu2018, Miao2019Akcay2017, Akcay2018aAkcay2018a, Liu2018Steitz2018Gaus2019Tuszynski2013, Andrews2016, Andrews2016a, Andrews2017, Akcay2018b, Akcay2019, Griffin2019
Figure 2: A Taxonomy of the X-ray security imaging papers.

This literature survey reviews the published work within various computer vision tasks (Figure 1b) in X-ray security screening, with a particular focus on the deep learning applications. The main contributions of this paper are as follows:

  • taxonomy — an extensive overview of classical machine learning and contemporary deep learning within X-ray security imaging.

  • datasets — an overview of the large datasets used to train deep learning approaches within the field.

  • open problems — discussion of the open problems, current challenges, and future directions based on the current trends within computer vision.

The rest of the paper is as follows: Sections 2 and 3 explore conventional image analysis and machine learning algorithms with a specific focus on image enhancement, threat image projection, image segmentation, object classification, and object detection. Section 4 reviews the applications of the deep learning algorithms withi X-ray security imaging. Section 5 discusses the open problems, current challenges and Section 6 finally concludes the paper.

2 Conventional Image Analysis

A conventional image understanding consists of the following stages: (i) pre-processing stage that enhances the quality of the input image, (ii) segmentation stage to crops the region of interests (RoI) from the full image, (iii) feature extraction stage that computes fundamental attributes of the object such as edges, texture and shape, (iv) classification stage to predict the corresponding class label based on the extracted features. This section explores the conventional image analysis techniques that perform image enhancement and threat image projection.

2.1 Image Enhancement

Preprocessing the input data plays a substantial role to yield higher-quality images that increase the readability by both screener and computer.

Initial attemps [Abidi2005] fuse low and high energy X-ray images and apply background subtraction for noise reduction. To improve low-density images, Radon transformation is used for threshold selection to declutter RoI from complex X-ray scans [Abidi2004]

. For adaptive image enhancement, multi-layer perceptron is used, where the model predicts the best enhancement technique based on input and enhanced output images


Following work [Abidi2005, Abidi2005a]

explore pseudocolouring grey scale X-ray images, which improves the detection performance and alertness level of the operators. Threat detection performance is further improved via new colour coding scheme by calibrating the estimation of effective atomic number (

) and density information () [Chan2010].

2.2 Threat Image Projection (TIP)

The detection performance of human screeners is heavily dependent on experience and knowledge acquired with computer-based training [Michel2007, Schwaninger2007, Cutler2009]. Due to the limited availability of X-ray scans with prohibited items, the training is achieved with the images onto which threat images are synthetically projected [Mitckes2003].

More recently TIP has also been used for synthetic data generation to address the data requirements of machine learning models. By projecting a large number of threat objects onto benign X-ray images, it is possible to gather large datasets that could train/evaluate machine learning algorithms [Rogers2016, Mery2017b, Bhowmik2019].

Common approach for TIP is to project binary threat mask onto a benign input X-ray image via multiplication, yielding an output X-ray with the threat item. Application of affine transformations improves the robustness of the algorithm [Rogers2016]. A similar approach first employs logarithmic transformation to separate foreground objects from the background, which are subsequently multiplied with the input [Mery2017b]. Another use of the algorithm is the task of object detection, where a sparse representation algorithm extracts the dictionaries of both foreground (threat) and background (benign) objects and performs classification, which yields , , precision, recall, and accuracy, respectively.

3 Machine Learning Approaches in X-ray Security Imaging

This section explores the applications of conventional machine learning approaches in X-ray security imaging. The literature is reviewed based on three tasks: (i) classification, (ii) detection, and (iii) segmentation. For an alternative perspective for this section, the reader could refer to the related reviews of Mery [Mery2015Computer] and Rogers et al. [Rogers2016a].

3.1 Object Classification

Prior to the dominance of the deep learning within the field, the bag of visual words (BoVW) approach was prevalent. In of the initial attempts utilizing BoVW, Baştan et al. [Bastan2011] perform classification of X-ray objects on a relatively limited dataset. Scale Invariant Feature Transform (SIFT) [Lowe2004], Speeded Up Robust Features (SURF) [Bay2008]

and Binary Robust Independent Elementary Features (BRiEF) feature descriptors are computed around the points detected using standard Difference of Gaussians (DoG), Hessian Laplace, Harris, Features from accelerated segment test (FAST) and STAR feature detectors. k-means

[Hartigan1979] clusters the visual vocabulary, which is trained with an SVM [Hearst1998]. DoG detector and SIFT descriptor are shown to perform the best among the descriptors (mAP: 0.65 on 200 X-ray images.

Inspired by [Bastan2011], Turcsany et al. [Turcsany2013] presents a unique BoWV approach for the X-ray firearm classification via class-specific feature extraction. With the use of SURF [Bay2008] feature detector and descriptor with a BoVW approach trained on an SVM [Hearst1998] classifier achieves true positive rate and false-positive rate.

A multi-staged approach [Jaccard2014] performs car detection from X-ray images of freight containers. The method first creates cars vs non-cars image patches from stream-of-commerce X-ray images. The next step extracts features via image intensity, log intensity together with basic and oriented images features [Griffin2009]

. The final stage utilizes Random Forest (RF)

[Breiman2001], achieving % detection rate with % false alarm rate. A follow-up work [Rogers2015]

detects loads in cargo containers by an RF classifier trained with local image moments and oriented basic image features (oBIF)

[Griffin2009], yielding % detection accuracy and % false positives.

BoVW approach is further employed in [Mery2016]. A dictionary is formed for each class that consists of SIFT [Lowe2004] feature descriptors of randomly cropped image patches. Fitting a sparse representation classification to the feature descriptors of the random test patches yields accuracy for each class and in case of occlusion. In another BoVW approach in [Zhang2015, Zhang2015a], an SVM is trained with local latent low-level image features extracted from a dataset with 15 different classes, each of which comprises 100 images (AUC: ).

Inspired by the various research outcome drawn for the BoVW, Kundegorski et al. [Kundegorski2016] exhaustively evaluate various feature point descriptors within a BoVW-based image classification task. The combination of FAST-SURF trained with an SVM classifier [Hearst1998] is the best performing feature detector and descriptor combination for a firearm detection task on a large dataset, yielding a statistical accuracy of 0.94 (true positive: 83% and false positive: 3.3%).

Despite the BoVW dominance, other computer vision/machine learning techniques have also been studied for X-ray object classification task. A study [Zheng2013] aims at detecting threat items in vehicles using X-ray cargo imagery. The proposed multi-staged approach (i) initially improves the image quality via normalization, denoising, and enhancement, (ii) subsequently performs multi-view alignment and pseudocolouring (iii) finally classifies the threats via correlating the similarities between temporally aligned images. Another study by Zhang et al. [Zhang2014] investigate the use of joint shape and texture features extracted from superpixel regions of the input. Tranining the extracted feature-map with SVM [Hearst1998] yields classification accuracy.

Mery et al. [Mery2012a] utilize structure estimation and segmentation together with a general tracking algorithm to detect X-ray objects. Another classification pipeline by Mery et al. [Mery2012a] (i) extracts features with SIFT [Lowe2004], (ii) removes redundancy via RANSAC [fischler1981], (iii) sort features based on the difference between two consecutive frames and (iv) use Mahalanobis distance classifier to predict class labels (P: , R: 64 X-ray images).

Similar works [Mery2016a, Riffo2016, SvecP.2016, Mery2017c, Xu2019] exhaustively evaluate various computer vision techniques, with a specific focus on k-nn based sparse representation. A k-means algorithm [Hartigan1979] clusters the features, segmented from input via an adaptive k-means [Dixit2014] and extracted via SIFT [Lowe2004]. During the test, the score for a patch is calculated based on the closest distance to a neighbour clustered via k-nn classifier [Cover1967], achieving comparable accuracy to deep models on GDXray ( vs. ).

3.2 Object Detection

This section reviews the conventional X-ray object detection models presented in the literature. Being a challenging task, where the bounding box coordinates and class labels are to be predicted simultaneously, conventional object detection algorithms in the literature is relatively limited in the field.

Schmidt-Hackenberg et al. [Schmidt-Hackenberg2012] compare the use of visual cortex inspired features such as SLF-HMAX and v1-like to the standard features such as SIFT [Lowe2004]. Compared to SIFT, HMAX features are shown to provide superior feature encoding for BoVW approach trained with SVM [Hearst1998].

Evaluation works of [Bastan2013, Bastan2015] exhaustively investigate the use of BoVW for the X-ray object detection. Evaluating various feature descriptors within a single and multiple-view imagery for the detection via branch and bound algorithm with structural SVM classifier [Hearst1998] shows that (i) combination of SIFT and SPIN achieves the best detection performance (mAP: ), and (ii) utilizing multi-view improves the detection (mAP: ).

Multi-view X-ray imaging improves the performance when rotation and superimposition hinder the viewability of the objects from one view [Michel2009]. Despite its computational complexity, multi-view imaging help human operators and machines to improve the detection performance [Bastian2008, Bastan2015].

A general multi-staged approach proposed in the works of [Mery2017, Mery2013b, Mery2013a, Mery2011] (i) initially performs feature extraction via feature descriptors and k-NN classifier [Cover1967], (ii) matches the key-points for the consecutive images from different views and (iii) analyse the multiple-views ,where the key-points of the two successive images are matched, and their 3D points are formed with structure from motion. After being clustered, 3D points are re-projected back to 2D key-points, which are classified by the k-NN classifier [Cover1967]. The best performing approach achieves precision, recall for 120 X-ray images.

Franzel et al. [Franzel2012] propose a sliding window detection approach with the use of a linear SVM classifier [Hearst1998] and histogram of oriented gradients (HOG) [Dalal2005]. As HOG is not fully rotationally invariant, they supplement their approach by detection of varying orientations. Multi-view integration step fuses the multiple viewpoints to find the intersection of the true detections, which achieves superior performance compared to single-view (mAP: 64.5).

3.3 Object Segmentation

One of the crucial steps for accurate object classification in conventional image understanding is the precise object segmentation. The rest of the section explores various segmentation techniques presented in the literature.

Early work in the field [Paranjape1998, Sluser1999] investigates simplistic pixel-based segmentation with a fixed absolute threshold and region grouping. Subsequent work, on the other hand, focuses more on pre-segmentation via nearest neighbour, overlapping background removal and final classificaiton [Singh2004, Ding2006, Lu2006].

Instead of using shape information, some approaches utilises chemical (attenuation) proporties [Heitz2010] and high atomic numbers [Kechagias-Stamatis2017]. Nearest Neighbour Distance Ratio [Mikolajczyk2005] on SURF [Lowe2004] features computed on regions disconnected with morphological operations achieves promising results (RMS error: on 23 X-ray images).

4 Deep Learning in X-ray Security Imaging

This section reviews the X-ray security applications utilising deep learning algorithms. By initially introducing the well-established datasets in the field, we explore the applications for various computer vision tasks such as object classification, detection, segmentation and unsupervised anomaly detection.

4.1 Datasets

This section explores X-ray security imaging datasets that are widely used in the literature.

4.1.1 Durham Baggage Patch/Full Image Dataset

This dataset comprises X-ray samples with associated false color materials mapping from dual-energy. Originally, samples have the following class distributions: camera, ceramic knife, knife, firearms, firearm parts, laptop and benign images. Several variants of this dataset is constructed for classification (DBP2 and DBP6) [Akcay2016, Kundegorski2016, Akcay2018a] and detection (DBF2 and DBF6) [Akcay2017, Akcay2018a].

4.1.2 GDXray

Grima X-ray Dataset (RAY) [Mery2015] comprises X-ray samples from five various subsets including castings (), welds (), baggage (), natural images (), and settings (). The baggage subset is mainly used for security applications and comprises images from multiple-views. The limitation of this dataset is its non-complex content, which is non-ideal to train for real-time deployment.

4.1.3 Ucl Tip

This dataset comprises benign images, each of which is 16-bit grayscale with sizes varying between and . The train and test split of the dataset is : , where the training images are patches randomly sub-sampled from images and the test set comprises benign and threat images. The threat images are synthetically generated via the TIP algorithm proposed in [Rogers2016], where, depending on the application, small metallic threats (SMT) or car images are projected into the benign samples. With several variants, this dataset is used in several studies such as [Andrews2016a, Andrews2017, Jaccard2016, Jaccard2016a, Jaccard2016b, Jaccard2017, Rogers2017a].

4.1.4 SIXray

Collected and released by [Miao2019], SIXray dataset comprises X-ray images, of which are manually annotated for different classes: gun, knife, wrench, pliers, scissors, hammer, and background. The dataset consists of objects with a wide variety in scale, viewpoint and especially overlapping, and is first studied in [Miao2019] for classification and localization problems.

4.1.5 Durham Baggage Anomaly Dataset –DBA

This in-house dataset comprises dual energy X-ray security image patches extracted via a overlapping sliding window approach. The dataset contains 3 abnormal sub-classes —knife (), gun () and gun component (). Normal class comprises benign X-ray patches, split via train-test ratio. DBA dataset is used in [Akcay2018b] and [Akcay2019] for unsupervised anomaly detection.

4.1.6 Full firearm vs Operational Benign –FFOB

As presented in [Akcay2018a, Akcay2018b, Akcay2019], this dataset contains samples from the UK government evaluation dataset [CAST2016], comprising both expertly concealed firearm (threat) items and operational benign (non-threat) imagery from commercial X-ray security screening operations (baggage/parcels). Denoted as FFOB, this dataset comprises firearm full-weapons as full abnormal and operational benign as full normal images, respectively.

4.1.7 Compass - XP Dataset

This dataset [Caldwell2019] is collected using objects from object classes that are subset of ImageNet classes. The dataset includes image pairs such that each pair has an X-ray image scanned with Gilardoni FEP ME 536 and its photographic version taken with a Sony DSC-W800 digital camera. In addition, each X-ray image has its low-energy, high-energy, material density, grey-scale (combination of low and high energy) and pseudo-colored RGB versions.

Dataset Domain Task # Samples Classes Performance Reference
DBP2 Baggage Classification 19,938 firearm, background ACC: 0.994 [Akcay2016, Akcay2018a]
DBP6 Baggage Classification 10,137 firearm, firearm parts, camera, ACC: 0.937 [Akcay2016, Akcay2018a]
knife, ceramic knife, laptop
UCL TIP Cargo Classification 120,000 small metallic threat (SMT), car ACC: 0.970 [Jaccard2016a, Andrews2016a, Caldwell2017, Rogers2017a, Andrews2017, Jaccard2017]
Anomaly Detection
GDXRay Baggage Classification 19,407 gun, shuriken, razor blade ACC: 0.963 [Mery2017c, Mery2017a, Xu2018, Sangwan2019]
DBF2 Baggage Detection 15,449 firearm, background mAP: 0.974 [Akcay2017, Akcay2018a]
DBF6 Baggage Detection 15,449 firearm, firearm parts, camera, mAP: 0.885 [Akcay2017, Akcay2018a]
knife, ceramic knife, laptop
PBOD Baggage Classification 9,520 Explosives AUC: 0.950 [Morris2018]
MV-Xray Baggage Detection 16,724 Glass Bottle, TIP Weapon, Real Weapon mAP: 0.956 [Steitz2018]
SASC Baggage Detection 3,250 Scissors, Aerosols mAP: 0.945 [Liu2018]
Zhao et al. Baggage Classification 1,600 wrench, pliers, blade, lighter, ACC: 0.992 [Zhao2018]
knife, screwdriver, hammer
Smiths-Duke Baggage Detection 16,312 gun, pocket knife, mixed sharp mAP: 0.938 [Liang2018]
SIXray Baggage Detection 1,059,231 gun, knife, wrench, pliers, mAP: 0.439 [Miao2019]
scissors, hammer, background
UBA Baggage Anomaly Detection 230,275 gun, gun part, knife AUC: 0.940 [Akcay2018b, Akcay2019]
FFOB Baggage Anomaly Detection 72,352 full-weapon, benign ACC: 0.998 [Akcay2018b, Akcay2019]
Yang et al. Baggage Classification 2,000 wrench, fork, handgun, power bank, ACC: 0.991 [Yang2019]
lighter, pliers, knife, liquid, umbrella, screwdriver
Table 1: Datasets used in deep learning applications within X-ray security imaging

4.2 Evaluation Criteria

Before listing the performance results of the reviewed papers, it is important to introduce the various performance metrics used in the field.

Accuracy (ACC)

Accuracy is defined as the number of correctly predicted samples over the the total number of predictions, which is mathematically shown as .

Mean Average Precision (mAP)

mAP is defined as the mean of the average precision, a metric evaluated by the area under the precision and recall curve, where precision is

, and recall is .


AUC is the area under the curve (AUC) of the receiver operating characteristics (ROC), plotted by the true positive rates and false positives rates.

Figure 3: An input X-ray image, and the outputs depending on the deep learning task, (a) classification via ResNet-50 [He2015], (b) detection with YOLOv3 [Redmon2018] and segmentation via Mask RCNN [He2017]

4.3 Classification

The study of [Akcay2016] is one of the first research applying CNN to X-ray security imagery. The authors examine the use of CNN via transfer learning to evaluate to what extent transfer learning helps classify X-ray objects within the problem domain, where the availability of the datasets is somewhat limited. Freezing AlexNet weights layer by layer on a two-class (gun vs. no-gun) X-ray classification problem shows that CNN significantly outperforms the BoVW approach (SIFT+SURF), trained with SVM or RF, even when the layers of the network are all frozen. Another set of experimentation analyses the use of CNN within a challenging 6-class classification problem, whose results show a great promise of the use of CNN in the field.

A similar work [Jaccard2016b] compares the use of deep learning against the conventional machine learning to classify non-empty cargo containers with cars or SMT. A multi-stage approach first classifies cargo containers as empty vs non-empty. The second stage is the classification of cars from the containers classified as non-empty, achieved via oBIF + RF. By using UCL TIP dataset, the authors evaluate the of 9 and 19 layers networks [Jaccard2016] that are similar to [Krizhevsky2012] and [Simonyan2014], and show that even the worst performing CNN outperforms the conventional machine learning approach (oBIF + RF).

A follow-up work [Jaccard2017] further investigates the detection of cars from X-ray cargo images. A sliding window splits UCL TIP images into patches. Authors then explore various features including intensity, oBIF [Griffin2009], Pyramid of Histogram of Visual Words (PHOW) [Bosch2007] and CNN features. Training these features on SVM [Hearst1998], RF [Breiman2001], and soft-max (CNN) shows that an RF classifier trained on the VGG-18 [Simonyan2014] features extracted from log-transform images achieves the highest performance (FPR: %).

Additional work by Jaccard et al. [Jaccard2016a] evaluate the impact of input types on CNN performance by training single-channel raw image and dual-channel data that contains the raw image and its log-transformed image on VGG [Simonyan2014] variants. The quantitive analysis demonstrates that VGG-19 model trained from scratch by using dual-channel raw and log-transformed images outperforms the other variants (AUC: , FPR: ).

Rogers et al. [Rogers2017a] explore the use of dual-energy X-ray images for automated threat detection. Authors investigate varying transformations applied to high-energy () and low-energy() X-ray images captured via the dual-energy X-ray machine. Using UCL TIP dataset, 640,000 image patches are generated via a sliding-window. Training this dataset with a fixed VGG-19 network [Simonyan2014] with varying input channels, including single-channel (), dual-channel(, ) and four-channels ({}) shows that dual and four-channels always achieves superior detection performance compared to their single-channel variants (ACC: –dual vs –single).

Inspired by the limited availability of X-ray datasets, a three-stage algorithm by Zhao et al. [Zhao2018]

first classifies and labels the input X-ray dataset via KNN Matting

[Chen2013] that uses the angle information of the foreground objects extracted from the input image. The second stage generates new X-ray objects via an adversarial network similar to [Arjovsky2017]. Additional use of [Isola2016] improves the quality of the generated images. Finally, a small classification network confirms whether the generated image belongs to the correct class. In a follow-up study, Yang et al. [Yang2019] further investigate the ways to improve the GAN training to produce better X-ray images. Experiments and evaluation based on Frechet Inception Distance (fiD) score [Heusel2017] show that the proposed GAN approach in the paper generates visually superior prohibited items.

Miao et al. [Miao2019] introduce a model (CHR) to classify/localize X-ray images from SIXray. The model copes with class imbalance and clutter issue by extracting image features from three consecutive layers, where subsequent layers are upsampled and concatenated with the previous layers. A refinement function removes the redundant information from the concatenated feature map. The objective of the work is to minimize the loss of the weighted sum of the classification of the refined mid-level features from the three consecutive layers (). Training the model with the proposed loss yields mAP improvement when used with ResNet-101 on SIXray ( vs. ).

An evaluation work [Morris2018] investigates the use of CNN for the task of explosive detection. An initial stage process the input data by fixing the image size, cropping the irrelevant background object where and applying data augmentation transformations. Evaluation of random initialization vs. pre-training on VGG19[Simonyan2014], Xception [Chollet2017], and InceptionV3 [Szegedy2016] networks shows that randomly initialized models achieves superior accuracy for binary classification task. To study the impact of intensity and Z-eff values on the performance, the authors train three VGG-19 networks on both intensity and Z-effective, the intensity only and Z-effective only. Training the model with only Z-eff is shown to yield the highest accuracy. The final set of experiments investigates localization via heatmaps and shows that pre-trained networks achieves superior performance since randomly initialized networks tend to overfit on small datasets.

Caldwell et al. [Caldwell2017] study the generalization capability of models trained with different datasets. To investigate this problem, the authors first train a network with a cargo dataset and evaluate its performance with a test set that also contains some parcel dataset samples. Quantitative analysis reveals that the performance of the CNN model is weak when it is tested with the combined dataset. The second stage combines these two datasets within the training stage, yielding a considerable improvement in the performance of the model. Based on this experimentation, authors conclude that transferring information between different modalities is challenging since CNN cannot sufficiently generalize to the unseen target dataset.

4.4 Detection

After the success of CNN for classification, the work of [Akcay2017] train sliding-window based CNN, Faster RCNN [Ren2015] and R-FCN [Dai2016b] models on DBF2/6 datasets for firearm and multi-class detection problems. Experiments demonstrate that Faster RCNN [Ren2015] with VGG16 [Simonyan2014] yield mAP on 6-class DBF6 dataset, while R-FCN with ResNet-101 achieves the highest performance ( mAP) on 2-class (gun vs no-gun) on DBF2 dataset.

Similar to [Akcay2017], another evaluation work [Liang2018] explores the performance of F-RCNN, R-FCN [Dai2016b] and SSD [Liu2016] within single/multi-view X-ray imagery. Utilizing OR-gate detection by merging object detection outputs from individual views shows that multi-view outperforms that of single-view ( vs. when trained with R-FCN and ResNet-101). A two-stage approach by Liu et al. [Jinyi2019] first extracts foreground objects and subsequently utilises F-RCNN to detect subway X-ray images, with an mAP of for 6 object classes.

A similar study [Liang2019] explores SSD and F-RCNN by training on a dataset containing 4 threat classes, each of which comprises approximately images. F-RCNN with Inception ResNet v2 backbone yields the highest mAP ( and on single and multi-view images, respectively).

Another work [Steitz2018] utilize multi-view by modifying F-RCNN. A multi-view pooling layer constructs 3D feature 2D extracted from the convolutional layers. 3D region proposal network generates the RoI. Classification and bounding box prediction is performed after 3D RoI pooling layer. Experiments show that multi-view yields an improvement compared to single-view imagery ( vs. ).

Liu et al. [Liu2018] also performs object detection via YOLOv2 [Redmon2018] to detect scissors and aeorosols on SASC dataset. Training YOLO v2 for 6000 iterations yield average precision and recall rates with FPS run-time speed.

Cui and Oztan [Cui2019a] argue that RetinaNet [Lin2018] achieves comparable detection performance, while being considerably faster than traditional sliding window classification when trained with 30,000 images synthetically genarated via TIP with X-ray cargo containers and firearms.

Hassan et al. [Hassan2019]

proposes an object detection algorithm, whereby the RoI is generated via cascaded multiscale structure tensors that extracts based on the variationts of the orientations of the object. The extracted RoI is then passed into a CNN, which quantitatively and computationally outperforms RetinaNet, YOLOv2 and F-RCNN on

GDXray and SIXray datasets.

Motivated by the lack of annotated X-ray datasets, Xu et al. [Xu2018]

, make use of attention mechanisms for the localization of threat materials. The first stage forward-passes an input and finds the corresponding class probability. The back-propagation stage finds which neurons within the network decides the output class. Using the neurons from the first convolutional layer on top of the input image localizes the threat. The final stage refines the activation map by normalizing the layers with the activations of the previous layer. Comparison against the traditional deconvolution method (mAP:

) shows that the proposed method achieves superior detection () without needing for bounding box information.

Similar to [Caldwell2017], generalisation capability of CNN is studied by Gaus et al. [Gaus2019a] by training/validating CNN on different datasets (DBF3 ( mAP) SIXray ( mAP)).

4.5 Anomaly Detection

Human operators tend to perform better detection when focusing on the benign objects rather than threat items. In addition, the knowledge of every-day benign objects leads to a much better detection performance [Sterchi2017]. Same concept is applied in anomaly detection, where the model is only trained with normal samples, and tested on normal/abnormal examples.

An anomaly detection approach [Andrews2016]

employs sparse feed-forward autoencoders in an unsupervised manner to learn the feature encoding of normal and abnormal data. An SVM

[Hearst1998] then classifies the images either anomalous or benign. Validation on MNIST [Lecun1998] and freight container dataset (empty vs non-empty) shows that hidden layer representation extracted from the autoencoder, in fact, is rather significant for the detection of abnormalities in the images. When fused with the raw-input and residual error, features encoding from the hidden layers yield even better detection performance.

A follow-up work utilizes intensity, log-intensity and VGG-19 [Simonyan2014] features extracted from patches from UCL TIP dataset and train normal images via forest of random split trees anomaly detector [Liu2012]. Testing the model on normal + abnormal data yields % AUC.

A similar study [Akcay2018b]

, in which image and latent vector spaces are optimized for anomaly detection, utilizes an adversarial network such that the generator comprises encoder-decoder-encoder sub-networks. The objective of the model is to minimize the distance between both real/generated images and their latent representations jointly, which overall outperforms the previous state-of-the-art both statistically and computationally (

UBA: , FFOB: – AUC). A follow-up work [Akcay2019] improves the performance of [Akcay2018b] further by (i) utilizing skip-connections in the generator network to cope with higher resolution images, and (ii) learning the latent representations within the discriminator network (UBA: , FFOB: – AUC).

Another anomaly detection algorithm [Griffin2019] (i) first extract the feature of the normal images from Inception v3 [Szegedy2017]

alike network, (ii) subsequently trains a multivariate Gaussian model to capture the normal distribution of

CAST dataset. Anomaly score of a test sample is based on its likelihood that is relative to the model, which overall yields AUC.

4.6 Segmentation

Due to the scarcity of datasets with pixel-level annotation, the task of segmentation is understudied within the field. One of the published work [Gaus2019b] addresses segmentation and anomaly detection tasks together, whereby a dual-CNN pipeline initially segments RoI via Mask RCNN [He2017] and classifies the regions as benign/abnormal via ResNet-18 [He2015], achieving segmentation mAP and anomaly detection accuracy. Another work [Bhowmik2019a] proposes three-stage approach, whereby (i) object-level segmentation is achieved by the use of Mask RCNN [He2017], (ii) sub-component regions are segmented via super-pixel segmentation and (iii) final object classification is performed via fine-grained CNN classification, which overall yields anomaly detection accuracy on electronic items. An et al. [An2019] propose a segmentation model that utilises dual attention mechanism within an encoder-decoder segmentation network. The former attention module classifies the RoI, while the latter localises the object. Experiments on PASCAL alike structured X-ray dataset containing augmented images from 7-classes yield accuracy and mean intersection over union (mIoU).

Reference Domain Problem Method
Akçay et al. [Akcay2016] Baggage Object Classification CNN with transfer learning
Svec [SvecP.2016] Baggage Object Classification CNN with transfer learning
Andrews et al. [Andrews2017] Cargo Anomaly Detection Train CNN features with Random Split Trees
Jaccard et al. [Jaccard2016b] Cargo Object Classification oBIF+RF for non-empty cargo detection, followed by CNN for car detection
Jaccard et al. [Jaccard2016] Cargo Object Classification CNN from scratch outperforms RF
Rogers et al. [Rogers2017a] Cargo Object Classification Evaluation of high and low energy x-ray imagery
Caldwell et al. [Caldwell2017] Cargo, Baggage Object Classification Transferability between domains
Yuan and Gui [Yuan2018] Tera Hertz Object Classification Two-stage. Classify from RGB, then Tera-Hertz images.
Zhao et al. [Zhao2018] Baggage Image Generation, Generate X-ray objects via GAN, and classify with CNN
Object Classification
Yang et al. [Yang2019] Baggage Image Generation Generate X-ray objects via GAN, and classify with CNN
Object Classification
Miao et al. [Miao2019] Baggage Object Classification with class-balanced hierarchical refinement
Morris et al. [Morris2018] Baggage Object Classification Region-based detection with Z-effective
Akçay and Breckon [Akcay2017] Baggage Object Detection Object Detection, Faster-RCNN is the best.
Liang et al. [Liang2018] Baggage Object Detection RFCN is the best. Multi-view outperforms single view.
Liang et al. [Liang2019] Baggage Object Detection Explores various detection algorithms, F-RCNN with Inception ResNet v2 achieves the highest performance
Steitz et al. [Steitz2018] Baggage Object Detection F-RCNN with multi view pooling is superior to single view only.
Liu et al. [Liu2018] Baggage Object Detection YOLOv2 achieves real time performance.
Xu et al. [Xu2018] Baggage Object Detection Localizes the threat material from the X-ray images via attention mechanisms
Islam et al. [Islam2018] Baggage Object Detection track passengers and their belongings in airports while passing X-ray security checkpoints
Liu et al. [Jinyi2019] Baggage Object Detection Foreground object segmentation via material info, followed by a F-RCNN
Gauss et al. [Gaus2019a] Baggage Object Detection F-RCNN to investigate the tranferrability between various X-ray scanners.
Cui and Oztan [Cui2019a] Baggage Object Detection RetinaNet trained on a TIP dataset achieves considerable faster detection than sliding window CNN.
Hassan et al. [Hassan2019] Baggage Object Detection RoI are extracted via cascaded multiscale structure tensors, which are then classified via a CNN
Bhowmik et al. [Bhowmik2019] Baggage Object Detection Explores the generalisation capability of the models trained on TIP datasets.
Andrews et al. [Andrews2016] Cargo Anomaly Detection Fusion of the raw-input and residual error with feature encoding from the hidden layers.
Akçay et al. [Akcay2018b] Baggage Anomaly Detection encoder- decoder-encoder sub-networks. Minimize latent vector and image space.
Akçay et al. [Akcay2019] Baggage Anomaly Detection Use of skip connections. Minimize latent vector in the discriminator network.
Griffin et al. [Griffin2019] Baggage Anomaly Detection Feature Extraction with CNN, then train with Gaussian model.
Gauss et al. [Gaus2019b] Baggage Object Segmentation Mask-RCNN to segment RoI, and CNN classification for anomaly detection
Bhowmik et al. [Bhowmik2019a] Baggage Object Segmentation Mask-RCNN to segment RoI, superpixel for sub-component level analysis, fine-grained CNN for classification
An et al. [An2019a] Baggage Object Segmentation Dual attention mechanism within an encoder decoder segmentation network.
Table 2: Overview of deep learning approaches applied within X-ray security imaging.

5 Discussion and Future Directions

This section evaluates the current trends within the field presented in Section 4, and discusses the challenges and future directions within the field.


Although the use of transfer learning improves the performance of small X-ray datasets, the lack of large datasets limits contemporary deep model training. Relatively large datasets in the field such as SIXray, FFOB are highly biased towards certain classes, limiting to train reliable supervised methods. Hence, it is essential to build large, homogeneous, realistic and publicly available datasets, collected either by (i) manually scanning numerous bags with different objects and orientations in a lab environment or (ii) generating synthetic datasets via contemporary algorithms.

There are advantages and disadvantages of both methods. Although manual data collection enables to gather realistic samples with the flexibility to produce any combination, it is rather expensive, requiring tremendous human effort and time.

Synthetic dataset generation, on that hand, is another method, currently achieved by TIP [Rogers2016, Mery2017b] or GAN [Zhao2018, Yang2019]. A recent study [Bhowmik2019] empirically demonstrates that using a TIP dataset for a detection task adversely impacts the detection performance on real examples. In future work, therefore, more advanced algorithms such as image translation or domain adaptation [Isola2016, Zhu2017] could be considered such that the model would learn to translate between benign and threat domains, which overall would yield superior projection/translation to TIP.

The literature has also seen another type of synthetic datasets generated by GAN algorithms. The limitation of current GAN datasets [Zhao2018, Yang2019], however, is that the models are currently capable of producing only objects but full X-ray images. Moreover, the quality of the generated images is far from being realistic. Further studies, taking these issues into account, will need to be undertaken. It might be feasible to create more realistic X-ray images by using contemporary GAN algorithms [Karras2019a].

Exploiting Multiple-View Information

Existing research recognizes the critical role played by multiple-view imagery, especially when the detection of an object from a particular viewpoint is challenging [Michel2009, Steitz2018, Liang2018].

Two key study [Liang2018] and [Steitz2018] investigate utilizing multiple-view integration inside/outside a CNN. Despite the incremental performance improvement reported, further work is required to investigate other possible ways to utilise multiple-view imagery better.

Transferring Between Domains and X-ray Scanners

As pointed out in [Caldwell2017, Gaus2019a], transferring models between different scanners could be challenging due to the unknown intrinsics of the scanners. Future work would utilize domain adaptation [Zhu2017], where the source domain contains images from one scanner, and the target domain would be of another X-ray scanner. Training with even unbalanced datasets would learn the intrinsic, and map from one to the other.

Improving Unsupervised Anomaly Detection Approaches

The performance of the current anomaly detection algorithms presented in Section 4.5 is somewhat limited to be deployed for a real-world scenario. Therefore, more research on this topic needs to be undertaken to design better reconstruction techniques that thoroughly learn the characteristics of the normality from which the abnormality would be detected.

Use of the Material Information

In dual-energy X-ray systems attenuation between high and low energies yields a unique value for different materials, which could be utilized further for more accurate object classification/detection [Chen2007, Fu2010]. Even though recent research [Morris2018, Rogers2017a] have examined the use of material information, the research outcome present inconsistent results. Hence, a further study thoroughly investigating the material information is suggested.

6 Conclusion

This paper taxonomises conventional machine and modern deep algorithms utilised within X-ray security imaging. Traditional approaches are sub-categorised based on computer vision tasks such as image enhancement, threat image projection, object segmentation, feature extraction, object classification, and detection. Review of the deep learning approaches includes classification detection, segmentation and unsupervised anomaly detection algorithms applied within the field. The discussion finally provides the strengths and weaknesses of the current techniques, open challenges and envisions the future directions of the field.