X-ray baggage security screening is widely used to maintain aviation and transport security, itself posing a significant image-based screening task for human operators reviewing compact, cluttered and highly varying baggage contents within limited time-scales. With both increased passenger throughput in the global travel network and an increasing focus on wider aspects of extended border security (e.g. freight, shipping, postal), this poses both a challenging and timely automated image classification task.
Prior work in the field has notably concentrated on the shaped-based detection of both threat and contraband (undeclared) items within X-ray imagery achieving both high detection performance with low false positive reporting [30, 15, 3, 4]. However, such approaches are insufficient when dealing with the detection of unknown anomalous items or materials potentially concealed within complex items such as consumer electronic devices.
Whilst existing security scanners use dual-energy X-ray for materials discrimination, and highlight specific image regions matching existing threat material profiles [25, 29], the detection of generalized anomalies within complex items remains challenging  (e.g. Figure 1).
Within machine learning, anomaly detection involves learning a pattern or distribution of normality for a given data source and thus detecting significant deviations from this norm
. Anomaly detection is an area of significant interest within computer vision, spanning biomedical imaging to video surveillance . In our consideration of X-ray security imagery, we are looking for abnormalities that indicate concealment or subterfuge whilst working against a real-world adversary who may evolve their strategy to avoid detection. Such anomalies may present (or conceal) themselves within appearance space in the form of an unusual shape, texture or material density (i.e. dual energy X-ray colour) . Alternatively they may present themselves in a semantic form, where the appearance of unfamiliar objects either globally or locally within the X-ray image .
Prior work on appearance and semantic anomaly detection, has considered unique feature representation as a critical component for detection within cluttered X-ray imagery . Early work on anomaly detection in X-ray security imagery , implements block-wise correlation analysis between two temporally aligned scanned X-ray images. More recently , anomalous X-ray items within freight containers have been detected using auto-encoder networks, and additionally via the use convolutional neural network (CNN) extracted features as a learned representation of normality across stream-of-commerce parcel X-ray images . In a similar vein, the work of  focuses on the use of a novel adversarial training architecture to detect anomalies as high reconstruction errors produced from a generator network adversarially trained on non-anomalous (benign) stream-of-commerce X-ray imagery only.
However, the majority of this prior anomaly detection work is focused at the image or object level, where anomaly presence is clear in appearance or semantic space, by asking the global question - is the image anomalous?
These approaches [5, 10, 2], fail to address the fact that anomaly presence maybe subtle and concealed (i.e. present) within a semantically benign object itself (e.g. Figure 1 A/B). In this case, we wish to ask a highly localised question - is this part of this complex object within the image anomalous ?
In order to address this issue, we consider the task of image segmentation - if we first segment a class of object from the image, then potentially segment that object into its sub-components how well can this issue of subtle and concealed anomaly detection be addressed.
To these ends, we introduce a side-by-side comparison of both object and sub-component level segmentation strategies for this case of intra-object anomaly detection. While anomaly detection at an object level is more common, the detection in sub-component level is still at infancy. The key concept is that whilst subtle localised anomalies maybe difficult to detect via an image level anomaly detection approach, we can instead target object level or sub-component level anomaly detection in isolation. Hence a more general learning-driven approach can developed at the object or sub-component level instead of tackling global signatures across all possible objects - and thus being able to tell if they are anomalous or benign in appearance / semantic space.
Following the work in Zhang et al,  where they leverage the use of superpixels  within X-ray cargo image classification, we complement such approach with prior object segmentation  as an enabler to sub-component level anomaly detection within X-ray security imagery. Using contemporary object segmentation via mask Region-based CNN (R-CNN)  and Simple Linear Iterative Clustering (SLIC)  superpixels, we evaluate alternate strategies for the detection of subtle intra-object anomalies at either a generalised object level or sub-component level segmentation strategy, thus facilitating effective anomaly detection independent of resolute object classification (Section II). Our work is evaluated over a range of large consumer electronics items with and without intra-object anomaly presence (Section III).
Ii Proposed Approach
Our approach considers two automatic segmentation strategies for intra-object anomaly detection in X-ray security imagery (Sec. II-A), as illustrated in the Figure 2A:- first, object level segmentation is performed (Figures 2B 2C) and secondly, sub-component level segmentation is performed (Figure 2C 2D). This is followed by secondary scale-specific variants to contemporary deep CNN architectures for final anomaly detection as a binary, , classification task (Sec. II-B).
Ii-a Segmentation Strategies
Object Level Segmentation: Our first segmentation strategy builds upon the Faster R-CNN  X-ray security image specific work of , to augment this model by adding two additional convolutional layers to construct a object boundary segmentation mask, following the Mask R-CNN concept of . This is performed by adding an additional branch to Faster R-CNN that outputs an additional image mask indicating pixel membership of a given detected object. Mask R-CNN  also addresses feature map misalignment, found in Faster R-CNN 
for higher resolution feature map boundaries, via bilinear boundary interpolation. Our Mask R-CNN is applied to an input X-ray image (Figure2A) with segmented object (Figure 2B then isolated from the image for subsequent object level anomaly detection (Figure 2C).
Sub-component Level Segmentation: Our second segmentation strategy uses image over-segmentation via Simple Linear Iterative Clustering (SLIC)  superpixels. It performs iterative clustering in a similar manner to -means, where the image is segmented into approximately equally-sized superpixels, whose total number k is user-defined. SLIC represents each pixel in , defined by the values of CIELAB colour space and the pixel coordinate. Instead of using Euclidean distance, SLIC introduce a new distance measure that considers superpixel size. SLIC takes as input a desired number of approximately equally-sized superpixel , and for the images with pixels, with the approximate size of each superpixel will be . Each of every approximately equally-sized superpixels, there will be a superpixel center at every grid interval . Let be the five dimensional point of a pixel, cluster center should be in the same form as . The distance measure is defined as:
where is the sum of the distance and the plane distance by the grid interval . Variable is introduced to control the compactness of the superpixel with the local convexity or concavity shape of each superpixel dependant on m (low m reduces the influence of coordinate information while for a high m each superpixel will approximate a square shape). Our choice (by taking consideration of the size of the object present in an image), results in a set superpixel region conforming to convex and concave image shape boundaries as illustrated in the Figure 3.
Ii-B Secondary Classification
Each segmented image region, from object level or sub-component level segmentation, is subsequently classified using a deep CNN architecture model formulated as a binary,, classification task. Three contemporary generalised CNN architectures plus leading fine-grain CNN classification approaches [18, 32, 27], specifically targeting the sub-categorization of pre-determined object types, are considered to form the basis of our anomaly detection study.
VGG-16  is a seminal network architecture that consists of 16 deep convolutional layers, with a fixed kernel size of 3, stacked on top of each other in increasing depth.
SqueezeNet  is a small network architecture that uses many 1-by-1 filters to aggressively reduce the number of weights. It offers equivalent accuracy to the AlexNet  yet operating with fewer parameters.
ResNet-50  solves the issue of vanishing gradient present in the forward feed and backward propagation processing in previous CNN architectures by introducing skip connection, parallel to the regular convolutional layers numbering 50 in depth.
Fine-grain Classification [18, 32, 27] is put into effect as we can consider the task of anomaly detection in our case as a fine-grained image classification (FGIC) problem. The Xray screening imagery used, has very subtle differentiating factors in the sub-component of a given object (e.g laptop, bottle) and as such a fine grained approach should be used to detect finer class-specific discriminatory patches within objects [13, 28].
Bilinear Convolutional Neural Network (BCNN) 
utilises a dual VGG-16 architecture in parallel with each stream implements uncommon, trivial elements of convolution and max pooling thus allowing focus on two separate distinct parts of the object. The two streams are concatenated into a bilinear vector using sum pooling over the outputs of both streams. This is then used in the final classification by feeding into the linear layers of the network, and finally a softmax layer to gain a probabilistic output of the most likely classifications for the image.
Multi-Attention (MA)  optimises part attentions of four distinct regions of an image using the feature channels in a VGG-16 architecture. This allows the network to focus on discriminative factors present in object parts, and use this in the final classification. Each of the four layers produces a classification at the end of the network linear layers which is then grouped by channel grouping loss in order to generate a final classification for a given object.
Discriminative Filter Bank (DFL) approach , heightens the mid-level network in a VGG-16 based architecture, by learning a collection of convolution filters known as a filter bank (FB), and a
with stride 8 to preserve global shape and appearance dependency in the image data. These filters when properly initialised and successfully learned can respond to discriminative regions when convoluted over the image.
When applied to the challenge of X-ray security screening, for the binary classification problem , the models should be able to recognise much subtler visual differences and locations of such parts within object sub-components which will ultimately lead to more reliable classification.
. For both object level and sub-component level segmentation our resulting image segments are padded and re-scaled to a common reference dimension (objects:; sub-components (superpixels): ). Dataset imbalance, a common problem for anomaly detection problems where anomalous examples can be scarce and challenging to obtain, is addressed by up-sampling the anomalous class with the lesser volume of samples. In total training is performed over a dataset of 14,964 X-ray imagery ( data split) and testing reported over a dataset of 7,878 X-ray imagery (50%: anomalous and 50%: benign) containing consumer electronics items.
Training is performed via transfer learning using stochastic gradient descent with a momentum of 0.9, a learning rate of 0.001, a batch size of 64 and categorical cross-entropy loss. All networks are trained on NVIDIA 1080 Ti GPU via PyTorch.
|Sub-component level segmentation||Binary Classification via CNN||ResNet-18 ||97.10||95.40||97.00||98.89||4.69|
|Fine-Grain Classification||BCNN ||97.54||95.53||97.49||95.49||4.30|
|Object level segmentation||Binary Classification via CNN||ResNet-18 ||86.20||80.60||76.90||95.42||21.13|
|Fine-Grain Classification||DFL ||89.77||83.70||89.33||83.70||3.88|
Our evaluation considers the comparative performance of: (a) object level segmentation followed by anomaly detection via CNN classification (i.e., anomaly present in object as a whole - ) and (b) sub-component level segmentation followed by anomaly detection via CNN classification (i.e., anomaly present in image sub-component patches, i.e., superpixels - ). We consider statistical Accuracy (A), Precision (P)
, F-score(F1), True Positive (TP) and False Positive (FP) as presented in Tables I and II.
The X-ray security imagery dataset used for evaluation is obtained using a conventional 2D X-ray scanner with associated false colour materials mapping from dual-energy X-ray materials information . It comprises large consumer electronics items (e.g., laptops) with and without intra-object anomaly concealment present. Anomaly concealments consist of marzipan, metal screws, metal plates, knife blades and similar inside the electronic items as illustrated in the examples of Figure 1 and Figure 2A/B.
Performance evaluation of the object level segmentation and component level segmentation approaches are performed over a set of images annotated with ground truth anomaly location gathered using local access to a dual-view X-ray cabin baggage security scanner.
From the results presented in Tables I and II, we can observe that a sub-component level segmentation strategy, supported by the secondary fine-grain CNN classification of DFL model , offers significantly superior anomaly detection performance (A: , TP: , FP: - Table I) than an object level segmentation strategy overall (Table II). Furthermore, fine-grain CNN classification similarly offers the highest overall accuracy and lowest false positive rate (A: , FP: - Table II) for object level segmentation. By contrast, the use of binary classification via a CNN offers superior performance for object level segmentation (Table II) in terms of higher accuracy supported primarily by higher true positive detection at the expense of false positive reporting. Second stage binary classification via CNN performed less well overall with the sub-component segmentation strategy (lower accuracy (A) caused by significantly higher false positive (FP) - Table II).
Fine grain classification model (DFL ) offer the lowest false positive and maximal accuracy for both segmentation strategies (Table I). We can deduce that increased levels isolation via segmentation to the sub-component level improves the performance of the discriminative feature space learnt by the fine-grain technique [27, 32, 18] whilst more classical object classification CNN architectures perform only marginally better on objects than sub-components (Table II).
whilst trained on object-level and sub-component level segmentation data respectively. They are generated on the Rectified Linear Unit (ReLU) activations of the final layer before the fully connected layer in the VGG-16 architecture of DFL. It is evident when inspecting these that the sub-level components (Figure 4 second column) show attention over the anomalous parts of the laptops in each image while the object-level analysis (Figure 4 first column) shows relatively sporadic sparse attention over the images. This provides qualitative visualization supporting the performance of the sub-component level segmentation strategy outperforming object level segmentation.
By enhancing the intermediate layer within the VGG-16 via a filter bank , we can hypothesise that this allows it to learn edge, corner and texture detail on specific sub-components at finer-level of the candidate region presented. As a result, the learned feature representation of anomaly against benign is highly discriminative leading to a significantly lower false positive than any other technique - at both the object and sub-component level (Tables I and II).
Binary classification via CNN using same VGG-16 architecture can achieve high true positive detection at the object level but at the expense of the highest false positive (TP: , FP: - Table II). This can also be observed for the ResNet and SqueezeNet architectures. For example, ResNet-50 achieves true positive of , however suffering from high FP of for object level segment classification (Table II).
Overall we observe that a sub-component level segmentation strategy, enabled via object segmentation via Mask R-CNN  and subsequent superpixel over-segmentation via SLIC , consistently outperforms an object level segmentation strategy (via Mask R-CNN  alone) when secondary region classification is performed using a specific fine-grain CNN variant . The mean runtime for end-to-end classification strategy (object segmentation, followed by sub-component level segmentation and fine-grain classification) is milliseconds, which is within the belt speed (0.2meter/second) of standard X-ray scanner .
We primarily focus on supervised anomaly detection strategies and compared the performances amongst. Hence we do not include unsupervised or semi-supervised anomaly detection strategy  in our experiments and we believe it is not an equitable comparison between supervised and semi-supervised approaches. To the best our knowledge, the proposed work on classification within large consumer electronics items, using sub-component level segmentation strategy, is first of its kind. As there is no prior related work is available on the literature of X-ray security imagery (e.g. sub-component level segmentation classification), we are unable to compare our strategies with any existing algorithm and present our results as the benchmark.
Figure 5 shows exemplar qualitative results of sub-component level segmentation with per superpixel classification using the fine-grained DFL  approach where we can see the colour coded set of anomalous (red) as well as benign (green) sub-component regions within the pre-isolated object-level image region.
We assess the performance impact of varying segmentation strategies, such as object level and sub-component level segmentation, for intra-object anomaly detection within the context of X-ray security imagery. Our experimental comparison demonstrates the superiority of a sub-component level segmentation approach in combination with a specific fine-grain CNN architecture achieving a performance accuracy of with a notable false positive rate for realistic anomaly concealment within representative consumer electronic items.
Future work will consider the conglomerate use of the multiple sub-component anomaly detection results in the robust determination of image-level anomaly vs. benign decision making for a broader range of object types.
Acknowledgements: Funding support - UK Department of Transport, Future Aviation Security Solutions (FASS) programme, (2018/2019).
-  (2012-11) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11), pp. 2274–2282. External Links: Cited by: §I, §II-A, Fig. 5, §III.
-  (2018) GANomaly: semi-supervised anomaly detection via adversarial training. In Asian Conference on Computer Vision – ACCV, Cited by: §I, §I, §III.
-  (2017-Sep.) An evaluation of region based object detection strategies within x-ray baggage security imagery. In 2017 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 1337–1341. External Links: Cited by: §I.
-  (2018-Sep.) Using deep convolutional neural network architectures for object classification and detection within x-ray baggage security imagery. IEEE Transactions on Information Forensics and Security 13 (9), pp. 2203–2215. External Links: Cited by: §I, §II-A.
-  (2016) Detecting Anomalous Data Using Auto-Encoders. 6 (1), pp. 21–26. External Links: Cited by: §I, §I.
-  (2018-05) Brothers plead not guilty to meat grinder bomb plot. Australian Broadcasting Corporation. External Links: Cited by: §I.
Imagenet: a large-scale hierarchical image database.
Proc. Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §II-B.
-  FEP ME 640 AMX. Note: https://www.gilardoni.it/en/security/x-ray-solutions/automatic-detection-of-explosives/fep-me-640-amx/Accessed: 2019-10-14 Cited by: §III.
-  (2010-11) Exposing the weakest link: as airline passenger security tightens, bombers target cargo holds. External Links: Cited by: §I.
-  (2018) ‘Unexpected item in the bagging area’. IEEE Transactions on Information Forensics and SecurityInternational Journal of Machine Learning and ComputingCoRRJournal of X-ray science and technologyCoRRApplied Radiation and IsotopesJournal of Air Transport Management (), pp. 1–1. Cited by: §I, §I.
-  (2017) Mask r-cnn. In Proc. of the IEEE Int. Conf. on Computer Vision, pp. 2961–2969. Cited by: §I, §II-A, §III.
-  (2016) Deep residual learning for image recognition. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §II-B, TABLE I, TABLE II.
-  (2015-06) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 595–604. External Links: Cited by: §II-B.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. abs/1602.07360. External Links: Cited by: §II-B, TABLE I, TABLE II.
-  (2014-08) Automated detection of cars in transmission x-ray images of freight containers. In 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Vol. , pp. 387–392. External Links: Cited by: §I.
An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. Journal of Imaging 4 (2). External Links: Cited by: §I.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §II-B.
-  (2017-09) Improved bilinear pooling with cnns. In Proceedings of the British Machine Vision Conference (BMVC), G. B. Tae-Kyun Kim and K. Mikolajczyk (Eds.), pp. 117.1–117.12. External Links: Cited by: §II-B, §II-B, TABLE I, §III.
-  (2015) A review of automated image understanding within 3d baggage computed tomography security screening. Journal of X-ray science and technology 23 (5), pp. 531–555. Cited by: §III.
-  (2017) Automatic differentiation in pytorch. Cited by: §II-B.
-  (2007) An overview of anomaly detection techniques: existing solutions and latest technological trends. Computer Networks 51 (12), pp. 3448 – 3470. External Links: Cited by: §I.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §II-A.
-  (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Information Processing in Medical Imaging, Cham, pp. 146–157. External Links: Cited by: §I.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §II-B, TABLE I, TABLE II, §III.
-  (2003) Explosives detection systems (eds) for aviation security. Signal Processing 83 (1), pp. 31–55. Cited by: §I.
-  (2018-01) Evaluation of the effectiveness of an airport passenger and baggage security screening system. 66, pp. 53–64. External Links: Cited by: §I.
-  (2018) Learning a discriminative filter bank within a cnn for fine-grained recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4148–4157. Cited by: §II-B, §II-B, TABLE I, TABLE II, Fig. 4, Fig. 5, §III, §III, §III, §III, §III, §III.
-  (2016) Cataloging public objects using aerial and street-level images — urban trees. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6014–6023. Cited by: §II-B.
-  (2012) A review of x-ray explosives detection techniques for checked baggage. 70 (8), pp. 1729–1746. Cited by: §I.
-  (2013) A vehicle threat detection system using correlation analysis and synthesized x-ray images. In Proc.SPIE, Vol. 8709, pp. 8709 – 8709 – 10. External Links: Cited by: §I, §I.
-  (2014-06) Joint shape and texture based x-ray cargo image classification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Vol. , pp. 266–273. External Links: Cited by: §I.
-  (2017) Learning Multi-Attention Convolutional Neural Network for Fine-Grained Image Recognition. In ICCV, Cited by: §II-B, §II-B, TABLE I, §III.