SIXray : A Large-scale Security Inspection X-ray Benchmark for Prohibited Item Discovery in Overlapping Images

01/02/2019 ∙ by Caijing Miao, et al. ∙ 0

In this paper, we present a large-scale dataset and establish a baseline for prohibited item discovery in Security Inspection X-ray images. Our dataset, named SIXray, consists of 1,059,231 X-ray images, in which 6 classes of 8,929 prohibited items are manually annotated. It raises a brand new challenge of overlapping image data, meanwhile shares the same properties with existing datasets, including complex yet meaningless contexts and class imbalance. We propose an approach named class-balanced hierarchical refinement (CHR) to deal with these difficulties. CHR assumes that each input image is sampled from a mixture distribution, and that deep networks require an iterative process to infer image contents accurately. To accelerate, we insert reversed connections to different network backbones, delivering high-level visual cues to assist mid-level features. In addition, a class-balanced loss function is designed to maximally alleviate the noise introduced by easy negative samples. We evaluate CHR on SIXray with different ratios of positive/negative samples. Compared to the baselines, CHR enjoys a better ability of discriminating objects especially using mid-level features, which offers the possibility of using a weakly-supervised approach towards accurate object localization. In particular, the advantage of CHR is more significant in the scenarios with fewer positive training samples, which demonstrates its potential application in real-world security inspection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Security inspection has been playing a critical role in protecting public space from safety threatening such as terrorism. With the growth of population in large cities and crowd density in public transportation hubs, it becomes more and more important to fast, automatically and accurately recognize prohibited items in X-ray scanned images. Recent years, the rapid development of deep learning 

[19]

in particular convolutional neural networks has brought an evolution to image processing and visual understanding, including discovering and recognizing objects in X-ray images 

[23][27][24]. Different from natural images and other X-ray scans [35], security inspection often deals with a baggage or suitcase where objects are randomly stacked and heavily overlapped with each other. Therefore, in the scanned images, the objects of interest may be mixed with arbitrary and meaningless clutters and thus can be ignored even by human inspectors, Figure 1.

Dataset: https://github.com/MeioJane/SIXray

Figure 1: Example images in the presented SIXray dataset with six categories of prohibited items. Challenges include large variety in object scale and viewpoint, object overlapping and complex backgrounds (please zoom in for details).

To provide a public benchmark for research in this field, in this paper, we present a dataset named Security Inspection X-ray (SIXray), which is times larger than the existing largest image collection for prohibited item discovery, i.e., the baggage group in the GDXray dataset [25]. SIXray contains more than one million X-ray images in which only less than images have positive labels (i.e., prohibited items are annotated). It mimics a similar testing environment to the real-world scenarios where inspectors often aim at recognizing prohibited items appearing in a very low frequency (e.g., in ). Unlike GDXray which only contains grayscale images in simple backgrounds, our dataset is much more challenging. Although a color-X-ray scanner assigns various colors to different materials, objects in the containers often suffer a considerable variety in scale, viewpoint, and style, yet a prohibited item may be mixed and overlapped with arbitrary numbers and types of safe items, as shown in Figure 1.

Figure 2: An X-ray image is composed of a set of overlapping images, each of which is transparent. (Best viewed in color).

We formulate this problem into an optimization task which, provided a dataset , aims at minimizing the expected loss function between ground-truth and prediction . Here denotes image data and is a

-dimensional vector with each index indicating whether a specific class is present in

. Based on this framework, we point out a clear difference between natural images and X-ray images. A natural image often contains only one class and thus can be sampled from a distribution . However, an X-ray image is often composed of a set of overlapping images which, provided a multi-class label ( dimensions), can be formulated using a mixture distribution where is sampled from a hidden distribution , as shown in Figure 2.

We present an approach in the context of deep neural networks to deal with this complex scenario. The key idea is to combine two sources of information, namely, using mid-level features (most often sampled from a mixture distribution) to determine high-level semantics , and reversely filtering irrelevant information out of by referring to the information contained in . To this end, we formulate the high-level supervision signals into reversed network connections. To alleviate data imbalance, we introduce a loss-balancing term based on this hierarchy. This leads to the complete pipeline named class-balanced hierarchical refinement (CHR). With being unobserved, an iterative process is required in optimization, which is computationally expensive in practice. To accelerate, we switch off iteration so that more training data are processed in a unit time period. In testing, CHR fuses visual information from different stages towards higher recognition accuracy, yet remains efficient in computation.

We evaluate CHR on the SIXray with different ratios of positive/negative samples. CHR reports significantly higher classification performance over various baselines, i.e., different network backbones, demonstrating the effectiveness of using high-level cues to assist mid-level features. In addition, we verify the necessity of adding the class-balanced loss term as we observe more significant improvement on less balanced training data. Last but not least, we provide annotations of prohibited items at the bounding box level in the testing set, and apply the class activation mapping (CAM) algorithm [38] as a baseline for weakly-supervised object localization.

The major contributions of this work are two-fold. () We provide a benchmark for future research in this challenging vision task. () We present an approach named CHR, which integrates multi-level visual cues and achieves class balance in the hierarchical structure.

2 Related Work

2.1 X-ray Images and Benchmarks

X-ray images are captured by irradiating the objects with X-ray and rendering them with pseudo colors according to their spectral absorption rates. Therefore, in X-ray images, objects made of the same material are assigned with very similar colors, e.g., metals are often shown in blue while impenetrable objects are often shown in red. Besides, the most significant difference between X-ray and natural images lies in object overlapping, because X-ray is often applied in the scenarios that some objects may heavily occlude others, e.g., in a baggage

, personal items are often stacked randomly. This property brings a new challenge to computer vision algorithms, while the traditional difficulties persist,

e.g.

, scale and viewpoint variance, intra-class variance and inter-class similarity,

etc., as widely observed in other object localization benchmarks like PascalVOC [9] and MS-COCO [21].

Researchers designed much work to deal with these difficulties and also approach the promising commercial value after them [1][10][26][30][34]. But unfortunately, very few X-ray datasets have been published for research purposes. A recently released benchmark, GDXray [25], contains three major categories of prohibited items including gun, shuriken and razor blade. However, images in GDXray were provided with few background clutters as well as overlaps, thus, it becomes considerably easy to recognizing these images and/or detecting the objects within. In addition, the relatively small number of negative samples (images not containing prohibited items) ease the algorithm in both training and testing stages. ChestXray8 [35] is a large-scale chest X-ray corpus for medical imaging analysis. Different from our scenario, objects in these images are rarely overlapping with each other.

2.2 Object Recognition and Localization

The research field of object recognition has been dominated by deep learning approaches. With the availability of large-scale datasets [18] and powerful computational resources, researchers are able to design and optimize very deep neural networks [18][31][16][4][13][14] to learn visual patterns in a hierarchical manner. In the scenario that each image may contain more than one objects, there are typically two types of localization methods. The first one worked on the image level which produces a score for each class indicating its presence or absence [38]. The second one instead worked on the object level, and produced a bounding box as well as a class label for each object individually [12][11][29][22][28]. The former type often encounters the issues of multi-object classification and training data imbalance [35], for which binary cross entropy (BCE) loss [5] as well as class-balancing techniques [35][15] were explored. The second type, on the other hand, was often based on a pipeline that first extracts a number of proposals in the image [12][11][29], and then determines the class of each proposal.

This paper studies image-level recognition, as per-object annotation is missing for training data, while our approach has the ability of object-level localization. This is related to the research in weakly-supervised object localization [3][6][33], or a series of work in localizing objects using

top-down class activation [8][7][39]. There were also efforts about formulating the object localization in multiple instance learning frameworks where convolutional filters behave as detectors which activate regions of interest on the feature maps [3][36][33].

In the context of object recognition in X-ray images, researchers realized that these images often contain fewer texture information, yet shape information stands out to be more discriminative. Therefore, in the era of bag-of-visual-word models [34][2], the topic of designing effective and efficient handcrafted features is explored in depth [30][26]. As deep learning becomes a standard tool of optimizing complex functions, researchers started to apply it to either extracting compact visual features for X-ray image representation [1] or fine-tuning a pre-trained model on X-ray images so that knowledge learned from natural images can be borrowed. This paper mainly focuses on the second approach.

3 The SIXray Benchmark

3.1 Data Acquisition

We collected a dataset named Security Inspection X-ray (SIXray), which contains a total of X-ray images, and is more than times larger than the only existing public dataset for the same purpose, i.e., the baggage group of the GDXray dataset [25]. These images were collected from several subway stations with the original meta-data indicating the presence or absence of prohibited items. There are six common categories of prohibited items, namely, gun, knife, wrench, pliers, scissors, and hammer. The hammer class with merely samples is not used in our experiments.

The distribution of these objects aligns with the real-world scenario, in which there are much fewer positive samples compared to negative samples. A statistics on this dataset is shown in Table 1. Each image was scaned by security inspection machine , which assigned different colors to objects made of different materials. All images were stored in JPEG format with an average size of pixels.

To study the impact brought by training data imbalance, we construct three subsets of this dataset, and name them as SIXray10, SIXray100 and SIXray1000, respectively, with the number indicating the ratio of negative samples over positive samples. In SIXray10 and SIXray100, all positive images are included, and there are exactly and negative images. SIXRay100 has a very close distribution to the real world scenario. To maximally explore the ability of our algorithm to deal with data imbalance, we construct the SIXray1000 dataset by randomly choosing only positive images but mixing them with all the negative images. Each subset is further partitioned into a training set and a testing set, with the former containing of the images and the latter containing (the ratio training/testing images is ).

On the entire dataset, we use the image-level annotations provided by human security inspectors, i.e., whether each type of prohibited items is present. In addition, on the testing sets, we manually add a bounding-box for each prohibited item to evaluate the performance of object localization.

The SIXray Dataset ()
Positive () Negative
Gun Knife Wrench Pliers Scissors Hammer
Table 1: The class distribution of the SIXray dataset. There is another hammer class with items, but it is not used due to the small number of samples.

3.2 Dataset Properties

The SIXray dataset has several properties which bring difficulties to visual recognition. First, these images were mostly obtained from X-ray scans on personal luggage, e.g., bags or suitcases, in which objects are often randomly stacked. When these items passed an X-ray scan, the penetration property makes it possible to see even the occluded items in the image. This leads to the most important property of this dataset, which we call it overlapping. Note that GDXray [25] does not have such a challenge as there is often only one item in each image. Second, prohibited items can appear in many different scales, viewpoints, styles and even subtypes, all of which cause considerable intra-class variation and increase the difficulty of recognition. Third, the images can be heavily cluttered yet it is almost impossible to assign all objects especially those non-prohibited ones with a clear class label. Thus, there is noise coming from an open set of objects, which makes it difficult to expect what appears in the background regions. Fourth and last, as mentioned above, the positive images (with at least one prohibited item) only occupy a small fraction of this dataset. Without a special treatment, it is easy for the training stage to bias towards the negative class, as simply guessing a negative label yields sufficiently high accuracy. This raises a challenge to training stability.

In the following section, we present our approach which takes these properties into consideration, especially the first and fourth properties which are specific to this dataset.

4 Our Approach

4.1 Motivation and Formulation

As observed in the previous section, a significant characteristic of X-ray images lies in that objects are overlapped with each other. Note that overlapping is different from occlusion in which the rear object is invisible. Instead, as X-ray is penetrable, both front and rear objects are visible in the image. This is named the penetration assumption, based on which we use a mixture model to formulate these data.

Let there be classes of possible items appearing in the dataset, with an index set of . Among them, classes are considered prohibited, e.g., in the SIXray dataset, . Without loss of generality, we assign them with the class index of . Let the dataset contain images. For each input image , our goal is to obtain a -dimensional vector for each , each dimension in which, , is either or , with indicating the specified prohibited item is present in this image and vice versa. Note that the ground-truth of only exists for the first dimensions, while others remain unobserved.

To obtain a mathematical formulation of , we assume that it is composed of sub-images , each of which corresponds to a specified class and is sampled from a conditional distribution . Then, based on the penetration assumption, each image can be written as:

(1)

This formulation is of course not accurate as we ignore the overlapping relationship between objects as well as the order that objects are stacked, but it serves as an approximate formulation of how overlapping impacts image data.

Our goal is to learn a discriminative function to predict the image label. Since the object of interest may appear in various scales. In order to recognize and further detect it, a popular choice [17][20] is to combine multi-stage visual information. Here we simply consider feature vectors extracted from different layers, the -th of which is denoted as

. A regular solution is to train a classifier beyond each layer,

, using the ground-truth signal as supervision. In the testing stage, we fuse all as the final output, i.e., .

However, we note a significant weakness of this model, which comes from the penetration assumption, i.e., Eqn (1), applied to mid-level features222Eqn (1) fits mid-level features best, because low-level features (e.g., raw image pixels) are often largely impacted by small noise, in both case it is learning the class-conditional distribution suffers a higher difficulty. Similarly, the very last layers (e.g., containing class-specific logits) are less likely to be additive as in Eqn (1).. This is to say, each is the composition of sub-images sampled from different classes, including those items of no interest, and thus may be distracted. A reasonable idea is to refine to get rid of these irrelevant information. This is achieved by a function , which shares the same dimensionality with . Summarizing these contents yields the following optimization problem:

(2)
(3)
(4)
(5)

Here is a loss function which is discussed in details later. The above formulae define a recurrent model, in which cannot be observed even in the training stage. The standard way of optimization involves iteration, in which we start with an sampled from and any (in the training process, the first dimensions are provided by ground-truth and other dimensions can be randomly initialized). We first compute for each accordingly, and use it to compute the first version of . In each round, we compute and use it to compute so that is updated as . Within this process, parameters and are updated accordingly with ground-truth and gradient back-propagation. This iteration continues until convergence or a maximal number of rounds is achieved333Here are some side notes. It has been widely believed that a deep network is able to fit training data sampled from one-class distributions, e.g., each sample contains only one object in class , so that is sampled from . In such scenarios,

as a one-hot vector is relatively easy to estimate and thus iteration is not required. This is the reason that deep networks produced satisfying performance in the GDXray dataset 

[25] in which most images contain only one object..

Figure 3: The overall architecture of the proposed class-balanced hierarchical refinement (CHR) approach (best viewed in color). The network backbone is shown on the leftmost column, from which layers are chosen as feature extractors. For simplicity, we show an example with . Each , , is up-sampled and concatenated with and fed into a refinement function that simulates , and is sent into for classification. GAP denotes global average pooling. A class-balancing loss is built upon the same hierarchy, on which mid-level negative samples are filtered out using high-level cues.

4.2 Approximation with Hierarchical Refinement

In practice, however, the above formulation has two major drawbacks. The first one lies in the inaccuracy of generative models. We expect a model to eliminate the components in that correspond to the non-targeted classes in . This is increasingly difficult especially when the is far from . So, we assume that only receives supervision signals from , which is much closer than , while continues to receive information from and this process continues until is reached. In implementation, this implies that reversed connections only emerge between neighboring feature layers. Here an exception happens at the last feature layer, , which is connected to via a classifier . Since direct supervisions have already been provided by this classifier, we ignore the connection between and , leaving a total of connections between and , for . This is to say, is replaced by . Nevertheless, can still obtain supervision signals from in an indirect manner, i.e., via a few intermediate steps. This is named the hierarchical refinement strategy.

Implementation details are illustrated in Figure 3. We start with

, the feature extracted from the top layer. It is concatenated with the feature at the previous stage,

, before which it is up-sampled if necessary. The concatenated feature is then fed into to produce . This process continues until is obtained. Each , , is sent into the corresponding classifier to obtain . All are averaged into the final output and supervised by .

The second drawback is the slowness of an iterative optimization. To accelerate, we switch off iteration so that each case is forward-propagated and back-propagated only once, and the updated parameters , and are directly applied to another case sampled from

. This can be understood as stochastic gradient descent on

. In practice, this allows us to sample more data in the same period of time, and thus improve training efficiency.

4.3 Class-Balanced Loss

Here we study the impact of the loss function, i.e., Eqn (3), in the training process. In this specific problem, i.e., prohibited item discovery, there are much fewer positive training samples (at least one prohibited item is labeled) than negative ones. This makes regular loss functions such as the Euclidean loss and the Binary Cross-Entropy (BCE) loss less effective, because the network can heavily bias towards negative examples (because simply guessing all training samples to be negative leads to a very low loss function) and, consequently, the recall becomes considerably low. A reasonable solution is to slightly change the loss function so as to equivalently reduce the number of negative training data [35]. Here we combine this approach in the context of hierarchical refinement which once again takes advantage of high-level supervision to guide mid-level features.

The proposed loss function works in a mini-batch . For each case with , we have a few stages defined previously, each of which produces a feature followed by a prediction . We add a binary weight vector, denoted by , measuring whether each class in contributes to the loss function. Thus, Eqn (3) becomes:

(6)

where is the loss vector, , and denotes element-wise multiplication.

It remains to define for each . In the highest (-th) level, directly measures whether each class, or each dimension in , has to be considered. This conditional variable is always true for each class with a positive label, while for that with a negative label, it is true only if the prediction is larger than a fixed threshold . In each of the lower levels, a class is considered if the above judgment returns true, as well as all the higher levels support this – in other words, if a class is switched off at some level, it will never be considered in each of the lower levels. This is based on the assumption that high-level features are more reliable in determining which classes are present and which are absent, while low-level features may produce false positives due to various reasons.

Replacing Eqn (3) with Eqn (6) gives the complete class-balanced hierarchical refinement (CHR) approach. In the training process, each is computed individually and averaged for gradient back-propagation. In the testing stage, we directly average all for the final prediction. Please refer to Figure 4 for details.

5 Experiments

5.1 Setting and Baselines

We use all three subsets, namely, SIXRay10, SIXRay100 and SIXRay1000, to evaluate different approaches. In each subset, all models are optimized on training data, and evaluated on the remaining testing data. These data splits are random but consistent for all competitors.

We evaluate both image-level classification mean Average Precision and object-level localization accuracies, for the second goal we manually labeled all prohibited items with a bounding-box in the testing images. For image classification, we apply the evaluation metric in the PascalVOC image classification task 

[9], which works on each class individually – all testing images are ranked by the confidence of containing the specified object, and the mean average precision (mAP) is computed. For object localization, we follow [37] to compute the accuracy of pointing localization. A hit is counted if the pixel of the maximum response falls within one of the ground-truth bounding-boxes of the specified object, otherwise a missed is counted. Thus, each class has a localization accuracy computed by . For both tasks, we also report the overall performance which is the average over all five classes.

We investigate five popular backbones, including ResNets [13] with , and layers, Inception-v3 [32], and densenet with 121 layers. We follow the conventions to setup all these networks, and CHR is applied to each of them using – three pooling layers with different spatial resolutions (e.g., in ResNets, , , and ) are used as features. It is of course possible to increase by adding more features, yet in practice we find is sufficient to provide complementary information.

Figure 4: Discriminative prohibited item localization with hierarchical features. (Best viewed in color).
Method Gun Knife Wrench Pliers Scissors mean
ResNet34[13]
ResNet34+CHR
ResNet50[13]
ResNet50+CHR
ResNet101[13]
ResNet101+CHR
Inception-v3[32]
Inception-v3+CHR
DenseNet[14]
DenseNet+CHR
Table 2: Classification mean Average Precision () on subsets of SIXray (each cell, left to right: SIXray10, SIXray100, SIXray1000).
Method Gun Knife Wrench Pliers Scissors mean
ResNet34[13]
ResNet34+CHR
ResNet50[13]
ResNet50+CHR
ResNet101[13]
ResNet101+CHR
Inception-v3[32]
Inception-v3+CHR
DenseNet[14]
DenseNet+CHR
Table 3: Localization accuracy () on subsets of SIXray (each cell, left to right: SIXray10, SIXray100, SIXray1000).

5.2 Classification: Quantitative Results

We first investigate the overall (averaged over five classes) image classification results which are summarized in Table 2. CHR achieves consistent mean Average Precision gain beyond all network backbones as well as in all different subsets, i.e., SIXray10, SIXray100 and SIXray1000.

We observe that CHR works better in deeper networks, which is also observed in experiments, e.g., on top of Inception-v3 and DenseNet, the absolute improvement over SIXRay1000 is and , respectively.

We next observe five types of objects individually. The benefit brought by CHR is different from class to class. Take the DenseNet as an example. When it is aimed at finding gun, classification performance is not boosted in all subsets, while we observe significant gains over all the other classes, especially for scissors, the accuracy is improved by an impressive amount of . We can see in Table 1 that the training samples of scissors is the least among all five prohibited items, for which reason the baseline suffers significant bias in the training stage. CHR, by introducing hierarchical signals for supervision, largely alleviates this bias.

Finally, we study the issue of data imbalance over different subsets. Recall that the ratio of negative over positive images is , and , respectively. From Figure 5, we can see that the performance gain goes up with data imbalance, which, as analyzed in Section 5.4, comes from our special treatment towards class balancing.

Figure 5: The overall accuracy gain of CHR becomes more significant in the subsets with larger negative-positive ratios.

5.3 Localization: Quantitative Results

To verify that CHR is not over-tuned to image classification, we attach the class activation map (CAM) [38], an weakly supervised approach for object localization, on top of the features extracted at different stages. CAM produces one heatmap for each class individually, and on each of these maps. We first rescale the maps to the original image size. If the maximal response across scales falls within one of the ground truth bounding boxes of the specified object, the predicted location is considered a valid localization.

Table 3 summarizes localization results. CHR based on DenseNet outperforms DenseNet by ( vs ) for SIXray100 and ( vs ) for SIXray1000.

Especially for Wrench of SIXray1000, Inception-v3+CHR outperforms Inception-v3 by ( vs ). Again, we observe significant accuracy gain on deeper networks (which produces more powerful features) and larger negative-over-positive ratios. more localization results are shown in Figure 6.

Figure 6: Examples of object localization based on DenseNet, Which shows CHR is effective in complex background and overlapping images. (best viewed in color).

5.4 Ablation Studies

In this part we provide diagnostic experiments. These experiments are performed on all three subsets of SIXRay, which have different ratios of negative-positive samples.

First, we study the performance of hierarchical refinement – the reversed connections, Table 4. It can be seen that the top-down refinement (ResNet34+HR) improves the classification and localization accuracies by and on SIXRay100, and and on SIXRay1000. (ResNet34+HR) outperforms the direct hierarchical fusion (ResNet34+H). The reason lies in that the information provided overlaps with regular networks, and the latter option provides more information to low-level features.

Second, we study the impact of different loss functions, Table 4. With the class-balance loss (ResNet34+CH), the classification and localization accuracies are improved by and on SIXRay100, and and on SIXRay1000. By combining hierarchical refinement with the class-balance loss (ResNet34+CHR), the classification and localization accuracies are improved by and on SIXRay100, and and over the baseline ResNet34, Table 4, which shows the significance of CHR on large-scale datasets with class imbalance.

Method SIXray10 SIXray100 SIXray1000
ResNet34
ResNet34+H

ResNet34+CH
ResNet34+HR
ResNet34+CHR
Table 4: Classification mean Average Precision and localization accuracies () on SIXRay subsets using options (refinement method, loss function, etc.) of CHR. The backbone is ResNet34. For the explanation of different options, see the main texts in Section 5.4.

Note that the accuracy gain is achieved with a relatively small amount of extra computation. For example, ResNet34 requires to process each testing image and ResNet34-CHR requires , both on a Tesla V100 GPU. Thus, extra time is used by CHR.

5.5 ILSVRC2012 Classification

Last but not least, we evaluate CHR on ILSVRC2012, a large-scale image classification dataset. This is to observe how CHR generalizes to natural image data, provided that it achieves significant accuracy gain on overlapping image data. ILSVRC2012 is a popular subset of the ImageNet databased, which has

classes and each of them contains a well-defined concept in WordNet. A total of training images and

validation images are provided, both of which are roughly uniformly distributed over all classes.

We follow the standard training and testing pipelines, including the policies of model initialization, data augmentation, learning rate decay, etc. Since ILSVRC2012 is not an imbalanced dataset, we switch off the weight terms in the loss function which was designed for this purpose.

The top- error of CHR based on ResNet18 is [13], which slightly lower than the baseline by ( vs ). Besides, the top- and top- errors of CHR based on ResNet50[13] are and . which are lower than the baseline by ( vs ) and ( vs ), respectively. This slight but consistent accuracy gain delivers two-fold messages. The reversed connections in our approach which carries high-level supervision to mid-level features do not conflict with natural images – although it aligns with overlapping image data much better. Given that the additional computational costs are almost negligible, it is worth investigating its extension in the natural image domains.

6 Conclusions

In this paper, we investigate prohibited item discovery in X-ray scanned images, which is a promising application in industry yet remains fewer studied in computer vision. To facilitate research in this field, we present SIXray, a large-scale dataset consisting of more than one million X-ray images, all of which were captured in real-world scenarios and therefore covered complicated scenarios. We manually annotated six types and more than prohibited items, which is at least times larger than all existing datasets. In methodology, we formulate X-ray images as the overlap of several sub-images, therefore sampled from a mixture distribution. Motivated by filtering irrelevant information, we present an algorithm to refine mid-level features in a hierarchical and iterative manner. In practice, we switch off iteration to optimize the network weights in an approximate but efficient manner. A novel loss function is also built upon the hierarchical architecture to deal with heavy data imbalance between positive and negative classes. Beyond a few popular network backbones, our approach produces consistent gain in both classification and localization accuracy, establishing a strong baseline for the proposed task.

The future research mainly lies in two directions. First, the formulation of overlapping images from the penetration assumption is not accurate in many aspects – we look forward to more effective approaches based on a better physical model. Second, the connection between overlapping images and natural images, e.g., object occlusion, remains unclear – studying this topic may imply some ways of extending these approaches to a wider range of applications.

References

  • [1] S. Akçay, M. E. Kundegorski, M. Devereux, and T. P. Breckon. Transfer learning using convolutional neural networks for object classification within x-ray baggage security imagery. In ICIP, pages 1057–1061, 2016.
  • [2] M. Baştan, M. R. Yousefi, and T. M. Breuel. Visual words on baggage x-ray images. In CAIP, pages 360–368, 2011.
  • [3] H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In CVPR, pages 2846–2854, 2016.
  • [4] S. Christian, L. Wei, J. Y. andF Sermanet Pierre, R. Scott, A. Dragomir, E. Dumitru, V. Vincent, and R. Andrew. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  • [5] A. Creswell, K. Arulkumaran, and A. A. Bharath. On denoising autoencoders trained to minimise binary cross-entropy. CoRR, abs/1708.08487, 2017.
  • [6] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and L. Van Gool. Weakly supervised cascaded convolutional networks. In CVPR, number 8, page 9, 2017.
  • [7] T. Durand, T. Mordan, N. Thome, and M. Cord.

    Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation.

    In CVPR, pages 5957–5966, 2017.
  • [8] T. Durand, N. Thome, and M. Cord. Weldon: Weakly supervised learning of deep convolutional neural networks. In CVPR, pages 4743–4752, 2016.
  • [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
  • [10] T. Franzel, U. Schmidt, and S. Roth. Object detection in multi-view x-ray images. PR, pages 144–154, 2012.
  • [11] R. Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [14] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, number 2, page 3, 2017.
  • [15] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002.
  • [16] S. Karen and Z. Andrew. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [17] W. Ke, J. Chen, J. Jiao, G. Zhao, and Q. Ye. Srn: Side-output residual network for object symmetry detection in the wild. In CVPR, pages 1068–1076, 2017.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
  • [19] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [20] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, number 2, page 4, 2017.
  • [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  • [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37, 2016.
  • [23] D. Mery. X-ray testing by computer vision. In CVPRW, pages 360–367, 2013.
  • [24] D. Mery and C. Arteta. Automatic defect recognition in x-ray testing using computer vision. In WCCV, pages 1026–1035, 2017.
  • [25] D. Mery, V. Riffo, U. Zscherpel, G. Mondragón, I. Lillo, I. Zuccar, H. Lobel, and M. Carrasco. Gdxray: The database of x-ray images for nondestructive testing. Journal of Nondestructive Evaluation, 34(4):42, 2015.
  • [26] D. Mery, E. Svec, and M. Arias. Object recognition in baggage inspection using adaptive sparse representations of x-ray images. In PSIVT, pages 709–720, 2015.
  • [27] D. Mery, E. Svec, M. Arias, V. Riffo, J. M. Saavedra, and S. Banerjee. Modern computer vision techniques for x-ray testing in baggage inspection. Systems, Man, and Cybernetics: Systems, 47(4):682–692, 2017.
  • [28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  • [29] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • [30] M. Roomi and R. Rajashankarii. Detection of concealed weapons in x-ray images using fuzzy k-nn. International Journal of Computer Science, Engineering and Information Technology, 2(2), 2012.
  • [31] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    In AAAI, volume 4, page 12, 2017.
  • [32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  • [33] P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instance detection network with online instance classifier refinement. In CVPR, pages 3059–3067, 2017.
  • [34] D. Turcsany, A. Mouton, and T. P. Breckon. Improving feature-based object recognition for x-ray baggage security screening using primed visualwords. In ICIT, pages 1140–1145, 2013.
  • [35] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In CVPR, pages 3462–3471, 2017.
  • [36] R. Weiqiang, H. Kaiqi, T. Dacheng, and T. Tieniu. Weakly supervised large scale object localization with multiple instance learning and bag splitting. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):405–416, 2016.
  • [37] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation backprop. IJCV, 126(10):1084–1102, 2018.
  • [38] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.

    Learning deep features for discriminative localization.

    In CVPR, pages 2921–2929, 2016.
  • [39] Y. Zhu, Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao. Soft proposal networks for weakly supervised object localization. 2017.