Foreign object detection in an industrial high-throughput setting is essential for guaranteeing quality and safety of objects processed in factory lines. Foreign objects may, for example, appear in products such as meat, fish or vegetables as small pieces of glass, bones, plastic, wood or stone that could harm consumers [1, 2, 3]. Conventional nondestructive methods for detecting foreign objects include ultrasound imaging, X-ray imaging, magnetic resonance imaging, fluorescence imaging, (hyperspectral) spectroscopic imaging and thermal imaging [4, 5, 6, 7, 8, 9]. X-ray imaging provides the unique opportunity to visualize the interior structure of an object in a fast, low-cost, and non-invasive manner. This enables X-ray based foreign object detection, in which the goal is to detect unwanted smaller objects inside base objects based on their distinct attenuation or attenuation patterns, as observed in generated radiographs (i.e. standard 2D X-ray images). The possibility to reveal hidden foreign objects on radiographs has lead to its extensive use in various industrial applications [7, 10, 11, 12, 13, 14, 15], for which low-cost, adaptive and efficient image processing methods are essential [9, 13]. One way to achieve better discrimination of foreign objects in radiographs is to use multispectral X-ray imaging detectors, simultaneously capturing radiographs at two or more energy levels [16, 17]. As the attenuation properties of each material have their own characteristic dependence on the X-ray energy, these multispectral images can be analyzed to extract material composition information.
However, superposition of materials gives rise to similar levels of intensities for different objects in 2D radiographs. This problem limits the application of commonly used segmentation methods, such as threshold-based, clustering-based, and boundary-based or edge-based segmentation [18, 19], to extract different components of the object. Additionally, high-throughput acquisition may lead to high noise levels in radiographs, and this increases the difficulty of successful foreign object detection even further [9, 13]. Commonly used segmentation methods can be unsuitable in case of poor image qualities caused by conditions such as noise, low contrast and homogeneity in regions close to foreign objects . Most conventional unsupervised methods can therefore not achieve high accuracies  without extensive manual parameter tuning to use a method for a specific problem [20, 21].
Machine learning is a powerful tool for recognizing patterns in images  and can potentially detect foreign objects in radiographs . Recent machine learning methods address a wide variety of segmentation problems [19, 23], and provide a remarkable improvement over more classical segmentation methods in many practical applications . A key obstacle in the application of machine learning is the need for large datasets [25, 26, 27], which is particularly prominent in machine learning for foreign object detection as each new combination of sample, foreign object, and imaging settings requires additional data. On top of that, supervised learning uses labeled datasets for training. However, manual annotation (as in e.g. [19, 20]) requires tremendous efforts , is time consuming and tedious , is subjective and can be prone to errors.
The key contribution of this paper is to propose a workflow based on 3D Computed Tomography (CT) for efficiently creating large training datasets, overcoming the aforementioned obstacle. CT scans of a relatively small number of objects are carried out with low exposure time – as in a high-throughput setting – yielding a large number of radiographs that are used as input for the supervised machine learning method. The same set of radiographs is also used offline for generating multiple high-quality tomographic 3D reconstructions, from which foreign objects can easily be segmented in 3D and projected back onto a virtual 2D detector to give the corresponding ground truth locations of the foreign objects on the radiographs. Without the effort of extensive manual labelling, this results in a large dataset with which deep learning can be carried out to detect foreign objects from fast-acquisition radiographs at a high rate. The example in Figure 1 illustrates the difference in ease of segmentation for a CT reconstructed 3D volume versus a 2D radiograph. Whereas segmenting the foreign object in a radiograph is a challenging task, simple global thresholding can be applied to the CT volume to separate the foreign object from the base object. Additionally, more sophisticated and accurate segmentation and denoising rules can be imposed on 3D volumes [23, 30, 31] than on 2D radiographs.
The structure of the paper is as follows. Section 2 provides the background of applying machine learning for foreign object detection, and explains the proposed method of data generation to apply machine learning. In Section 3, the workflow is demonstrated in a laboratory experiment, and shows how the number of imaged objects affects the detection accuracies. Additionally, the robustness of the workflow is analyzed. Section 4 discusses various aspects of the results and the flexibility and modularity of the workflow. Section 5 presents the conclusions from this work.
In this section, we introduce machine learning for foreign object detection and explain the methodology of our CT-based workflow for creating training data.
2.1 Foreign object detection with X-ray imaging
We consider the problem of foreign object detection in an industrial high-throughput conveyor belt setting. The problem and the usage of X-ray imaging to solve this are schematically shown in Figure 2. In foreign object detection, the aim is to correctly determine for each object whether a foreign object is contained in it or not, for instance a piece of bone within a meat sample.
For this problem, we focus on finding an accurate segmentation for each radiograph. A segmentation partitions an image into sets of pixels with the same label. In our case, the formed segmented image is binary and indicates on which detector pixels a foreign object is projected. The segmentation depends on the type of objects that are considered to be foreign (by for instance a manufacturer). Any further classification (based on the minimum size of a foreign object for example) can be carried out after the segmented image is produced.
Throughout this paper, we use the term radiograph for radiographs corrected using flatfield radiographs (without an object) and darkfield images (without the X-ray beam) that serve as input to the segmentation method. The quality of a radiograph depends on a number of properties of the scan, including exposure times, tube intensities, photon energy windows and the geometric setup . In a high-throughput setting, the steps in Figure 2 should be fast to carry out, typically resulting in high noise levels and a challenging segmentation task.
2.2 Supervised learning
Machine learning is a widely used approach for difficult imaging tasks, as it can extract complicated patterns from complex images. In the foreign object detection problem, supervised machine learning can be used to learn the segmentation task such that it generalizes well for all possible fast-acquisition radiographs of similar objects with similar acquisition settings. To do so, a set of examples is used, where are acquired radiographs and are their corresponding foreign objects segmentations. The aim is to find the unknown segmentation function that maps each radiograph to its segmentation . To find an approximate solution that generalizes well, the set of images is partitioned into a training set, a validation set and a test set. The training set is used to learn the function that minimizes the loss on the training set, which is the sum of errors between the segmented images produced by the segmentation function and the true segmented images
. To find a suitable segmentation function, a (convolutional) neural network is often used as a model and parametrized using weights and biases that are optimized during the training process. While carrying out the training with a chosen loss function and optimization algorithm, the performance of the model is evaluated on the validation set. Several stopping criteria can be used for this, for example stopping the training when the error on the validation set increases, or training for a fixed time (and recording the network that gives the best results on the validation set). To avoid any bias towards the training and validation data, the accuracy of the trained model is finally assessed using the test set.
Since the introduction of Fully Convolutional Networks 
, in which successive contracting convolutional layers are utilized for pixel-wise semantic segmentation, many convolutional neural network (CNN) architectures have been proposed that can be used for the object segmentation task. U-Net changes the FCN architecture by - along with downsampling operators and skip connections - introducing upsampling operators instead of pooling operators, giving it an U-shaped appearance. Similarly, Deconvnet  also introduces an auto-encoder structure with deconvolution and unpooling operations (without skip connections). The success of these methods on medical image segmentation and object detection spawned other commonly used CNN architectures for segmentation such as SegNet , RefineNet , PSPNet , and Mask R-CNN  for instance segmentation. Although some of the listed architectures need relatively few training examples for successful segmentation, the annotation of these examples still requires considerable efforts.
2.3 Proposed workflow for training data acquisition
Our proposed workflow for using CT to obtain annotated training images is schematically displayed in Figure 3. First, we select a set of representative objects as training objects (Fig. 3a). For each object, a set of fast-acquisition radiographs is collected from a set of predefined angles (Fig. 3b). These fast-acquisition radiographs will form the input set of the intended training dataset (Fig. 3c). The total number of examples in the resulting dataset is the number of training objects multiplied by the number of selected angles.
The same set of radiographs is used to carry out a tomographic reconstruction of the object and acquire high-quality CT volumetric data (Fig. 3d and e). The next step is to segment the reconstructed volume such that a possible foreign object is separated from the base object (Fig. 3f). This segmentation step can be automated and many methods are available to implement this . Here, we consider volumetric segmentation methods that consist of a global thresholding step. Binary segmentation by global thresholding is defined by the following function that acts on every voxel in reconstruction volume :
where is the segmentation threshold. The more angles and other high-quality settings are used to obtain projection data, the easier it is to accurately segment the foreign object. Easier segmentation can also be accomplished by carrying out a separate high-quality scan of the same object and making a reconstruction with these high-quality radiographs. Additionally, for segmentation, prior information about the objects can be used, such as bounding boxes on the foreign object location . Also, 3D denoising [42, 43] can be used to remove non-foreign object pixels captured by the thresholding operation.
From the constructed foreign object segmentation, virtual ground truth projections are generated by simulating projections of the foreign objects onto a virtual detector (Fig. 3g). This results in the set of ground truth images, which will serve as target images in the machine learning procedure (Fig. 3h). These virtual projections need to be taken under the same angles as in the fast-acquisition scan (Fig. 3b). When this procedure is repeated for all objects, this results in a large dataset with annotated training examples with which supervised machine learning can be carried out (Fig. 3c and f). The trained model can then be applied to similar new objects scanned in the same fast-acquisition setting, without the need for acquisition of high-quality radiographs or CT scans.
3 Experiments and Results
In this section, we demonstrate the proposed workflow using the in-house FleX-ray CT system at CWI  (Fig. 4), and investigate the relation between machine learning performance and the number of training objects used.
3.1 Base objects and foreign objects
As test objects, we use base objects that are created from a fixed amount of modeling clay (Play-Doh, Hasbro, RI, USA). Play-Doh is primarily made of a mixture of water, salt and flour and we therefore consider it to be a representative example of products in the food industry, where foreign objects may be pieces of stone, plastic, or metal. A basic shape is deformed and remolded for every object instance (Fig. 4(a)) in such a way that they are similar from object to object, but still exhibit some natural variation. For the foreign objects choose to use gravel (Fig. 4(b)), with the stones having an average diameter of ca. 7mm (ranging from 3mm to 11mm). These stones have slight variations in shape and material. We create objects with three inserted stones, with two stones, with one stone (Fig. 4(c)) and without a stone.
3.2 CT scanning and data preparation
A fast CT-scan is made for each of the objects, which yields both a series of radiographs (i.e. the X-ray projections) and a reconstructed 3D volume of the object. The objects are scanned in the FleX-ray laboratory  (Fig. 4). The FleX-ray CT-scanner has a cone-beam microfocus X-ray point source with a focal spot size of 17 m, and a Dexela1512NDT detector. The source, object and detector positions can be configured flexibly, and are arranged such that the distance between the source and detector is 69.80 cm, and the distance between the source and the object 44.14 cm. For the radiographs a voltage of 90kV with a power of 20W is used, while the exposure time is kept low at 20ms, with the intention to emulate the imaging conditions of in-line industrial systems and produce sufficiently noisy radiographs. To achieve high-quality reconstructions, 1800 projections of each object are obtained over a full rotation. Before and after each scan, 10 darkfield images and 10 flatfield projections are obtained. Each object is positioned in a random manner, and the cylinders may therefore be standing upright or be laying down on the long edge. Example radiographs are shown in Figure 6. Separating the projected foreign objects from the base object in these radiographs is not a trivial task, illustrating the problem of obtaining annotated training data for automated segmentation using machine learning directly from these images.
The Simultaneous Iterative Reconstruction Technique (SIRT) [45, 46] algorithm ( iterations) as implemented in the ASTRA toolbox [47, 48] is used to compute the reconstructed 3D CT volume of the object. A visualization of the reconstruction from the third object in Figure 6 and its foreign object is shown in Figure 7. The CT reconstruction allows to slice the object along different axes. As the CT voxel intensity is directly related to the attenuation coefficient of the material in a voxel, the segmentation task for the 3D CT volume is, in this case, much more straightforward and can be carried out by global thresholding (see Appendix for additional details on intensity value distributions). Therefore, a simple global threshold based on Otsu’s method  is sufficient to segment the foreign objects.
From the 3D segmented objects obtained from the CT-scans, 2D segmentations for the individual radiographs are computed. This is done by computing the projections of the segmented parts with the ASTRA toolbox using the same geometric properties as used when acquiring the radiographs of the actual CT-scan. Every non-zero pixel on the detector is marked as a projected foreign object location. The result is a dataset containing radiographs and corresponding segmented images for each object.
3.3 Machine learning
to train the task of image segmentation. For our experiments with U-Net, we have slightly changed the architecture, as we observed this improved performance in the experiments compared to the standard version. We downsample twice, with a stride of. The initial number of feature maps is set to
, and the number of feature maps doubles for each downsampling layer. For upsampling, bilinear interpolation is used. A spatial
convolution operation with zero padding and a ReLU activation function are carried out before and after all downsampling and upsampling operations. The biases and convolution weights are initialized by sampling from, with being the range, the number of input channels and the kernel size. ADAM optimization on the average of the binary cross entropy loss and the dice loss [51, 52]
between the data and the predictions is used for training. The network is implemented with PyTorch[53, 54]. For comparison between architectures, we also use the MSD network for training. MSD is a compact network architecture that has been demonstrated to be suitable for real-time segmentation of X-ray and CT images using relatively few training examples compared to larger networks , including the U-Net architecture. We use a depth of 100 intermediate layers and width of 1 channel per intermediate layer and increase the dilation parameter repeatedly from to dilations in each layer, which are common settings for the MSD network [50, 55, 56]. Xavier initialization is used for the convolution weights. ADAM optimization  is used during training on the cross-entropy loss between the ground truth and the segmented images, and the batch size of training examples is set to . We use the GPU implementations in Python that are available [50, 58]. For both architectures, the learning rate is set to and all networks are trained on a GeForce GTX TITAN X GPU with CUDA version 10.1.243. Data augmentation is applied by rotation and flipping of the input examples. All networks are trained for 9 hours, and the network with parameters resulting in the lowest error on the validation set is used for testing.
With these networks, we carry out an image-to-image training from radiographs (Fig. 3c) to their corresponding foreign objects segmentations (Fig. 3f). For training, randomly chosen base objects containing a foreign object are used. The remaining objects are used for testing. All images are resized using cubic interpolation to to speed up the training process (global thresholding with parameter is applied to the resized ground truth images to make these binary again). We test the performance of the trained networks for different numbers of objects included in the training scheme. To compare the workflow with labour-intensive 2D data annotation, we compare the following training strategies:
Workflow approach: For each network, we fix the total number of training examples to . A random but fixed order of the training objects is created and the first objects among these are used for the training set. The training examples are selected from the set of radiographs and ground truths created by the workflow from these training objects in equal amounts. Every th example is used for validation during training.
Manual annotation approach: For each network, only one randomly chosen training radiograph with the corresponding ground truth is provided for each of the first included training objects. The resulting set of training examples is separated such that part is used for training (rounded down to the nearest integer) and part is used for validation (rounded up).
3.4 Quality measures
To evaluate the accuracy of the trained networks on the test set, we compute three different measures on the segmented images and the corresponding target images. The collection of these measures both assess the image segmentation accuracy and the object detection accuracy. An image segmentation accuracy is based on the classification of each pixel in the segmented image, and there are standardized ways to measure this that do not depend on any parameters . An object detection accuracy compares connected components (groups of pixels connected by their edges) in the segmented image with the ground truth images. Although these accuracy measures require additional parameters to define the notion of detection, they are more relevant to the foreign object detection application.
The first measure is an image-based average class accuracy (also called balanced accuracy ) to assess the accuracy of a produced segmentation. The average class accuracy of a segmented image relative to the target image is given by the sum of the true positives divided by the true positives and false negatives (the recall) of each class, averaged over the number of classes. In the binary case this becomes
Here, and are the true positive and false negative rates of the foreign object and the combined base object and background pixel classifications respectively over the entire segmented image relative to the target image. The average class accuracy as given in (1) is averaged over all target images.
The second measure is an object based detection rate. A connected component is a maximal set of nonzero-valued pixels such that each pixel is reachable from another pixel in the set via a sequence of neighboring pixels in the set. Each connected component in the target image with a minimum size of pixels ( of the image size) is considered as an object that should be detected. We define such an object as detected if its pixel-wise recall relative to the segmented image is higher than a certain threshold :
Here, and are the true positive and false negative pixels in the target object relative to the segmented image. In our experiments, we set . We define the detection rate as the percentage of components in all target images for which condition (2) holds.
The third measure is an object based false positive detection rate. Each connected component in the segmented image with a minimum size of pixels is considered as a potentially detected object. We define such a potentially detected object as a false positive if its pixel-wise recall relative to the target image is lower than a certain threshold :
Here, and are the true positive and false negative pixels in the segmented object relative to the target image. In our experiments, we set . We define the false positive detection rate as the percentage of potential objects in all segmented images for which condition (3) holds.
trained networks, with a different training object order for each run. The shaded regions indicate the respective standard deviations.
For the test set, we select a random angle and an orthogonal one for each test object, making the total number of testing radiographs . We measure the average class accuracy, the object-based detection rate and the object-based false positive detection rate of segmentations created by the network on the projections from the test set. The results are given in Figure 8.
For all measures, the quality of the foreign object segmentations on the radiographs using networks trained with the workflow data is low for a few training objects. This initially improves with the addition of relatively few training objects, but this improvement stagnates beyond 20 objects. However, the detection accuracy still shows slight improvements beyond this point, but almost completely stabilizes from objects onwards. Based on a decided accuracy goal, a certain number of objects need to be scanned and used for training to achieve that accuracy. The false positive rate decreases strongly and maintains a low level value from including objects in the training onwards. Note that the results between the U-Net and MSD architectures agree well with each other.
When we compare the usage of a fixed number of training radiographs among all training objects with the approach of using only one radiograph per object, we see that this leads to inferior results in all aspects. The average class accuracies and the object based detection rates are lower for all numbers of included training objects, while the false positive rates are higher. The difference between architectures only shows for the false positive detection rate, which is generally higher with the U-Net architecture.
3.6 Laboratory experiments with many foreign objects
A natural way to reduce the number of objects used for training that need to be scanned for obtaining accurate segmentations may be to include more foreign object in the imaged objects. To test this, we repeat the experiments of the previous section, but we insert to foreign objects instead of to . The foreign objects are placed within the base object such that overlapping of foreign objects in the radiographs is minimized. We have scanned an additional set of objects with these characteristics. An example of a radiograph of an object with many foreign objects is shown in Figure 8(a). We compare the following training strategies in which the workflow data comes from the following sets of training objects:
Few foreign objects: Base objects with to foreign objects
Many foreign objects: Base objects with to foreign objects
Mixed: mix of base objects with to foreign objects and base objects with to foreign objects.
All networks are evaluated on the testing set from the previous section (with test objects containing few foreign objects). The average class accuracies, detection accuracies and the false positive rates of the trained neural networks with these schemes on the test set are shown in Figure 9. From the graphs in Figures 8(b) and 8(c) we see that the average class accuracies and detection accuracies are higher for the many foreign object training scheme, but Figure 8(d) indicates that false positive rate is also roughly times higher. The mixed approach appears to find middle ground between the two other approaches for all measures. We see that from objects onwards the mixed approach is as good as the approach with a few foreign objects in terms of the false positive rate, while being superior in terms of average class accuracy and detection accuracy for up to training objects. This shows that including many foreign objects in the training set for detecting few to no foreign objects in the test set has limited additional value, but mixing these with examples with objects containing a few foreign objects may result in higher detection quality while maintaining a similar false positive detection rate.
3.7 Robustness of the workflow
In the previous experiments, the trained networks are tested on a set of projections that are generated using the same 3D segmentation threshold parameter in the workflow as in the generation of the data for the training and validation sets. To assess the robustness of the workflow to different segmentation parameters, we generate the training datasets with different values of the segmentation parameter (see Figure 9(a)). For each of these values, networks are trained and assessed on the test set from the previous sections. The number of training objects that are included in the workflow is fixed to (which has led to equivalent results in the previous experiments as with objects in the manual annotation approach).
In Figure 10, the average class accuracies, detection accuracies and the false positive rates of the trained neural networks are shown for the different thresholds. The results for U-Net and MSD are very similar. As the threshold value increases, the average class accuracy decreases, with significantly lower values for and . The same holds for the detection rate, but it reaches a plateau between and where this accuracy measure gives similar values. For low values of the threshold parameter, the false positive values are high, and from and higher these are low and similar to each other. Taken together, threshold parameters between and lead to very similar results. We conclude that for the class of objects considered in these experiments, the workflow is robust against moderate variation of the segmentation parameter and that suboptimal segmentation methods can also be used in the workflow.
3.8 Simulation experiments
In this section, we will demonstrate the workflow in a controlled simulated setting. In this way, we can verify the results with larger training and test sets when more objects are available. Furthermore, the test set previously consisted of data generated with the workflow, but in a simulated setting ‘absolute’ ground truth can be created for the test set by directly projecting the simulated foreign objects (see Figure 11). We verify that the proposed workflow (with CT scanning, reconstruction and segmentation) results in segmented foreign objects of which the projections are similar to absolute ground truth projections, which further supports the confidence we can have in the experimental test results.
We have generated a set of objects, each in an object space of voxels. Each object is a cube of size voxels, which is placed in the center of the volume. To create sufficient variety among the objects, the cube is cut off by eight planes. For each corner of the cube, a plane is created by selecting points on each of the three outgoing edges of the corner, randomly between the corner point and the midpoint of that edge. The pixels are cut off whose location is on side of the plane opposite to the center of the cube. See Figure 12 for a visualization. Additionally, we rotate the resulting object with random angles around all axes. After that we include a foreign object as an ellipsoid with a radius randomly chosen between 3 and 7 voxels at a random location within or on the edge of the base object. These ellipsoids have a random orientation as well. As a result, the foreign objects vary in shape, size, orientation and location. With probability, we include two of these foreign objects instead of one in the base object.
Based on the spectral properties of the assigned materials, we create simulated radiographs (Fig. 11b). Details of the computation can be found in the Appendix of . First, we make projections of each material separately by computing cone beam forward projections using the ASTRA toolbox [47, 48]. From this, the simulated radiographs are computed by taking the spectral properties of each material into account (taken from the National Institute for Standards and Technology (NIST) ). We model the foreign objects as bone and the base object as tissue for each object. We take the spectral material characteristics between KeV and KeV into account, and use an exposure time of seconds for each radiograph, for which the Poisson noise that is applied is relatively high. These settings are chosen such that there is sufficient contrast in the radiographs, but not as much that it can be very easily identified with simple segmentation methods. The simulated detector size – and therefore the projection image size – is pixels. Examples of radiographs from five objects are given in Figure 13.
A total of objects are reserved as training objects, while the other objects are reserved for testing. For each training object, the ground truth corresponding to each radiograph is generated with the workflow, with the same strategy and parameters as in Section 3.2. Global thresholding with parameter value is used for the reconstructions. For each test object, the ‘absolute’ ground truth corresponding to each radiograph is generated by directly projecting the virtual foreign objects (Fig. 11a and f), thereby skipping the reconstruction and segmentation steps. The projections are segmented such that every non-zero pixel on the detector is a projected foreign object location.
To verify that the direct use of the generated 3D volumes results in very similar ground truth projections compared to when to workflow is followed, the resulting ground truth projections are compared for the training set. The Jaccard index between the resulting ground truth pairs, averaged over all projection angles for alltraining objects, is for SIRT with iterations. This result indicates that the resulting ground truth projections resulting from both approaches are very similar.
To further confirm this, the training of networks as described in Section 3.3 is repeated with the simulated projections, with the trained networks this time being evaluated on the test set with ‘absolute’ ground truth. The results for the three measures are given in Figure 14, and are in accordance with the experiments with the laboratory data. A notable difference is that the average class accuracy and detection accuracy reach their maximum values for a relatively lower number of training objects (and the same goes for the minimum value of the false positive rate). This is most likely because the simulated objects are less complex, resulting in radiographs with less complicated structures. Nevertheless, the results again show inferior results for the approach where one radiograph per training object is used, since objects are needed to reach similar quality measure values as for the workflow with only objects.
Overall, the graphs presenting the foreign object detection accuracies in Section 3 indicate an increase of segmentation and detection accuracy with increasing the number of objects from which the training data is created. The accuracies initially increase strongly with the number of training objects but this increase decays when the number of training objects is further increased. The maximum detection accuracy that can be achieved depends on the nature of the foreign detection problem. For instance, if the X-ray flux is low and the noise is high, foreign objects are more difficult to detect from the radiographs. In the case of the laboratory experiments, foreign objects are difficult to detect when the cylindrical shape was located with the long edge on the ground and oriented orthogonal to the detector. The radiographs should contain sufficient discriminatory information such that foreign object detection with deep learning is possible. Additionally, for the dataset to be suitable for supervised machine learning, the ground truth should also be of sufficient quality, although this seemed to be less of an issue in our experiments as we observed no negative effects from occasional noise in the ground truth on the training and detection accuracy.
With the above considerations in mind, the workflow is designed to be modular. Every stage of the proposed workflow can be designed according to the available data-acquisition equipment, the intended detection accuracy, the type of base objects and foreign objects, and the available computer memory, among other things. We highlight some possible considerations for every stage:
Objects (Fig. 3a): The set of objects can be enlarged or diversified when the accuracy of the trained neural network is not satisfactory. Also, more objects can be added to obtain a more diverse representation of objects when a more diverse array of objects or orientations are considered to be subjected to X-rays in the industrial application, such as on a conveyor belt. When a completely new type of objects is considered, these objects should be added to the workflow as well.
Scanning routine (Fig. 3b): In our experimental setting we have used data resulting from low exposure times as input for both the neural networks and the reconstruction algorithm. If the foreign objects turn out to be too difficult to separate in the reconstructions, more scanning angles may be considered. Additionally, if the factory settings are allowed to be altered, higher fluxes, different tube voltages or longer exposure times can be used to obtain radiographs of higher quality, as long as the processing times remain acceptable. Also, more discrimination can be achieved by applying spectral imaging (dual-energy  or multi-energy imaging [63, 16, 17]) such that the neural network can distinguish the foreign objects from the base objects. If changing the quality of the radiographs is not possible, a separate high-quality scan of the same object can be made under the same angles, to achieve more contrast of the foreign object in the reconstructions. The scanning routine can be carried out in any lab, as long as it done under similar conditions as in the intended industrial X-ray imaging setting.
Reconstruction algorithm (Fig. 3e): Depending on the type of data, different reconstruction algorithms may be considered [64, 65]. In this work, we have used the SIRT algorithm to account for the noise in the data, but other reconstruction algorithms such as Feldkamp-Davis-Kress (FDK) algorithm  or the Conjugate Gradient method for Least Squares (CGLS)  can be considered as well. Also, when dealing with spectral or generic multi-channel data, multi-channel reconstruction methods [68, 69, 70, 71, 72] can be used to increase the reconstruction accuracy even further. When dealing with objects that may change in time, dynamic reconstruction methods can be considered [73, 74, 75].
Segmentation algorithm (Fig. 3f): In this work we have used a simple global thresholding scheme, but many more segmentation methods are available, as well as approaches to reduce possible noise , or bounding boxes when the location of the foreign object is known . In case of multi-channel data, a multi-dimensional thresholding scheme can be used, as well as clustering methods. Discrete reconstructions algorithms that combine reconstruction and segmentation are also available [76, 77].
Virtual projection (Fig. 3g): When creating the virtual projection, post-processing on the generated ground truth projections can be applied to increase the training target quality, for instance by denoising the obtained ground truth projections.
Supervised learning (Fig. 3c and h): To validate the workflow, we have used the U-Net architecture with ADAM optimization on cross entropy loss and dice loss, as well as the MSD network with ADAM optimization  on the cross-entropy loss. Other neural network architectures (see Section 2.2) can also be considered, as well as different optimization strategies and loss functions. Note that the foreign object detection problem considered in this work may be ambiguous, since for a base object containing a foreign object another base object can theoretically be constructed (without foreign object) that results the same radiograph. This constructed base object may have an unnatural shape when compared with other base objects, but if it happens, it may lead to inconsistent training data for the network. However, this possible problem is independent of the workflow and can be resolved by multi-spectral imaging or multi-angle imaging, and training the networks with multiple images from the same object resulting from these imaging methods. However, creating reconstructions with data from these advanced imaging methods would not be necessary.
The training data acquisition workflow proposed in this paper holds a possible advantage over annotation of 2D radiographs, even when perfect annotations are created. According to the results in Figures 8 and 14, segmentation and detection accuracy can be improved by using multiple annotated radiographs for each training object. As opposed to manual annotation, with the proposed workflow many additional radiographs are obtained for each training object.
In this research, a new workflow is proposed for generating training data for supervised deep learning for foreign object detection in an industrial setting. In this workflow, a number of representative objects are scanned using X-ray imaging, reconstructed using computed tomography, segmented and virtually projected in an objective and reproducible manner to obtain the true foreign object locations in a large set of radiographs, after which supervised machine learning can be applied to detect foreign objects with high accuracy depending on the number representative objects included. We demonstrate this workflow on both laboratory and simulated data using using neural networks for the deep learning task. Through laboratory experiments, we have verified that the workflow produces adequate target images. The introduced measures assess the quality of foreign object detection with networks trained using datasets generated with this workflow. All experiments show a consistent result in which the accuracy increases significantly with a few number of training objects, and less significantly for every additional training object. In the laboratory experiment, we consistently obtain high accuracies for detecting gravel in modeling clay with low exposure times using this workflow, demonstrating its application potential in an industrial setting.
Mathé T. Zeegers: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization. Tristan van Leeuwen: Conceptualization, Writing - Review & Editing, Supervision, Project administration. Daniël M. Pelt: Conceptualization, Methodology, Software, Writing - Review & Editing, Project Administration. Sophia Bethany Coban: Conceptualization, Writing - Review & Editing. Robert van Liere: Conceptualization, Writing - Review & Editing, Funding Acquisition. Kees Joost Batenburg: Conceptualization, Methodology, Writing - Review & Editing, Supervision, Project Administration, Funding Acquisition.
Conflict of interest
The authors declare no conflict of interest.
The authors acknowledge financial support from the Netherlands Organisation for Scientific Research (NWO), project number 639.073.506. D. M. Pelt is supported by The Netherlands Organisation for Scientific Research (NWO), project number 016.Veni.192.235. The authors also acknowledge TESCAN-XRE NV for their collaboration and support of the FleX-ray laboratory.
-  V. Andriiashen, R. van Liere, T. van Leeuwen, and K. J. Batenburg. Unsupervised foreign object detection based on dual-energy absorptiometry in the food industry. Journal of Imaging, 7(7):10, 2021.
-  K. H. Wilm. Foreign object detection: Integration in food production. Food Safety Magazine, 18:14–17, 2012.
-  L. Zhu, P. Spachos, E. Pensini, and K. N. Plataniotis. Deep learning and machine vision for food processing: A survey. Current Research in Food Science, 4:233–249, 2021.
-  Y. He, Q. Xiao, X. Bai, L. Zhou, F. Liu, and C. Zhang. Recent progress of nondestructive techniques for fruits damage inspection: a review. Critical Reviews in Food Science and Nutrition, pages 1–19, 2021.
-  S. Li, H. Luo, M. Hu, M. Zhang, J. Feng, Y. Liu, Q. Dong, and B. Liu. Optical non-destructive techniques for small berry fruits: A review. Artificial Intelligence in Agriculture, 2:85–98, 2019.
-  M. T. Mohd Khairi, S. Ibrahim, M. A. Md Yunus, and M. Faramarzi. Noninvasive techniques for detection of foreign bodies in food: A review. Journal of Food Process Engineering, 41(6):e12808, 2018.
-  K. Narsaiah, A. K. Biswas, and P. K. Mandal. Nondestructive methods for carcass and meat quality evaluation. In A. K. Biswas and P. K. Mandal, editors, Meat Quality Analysis, pages 37–49. Academic Press, 2020.
-  B. M. Nicolaï, T. Defraeye, B. De Ketelaere, E. Herremans, M. L. A. T. M. Hertog, W. Saeys, A. Torricelli, T. Vandendriessche, and P. Verboven. Nondestructive measurement of fruit and vegetable quality. Annual Review of Food Science and Technology, 5:285–312, 2014.
-  Z. Xiong, D. Sun, H. Pu, W. Gao, and Q. Dai. Applications of emerging imaging techniques for meat quality and safety detection and evaluation: A review. Critical Reviews in Food Science and Nutrition, 57(4):755–768, 2017.
-  H. Einarsdóttir, M. J. Emerson, L. H. Clemmensen, K. Scherer, K. Willer, M. Bech, R. Larsen, B. K. Ersbøll, and F. Pfeiffer. Novelty detection of foreign objects in food using multi-modal X-ray imaging. Food Control, 67:39–47, 2016.
-  R. P. Haff and N. Toyofuku. X-ray detection of defects and contaminants in the food industry. Sensing and Instrumentation for Food Quality and Safety, 2(4):262–273, 2008.
-  J. Kwon, J. Lee, and W. Kim. Real-time detection of foreign objects using X-ray imaging for dry food manufacturing line. In 2008 IEEE International Symposium on Consumer Electronics, Vilamoura, Portugal, pages 1–4. IEEE, IEEE, 2008.
-  S. K. Mathanker, P. R. Weckler, and T. J. Bowser. X-ray applications in food and agriculture: A review. Transactions of the ASABE, 56(3):1227–1239, 2013.
-  D. Mery, I. Lillo, H. Loebel, V. Riffo, A. Soto, A. Cipriano, and J. M. Aguilera. Automated fish bone detection using X-ray imaging. Journal of Food Engineering, 105(3):485–492, 2011.
-  J. Zhong, F. Zhang, Z. Lu, Y. Liu, and X. Wang. High-speed display-delayed planar x-ray inspection system for the fast detection of small fishbones. Journal of Food Process Engineering, 42(3):e13010, 2019.
-  S. Si-Mohamed, D. Bar-Ness, M. Sigovan, D. P. Cormode, P. Coulon, E. Coche, A. Vlassenbroek, G. Normand, L. Boussel, and P. Douek. Review of an initial experience with an experimental spectral photon-counting computed tomography system. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 873:27–35, 2017.
-  K. Taguchi, I. Blevis, and K. Iniewski. Spectral, photon counting computed tomography: technology and applications. (1st ed.) CRC Press, 2020.
-  M. Sezgin and B. Sankur. Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic imaging, 13(1):146–165, 2004.
-  G. Silva, L. Oliveira, and M. Pithon. Automatic segmenting teeth in X-ray images: Trends, a novel data set, benchmarking and future perspectives. Expert Systems with Applications, 107:15–31, 2018.
-  M. Al-Sarayreh, M. M. Reis, W. Q. Yan, and R. Klette. A sequential CNN approach for foreign object detection in hyperspectral images. In M. Vento and G. Percannella, editors, International Conference on Computer Analysis of Images and Patterns, pages 271–283. Springer, 2019.
-  D. Rong, L. Xie, and Y. Ying. Computer vision detection of foreign objects in walnuts using deep learning. Computers and Electronics in Agriculture, 162:1001–1010, 2019.
-  Z. Zhao, P. Zheng, S. Xu, and X. Wu. Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems, 30(11):3212–3232, 2019.
-  A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez. A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857, 2017.
-  Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew. A review of semantic segmentation using deep neural networks. International Journal of Multimedia Information Retrieval, 7(2):87–93, 2018.
-  G. Chartrand, P. M. Cheng, E. Vorontsov, M. Drozdzal, S. Turcotte, C. J. Pal, S. Kadoury, and A. Tang. Deep learning: a primer for radiologists. Radiographics, 37(7):2113–2131, 2017.
-  H. Wu, Q. Liu, and X. Liu. A review on deep learning approaches to image classification and object segmentation. Computers, Materials & Continua, 60(2):575–597, 2019.
-  A. M. Deshpande, A. A. Minai, and M. Kumar. One-shot recognition of manufacturing defects in steel surfaces. Procedia Manufacturing, 48:1064–1071, 2020.
-  S. Akcay and T. Breckon. Towards automatic threat detection: A survey of advances of deep learning within X-ray security imaging. Pattern Recognition, 122:108245, 2022.
-  N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding. Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation. Medical Image Analysis, page 101693, 2020.
-  H. Pan, C. Zhou, Q. Zhu, and D. Zheng. A fast registration from 3D CT images to 2D X-ray images. In 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA), Shanghai, China, pages 351–355. IEEE, IEEE, 2018.
-  T. Van De Looverbosch, E. Raeymaekers, P. Verboven, J. Sijbers, and B. Nicolaï. Non-destructive internal disorder detection of conference pears by semantic segmentation of X-ray CT scans using deep learning. Expert Systems with Applications, 176:114925, 2021.
-  P. Russo. Handbook of X-ray imaging: physics and technology. (1st ed.). CRC Press, 2017.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, pages 3431–3440. IEEE, 2015.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, editors, International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, Springer, 2015.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, Santiago, Chile, pages 1520–1528. IEEE, 2015.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.
-  G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, pages 1925–1934. IEEE, 2017.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pages 2881–2890. IEEE, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, pages 2961–2969. IEEE, 2017.
-  L. Lenchik, L. Heacock, A. A. Weaver, R. D. Boutin, T. S. Cook, J. Itri, C. G. Filippi, R. P. Gullapalli, J. Lee, M. Zagurovskaya, et al. Automated segmentation of tissues using CT and MRI: a systematic review. Academic Radiology, 26(12):1695–1706, 2019.
-  D. Kern and A. Mastmeyer. 3D bounding box detection in volumetric medical image data: A systematic literature review. In 2021 IEEE 8th International Conference on Industrial Engineering and Applications (ICIEA), Chengdu, China, pages 509–516. IEEE, IEEE, 2021.
-  M. Diwakar and M. Kumar. A review on CT image noise and its denoising. Biomedical Signal Processing and Control, 42:73–88, 2018.
-  A. A. Hendriksen, D. M. Pelt, and K. J. Batenburg. Noise2inverse: Self-supervised deep convolutional denoising for tomography. IEEE Transactions on Computational Imaging, 6:1320–1335, 2020.
-  S. B. Coban, F. Lucka, W. J. Palenstijn, D. Van Loo, and K. J. Batenburg. Explorative imaging and its implementation at the FleX-ray laboratory. Journal of Imaging, 6(4):18, 2020.
-  A. C. Kak, M. Slaney, and G. Wang. Principles of computerized tomographic imaging, 2002.
-  A. Van der Sluis and H. A. Van der Vorst. SIRT-and CG-type methods for the iterative solution of sparse linear least-squares problems. Linear Algebra and its Applications, 130:257–303, 1990.
-  W. Van Aarle, W. J. Palenstijn, J. Cant, E. Janssens, F. Bleichrodt, A. Dabravolski, J. De Beenhouwer, K. J. Batenburg, and J. Sijbers. Fast and flexible X-ray tomography using the ASTRA toolbox. Optics Express, 24(22):25129–25147, 2016.
-  W. Van Aarle, W. J. Palenstijn, J. De Beenhouwer, T. Altantzis, S. Bals, K. J. Batenburg, and J. Sijbers. The ASTRA toolbox: A platform for advanced algorithm development in electron tomography. Ultramicroscopy, 157:35–47, 2015.
-  N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66, 1979.
-  D. M. Pelt and J. A. Sethian. A mixed-scale dense convolutional neural network for image analysis. Proceedings of the National Academy of Sciences, 115(2):254–259, 2018.
-  C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In M. J. Cardoso, T. Arbel, G. Carneiro, T. Syeda-Mahmood, J. M. R. S. Tavares, M. Moradi, A. Bradley, H. Greenspan, J. P. Papa, A. Madabhushi, J. C. Nascimento, J. S. Cardoso, V. Belagiannis, and Z. Lu, editors, Deep learning in medical image analysis and multimodal learning for clinical decision support, pages 240–248. Springer, 2017.
-  S. Jadon. A survey of loss functions for semantic segmentation. arXiv preprint arXiv:2006.14822, 2020.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPS-W, 2017.
-  A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, Luca A., et al. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in neural information processing systems, pages 8026–8037, 2019.
-  M. J. Lagerwerf, D. M. Pelt, W. J. Palenstijn, and K. J. Batenburg. A computationally efficient reconstruction algorithm for circular cone-beam computed tomography using shallow neural networks. Journal of Imaging, 6(12):135, 2020.
-  D. M. Pelt, K. J. Batenburg, and J. A. Sethian. Improving tomographic reconstruction from limited data using mixed-scale dense convolutional neural networks. Journal of Imaging, 4(11):128, 2018.
-  L. Kingma, D. P.; Ba. ADAM: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7-9 May, 2015.
-  D. M. Pelt. Github - dmpelt/msdnet: Python implementation of the mixed-scale dense convolutional neural network. https://github.com/dmpelt/msdnet, 2019. Accessed on 24 November 2020.
-  M. Grandini, E. Bagli, and G. Visani. Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756, 2020.
-  M. T. Zeegers, D. M. Pelt, T. van Leeuwen, R. van Liere, and K. J. Batenburg. Task-driven learned hyperspectral data reduction using end-to-end supervised deep learning. Journal of Imaging, 6(12):132, 2020.
-  J. H. Hubbell and S. M. Seltzer. Tables of X-ray mass attenuation coefficients and mass energy-absorption coefficients 1 KeV to 20 MeV for elements Z=1 to 92 and 48 additional substances of dosimetric interest. Technical report, National Institute of Standards and Technology-PL, Gaithersburg, MD, USA. Ionizing Radiation Div., 1995.
-  V. Rebuffel and J. Dinten. Dual-energy X-ray imaging: benefits and limits. Insight - Non-Destructive Testing and Condition Monitoring,, 49(10):589–594, 2007.
-  G. Einarsson, J. N. Jensen, R. R. Paulsen, H. Einarsdottir, B. K. Ersbøll, A. B. Dahl, and L. B. Christensen. Foreign object detection in multispectral X-ray images of food items using sparse discriminant analysis. In P. Sharma and F.M. Bianchi, editors, Scandinavian Conference on Image Analysis, pages 350–361. Springer, Springer, 2017.
-  T. M. Buzug. Computed Tomography: From Photon Statistics to Modern Cone-Beam CT. (1st ed.). Springer, 2008.
-  P. C. Hansen, J. S. Jørgensen, and W. R. B. Lionheart. Computed Tomography: Algorithms, Insight, and Just Enough Theory. (1st ed.). SIAM, 2021.
-  L. A. Feldkamp, L. C. Davis, and J. W. Kress. Practical cone-beam algorithm. Journal of the Optical Society of America A, 1(6):612–619, 1984.
-  M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49:409–436, 1952.
-  D. Kazantsev, J. S. Jørgensen, M. S. Andersen, W. R. B. Lionheart, P. D. Lee, and P. J. Withers. Joint image reconstruction method with correlative multi-channel prior for X-ray spectral computed tomography. Inverse Problems, 34(6):064001, 2018.
-  D. S. Rigie and P. J. La Rivière. Joint reconstruction of multi-channel, spectral CT data via constrained total nuclear variation minimization. Physics in Medicine & Biology, 60(5):1741, 2015.
-  A. Sawatzky, Q. Xu, C. O. Schirra, and M. A. Anastasio. Proximal ADMM for multi-channel image reconstruction in spectral X-ray CT. IEEE Transactions on Medical Imaging, 33(8):1657–1668, 2014.
-  O. Semerci, N. Hao, M. E. Kilmer, and E. L. Miller. Tensor-based formulation and nuclear norm regularization for multienergy computed tomography. IEEE Transactions on Image Processing, 23(4):1678–1693, 2014.
-  M. T. Zeegers, F. Lucka, and K. J. Batenburg. A multi-channel DART algorithm. In R. P. Barneva, V. E. Brimkov, and J. M. R. S. Tavares, editors, International Workshop on Combinatorial Image Analysis, pages 164–178. Springer, 2018.
-  N. Djurabekova, A. Goldberg, A. Hauptmann, D. Hawkes, G. Long, F. Lucka, and M. Betcke. Application of proximal alternating linearized minimization (PALM) and inertial PALM to dynamic 3D CT. In S. Matej and S. D. Metzler, editors, 15th International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine, volume 11072, pages 30–34. International Society for Optics and Photonics, SPIE, 2019.
-  V. V. Nikitin, M. Carlsson, F. Andersson, and R. Mokso. Four-dimensional tomographic reconstruction by time domain decomposition. IEEE Transactions on Computational Imaging, 5(3):409–419, 2019.
-  A. Hauptmann, O. Öktem, and C. Schönlieb. Image reconstruction in dynamic inverse problems with temporal models. arXiv preprint arXiv:2007.10238, 2020.
-  K. J. Batenburg and J. Sijbers. DART: A practical reconstruction algorithm for discrete tomography. IEEE Transactions on Image Processing, 20(9):2542–2553, 2011.
-  G. T. Herman and A. Kuba. Discrete tomography: Foundations, algorithms, and applications. (1st ed.). Springer, 1999.
-  M. T. Zeegers. A collection of X-ray projections of 131 pieces of modeling clay containing stones for machine learning-driven object detection. Zenodo, 2022.
-  M. T. Zeegers. A collection of 131 CT datasets of pieces of modeling clay containing stones. Zenodo, 2022.
Appendix A Intensity value histograms
We compare the intensity distributions for radiographs and for a CT scan for an object in Figure A1, which shows a number of statistics about the pixel and voxel intensities for object 3 (Fig. 6). For both approaches, the intensity value distributions are plotted and separated into values of pixel or voxels that have been marked as foreign object by the thresholding method. The 3D case has a clear separation between foreign object and the base object based on attenuation, such that a simple global threshold based on Otsu’s method  is sufficient to segment the foreign object. On the other hand, in the 2D radiograph case, the intensity values corresponding to the foreign object locations are similar to values of the base object.