Quality management is a fundamental component of a manufacturing process . To meet growth targets, manufacturers must increase their production rate while maintaining stringent quality control limits. In a recent report, the development of better quality management systems was described as the most important technology advancement for manufacturing business performance . In order to meet the growing demand for high-quality products, the use of intelligent visual inspection systems is becoming essential in production lines.
Processes such as casting and welding can introduce defects in the product which are detrimental to the final product quality . Common casting defects include air holes, foreign-particle inclusions, shrinkage cavities, cracks, wrinkles, and casting fins . If undetected, these casting defects can lead to catastrophic failure of critical mechanical components, such as turbine blades, brake calipers, or vehicle driveshafts. Early detection of these defects can allow faulty products to be identified early in the manufacturing process, leading to time and cost savings . Automated quality control can be used to facilitate consistent and cost-effective inspection. The primary drivers for automated inspection systems include faster inspection rates, higher quality demands, and the need for more quantitative product evaluation that is not hampered by the effects of human fatigue.
Nondestructive evaluation techniques allow a product to be tested during the manufacturing process without jeopardizing the quality of the product. There are a number of nondestructive evaluation techniques available for producing two-dimensional and three-dimensional images of an object. Real-time X-ray imaging technology is widely used in defect detection systems in industry, such as on-line weld defect inspection . Ultrasonic inspection and magnetic particle inspection can also be used to measure the size and position of casting defects in cast components [6, 7]. X-ray Computed Tomography (CT) can be used to visualize the internal structure of materials. Recent developments in high resolution X-ray computed tomography have made it possible to gain a three-dimensional characterization of porosity 
. However, automatically identifying casting defects in X-ray images still remains a challenging task in the automated inspection and computer vision domains.
The defect detection process can be framed as either an object detection task or an instance segmentation task. In the object detection approach, the goal is to place a tight-fitting bounding box around each defect in the image. In the image segmentation approach, the problem is essentially one of pixel classification, where the goal is to classify each image pixel as a defect or not. Instance segmentation is a more difficult variant of image segmentation, where each segmented pixel must be assigned to a particular casting defect. A comparison of these computer vision tasks is provided in Figure1. In general, object detection and instance segmentation are difficult tasks, as each object can cast an infinite number of different 2-D images onto the retina . Additionally, the number of instances in a particular image is unknown and often unbounded. Variations of the object’s position, pose, lighting, and background represent additional challenges to this task.
Many state-of-the-art object detection systems have been developed using the region-based convolutional neural network (R-CNN) architecture 
. R-CNN creates bounding boxes, or region proposals, using a process called selective search. At a high level, selective search looks at the image through windows of different sizes and, for each size, tries to group together adjacent pixels by texture, color, or intensity to identify objects. Once the proposals are created, R-CNN warps the region to a standard square size and passes it through a feature extractor. A support vector machine (SVM) classifier is then used to predict what object is present in the image, if any. In more recent object detection architectures, such as region-based fully convolutional networks (R-FCN), each component of the object detection network is replaced by a deep neural network.
In this work, a fast and accurate defect detection system is developed by leveraging recent advances in computer vision. The proposed defect detection system is based on the mask region-based CNN (Mask R-CNN) architecture . This architecture simultaneously performs object detection and instance segmentation, making it useful for a range of automated inspection tasks. The proposed system is trained and evaluated on the GRIMA database of X-ray images (GDXray) dataset, published by Grupo de Inteligencia de Máquina (GRIMA) . Some examples from the GDXray dataset are shown in Figure 2.
The remainder of this article is organized as follows: The first section provides an overview of related works, and the second section provides a brief introduction to CNNs. A detailed description of the proposed defect detection system is provided in the “Defect Detection System” section. The “Implementation Details and Experimental Results” section explains how the system is trained to detect casting defects, and provides the main experimental results, as well as a comparison with similar systems in the literature. The article is concluded with a number of in-depth studies, a thorough discussion of the results, and a brief conclusion.
Ii Related Works
The detection and segmentation of casting defects using traditional computer vision techniques has been relatively well-studied. One popular method is background subtraction, where an estimated background image (which does not contain the defects) is subtracted from the preprocessed image to leave a residual image containing the defects and random noise[14, 15]. Background subtraction has also been applied to the welding defect detection task, with varying levels of success [16, 17, 18]. However, background subtraction tends to be very sensitive to the positioning of the image, as well as random image noise . A range of matched filtering techniques have also been proposed, with modified median (MODAN) filtering being a popular choice . The MODAN-Filter is a median filter with adapted filter masks, that is designed to differentiate structural contours of the casting piece from casting defects . A number of researchers have proposed wavelet-based techniques with varying levels of success [4, 21]. In wavelet-based and frequency-based approaches, defects are commonly identified as high-frequency regions of the image, when compared to the comparatively lower frequency background . Many of these approaches fail to combine local and global information from the image when classifying defects, making them unable to separate design features like holes and edges from casting defects.
In many traditional computer vision approaches, it is common to manually identify a number of features which can be used to classify individual pixels. Each image pixel is classified as a defect or treated as not being a defect, depending on the features that are computed from a local neighborhood around the pixel. Common features include statistical descriptors (mean, standard deviation, skewness, kurtosis) and localized wavelet decomposition. Several fuzzy logic approaches have also been proposed, but these techniques have been largely superseded by modern CNN-based computer vision techniques .
The related task of automated surface inspection (ASI) is also well-documented in the literature. In ASI, surface defects are generally described as local anomalies in homogeneous textures. Depending on the properties of surface texture, ASI methods can be divided into four approaches . One approach is structural methods that model the texture primitives and displacements. Popular structural approaches include primitive measurement , edge features , and morphological operations . The second approach is the statistical methods which measure the distribution of pixel values. The statistical approach is efficient for stochastic textures, such as ceramic tiles, castings, and wood. Popular statistical methods include histogram-based method , local binary pattern (LBP) , and co-occurrence matrix . The third approach is filter-based methods that apply filter banks on texture images. The filter-based methods can be divided into spatial-domain , frequency-domain , and spatial-frequency domain . Finally, model-based approaches construct representations of images by modeling multiple properties of defects .
The research community, including this work, is greatly benefited from well-archived experimental datasets, such as the GRIMA database of X-ray images (GDXray) . The performance of several simple methods for defect segmentation are compared in  using the GDXray Welds series, but each method is only evaluated qualitatively. A comprehensive study of casting defect detection using various computer vision techniques is provided in , where patches of size pixels are cropped from GDXray Castings series and used to train and test a number of different classifiers. The best performance is achieved by a simple LBP descriptor with a linear SVM classifier 
. Several deep learning approaches are also evaluated, obtaining up to 86.4 % patch classification accuracy. When applying the deep learning techniques, the authors resize the 3232 3 pixel patches to a size of 244 244 3 pixels so that they can be feed into pretrained neural networks [36, 37]. A deep CNN is used for weld defect segmentation in  obtaining 90.5 % accuracy on the binary classification of 25 25 pixel patches.
In recent times, a number of machine learning techniques have been successfully applied to the object detection task. Two notable neural network approaches are Faster Region-Based CNN (Faster R-CNN) and Single Shot Multibox Detector (SSD) . These approaches share many similarities, but the latter is designed to prioritize evaluation speed over accuracy. A comparison of different object detection networks is provided in . Mask R-CNN is an extension of Faster R-CNN that simultaneously performs object detection and instance segmentation . In previous research, it has been demonstrated that Faster R-CNN can be used as the basis for a fast and accurate defect detection system . This work builds on that progress by developing a defect detection system that simultaneously performs object detection and instance segmentation.
Iii Convolutional Neural Networks
There has been significant progress in the field of computer vision, particularly in image classification, object detection and image segmentation. The development of deep CNNs has led to vast improvements in many image processing tasks. This section provides a brief overview of CNNs. For a more comprehensive description, the reader is referred to .
In a CNN, pixels from each image are converted to a featurized representation through series of mathematical operations. Images can be represented as an order 3 tensorwith height , width , and color channels . The input sequentially goes through a number of processing steps, commonly referred to as layers. Each layer , can be viewed as an arbitrary transformation with inputs , outputs , and parameters
. The outputs of a layer are often referred to as a feature map. By combining multiple layers it is possible to develop a complex nonlinear function which can map high-dimensional data (such as images) to useful outputs (such as classification labels). More formally, a CNN can be thought of as the composition of number of functions:
where is the input to the CNN and
is the output. There are several layer types which are common to most modern CNNs, including convolution layers, pooling layers and batch normalization layers. A convolution layer is a functionthat convolves one or more parameterized kernels with the input tensor, . Suppose the input is an order 3 tensor with size . A convolution kernel is also an order 3 tensor with size . The kernel is convolved with the input by taking the dot product of the kernel with the input at each spatial location in the input. The convolution of a kernel with an image is shown diagrammatically in Figure 3. By convolving certain types of kernels with the input image, it is possible to obtain meaningful outputs, such as the image gradients . In most modern CNN architectures, the first few convolutional layers extract features like edges and textures. Convolutional layers deeper in the network can extract features that span a greater spatial area of the image, such as object shapes.
Deep neural networks are, by design, parameterized nonlinear functions 
. An activation function is applied to the output of a neural network layer to introduce this nonlinearity. Traditionally, the sigmoid function was used as the nonlinear activation function in neural networks. In modern architectures, the Rectified Linear Unit (ReLU) is more commonly used as the neuron activation function, as it performs best with respect to runtime and generalization error. The nonlinear ReLU function follows the formulation for each value, , in the input tensor . Unless otherwise specified, the ReLU activation function is used as the activation function in the defect detection system described in this article.
Pooling layers are also common in most modern CNN architectures [43, 46]. The primary function of pooling layers is to progressively reduce the spatial size of the representation to reduce the number of parameters in the network, and hence control overfitting. Pooling layers typically apply a max or average operation over the spatial dimensions of the input tensor. The pooling operation is typically performed over a or area of the input tensor. By stacking pooling and convolutional layers, it is possible to build a network that allows a hierarchical and evolutionary development of raw pixel data towards powerful feature representations.
Training a neural network is performed by minimizing a loss function
. The loss function is normally a measure of the difference between the current output of the neural network and the ground truth. As long as each layer of the neural network is differentiable, it is possible to calculate the gradient of the loss function with respect to the parameters. The backpropagation algorithm allows the numerical gradients to be calculated efficiently
. A gradient-based optimization algorithm such as stochastic gradient descent (SGD) can be used to find the parameters that minimize the loss function.
Iii-a Residual Networks
The properties of a neural network are characterized by choice and arrangement of the layers, often referred to as the architecture. Deeper networks generally allow more complex features to be computed from the input image. However, increasing the depth of a neural network often makes it more difficult to train, due to the vanishing gradient problem
. The residual network (ResNet) architecture was designed to avoid many of the issues that plagued very deep neural networks. Most predominately, the use of residual connections helps to overcome the vanishing gradient problem. A cell from the ResNet architecture is shown in Figure 4. There are a number of standard variants of the ResNet architecture, containing between 18 and 152 layers. In this work, the relatively large ResNet-101 variant with 101 trainable layers is used as the neural network backbone .
While ResNet was designed primarily to solve the image classification problem, it can also be used for a wider range of image processing tasks. More specifically, the outputs from the intermediate layers can be used as high-level representations of the image. When used this way, ResNet is referred to as a feature extractor, rather than a classification network.
Iv Defect Detection System
In this section, a defect detection system is proposed to identify casting defects in X-ray images. The proposed system simultaneously performs defect detection and defect segmentation, making it useful for a range of automated inspection applications. The design of the defect detection system is based on the Mask R-CNN architecture . As depicted in Figure 5
, the defect detection system is composed of four modules. The first module is a feature extraction module that generates a high-level featurized representation of the input image. The second module is a CNN that proposes regions of interest (RoIs) in the image, based on the featurized image. The third module is a CNN that attempts to classify the objects in each RoI. The fourth module performs image segmentation, with the goal of generating a binary mask for each region. Each module is described in detail throughout the remainder of this section.
Iv-a Feature Extraction
The first module in the proposed defect detection system transforms the image pixels into a high-level featurized representation. Many CNN-based object detection systems use the VGG-16 architecture to extract features from the input image [10, 39, 49]. However, recent work has demonstrated that better results can be obtained with more modern feature extractors . In a related work, we have shown that an object detection network with the ResNet-101 feature extractor results in a higher bounding-box prediction accuracy on the GDXray Castings dataset, than the same object detection network with a VGG-16 feature extractor . Therefore, the ResNet-101 architecture is chosen as the backbone for the feature extraction module. The neural-network architecture of the feature extraction module is shown in Table 1. Some feature maps from the feature extraction module are shown in Figure 6.
The ResNet-101 feature extractor is a very deep convolutional neural network with 101 trainable layers and approximately 27 million parameters. Hence, it is unlikely that the network can be trained to extract meaningful features from input images, using the relatively small GDXray dataset. One interesting property of CNN-based feature extractors is that the features they generate often transfer well across different image processing tasks. This property is leveraged when training the proposed casting defect detection system, by first training the feature extractor on the large ImageNet dataset. Throughout the training process the feature extractor learns to extract many different types of features, only some of which are useful on the comparatively simpler casting defect detection task. When training the object detection network on the GDXray Castings dataset, the system learns which features correlate well with casting defects and discards unneeded features. This process tends to work well, as it is much easier to discard unneeded features than it is to learn entirely new features.
|Layer Name||Filter Size||Output Size|
Iv-B Region Proposal Network
The second module in the proposed defect detection system is the region proposal network (RPN). The RPN takes a feature map of any size as input and outputs a set of rectangular object proposals, each with a score describing the likelihood that the region contains an object. To generate region proposals, a small CNN is convolved with the output of the ResNet-101 feature extractor. The input to this small CNN is an spatial window of the ResNet-101 feature map. At a high-level, the output of the RPN is a vector describing the bounding box coordinates and likeliness of objects at the current sliding position. An example output containing 50 region proposals is shown in Figure 7.
Anchor Boxes: Casting defects come in a range of different scales and aspect ratios. To accurately identify casting defects, it is necessary to evaluate boxes with a range of box shapes, at every location in the image. These boxes are commonly referred to as anchor boxes. Anchors vary in aspect-ratio and scale, so as to contain any potential object in the image. At each sliding location, the RPN estimates the likelihood that each anchor box contains an object. The anchor boxes for one position in the feature map are shown in Figure 8. In this work, anchor boxes with 3 scales and 5 aspect ratios are used, yielding 15 anchors at each sliding position. The total number of anchors in each image depends on the size of the image. For a convolutional feature map of a size (typically 42,400), there are anchors in total.
The size and scale of the anchor boxes are chosen to match the size and scale of objects in the dataset. It is common to use anchor boxes with areas of 1282, 2562, and 5122 pixels and aspect ratios of 1:1, 1:2, and 2:1, for detection of common objects like people and cars . However, many of the casting defects in the GDXray dataset are on the scale of pixels. Therefore, the smallest anchor box is chosen to be pixels. Aspect ratios 1:1, 1:2, and 2:1 are used. Scale factors of 1, 2, 4, 8, and 16 are used. Most defects in the dataset are smaller than pixels, so using scales 1, 2, and 4 could be considered sufficient for the defect detection task. However, the object detection network is pretrained on a dataset with many large objects, so the larger scales are included to avoid restricting the system during the pretraining phase.
: The RPN predicts the bounding box coordinates and probability that the box contains an object, for allanchor boxes at each sliding position. The n × n input from the feature extractor is first mapped to a lower-dimensional feature vector (512-d) using a fully connected neural network layer. This feature vector is fed into two sibling fully-connected layers: a box-regression layer () and a box-classification layer (). The class layer outputs scores that estimate the probability of object and not object for each anchor box. The loc layer has outputs, which encode the coordinate adjustments for each of the boxes. The reader is referred to  for a detailed description of the neural network architecture. The probability that an anchor box contains an object is referred to as the objectness score of the anchor box. This objectness score can be thought of as a way to distinguish objects in the image from the background. At the end of the region proposal stage, the top n anchor boxes are selected by objectness score as the region proposals.
Training: Training the RPN involves minimizing a combined classification and regression loss, that is now described. For each anchor, a, the best matching defect bounding box b is selected using the intersection over union (IoU) metric. If such a match is found, then a is assumed to contain a defect and it is assigned a ground-truth class label . In this case, a vector encoding of box b with respect to anchor a is created, and denoted . If no match is found, then a does not contain a defect and the class label is set . At training time, the location loss function captures the distance between the true location of a bounding box and the location of the region proposal . The location-based loss for a is expressed as a function of the predicted box encoding and ground truth :
where is the image, is the model parameters, and is the smooth L1 loss function, as defined in . The box encoding of box with respect to is a vector:
where and are the center coordinates of the box, is the box width, and h is the box height. and are the width and height of the anchor . The geometry of an anchor, predicted bounding box, and a ground truth box is shown diagrammatically in Figure 9. The classification loss is expressed as a function of the predicted class and :
where is the cross-entropy loss function. The total loss for is expressed as the weighted sum of the location-based loss and the classification loss :
where , are weights chosen to balance localization and classification losses . To train the object detection model, (5) is averaged over the set of anchors and minimized with respect to parameters .
Transfer Learning: The RPN is an ideal candidate for the application of transfer learning, as it identifies regions of interest (RoIs) in images, rather than identifying particular types of objects. Transfer learning is a machine learning technique where information that is learned in one setting is exploited to improve generalization in another setting. It has been shown that transfer learning is particularly applicable for domain-specific tasks with limited training data [51, 52]. When training an object detection network on a large dataset with many classes, the RPN learns to identify subsections of the image that likely contain an object, without discriminating by object class. This property is leveraged by first pretraining the object detection system on a large dataset with many classes of objects, namely the Microsoft Common Objects in Context (COCO) dataset . Interestingly, when the RPN from the trained object detection system is applied to an X-ray image, it immediately identifies casting defects amongst other interesting regions of the image. The output of the RPN after training solely on the COCO dataset is shown in Figure 10.
Iv-C Region-Based Detector
Thus far the defect detection system is able to select a fixed number of region proposals from the original image. This section describes how a region-based detector (RBD) is used to classify the casting defects in each region, and fine-tune the bounding box coordinates. The RBD is based on the Faster R-CNN object detection network .
The input to the RBD is cropped from the output of ResNet-101 feature extractor, according to the shape of the regressed bounding box. Unfortunately, the size of the input is dependent on the size of the bounding box. To address this issue, an RoIAlign layer is used to convert the input to a fixed-length feature vector . RoIAlign works by dividing the RoI window into an grid of sub-windows of size
. Bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each sub-window. The reader is referred to  for a more detailed description of the RoIAlign layer. The resulting feature vector has spatial dimensions , regardless of the input size.
Each feature vector from the RoIAlign layer is fed into a sequence of convolutional and fully connected layers. In the proposed defect detection system, the RBD contains two convolutional layers and two fully connected layers. The last fully connected layer produces two output vectors: The first vector contains probability estimates for each of the object classes plus a catch-all “background” class. The second vector encodes refined bounding-box positions for one of the classes. The RBD is trained by minimizing a joint regression and classification loss function, similar to the one used for the RPN. The reader is referred to  for a detailed description of the loss function and training process. The output of the RBD for a single image is shown in Figure 10.
Defect Segmentation: Instance segmentation is performed by predicting a segmentation mask for each RoI. The prediction of segmentation masks is performed using another CNN, referred to as the instance segmentation network. The input to the segmentation network is a block of features cropped from the output of the feature extractor. The instance segmentation network has a dimensional output for each RoI, which encodes binary masks of resolution , one for each of the classes. The instance segmentation network is shown alongside the RBD in Figure 11.
During training, a per-pixel sigmoid function is applied to the output of the instance segmentation network. The loss function is defined as the average binary cross-entropy loss. For an RoI associated with ground-truth class , is only defined on the -th mask (other mask outputs do not contribute to the loss). This definition of allows the network to generate masks for every class without competition among classes. It follows that the instance segmentation network can be trained by minimizing the joint RBD and mask loss. At test time, one mask is predicted for each class ( masks in total). However, only the -th mask is used, where is the predicted class by the classification branch of the RBD. The
floating-number mask output is then resized to the RoI size, and binarized at a threshold of. Some example masks are shown in Figure 12.
V Implementantion Details and Experimental Results
This section describes the implementation of the casting defect detection system described in the previous section. The model is primarily trained and evaluated using images from the GDXray dataset . The Castings series of this dataset contains 2727 X-ray images mainly from automotive parts, including aluminum wheels and knuckles. The casting defects in each image are labelled with tight fitting bounding-boxes. The size of the images in the dataset ranges from pixels to pixels. To ensure the results are consistent with previous work, the training and testing data is divided in the same way as described in .
The model is trained in a manner similar to many other modern object detection networks, such as Faster R-CNN and Mask R-CNN [39, 12]. However, several adjustments are made to account for the small size of casting defects, and the limited number of images in the GDXray dataset. Images are scaled so that the longest edge is no larger than 768 pixels. Images are then padded with black pixels to a size of pixels. Additionally, the images are randomly flipped horizontally and vertically at training time. No other form of preprocessing is applied to the images at training or testing time.
Transfer learning is used to reduce the total training time and improve the accuracy of the trained models, as depicted in Figure 13. The ResNet-101 feature extractor is initialized using weights from a ResNet-101 network that was trained on the ImageNet dataset. The defect detection system is then trained on the COCO dataset . When pretraining the model, the learning rates are adjusted to the schedule outlined in . Training on the relatively large COCO dataset ensures that each model is initialized to localize common objects before it is trained to localize defects. Training on the COCO dataset is conducted using 8 NVIDIA K80 GPUs. Each mini-batch has 2 images per GPU and each image has 100 sampled RoIs, with a ratio of 1:3 of positive to negatives. As in Faster R-CNN, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise.
The defect detection system is then fine-tuned on the GDXray dataset as follows: The output layers of the RBD and instance segmentation layers are resized, as they return predictions for the 80 object classes in the COCO dataset. More specifically, the output shape of these layers is resized to accommodate for two output classes, namely “Casting Defect” and “Background”. The weights of the resized layers are initialized randomly using a Gaussian distribution with zero mean and a 0.01 standard deviation. The defect detection system is trained on the GDXray dataset for 80 epochs, holding all parameters fixed except those of the output layers. The defect detection system is then trained further for an additional 80 epochs, without holding any weights fixed.
The defect detection system is evaluated on a 3.6 GHz Intel Xeon E5 desktop computer machine with 8 CPU cores, 32 GB RAM, and a single NVIDIA GTX 1080 Ti Graphics Processing Unit (GPU). The models are evaluated with the GPU being enabled and disabled. For each image, the top 600 region proposals are selected by objectness score from the RPN and evaluated using the RBD. Masks are only predicted for the top 100 bounding boxes from the RBD. The proposed defect detection system is trained with and without the instance segmentation module, to investigate whether the inclusion of the instance segmentation module changes bounding box prediction accuracy. The accuracy of the system is evaluated using the GDXray Castings dataset. Every image in the testing data set is processed individually (no batching). The accuracy of each model is evaluated using the mean of average precision (mAP) as a metric . The IoU metric is used to determine whether a bounding box prediction is to be considered correct. To be considered a correct detection, the area of overlap between the predicted bounding box and ground truth bounding box must exceed according to the formula:
where denotes the intersection of the predicted and ground truth bounding boxes and denotes their union. The average precision is reported for both the bounding box prediction () and segmentation mask prediction ().
V-C Main Results
As shown in Table 2, the speed and accuracy of the defect detection system is compared to similar systems from previous research . The proposed defect detection system exceeds the previous state-of-the-art performance on casting defect detection reaching an of . Some example outputs from the trained defect detection system are shown in 14. The proposed defect detection system exceeds the Faster R-CNN model from  in terms of accuracy and evaluation time. The improvement in accuracy is thought to be largely due to benefits arising from joint prediction of bounding boxes and segmentation masks. Both systems take a similar amount of time to evaluate on the CPU, but the proposed system is faster than the Faster R-CNN system when evaluated on a GPU. This difference arises probably because our implementation of Mask R-CNN is more efficient at leveraging the parallel processing capabilities of the GPU than the Faster R-CNN implementation used in . It should be noted that single stage detection systems such as the SSD ResNet-101 system proposed in  have a significantly faster evaluation time than the defect detection system proposed in this article.
When the proposed defect detection system is trained without the segmentation module, the system only reaches an of 0.931. That is, the bounding-box prediction accuracy of the proposed defect detection system is higher when the system is trained simultaneously on casting defect detection and casting defect instance segmentation tasks. This is a common benefit of multi-task learning which is well-documented in the literature [12, 39, 49]. The accuracy is improved when both tasks are learned in parallel, as the bounding box and segmentation modules use a shared representation of the input image (from the feature extractor) . However, it should be noted that the proposed system is approximately 12 % slower when simultaneously performing object detection and image segmentation. The memory requirements at training and testing time are also higher, when object detection and instance segmentation are performed simultaneously compared to pure object detection. For inference, the GPU memory requirement for simultaneous object detection and instance segmentation is 9.72 Gigabytes, which is 9 % higher than that for object detection alone.
|Method||Evaluation time per
image using CPU [s]
|Evaluation time per
image using GPU [s]
|Defect detection system
(Object detection only)
|Defect detection system
(Detection & segmentation)
V-D Error Analysis
The proposed system makes very few misclassifications on GDXray Castings test dataset. In this section two example misclassifications are presented and discussed. Figure 15 provides an example where the defect detection system produces a false positive detection. In this case, the proposed defect detection system identifies a region of the X-ray image which appears to be a defect in the X-ray machine itself. This defect is not included GDXray castings dataset, and hence is labelled as a misclassification. Similar errors could be avoided in future systems by removing bounding box predictions which lie outside the object being imaged. Figure 16 provides an example where the bounding box coordinates are incorrectly predicted, resulting in a misclassification according to the IoU metric. However, it should be noted that the label in this case is particularly subjective; the ground truth could alternatively be labelled as two small defects rather than one large one.
During the development of the proposed casting defect detection system, a number of experiments were conducted to better understand the system. This section presents the results of these experiments, and discusses the properties of the proposed system.
Vi-a Speed / Accuracy Tradeoff
There is an inherent tradeoff between speed and accuracy in most modern object detection systems . The number of region proposals selected for the RBD is known to affect the speed and accuracy of object detection networks based on the Faster R-CNN framework [12, 39, 49]. Increasing the number of region proposals decreases the chance that an object will be missed, but it increases the computational demand when evaluating the network. Researchers typically achieve good results on complex object detection tasks using 3000 region proposals. A number of tests were conducted to find a suitable number of region proposals for the defect detection task. Figure 17 shows the relationship between accuracy, evaluation time and the number of region proposals. Based on these results, the use of 600 region proposals is considered to provide a good balance between speed and accuracy.
Vi-B Data Requirements
As with many deep learning tasks, it takes a large amount of labelled data to train an accurate classifier. To evaluate how the size of the training dataset influences the model accuracy, the defect detection system is trained several times, each time with a different amount of training data. The and performance of each trained system is observed. Figure 18 shows how the amount of training data affects the accuracy of the trained defect detection system. The object detection accuracy () and segmentation accuracy improve significantly when the size of the training dataset is increased from 1100 to 2308 images. It also appears that a large amount of training data is required to obtain satisfactory instance segmentation performance compared to defect detection performance. Extrapolating from Figure 18 suggests that a higher mAP could be achieved with a larger training dataset.
Vi-C Training Set Augmentation
|Horizontal Flip||Vertical Flip||Gaussian Blur||Gaussian Noise||Random
It is well-documented that training data augmentation can be used to artificially increase the size of training datasets, and in some cases, lead to increased prediction accuracy [12, 49]. The effect of several common image augmentation techniques on testing accuracy is evaluated in this section. Randomly horizontally flipping images is a technique where images are horizontally flipped at training time. This technique tends to be beneficial when training CNNs, as the label of an object is agnostic to horizontal flipping. On the other hand, vertical flipping is less common as many objects, such as cars and trains, seldomly appear upside-down. Gaussian blur is a common technique in image processing as it helps to reduce random noise that may have been introduced by the camera or image compression algorithm . In this study, the Gaussian blur augmentation technique involved convolving each training image with a Gaussian kernel using a standard deviation of 1.0 pixels. Adding Gaussian noise to the training images is also a common technique for improving the robustness of the trained model to noise in the input images . In this study, zero-mean Gaussian noise with a standard deviation equal to 0.05 of the image dynamic range, is added to each image. In this context, the dynamic range of the image is defined as the range between the darkest pixel and the lightest pixel in the image. The augmentation techniques are applied during the training phase only, with the original images being used at test time.
Vi-D Transfer Learning
This study hypothesized that transfer learning is largely responsible for the high prediction accuracy obtained by the proposed defect detection system. The system is able to generate meaningful image features and good region proposals for GDXray casting images, before it is trained on the GDXray Casting dataset. This is made possible by initializing the ResNet feature extractor using weights pretrained on the ImageNet dataset and subsequently training the defect detection system on the COCO dataset. To test the influence of transfer learning, three training schemes are tested: In training scheme (a) the proposed defect detection system is trained on the GDXray Castings dataset without pretraining on the ImageNet or COCO datasets. Xavier initialization  is used to randomly assign the initial weights to the feature extraction layers. In training scheme (b) the same training process is repeated but the feature extractor weights are initialized using weights pretrained on the ImageNet dataset. Training scheme (c) uses pretrained ImageNet weights COCO pretraining, as described in the ”Defect Detection System” section.
In Table 4, each trained system is evaluated on the GDXray Castings test dataset. Training scheme (a) does not leverage transfer learning, and hence the resulting system obtains a low of 0.651 on the GDXray Castings test dataset. In training scheme (b), the feature extractor is initialized using pretrained ImageNet, and hence the system obtains a higher of 0.874 on the same dataset. By fully leveraging transfer learning, training scheme (c) leads to a system that obtains a of 0.957, as described earlier. In Table 4, the mAP of the trained systems is also reported on the GDXray Castings training dataset. In all cases, the model fits the training data closely, demonstrating that transfer learning affects the system’s ability to generalize predictions to unseen images rather than its ability to fit to the training dataset.
|GDXRay Castings Training Set||GDXRay Castings Test Set|
|Training Scheme||Feature Extractor Initialization||Pretraining on MS COCO Dataset|
|a||Xavier Initialization  (Random)||No||0.970||0.960||0.651||0.420|
|b||Pretrained ImageNet Weights||No||1.00||0.981||0.874||0.721|
|c||Pretrained ImageNet Weights||Yes||1.00||0.991||0.957||0.930|
Vi-E Weld defect segmentation with multi-class learning
The ability to generalize a model to multiple tasks is highly beneficial in a number of applications. The proposed defect detection system was retrained on both the GDXray Castings dataset and the GDXray Welds dataset. The GDXray Welds dataset contains 88 annotated high-resolution X-ray images of welds, ranging from 3176 to 4998 pixels wide. Each high-resolution image is divided horizontally into 8 smaller images for testing and training, yielding a total of 704 images. 80 % of the images are randomly assigned to the training set, with the remaining 20 % assigned to the testing set. Unlike the GDXray Castings dataset, the GDXray Welds dataset is only annotated with segmentation masks. Bounding boxes are fitted to the segmentation masks by identifying closed shapes in the mask using a binary border following algorithm , and wrapping each shape in a tightly fitting bounding box. The defect detection system is simultaneously trained on images from the Castings and Welds training sets. The defect detection system is able to simultaneously identify casting defects and welding defects, reaching a segmentation accuracy of 0.850 on the GDXray Welds test dataset. Some example predictions are shown in Figure 19. The detection and segmentation of welding defects can be considered very accurate, especially given the small size of the GDXray Welds dataset with only 88 high-resolution images. Unfortunately, there is no measurable improvement on the accuracy of casting defect detection when jointly training on both datasets
Vi-F Defect Detection on Other Datasets Using Zero-Shot Learning
A good defect detection system should be able to classify defects for a wide range of different objects. The defect detection system can be said to generalize well if it is able to detect defects in objects that do not appear in the training dataset. In the field of machine learning, zero-shot transfer is the process of taking a trained model, and using it, without retraining, to make predictions on an entirely different dataset. To test the generalization properties of the proposed defect detection system, the trained system is tested on a range of X-ray images from other sources. The system correctly identifies a number of defects in a previously unseen X-ray image of a jet turbine blade, as shown in Figure 20. The jet turbine blade contains five casting defects, of which four are identified correctly. It is unsurprising that the system fails to identify one of the casting defects in the image, as there are no jet engine turbine blades in the GDXray dataset. Nonetheless, the fact that the system can identify defects in images from different datasets demonstrates its potential for generalizability and robustness.
Vii Summary and Conclusion
This work presents a defect detection system for simultaneous detection and segmentation of defects in metal castings. This ability to simultaneously perform defect detection and segmentation makes the proposed system suitable for a range of automated quality control applications. The proposed defect detection system exceeds state-of-the-art performance for defect detection on the GDXray Castings dataset obtaining a mean average precision () of 0.957, and establishes a new benchmark for instance segmentation on the same dataset. This high-accuracy system is developed by leveraging a number of powerful paradigms in machine learning, including transfer learning, dataset augmentation, and multi-task learning. The benefit of the application of each of these paradigms was evaluated quantitatively through extensive ablation testing.
The defect detection system described in this work is able to detect casting and welding defects with very high accuracy. Future work could involve training the same network to detect defects in other materials such as wood or glass. The proposed defect detection system was designed for multi-class detection, so the system could naturally be extended detect a range of different defect types in multiple materials. The defect detection system described in this work could also be trained to detect defects in additive manufacturing applications.
The proposed defect detection system is accurate and performant enough to be useful in a real manufacturing setting. However, the training process for the system is complex and computationally expensive. Future work could focus on developing a standardized method of representing these models, making it easier to distribute the trained models.
The authors acknowledge the support by the Smart Manufacturing Systems Design and Analysis Program at the National Institute of Standards and Technology (NIST), US Department of Commerce. This work was performed under the financial assistance award (NIST Cooperative Agreement 70NANB17H031) to Stanford University. Certain commercial systems are identified in this article. Such identification does not imply recommendation or endorsement by NIST; nor does it imply that the products identified are necessarily the best available for the purpose. Further, any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NIST or any other supporting U.S. government or corporate organizations.
-  T. R. Rao, Metal casting: Principles and practice. New Age International.
-  K. I. , “The future of manufacturing: 2020 and beyond,” p. 12.
-  R. Rajkolhe and J. Khan, “Defects, causes and their remedies in casting process: A review,” International Journal of Research in Advent Technology, vol. 2, no. 3, pp. 375–383, 2014.
-  X. Li, S. K. Tso, X.-P. Guan, and Q. Huang, “Improving automatic detection of defects in castings by applying wavelet technique,” IEEE Transactions on Industrial Electronics, vol. 53, no. 6, pp. 1927–1934, 2006.
-  S. Ghorai, A. Mukherjee, M. Gangadaran, and P. K. Dutta, “Automatic defect detection on hot-rolled flat steel products,” IEEE Transactions on Instrumentation and Measurement, vol. 62, no. 3, pp. 612–621, 2013.
-  I. Baillie, P. Griffith, X. Jian, and S. Dixon, “Implementing an ultrasonic inspection system to find surface and internal defects in hot, moving steel using EMATs,” vol. 49, no. 2, pp. 87–92.
-  M. Lovejoy, Magnetic particle inspection: a practical guide. Springer Science & Business Media.
-  E. Masad, V. Jandhyala, N. Dasgupta, N. Somadevan, and N. Shashidhar, “Characterization of air void distribution in asphalt mixes using x-ray computed tomography,” vol. 14, no. 2, pp. 122–129.
-  N. Pinto, D. D. Cox, and J. J. DiCarlo, “Why is real-world visual object recognition hard?,” vol. 4, no. 1, p. e27.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” in
IEEE conference on computer vision and pattern recognition, pp. 580–587.
-  J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based fully convolutional networks,” in Advances in neural information processing systems (NIPS 2016), pp. 379–387.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-CNN,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, IEEE.
-  D. Mery, V. Riffo, U. Zscherpel, G. Mondragón, I. Lillo, I. Zuccar, H. Lobel, and M. Carrasco, “GDXray: The database of x-ray images for nondestructive testing,” vol. 34, no. 4, p. 42.
-  M. Piccardi, “Background subtraction techniques: a review,” in 2004 IEEE international conference on Systems, man and cybernetics (SMC), vol. 4, pp. 3099–3104, IEEE.
-  V. Rebuffel, S. Sood, and B. Blakeley, “Defect detection method in digital radiography for porosity in magnesium castings,”
-  N. Nacereddine, M. Zelmat, S. S. Belaifa, and M. Tridi, “Weld defect detection in industrial radiography based digital image processing,” vol. 2, pp. 145–148.
-  G. Wang and T. W. Liao, “Automatic identification of different types of welding defects in radiographic images,” vol. 35, no. 8, pp. 519 – 528.
-  V. Kaftandjian, A. Joly, T. Odievre, Courbiere, C, and Hantrais, C, “Automatic detection and characterization of aluminium weld defects: comparison between radiography, radioscopy and human interpretation,” pp. 1179–1186, Society of Manufacturing Engineers.
-  D. Mery, T. Jaeger, and D. Filbert, “A review of methods for automated recognition of casting defects,” vol. 44, no. 7, pp. 428–436.
-  D. S. MacKenzie and G. E. Totten, Analytical characterization of aluminum, steel, and superalloys. CRC press.
-  Y. Tang, X. Zhang, X. Li, and X. Guan, “Application of a new image segmentation method to detection of defects in castings,” vol. 43, no. 5, pp. 431–439.
-  X.-W. Zang, Y.-Q. Ding, Y.-Y. Lv, A.-Y. Shi, and R.-Y. Liang, “A vision inspection system for the surface defects of strongly reflected metal based on multi-class SVM,” vol. 38, no. 5, pp. 5930–5939.
-  V. Lashkia, “Defect detection in x-ray images using fuzzy reasoning,” vol. 19, no. 5, pp. 261–269.
-  X. Xie, “A review of recent advances in surface defect detection using texture analysis techniques,” vol. 7, no. 3.
-  J. Kittler, R. Marik, M. Mirmehdi, M. Petrou, and J. Song, “Detection of defects in colour texture surfaces.,” in IAPR Workshop on Machine Vision Applications (MVA), pp. 558–567.
-  W. Wen and A. Xia, “Verifying edges for visual inspection purposes,” vol. 20, no. 3, pp. 315–328.
-  B. Mallik-Goswami and A. K. Datta, “Detecting defects in fabric with laser-based morphological image processing,” vol. 70, no. 9, pp. 758–762.
-  C.-W. Kim and A. J. Koivo, “Hierarchical classification of surface defects on dusty wood boards,” vol. 15, no. 7, pp. 713–721.
-  M. Niskanen, O. Silvén, and H. Kauppinen, “Color and texture based wood inspection with non-supervised clustering,” in 2001 Scandinavian Conference on image analysis (SCIA 2001), pp. 336–342.
-  R. W. Conners, C. W. Mcmillin, K. Lin, and R. E. Vasquez-Espinosa, “Identifying and locating surface defects in wood: Part of an automated lumber processing system,” no. 6, pp. 573–583.
-  F. Ade, N. Lins, and M. Unser, “Comparison of various filter sets for defect detection in textiles,” in 14th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 428–431.
S. A. Hosseini Ravandi and K. Toriumi, “Fourier transform analysis of plain weave fabric appearance,” vol. 65, no. 11, pp. 676–683.
-  J. Hu, H. Tang, K. C. Tan, and H. Li, “How the brain formulates memory: A spatio-temporal model research frontier,” vol. 11, no. 2, pp. 56–68.
-  A. Conci and C. B. Proença, “A fractal image analysis system for fabric inspection based on a box-counting method,” vol. 30, no. 20, pp. 1887–1895.
-  F. Mirzaei, M. Faridafshin, A. Movafeghi, and R. Faghihi, “Automated defect detection of weldments and castings using canny, sobel and gaussian filter edge detectors: A comparison study,”
-  D. Mery and C. Arteta, “Automatic defect recognition in x-ray testing using computer vision,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1026–1035, IEEE.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015).
-  W.-X. Ren, T. Zhao, and I. E. Harik, “Experimental and analytical modal analysis of steel arch bridge,” vol. 130, no. 7, pp. 1022–1031.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems (NIPS 2015), pp. 91–99.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in 14th European Conference on Computer Vision (ECCV 2016), pp. 21–37.
-  J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and others, “Speed/accuracy trade-offs for modern convolutional object detectors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017).
-  M. Ferguson, R. Ak, Y.-T. T. Lee, and K. H. Law, “Automatic localization of casting defects with convolutional neural networks,” in 2017 IEEE International Conference on Big Data (Big Data 2017), pp. 1726–1735, IEEE.
-  J. Wu, Convolutional neural networks. Published online at https://cs.nju.edu.cn/wujx/teaching/15_CNN.pdf.
-  I. Sobel, “An isotropic 3x3 image gradient operator,” pp. 376–379.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in27th international conference on Machine Learning (ICML), pp. 807–814.
J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deep reinforcement learning with successor features for navigation across similar environments,” pp. 2371–2378.
-  P. J. Werbos, “Backpropagation through time: what it does and how to do it,” vol. 78, pp. 1550–1560.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE conference on computer vision and pattern recognition (CVPR 2016), pp. 770–778.
-  R. Girshick, “Fast r-CNN,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448, IEEE.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, and others, “Imagenet large scale visual recognition challenge,” vol. 115, no. 3, pp. 211–252.
-  Z. Kolar, H. Chen, and X. Luo, “Transfer learning and deep convolutional neural networks for safety guardrail detection in 2d images,” Automation in Construction, vol. 89, pp. 58–70, 2018.
-  Y. Gao and K. M. Mosalam, “Deep Transfer Learning for Image-Based Structural Damage Recognition,” Computer-Aided Civil and Infrastructure Engineering, vol. 33, pp. 748–768.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European conference on computer vision (ECCV 2014), pp. 740–755, Springer.
M. Jaderberg, K. Simonyan, A. Zisserman, and others, “Spatial transformer networks,” inAdvances in neural information processing systems, pp. 2017–2025.
-  C. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, vol. 39. Cambridge University Press.
-  R. Caruana, “Multitask learning,” vol. 28, no. 1, pp. 41–75.
-  H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for image processing and reconstruction,” vol. 16, no. 2, pp. 349–366.
-  S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving the robustness of deep neural networks via stability training,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), (Las Vegas, United States), pp. 4480–4488, 2016.
X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,”
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323.
-  X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Thirteenth international conference on artificial intelligence and statistics, pp. 249–256.
-  S. Suzuki and K. Abe, “Topological structural analysis of digitized binary images by border following,” vol. 30, no. 1, pp. 32–46.