Surface defects have a significant impact on the result of the quality of industrial products. Small defects need to be carefully and reliably detected during the process of monitoring. It is crucial to ensure the defective products are noticed at earlier stages, which prevents a negative impact on a company’s reputation and additional financial loss. In recent research, surface detection has been increasingly studied and has improved quality control in the industrial field. However, surface defect detection is challenging due to 1) collecting defective samples and manually labeling for training is time-consuming; 2) the defects’ characteristics are difficult to define as new types of defect can happen all the time; 3) and the real-world product images contain lots of background noise. The results of defect detection become less reliable because of the influence of these factors.
In the current industry, the defect types are varied, and the defects’ characteristics are difficult to define. Most existing defect datasets are lack defect richness and data scale. Specifically, the dataset is limited to a few categories of products and a smaller number of samples. To ensure our experiment’s realism and applicability, we introduced a new dataset collected from a real-world production line monitoring system. This dataset includes 21 video clips and 1634 images consisting of multiple types of bottles with both good and defective samples. The videos of bottles are gathered from videos of the assembly line provided by ZeroBox Inc.
In this paper, we propose a two-stage defect detection model based on object detection and normalizing flow-based defect detection. For a given product video as input, the object detection is performed by YOLO  to draw bounding boxes of products in each frame. Each product image is further cropped by our model and fed into the normalizing flow-based model for training and predicting. Since the cropped product images contain lots of background noise, multi-scale image transformations such as image cropping and rotation are also implemented in our model to ensure high robustness and performance. We also introduced a visualization model to plot predicted bounding boxes for each bottle, and the predicted anomaly result on every video frame. In summary, the main contributions of this paper are:
We create a new dataset that includes various types of bottles collected from a real-world production line monitoring system.
We propose a two-stage defect detection model based on object detection and normalizing flow-based defect detection and a visualization model for predicted bounding boxes for each bottle and the predicted anomaly result using a quality control inspection video as input.
We propose an image transformation model to rotate and crop each bottle image’s edges to remove the background noises surrounding the bottle.
Extensive experiments on the new dataset demonstrate the proposed model’s effectiveness and our dataset’s practicability.
Ii Related Work
II.a Normalizing Flows
Normalizing Flows (NF) are a network that can generate complex distributions by transforming a probability density from a latent space through a series of invertible affine transformations ‘flows’. Based on the change of variable rule, the bijective mappings between feature space and latent space can be evaluated in both directions. The formula is derived as the way below:
refers to the complex distribution generated from the learned normal distribution, and the magnitude of Jacobian determinant of function G, it indicates how much a transformation locally stretch or squish the area is necessary to ensure that the density function pie of z satisfies this requirement.
To ensure the bijective is invertible, tractable and the Jacobian is easy to compute, the coupling layer was introduced by L. Dinh et al. in 2017 . With a given D dimensional input x and d ¡ D, the output y of an affine coupling layer follows the equations:
In the coupling layer, the input data are split into two parts half by half, and the first parts will directly copy to the output . The second half of the input will go through an affine coupling layer to generate output , where s and t stand for scale and translation.
After a certain number of affine coupling transformation, the complex distribution can be transformed into a simple normal distribution. Also, the inverse of this flow can generate complex distribution from the learned normal distribution.
II.b Semi-Supervised Defect Detection with Normalizing Flows
Our work is based on a normalizing flow model called DifferNet, proposed in 2020 by M. Rudolph et al.
DifferNet utilizes a latent space of normalizing flow represent normal samples’ feature distribution. Unlike other generative models such as variational autoencoder (VAE) and GANs, the flow-based generator makes the bijective mapping between feature space and latent space assigned to a likelihood. Thus a scoring function can be derived to decide if an image contains an anomaly or not. As a result, most common samples will have a high likelihood, while uncommon images will have a lower likelihood. Since DifferNet only requires good product images as the training dataset, so that defects are not present during training. Therefore, the defective products will be assigned to a lower likelihood, which can be easily detected by the scoring function.
The design of coupling layer splits the input data into and and then apply series of affine transformation that includes regressing multiplicative (scaling function s) and additive manipulations (translation function t). The scale and translation operations are written as below :
The exponential function is applied to the output of function s to make sure non-zero coefficients. The refers to the element-wise multiplication. The transformation function s and t can be any differentiable function. In the DifferNet, a fully connected network is implemented to generate the results from the input value .
This model aims to find the best probability distribution Pz(z) in the latent space Z to maximize likelihoods for extracted features y
. According to the change-of-variables formula, after adding log function on both sides, the loss function can be defined as:
A scoring function
is used to calculate likelihoods to classify a sample as defective or normal. Rotations and manipulations of brightness and contrast have been performed on the input images, and the average value of the negative likelihoods is calculated to get an anomaly score. The formula is defined as below:
The anomaly score is further used to compare with a threshold value , which is learned from the training process. The anomaly is classified where equal to 1, and good product is classified where equal to 0 .
II.c You Only Look Once: Unified, Real-Time Object Detection
In 2016, J. Redmon et al. introduced a unified model YOLO for object detection. It reframes object detection as a regression problem that separates bounding boxes spatially and associates their class probabilities 
. Only a single convolutional neural network is used to predict bounding boxes and class probabilities in the YOLO’s system. With a given image as input, the system first divides the image into a S x S grid. Each cell predicts B bounding boxes and their corresponding confidence scores. The confidence score is defined asObject , where the intersection over union (IOU) between the ground truth and predicted bounding box is calculated. Later, the conditional class probabilities are multiplied with confidence scores of each bounding box to obtain confidence scores for a specific class as:
YOLO is extremely fast, reasons globally, and learns a more generalized representation of the objects, making it outperformed other detection methods. It achieves efficient performance in both fetching images from the camera and displaying the detections. However, YOLO struggles with small items that appeared in the group under the strong spatial constraints. It also struggles to identify objects in new or unusual configurations from the data it has not seen during the training .
II.d Improving Unsupervised Defect Segmentation by Applying Structural Similarity To Autoencoders
Convolutional autoencoder has become a popular approach for unsupervised defect segmentation of Images. In this paper, a model is proposed to use the structural similarity (SSIM) metric with an autoencoder to capture the inter-dependencies between local regions of an image. This model is trained exclusively with defect-free images and able to segment defective regions in an image after the process of training .
The autoencoder in the proposed model attempts to reconstruct an input image precisely after passing through a bottleneck and effectively project the input image into a latent space. To prevent the model from simply copying the input image, the latent space dimension is much less than the input image’s dimension. For a given input image x, the overall process is summarized as :
Function D stands for a decoder function, Function E represents an encoder function, and z denotes the latent space. If the autoencoder encounters images that have not been seen in training, i.e., samples with defects, it will fail to reconstruct such images .
SSIM is a distance measure that is designed to capture the similarity between two images. It is less sensitive to edge alignment and considers luminance, contrast, and structural information at the same time. With given patches p and q from two images, the SSIM index compares the patches from three statistical feature and is summarized as:
II.e Segmentation-Based Deep-Learning Approach for Surface-Defect Detection
This paper proposes a segmentation-based deep-learning architecture on a specific surface crack detection domain with the great success that deep-learning methods have achieved in quality control. This model is trained with a small number of samples with approximately 25-30 samples that are defective.
In the paper, a deep convolution network is constructed based on a two-stage architecture. The first stage contains a segmentation network to perform the pixel-wise location of defects. It focuses on detecting small defects in a large-scale image with the requirement of a large receptive field. The second stage, an extra network built on top of the segmentation network, implements a decision network. It is where the binary-image classification performs. It ensures the model capture not only local shapes but also global ones. The performance of the model has been proved on the specific task of crack detection. Moreover, the network architecture can be applied for new domains with multiple complex surfaces, and other different defect types .
Iii Proposed Method
In this paper, object detection is performed by YOLO to detect and draw bounding boxes of products in each frame. After comparing with the most recent defect detection methods, we will focus on one of the state-of-the-art normalizing flow-based models called DifferNet to perform defect detection . However, due to the product images detected and cropped by YOLO are usually contains lots of background noises, we propose to devise an improved normalizing flow-based model with an additional image transformation layer to remove the background noises. A Visualization model is also introduced to plot predicted bounding boxes for each bottle and the predicted anomaly result on every video frame. Fig.2 shows an overview of our proposed model:
III.a Proposed model
1) Our model takes video clips of bottle products as input and utilizes YOLO to detect and draw bounding boxes on each bottle in each frame.
2) A data extraction model is created to crop the bottle images based on the bounding boxes drawn by YOLO. Both of the cropped bottle images and the original video frames will be saved into separate folders.
3) An image transformation model is further introduced to rotate and crop each bottle image’s edges to remove the background noises surrounding the bottle.
4) The processed bottle images are then passed into the normalizing flow-based defect detection model to generate a normal distribution by maximum likelihood training.
5) After training the model, a scoring function is used to calculate likelihoods to classify the input sample as defective or normal. We also created a visualization model to plot both the bounding box and anomaly prediction onto the original input video frames.
III.b Novel Combination of YOLO and improved DifferNet
YOLO is a state-of-art model that extremely fast, reasons globally, and learns a more generalized representation of the objects, making it outperformed other detection methods. The model is constructed with twenty-four convolutional layers and two fully connected layers. With a given image as input, the system first divides the image into a S x S grid. Each cell predicts B bounding boxes and their corresponding confidence scores. The bounding box drawn by YOLO contains object class, center coordinates, the height and width of each bounding box . We decided to use YOLO to perform object detection on video clips collected from a real-world production line monitoring system. The bottle images cropped based on the bounding boxes are further passed into our improved DifferNet to perform training and predicting.
DifferNet is a state-of-art model that utilizes a latent space of normalizing flow to represent normal samples’ feature distribution. Unlike other generative models such as variational autoencoder (VAE) and GANs, the flow-based generator makes the bijective mapping between feature space and latent space assigned to a likelihood .
To improve the performance of DifferNet on the output images from YOLO, we propose an image transformation model to rotate and crop each bottle image. In training, various scales of cropping on bottle images are performed to ease background noise interference. Moreover, the range of rotation for input images is reduced from 360 degrees to 10 or 20 degrees for better computing performance.
Then the transformed images are fed into a pre-trained AlexNet to extract the feature. The extracted feature map is further passed into a normalizing flow-based coupling layer to output a normal distribution by maximum likelihood training. The DifferNet uses the negative log-likelihood loss L(y) to obtain a minimization problem :
To classify if an input image is anomalous or not, DifferNet uses a scoring function that calculates the average of the negative log-likelihoods using multiple transformations Ti(x) of an image x:
The result will compare with the threshold value to determine if the image contains an anomaly or not. 
Iv Experiments and Results
In this section, we evaluate the proposed model based on real-world videos obtained from the factory. First, we briefly introduce the dataset used in experiments. Then, the results of the experiments are analyzed with visual statistics. Since the complexity of experiments primarily stems from the noisy background in the video clips, our experiments concentrate on logo-free products and group into single and multiple product categories.
In this paper, we evaluate our model to real-world defect detection problems. We created a new dataset collected from a real-world production line monitoring system. This dataset includes 21 video clips consisting of 20 types of bottles with both good and defective samples. The videos of bottles are gathered from videos of the assembly line provided by ZeroBox Inc. 1381 good bottle’s images, and 253 defective bottle images are generated from YOLO detection and cropping. Examples of defective and defective-free samples can be seen in Fig. 3.
Since our normalizing flow-based model is semi-supervised learning, it only requires about 200 good sample images to learn how to use a simple normal distribution to represent a good sample’s complex distribution. In our experiments, we only use 200 good sample images for training, and all the rest sample images are used as test datasets.
IV.b Implementation Details
The Area Under Receiver Operator Characteristics (AUROC) is computed for performance evaluation. We adopt this performance metric since it reveals the model’s ability to discriminate between positive samples and negative samples. It calculates the area under a ROC curve which is a graph that plots the true positive rate and false positive rate at different classification thresholds. AUROC is not sensitive to the percentage of defective samples and therefore chosen as the metric for performance evaluation. The experimental results are presented and analyzed both qualitatively and quantitatively.
Detection on One Product Type with Image Processing Techniques
Experiment Result of Image Cropping: Table I and Fig.4 present the detailed comparison AUROC result of the detection on one product with different scales of cropping. The proposed model can obtain better result on defect detection after cropping the background of bottles’ images. Based on the background of images, an adequate adjustment of the cropping scales while retaining the defective regions can obtain higher AUROC scores.
Scale of Image Cropping AUROC [%] Original Image 76.9% Top 10% Cropping, Bottom, Left and Right 5% Cropping 92.2% Top and Bottom 10% Cropping, Left and Right 5% Cropping 99.2% TABLE I: AUROC (%) in detection on one product with image cropping
Experiment Result of Image Rotation: Table II and Fig.7 display the AUROC comparison of the detection result for one product with random image rotation within a specific range. Since the bottles are motionless on the production line, we reduce the range of random rotation from the initial 360 degrees in DifferNet to smaller ranges. As a result, the proposed model can obtain better result on defect detection within a rotation angle between -5 and 5 degrees or -10 and 10 degrees.
Range of Image Rotation AUROC [%] Original Image 76.9% 10 Degrees of Image Rotation 97.8% 20 Degrees of Image Rotation 99.6% 360 Degrees of Image Rotation 93.3% TABLE II: AUROC (%) in detection on one product with image rotation
Detection on Multiple Product Types with Image Processing Techniques
Scales of Image Cropping AUROC [%] Original Image 73.2% Top and Bottom 10% Cropping 86.1% Top and Bottom 15% Cropping 93.4% TABLE III: AUROC (%) in detection on multiple product types with image cropping
Detection on All Product Types with Image Processing Techniques
Table IV and Fig.6 show the AUROC result of the model on the detection of all products. Similar to the results of detection on single and multiple product types, image cropping to eliminate background noise near edges of images enables the model to achieve better performance.
Scales of Image Cropping AUROC [%] Original Image 88.1% Top and Bottom 10% Cropping 90.1% Top and Bottom 15% Cropping 93.5% TABLE IV: AUROC (%) in detection on all product types with image cropping
In this paper, we introduce a new dataset for bottle surface defect detection. This dataset has several challenges regarding defect types, background noise, and dataset sizes. Also, we propose an two-stage defect detection network based on object detection and normalizing flow-based defect detection. In order to overcome the significant effect of background noise on both positive and negative samples, we present the multi-scale image transformations for solving this issue. Finally, extensive experiments show that the proposed approach is robust for the detection of surface defects on bottle products. In the future, we will work on using background and foreground segmentation with an end-to-end trained mask to eliminate the background noise in images cropped by YOLO. Also, more data samples will be collected for training and improving our proposed method.
-  (2019) Improving unsupervised defect segmentation by applying structural similarity to autoencoders. SCITEPRESS - Science and Technology Publications. External Links: Cited by: §II.d, §II.d, §II.d, §II.d.
Density estimation using real nvp. External Links: Cited by: Fig. 1, §II.a, §II.b.
-  (2016) You only look once: unified, real-time object detection. External Links: Cited by: Computer Vision and Normalizing Flow Based Defect Detection, §I, §II.c, §II.c, §III.b.
-  (2015) Variational inference with normalizing flows. External Links: Cited by: §II.a.
-  (2020) Same same but differnet: semi-supervised defect detection with normalizing flows. External Links: Cited by: Computer Vision and Normalizing Flow Based Defect Detection, §II.b, §II.b, §II.b, §II.b, §II.b, §II.b, §III.b, §III.b, §III.b, §III.
-  (2019-05) Segmentation-based deep-learning approach for surface-defect detection. Journal of Intelligent Manufacturing 31 (3), pp. 759–776. External Links: Cited by: §II.e, §II.e.