Automatic defect detection based on steel surface images is a challenging task due to its diversity and complexity in the real industry, especially in the high-speed production lines, where the demand for real-time detection is essential but difficult for humans 
. Different environments’ characteristics could affect the appearance of steel and its damage type. Existing methods could be coarsely divided into three main steps: image prepossessing, feature extraction, and classification. However, most of the times, the requirement of high-quality feature extractors are based on both hand-crafted works and adequate expert knowledge.
Because defect detection is based on images of steel surface, we conversed this problem to a image segmentation problem and applied the DeepLabV3+ model on solving it. DeepLabV3+ could be combined with different pre-trained models as backbones. To balance the accuracy and efficiency of the model, we experimented different backbones and have the detailed analysis of their performance.
2 Related Works
Among all architectures / tools applied for image segmentation, spatial pyramid pooling and encoder-decoder structure are two commonly applied tools.
Encoder-decoder structure applies encoder to gradually decrease feature maps and captures higher semantic information, decoder part to gradually recovers the spatial information. Well-studied models could be applied as the encoder module. In late 2014, Long et proposed ’Fully convolution network’, which utilized the well-studied image classification networks (eg. AlexNet) as encoder, appended a decoder module to upsample the coarse feature maps to the same size as the input image. In 2015, the U-Net  has been proposed, which consists of a contracting path for capturing context information for the image and a symmetric expanding path for precise localization.
Spatial pyramid pooling works by keeping the partitioning image into smaller sub-regions and then taking a weighted sum of the number of matches at each sub-region level.  DeepLabV3+, which reconstructed DeepLab architecture, has applied Atrous Spatial Pyramid Pooling (ASPP). With ASPP, the architecture could learn larger field-of-view without increasing the number of parameters or the amount of computation. And the output feature could have a larger size, which benefits to segmentation task.
Data is from the Kaggle Competition - ’Severstal: Steel Defect Detection’, a brunch of steel surface images. The objective is to automatically detect the region of defects of the steel. In each image, the steel could contain 0, 1, 2, 3 or all 4 kinds of defects.
The training set includes 12568 images. Among all these images, 6666 of them include at least one defect region. The ground truth of these images has been given by human labels.
The training set includes 7095 defect regions in total, but the defection classes are quite unbalanced in the aspect of the class type and defection area. To be more specific, around 73% of defections are from class 3. Class 4 only maintain 11.3% among all defections. But if consider the area, it is around 17% among all defections, which means that this kind of defections is tended to be larger. On the contrary, the defect size for class 2 and class 1 is quite small. Although around 12.6% and 3.48% of defects come from these two classes, they only make up to 2.39% and 0.51% of the total mask respectively. In terms of sample and especially in terms of area, Our network may have a hard time finding classes 1 and 2 two because of their small size.
3.2 Data Preprocessing
Resize: reduced image size from 256 1700 to 256 256[Figure 1 (left)] for training.
Encoded segmentation and classification information as 2D numeric labeled masks matrix with same size as input data: Each element in the mask has value from 0 to 4. Background area is marked as 0. 1 to 4 corresponds to 4 types of defects. Visualize as Figure 1( right).
3.3 Data Augmentation
Applied random weighted augmentation on training data to balance the defections. The logic is first calculating the proportion of each defection classes and then use as the possibility of performing augmentation for each defect type. The augmentation is choosing randomly from the following types:
Random Crop: Randomly crop the image with the target input size from the raw image.
Vertical Flip: Flip the image vertically.
Random Rotation: Rotate the image with a random angle.
The augmentation applies to both the original image and the corresponding ground truth masks to pair with the augmented images.
We add a weight(probability)for each type of augmentation during random selection for the augmentation actions.
4 Evaluation Metrics
The baseline model[Figure 2
] is a bare bone DeeplabV3+ by removing the encoder module. To be more specific, the input image will be fed into a pretrained Reset101 model then the DeeplabV3+ decoder module. The decoder module consists of 2 convolution layers and an up sample layer. Batch normalization and ReLu activation functions are added followed by the echo of these 2 convolution layers. The first convolution layer in the decoder module has 2048 input channels, 256 output channels, with kernel size 3. The second convolution layer has 256 input channels, 256 output channels, with kernel size 3. we kept the upsample layer to be consistent with DeepLabV3+ decoder structure.
5.2 Experimented Model
The experimented model is based on the DeepLabv3+ , the overall architecture is shown below.
As the graph is shown, the model utilized the encoder-decoder structure. The encoder starts with 5 normal convolution modules, the output will be passed to 4 Atrous convolution modules and 1 average max-pooling module in parallel. The first 5 normal convolution modules are based on pre-trained backbone model, which describes as below subsectionsLABEL:res_net, the convolution layer has 7 7,11,11,11,1
1 kernels, 2,1,2,2,1 strides, 0,1,1,1,2 dilation separately. The output of this module, on one hand, will be copied and saved as an input to the decoder to maintain more lower-level information. On the other hand, it will be passed to atrous convolution modules in parallel as the plot shown. The 4 atrous convolution modules have different convolution layer. The kernel of them are 11,33,33,33 and dilation values of them are 1, 12, 24, 3. The stride value is always 1. The average max-pooling module is a combination of an adaptive average Pooling layer and a normal convolution module to adjust the output size the same as the atrous convolution module.
The output from these 5 modules will be concatenated and passed to another depthwise convolution module aimed to merge information from different channels. Before the output pass to the decoder it will be up-sampled by 4 with bilinear methods.
The decoder part has two inputs, the outputs from the first 5 normal convolution modules and the outputs from the encoder part. The outputs from normal convolution modules are passed to a normal convolution module and then concatenates with the outputs from the encoder part. Finally, the concatenated outputs are passed to 3 other convolution modules. The kernel size of the convolution layer is 3 3,33 and 1 1. The last step before the final output is up-sample the output by 4 with bilinear methods to make it has the same size as the input.
We implemented the model in PyTorch and project code in python, using Google Colab as the coding environment. But we there‘s limited GPU available hours per day, and the log data may lose in between resume training. Due to the limitation of this, we only show part of the loss and evaluation trend here.
We trained base line model on Google Colab with Nvida Telsa P100-GPU with learning rate=0.01, batch size = 16 weight decay = 1e-4, and data split is train/evaluation/test=80/10/10(%). It took about 3 hours to reach mIoU = 0.30 after 30 epochs.
We trained DeepLabV3+ for all backbones we explored in 5.3 on Google Colab with Nvida Telsa P100-GPU with learning rate=0.01, batch size = 16 weight decay = 1e-4, and data split is train/evaluation/test=80/10/10(%)Partial smoothed loss and evaluation trends shown as Figure .
6.3 Train/Evaluation Speed
Total training time is based on trained iteration per seconds, the batch size is 16, so each iteration, the input is Batch x Channel x H x W matrix, which are 16 resized 256 x 256 images.
Evaluation time is based on image inferences per seconds, the batch size is set to 1.
|Model backbone||Train(iter/s)||Eval Resized Image(image/s)||Eval Original Image(image/s)|
7.1 Mean IoU on Test Data
|DenseNet201 + DeeplabV3Plus||0.325|
|ResNet101 + DeeplabV3Plus||0.57|
|EfficientNet B1 + DeeplabV3Plus||0.57|
7.2 IoU Distribution on Test Data
IoU distribution statics for 4 models that experimented shown as [Figure 4]
7.3 Inference Sample on Test Data
Perform model inference on test data with original image size(256 x 1700) and also compared with re-sized image(256 x 256).
7.3.1 Single Defect Type per Image
[Figure 5] shows output comparison between different models on original size images. The test data has only one defect type for one raw image.
7.3.2 Multiple Defect Type per Image
[Figure 6] shows output comparison between different models on original size images. The test data has multiple defect type for one raw image.
7.3.3 Resized Output
Comparing the outputs from all four models with ground truth label[Figure 5, 6, 7], we can see that the encoder module is helpful on learning details, like segmentation edges, and could make the detection region more precise. As a result of this, the IoU score could be improved with Encoder-Decoder architecture. When using DeepLabV3+, ResNet and EfficientNet as the backbone, the IoU scores are pretty similar, when using EfficientNet as the backbone, the cost of training and evaluation is less. On the contrary, DenseNet does not perform well.
Some images have a relatively low IoU score [Figure 4], which probably caused by the imbalanced data set and limitation of the model. With a deeper analysis on the underlying data set, we found that the defections in
Class 1 tends to have a smaller size and split in high fragments. It has the highest percentage of having more than 5 segments in one image but will occupy only a small total defection area. Also, it is the only class with considerable percentage of 10+ segment count.
Class 2 tends to have a smaller size but not in many fragments. It doesn’t have any 5+ segments per defect. Mostly comes in one or two segments.
Class 3 has a fair amount of variation in terms of the number of segments and the area.
Class 4 is less frequent than in class 3 for having more than 3 segments per defect.
based on these analysises, we thought that unbalanced data could lead to a low mIoU score.
By analyzing predictions with the low IoU score, we found some of the data were labeled improperly. However, the model could predict extra defects, which were not marked in the ground truth mask[Figure 8]. Some of the defect areas are detected by the model but not humans due to its extremely small size, which is invisible to human eyes[Figure 9]. We also found some of the defects are too small to be labeled manually, but predicted by the model precisely. That could explain that some low the IoU score impacted overall the mean IoU score.
DeepLabV3+ with different backbones could be applied to detect and classify the steel defection automatically, both in fair accuracy and high efficiency. Among 4 experimented backbone, ResNet101 and EfficientNet have similar better performance, which IoU are around 0.57. In the meantime, ResNet101 is more efficient in training. The unbalanced data is a large difficulty in further increasing model performance. But the model could predict some extra regions, which too small to be recognized by humans but still potentially defected. We believe more and more explorations would be applied in this area to automatic more label-related tasks, which could be the part of the core idea of Industrial 4.0 manufacturing standard.
10 Future Work
-  Jiangyun Li, Zhenfeng Su, Jiahui Geng, and Yixin Yin. Real-time detection of steel strip surface defects based on improved yolo detection network. IFAC-PapersOnLine, 51:76–81, 01 2018.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015.
-  Svetlana Lazebnik, Cordelia Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. volume 2, pages 2169 – 2178, 02 2006.
-  Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space, 2019.
-  Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In , June 2019.
-  Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR, abs/1802.02611, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
-  Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR, abs/1905.11946, 2019.
-  Andreja Rojko. Industry 4.0 Concept: Background and Overview. https://online-journals.org/index.php/i-jim/article/viewFile/7072/4532, 2017.
-  Fereshte Khani, Aditi Raghunathan, and Percy Liang. Maximum weighted loss discrepancy. CoRR, abs/1906.03518, 2019.
-  Hongyi Zhang, Moustapha Cissé, Yann Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ArXiv, abs/1710.09412, 2017.