Analysis on DeepLabV3+ Performance for Automatic Steel Defects Detection

04/09/2020 ∙ by Zheng Nie, et al. ∙ Columbia University Stanford University 0

Our works experimented DeepLabV3+ with different backbones on a large volume of steel images aiming to automatically detect different types of steel defects. Our methods applied random weighted augmentation to balance different defects types in the training set. And then applied DeeplabV3+ model three different backbones, ResNet, DenseNet and EfficientNet, on segmenting defection regions on the steel images. Based on experiments, we found that applying ResNet101 or EfficientNet as backbones could reach the best IoU scores on the test set, which is around 0.57, comparing with 0.325 for using DenseNet. Also, DeepLabV3+ model with ResNet101 as backbone has the fewest training time.



There are no comments yet.


page 2

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic defect detection based on steel surface images is a challenging task due to its diversity and complexity in the real industry, especially in the high-speed production lines, where the demand for real-time detection is essential but difficult for humans [1]

. Different environments’ characteristics could affect the appearance of steel and its damage type. Existing methods could be coarsely divided into three main steps: image prepossessing, feature extraction, and classification. However, most of the times, the requirement of high-quality feature extractors are based on both hand-crafted works and adequate expert knowledge.

Because defect detection is based on images of steel surface, we conversed this problem to a image segmentation problem and applied the DeepLabV3+ model on solving it. DeepLabV3+ could be combined with different pre-trained models as backbones. To balance the accuracy and efficiency of the model, we experimented different backbones and have the detailed analysis of their performance.

2 Related Works

Among all architectures / tools applied for image segmentation, spatial pyramid pooling and encoder-decoder structure are two commonly applied tools. Encoder-decoder structure applies encoder to gradually decrease feature maps and captures higher semantic information, decoder part to gradually recovers the spatial information. Well-studied models could be applied as the encoder module. In late 2014, Long et proposed ’Fully convolution network’[2], which utilized the well-studied image classification networks (eg. AlexNet) as encoder, appended a decoder module to upsample the coarse feature maps to the same size as the input image. In 2015, the U-Net [3] has been proposed, which consists of a contracting path for capturing context information for the image and a symmetric expanding path for precise localization.
Spatial pyramid pooling works by keeping the partitioning image into smaller sub-regions and then taking a weighted sum of the number of matches at each sub-region level. [4] DeepLabV3+, which reconstructed DeepLab architecture, has applied Atrous Spatial Pyramid Pooling (ASPP). With ASPP, the architecture could learn larger field-of-view without increasing the number of parameters or the amount of computation. And the output feature could have a larger size, which benefits to segmentation task.

3 Data

3.1 Overview

Data is from the Kaggle Competition - ’Severstal: Steel Defect Detection’, a brunch of steel surface images. The objective is to automatically detect the region of defects of the steel. In each image, the steel could contain 0, 1, 2, 3 or all 4 kinds of defects.
The training set includes 12568 images. Among all these images, 6666 of them include at least one defect region. The ground truth of these images has been given by human labels.
The training set includes 7095 defect regions in total, but the defection classes are quite unbalanced in the aspect of the class type and defection area. To be more specific, around 73% of defections are from class 3. Class 4 only maintain 11.3% among all defections. But if consider the area, it is around 17% among all defections, which means that this kind of defections is tended to be larger. On the contrary, the defect size for class 2 and class 1 is quite small. Although around 12.6% and 3.48% of defects come from these two classes, they only make up to 2.39% and 0.51% of the total mask respectively. In terms of sample and especially in terms of area, Our network may have a hard time finding classes 1 and 2 two because of their small size.

3.2 Data Preprocessing

  1. Resize: reduced image size from 256 1700 to 256 256[Figure 1 (left)] for training.

  2. Encoded segmentation and classification information as 2D numeric labeled masks matrix with same size as input data: Each element in the mask has value from 0 to 4. Background area is marked as 0. 1 to 4 corresponds to 4 types of defects. Visualize as Figure 1( right).

Figure 1: Resized image(left) and ground truth mask(right)

3.3 Data Augmentation

Applied random weighted augmentation[5] on training data to balance the defections. The logic is first calculating the proportion of each defection classes and then use as the possibility of performing augmentation for each defect type. The augmentation is choosing randomly from the following types:

  • Random Crop: Randomly crop the image with the target input size from the raw image.

  • Vertical Flip: Flip the image vertically.

  • Random Rotation: Rotate the image with a random angle.

The augmentation applies to both the original image and the corresponding ground truth masks to pair with the augmented images.

We add a weight(probability)

for each type of augmentation during random selection for the augmentation actions.

4 Evaluation Metrics

In this project, we calculated IoU[6]

for each sample in the test data set, then get the mean value as the evaluation metrics. If

denotes the input image, denotes the mask matrix. N denotes total test data set size, mean IOU could be calculated as:

5 Models

5.1 Baseline

The baseline model[Figure 2

] is a bare bone DeeplabV3+ by removing the encoder module. To be more specific, the input image will be fed into a pretrained Reset101 model then the DeeplabV3+ decoder module. The decoder module consists of 2 convolution layers and an up sample layer. Batch normalization and ReLu activation functions are added followed by the echo of these 2 convolution layers. The first convolution layer in the decoder module has 2048 input channels, 256 output channels, with kernel size 3. The second convolution layer has 256 input channels, 256 output channels, with kernel size 3. we kept the upsample layer to be consistent with DeepLabV3+ decoder structure.

Figure 2: Bare Bone DeepLabV3+

5.2 Experimented Model

The experimented model is based on the DeepLabv3+ [7], the overall architecture is shown below.

As the graph is shown, the model utilized the encoder-decoder structure. The encoder starts with 5 normal convolution modules, the output will be passed to 4 Atrous convolution modules and 1 average max-pooling module in parallel. The first 5 normal convolution modules are based on pre-trained backbone model, which describes as below subsections

LABEL:res_net, the convolution layer has 7 7,11,11,11,1

1 kernels, 2,1,2,2,1 strides, 0,1,1,1,2 dilation separately. The output of this module, on one hand, will be copied and saved as an input to the decoder to maintain more lower-level information. On the other hand, it will be passed to atrous convolution modules in parallel as the plot shown. The 4 atrous convolution modules have different convolution layer. The kernel of them are 1

1,33,33,33 and dilation values of them are 1, 12, 24, 3. The stride value is always 1. The average max-pooling module is a combination of an adaptive average Pooling layer and a normal convolution module to adjust the output size the same as the atrous convolution module.
The output from these 5 modules will be concatenated and passed to another depthwise convolution module aimed to merge information from different channels. Before the output pass to the decoder it will be up-sampled by 4 with bilinear methods.
The decoder part has two inputs, the outputs from the first 5 normal convolution modules and the outputs from the encoder part. The outputs from normal convolution modules are passed to a normal convolution module and then concatenates with the outputs from the encoder part. Finally, the concatenated outputs are passed to 3 other convolution modules. The kernel size of the convolution layer is 3 3,33 and 1 1. The last step before the final output is up-sample the output by 4 with bilinear methods to make it has the same size as the input.

5.3 Backbone

Like we mentioned previously, We explored different pretrained models as backbone for DeeplabV3+, the backbone we experimented are:

  • ResNet[8]: a pre-trained 101 layers ResNet

  • DenseNet[9]: a pre-trained 201 layers DenseNet

  • EfficientNet[10]: a pre-trained EfficientNet B1 model

6 Experiments

We implemented the model in PyTorch and project code in python, using Google Colab as the coding environment. But we there‘s limited GPU available hours per day, and the log data may lose in between resume training. Due to the limitation of this, we only show part of the loss and evaluation trend here.

6.1 Baseline

We trained base line model on Google Colab with Nvida Telsa P100-GPU with learning rate=0.01, batch size = 16 weight decay = 1e-4, and data split is train/evaluation/test=80/10/10(%). It took about 3 hours to reach mIoU = 0.30 after 30 epochs.

6.2 DeepLabV3+

We trained DeepLabV3+ for all backbones we explored in 5.3 on Google Colab with Nvida Telsa P100-GPU with learning rate=0.01, batch size = 16 weight decay = 1e-4, and data split is train/evaluation/test=80/10/10(%)Partial smoothed loss and evaluation trends shown as Figure [3].

Figure 3: Loss(Left) and Mean IoU(right)

6.3 Train/Evaluation Speed

Total training time is based on trained iteration per seconds, the batch size is 16, so each iteration, the input is Batch x Channel x H x W matrix, which are 16 resized 256 x 256 images.
Evaluation time is based on image inferences per seconds, the batch size is set to 1.

Model backbone Train(iter/s) Eval Resized Image(image/s) Eval Original Image(image/s)
Base 24 26 11
DenseNet201 25 10 7
ResNet101 14 16 8
EfficientNet B1 40 20 11
Table 1: Speed Comparison

7 Result

7.1 Mean IoU on Test Data

model mIoU
Baseline 0.29
DenseNet201 + DeeplabV3Plus 0.325
ResNet101 + DeeplabV3Plus 0.57
EfficientNet B1 + DeeplabV3Plus 0.57
Table 2: Mean IoU for different models

7.2 IoU Distribution on Test Data

IoU distribution statics for 4 models that experimented shown as [Figure 4]

Figure 4: IoU Distribution: Baseline(First Row Left), DenseNet Backbone(First Row Right), EfficientNet B1 Backbone(Second Row Right) VS ResNet Backbone(Second Row Left)

7.3 Inference Sample on Test Data

Perform model inference on test data with original image size(256 x 1700) and also compared with re-sized image(256 x 256).

7.3.1 Single Defect Type per Image

[Figure 5] shows output comparison between different models on original size images. The test data has only one defect type for one raw image.

b.) Base Model
c.) ResNet101 + DeeplabV3Plus
c.) EfficientNetB1 + DeeplabV3Plus
a.) GroundTruth
b.) Base Model
c.) ResNet101 + DeeplabV3Plus
c.) EfficientNetB1 + DeeplabV3Plus
d.) DenseNet201 + DeeplabV3Plus
Figure 5: Output comparison(resized)
a.) GroundTruth

7.3.2 Multiple Defect Type per Image

[Figure 6] shows output comparison between different models on original size images. The test data has multiple defect type for one raw image.

b.) Base Model
c.) ResNet101 + DeeplabV3Plus
c.) EfficientNetB1 + DeeplabV3Plus
a.) GroundTruth
b.) Base Model
c.) ResNet101 + DeeplabV3Plus
c.) EfficientNetB1 + DeeplabV3Plus
d.) DenseNet201 + DeeplabV3Plus
Figure 6: Output comparison(resized)
a.) GroundTruth

7.3.3 Resized Output

b.) Ground truth label mask
c.) Base line output mask
a.) input image
b.) Ground truth label mask
c.) Base line output mask
d.) Resnet101 + DeeplabV3Plus output mask
Figure 7: Output comparison(resized)
a.) input image

8 Analyze

Comparing the outputs from all four models with ground truth label[Figure 5, 6, 7], we can see that the encoder module is helpful on learning details, like segmentation edges, and could make the detection region more precise. As a result of this, the IoU score could be improved with Encoder-Decoder architecture. When using DeepLabV3+, ResNet and EfficientNet as the backbone, the IoU scores are pretty similar, when using EfficientNet as the backbone, the cost of training and evaluation is less. On the contrary, DenseNet does not perform well.

Some images have a relatively low IoU score [Figure 4], which probably caused by the imbalanced data set and limitation of the model. With a deeper analysis on the underlying data set, we found that the defections in

  • Class 1 tends to have a smaller size and split in high fragments. It has the highest percentage of having more than 5 segments in one image but will occupy only a small total defection area. Also, it is the only class with considerable percentage of 10+ segment count.

  • Class 2 tends to have a smaller size but not in many fragments. It doesn’t have any 5+ segments per defect. Mostly comes in one or two segments.

  • Class 3 has a fair amount of variation in terms of the number of segments and the area.

  • Class 4 is less frequent than in class 3 for having more than 3 segments per defect.

based on these analysises, we thought that unbalanced data could lead to a low mIoU score.

By analyzing predictions with the low IoU score, we found some of the data were labeled improperly. However, the model could predict extra defects, which were not marked in the ground truth mask[Figure 8]. Some of the defect areas are detected by the model but not humans due to its extremely small size, which is invisible to human eyes[Figure 9]. We also found some of the defects are too small to be labeled manually, but predicted by the model precisely. That could explain that some low the IoU score impacted overall the mean IoU score.

b.) Groud Truth
a.) Raw Image
b.) Groud Truth
c.) What Model See
Figure 8: Missing partial ground truth mask
a.) Raw Image
b.) Groud Truth
a.) Raw Image
b.) Groud Truth
c.) What Model See
Figure 9: Model limitation
a.) Raw Image

9 Conclusion

DeepLabV3+ with different backbones could be applied to detect and classify the steel defection automatically, both in fair accuracy and high efficiency. Among 4 experimented backbone, ResNet101 and EfficientNet have similar better performance, which IoU are around 0.57. In the meantime, ResNet101 is more efficient in training. The unbalanced data is a large difficulty in further increasing model performance. But the model could predict some extra regions, which too small to be recognized by humans but still potentially defected. We believe more and more explorations would be applied in this area to automatic more label-related tasks, which could be the part of the core idea of Industrial 4.0

[11] manufacturing standard.

10 Future Work

We will continue improving the model mIoU performance and try to achieve a state-of-art result. We plan to try weighted loss[12] and some other technique e.g. mixed-up[13] to continue our experiment. We will try modifying the DeeplabV3+ to explore improvement on the model performance.