Log In Sign Up

Rail-5k: a Real-World Dataset for Rail Surface Defects Detection

by   Zihao Zhang, et al.
NetEase, Inc

This paper presents the Rail-5k dataset for benchmarking the performance of visual algorithms in a real-world application scenario, namely the rail surface defects detection task. We collected over 5k high-quality images from railways across China, and annotated 1100 images with the help from railway experts to identify the most common 13 types of rail defects. The dataset can be used for two settings both with unique challenges, the first is the fully-supervised setting using the 1k+ labeled images for training, fine-grained nature and long-tailed distribution of defect classes makes it hard for visual algorithms to tackle. The second is the semi-supervised learning setting facilitated by the 4k unlabeled images, these 4k images are uncurated containing possible image corruptions and domain shift with the labeled images, which can not be easily tackle by previous semi-supervised learning methods. We believe our dataset could be a valuable benchmark for evaluating robustness and reliability of visual algorithms.


page 3

page 4

page 6

page 8


The Semi-Supervised iNaturalist Challenge at the FGVC8 Workshop

Semi-iNat is a challenging dataset for semi-supervised classification wi...

Open-World Semi-Supervised Learning

Supervised and semi-supervised learning methods have been traditionally ...

3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding

The ability to understand the ways to interact with objects from visual ...

A Visual Analytics Framework for Composing a Hierarchical Classification for Medieval Illuminations

Annotated data is a requirement for applying supervised machine learning...

A Realistic Evaluation of Semi-Supervised Learning for Fine-Grained Classification

We evaluate the effectiveness of semi-supervised learning (SSL) on a rea...

Semi-Supervised Contrastive Learning for Remote Sensing: Identifying Ancient Urbanization in the South Central Andes

The detection of ancient settlements is a key focus in landscape archaeo...

1 Introduction

The introduction of large scale annotated datasets such as ImageNet 

Deng et al. (2009)

greatly speeds up the development of deep-learning based vision algorithms 

He et al. (2016)

. Deep learning algorithms pre-trained on ImageNet 

Deng et al. (2009) has also been shown to effectively transfer between domain and tasks such as object detection Ren et al. (2015), agriculture Yang et al. (2020); Chiu et al. (2020) or medical image analysis Irvin et al. (2019).

As an important basic infrastructure of human life, the maintenance and status analysis of railways has a real world economy and safety-focused value. However, current datasets in the railways domain are either limited in size  Gan et al. (2017), quality of images Faghih-Roohi et al. (2016); Feng et al. (2020); Gan et al. (2017), or the annotation types Feng et al. (2020); Gan et al. (2017) The limited size and quality of currently available dataset are not yet ready for support the training of deep learning methods.

Our dataset has enough high quality images captured from real-world railway to enable the training of deep learning models. Besides the labeled set with 1.1k images, we also provide a unlabeled set of 4k images to enable a semi-supervised setting. Several unique characteristics of our dataset also poses new challenges to vision algorithm. The first challenge is the long-tailed distribution of classes presented in out dataset, the imbalance ratio of the most majority class to the most minority class is up to 40.98, it has been shown that the long-tailed distribution would greatly hurt the performance of the learned model Liu et al. (2019); Gupta et al. (2019). Besides the long-tailed class distribution in the labeled set of the dataset, the unlabeled set of images also poses a difficult scenario of semi-supervised defect detection, semi-supervised object detection is a relatively new task with few recent works Gao et al. (2019); Sohn et al. (2020), the previous method often assumes that the unlabeled set is also curated. However, in our case, the unlabeled set is uncurated with multiple unknown image corruptions and unseen object in the labeled set. Given these unique properties, we believe that our proposed dataset could not only facilitate the development of algorithms for rail surface defects detection, but also the development for a more robust vision model to handle the long-tailed distribution and possible corruptions in the unlabeled set.

2 Related work

Traditional inspection methods like subjective manual observation, sampling checking, are all qualitative or compensating methods, can not provide a digital and automatic decision-making basis for intelligent maintenance of the whole line. Our dataset mainly focus on the task of defects detection, we summarize relevant literatures in the section.

2.1 Natural Image Dataset

The surface defect detection tasks are most related to the tasks of object detection in visual algorithms. Common benchmarks for visual object detection are constructed using natural images such as Pascal VOC 

Everingham et al. (2010)

and MS-COCO 

Lin et al. (2014). These dataset are mostly balanced in terms of class distributions. The LVIS Gupta et al. (2019) dataset proposed a larger collection of images with a long-tailed distribution of classes. Our proposed dataset also has a long-tailed distribution with respect to classes. Unlike the general natural image datasets, our dataset also presents fine-grained class definition due to the nature of railway images.

Domain Dataset Task # class # image # box per image Resolution Annotation Quality
Rail Defects Delft Faghih-Roohi et al. (2016) cls 6 3240 1 gray-scale image-level
RSDDs Gan et al. (2017) seg 2 195 5 gray-scale image-level
CRRC Feng et al. (2020) det 3 >1000 1 gray-scale band-level
Rail-5k(labeled) det 13 1100 22.9 RGB instance-level
Natural Image VOC-2007 det 20 12974 3.1 - instance-level
VOC-2012 Everingham et al. (2010) det 20 34071 2.7 469 x 387 RGB instance-level
ILSVRC-2014 Deng et al. (2009) det 200 516840 1.1 482 x 415 RGB instance-level
MS COCO 2018 Lin et al. (2014) det 80 163957 7.3 - instance-level
OID V6 Kuznetsova et al. (2020) det 600 1910098 8.4 - instance-level
Table 1: Dataset compare

2.2 Synthetic Corruption Dataset

There are also many datasets focusing on testing the robustness of deep-learning models under domain shift and image corruptions like ImageNet-C Hendrycks and Dietterich (2019)

, CityScapes-C 

Michaelis et al. (2019), and COCO-C Michaelis et al. (2019). However, the corruptions in these dataset are synthetic, generated using image processing techniques. Also, they are mainly used as the test set to test the robustness rather than the training set. In our dataset, the labeled dataset are well-curated, but the unlabeled set mat contains various real-world corruption, thus poses a new challenge for semi-supervised learning method.

2.3 Rail Defects Dataset

In the rail engineering domain, there are dataset focusing on the classification and detection of railway defects Zendel et al. (2019). As for rail engineering, images are mostly in the form of atlas for manual reference. There are classification and detection Zendel et al. (2019) datasets of railway scene, as well as ultrasonic inspection datasets IEM-RM (2003). But still lacking of real-world datasets for rail surface defects. Faghih-Roohi etal. Faghih-Roohi et al. (2016) collects and labels 100 x 50 resolution images in 6 defects classes. RSDDs datasets Gan et al. (2017) contains 195 gray-scale images in 2 kinds of railway with segmentation mask. Feng etal. Feng et al. (2020) collects thousands of images and annotate corrugation, fatigue and spalling in band region. Datasets above are all collected by high-speed linear scan cameras with low resolution and coarse-grained annotation. As a consequence, they all fail to drive the training of real-world robust deep learning algorithms.

Class Running surface Contact band Dark Contact Band Spalling Crack Corrugation Grinding
# Boxes 1082 1093 773 12582 3785 3349 337
#Images 1080 1087 769 1005 375 445 179
# Large 1082 1092 773 1277 2965 3329 336
#Medium 0 0 0 5147 784 17 1
# Small 0 1 0 6148 36 3 0
Class Fastening Spike Screw Set Screw Indentation Burning Welded Joint
# Boxes 757 502 414 307 41 14
# Images 582 424 360 216 10 8
# Large 750 475 400 4 41 14
# Medium 7 27 14 237 0 0
# Small 0 0 0 66 0 0
Table 2: Categories statistics.
Figure 1: Typical image capture and annotations.

3 The Rail-5k dataset

3.1 Rail Image Acquisition

The rail surface defects are mostly caused by the metal fatigue under the constant load from the wheel in high-speed section in a railway system. Rail images in the Rail-5k dataset were captured by specialized cameras mounted on inspection cars riding along the railway, making the lens 200 mm vertically away from the rail surface and focusing vertically downward. We exclude images with shadows or overexposure on the rail surface for railway experts to label. We collected annotations for 1100 RGB images with pixels in resolution, covering scenarios as tunnel, elevated bridge, straight and curve line, inner and outer rail, before and afer grinding or milling. fig. 1 shows the map of a typical rail section that we collect images. Each dot represents an image.

We also collected 3k images from uncurated images of rail surfaces. These images contains unknown corruption and unseen objects in the labeled set. fig. 3 shows some typical images in the unlabeled set.

In summary, our dataset contains two part of data, the first part is the labeled subset with a 1k labeled images, the second part is the unlabeled subset with 3k images. Thus our dataset can support both supervised and semi-supervised learning settings.

Figure 2: Map of typical sample points.
Figure 3: First line is corruption images, second line is prediction results. It can be observed that there are many false positives.

3.2 Fine-grained class definition and instance-level annotation

The annotations in our dataset were labeled by ten railway experts, each labeled images were at least checked by three experts. Based on the expert knowledge and railway standards, we use a fine-grained class definition and instance-level annotation paradigm for the railway defects detection. The labeling principle are listed in table 3.

Note that the crack area are sharp and thin objects with no clear edge boundary, we annotate with a segmentation mask.

Size Boundary Typical class Annotation paradigm
Large clear Rail surface, Fastener, Screw external rectangular box(same as common detection)
obsure lump Corrugation wave valley of corrugation
Small clear Spalling,Indentation stripped dent
Diffuse sharp Crack union regions of small and dense boxes envelops cracking diffuse regions
Table 3: Annotation paradigm.

3.3 Dataset Splitting

We randomly split 20% of the 1,100 labeled images in to the test set, the remaining images are used as the training set in the supervised setting. For the semi-supervised setting, we use the same test set to evaluate the performance for comparison.

4 Annotation Statistics

In this section, we present the statistics of our dataset. The statistics are presented in three aspects, namely the image and bounding box distribution among class, the bounding box sizes and aspect ratios, and the center point of annotated bounding boxes.

4.1 Class distribution

fig. 5 shows the number of images and annotations containing each classes. The Burning and welded joint are ignored in our experiments and benchmark because of their rare appearance. The imbalance ratio with respect to the number of bounding box between the most majority class and the most minority is 40.98, the imbalance ratio with respect to the number of images is 6.07.

4.2 Sizes and aspect ratios of bounding boxes

(box size ratio graph) Bounding box annotations in our dataset vary dramatically in sizes and aspect ratios. There exist both tall and narrow objects as well as short and wide objects such as rail surface and contact band, normal square objects(fastener and screw). Besides, as shown in  fig. 5, there are tremendous numbers of densely distributed small objects like spalling.

Figure 4: PR curve
Figure 5: Concrete and Constructions

4.3 Object positions

fig. 7 shows the distribution of objects’ center positions in our dataset. Because of the special shooting paradigm, rail surfaces usually lie horizontally or vertically in images. As a consequence, defects usually spread at the cross-zone in images.

Figure 6: Width-height ratio of all annotations.
Figure 7: Center positions of all annotations.

5 Pilot Study on the Rail-5k Dataset

In this section, we conducted comprehensive experiments in several aspects to investigate the challenges and potential of the Rail-5k dataset. We trained an object detection model and a semantic segmentation model on Rail-5k as our baselines and showed the challenging attributes of our dataset.

Additionally, We proposed a semi-supervised benchmark for object detection.

Figure 8: Typical prediction results on testset.

5.1 Benchmark for Detection

There are many popular detectors Redmon et al. (2016); Ren et al. (2015); Lin et al. (2017) on general object detection datasets. Recently, many new methods have been proposed and achieved the state-of-the-art results in the MS-COCO benchmark Lin et al. (2014). For example, YOLOv5 Jocher et al. (2021) is a light-weight model with mosaic augmentation and Generalized Intersection over Union(GIOU) loss. In our experiments, we finetined Yolov5-s as baseline on our dataset with MS-COCO pretraining. Detailed training settings are according to data/hyp.finetune.yaml111We implemented our experiments with Release v4.0 from

Figure 9: PR curve.
Figure 10: Ablation experiments.
Class Precision Recall AP@0.5 mAP@0.5:0.95 AP
Rail Surface 77.5 99.1 98.9 90.6 98.6
Contact Band 60.2 97.7 94.5 71.9 96.3
Spalling 33.2 74 60 24.8 58.9
Corrugation 60.3 91.2 89.3 48.2 87.6
Grinding 21.4 38.8 24 7.4 24.1
Dark Contact Band 64.4 81.4 76.7 36.7 83.4
Fastener 47.5 91.3 83.8 62.9 86.1
Spike Screw 37.8 92.5 86.8 48.6 91.8
Set Screw 58.6 88.2 87.3 52.2 88.5
Indentation 0 0 0.7 0.1 16.2
Crack - - - - -
Table 4: Metrics of baseline model for detection.

It can be noticed that the detector’s performance on crack is extremely low. This is because crack is more a texture than an object without clear definition of separated instances. Thus, we chose to tackle with this problem from another approach, which will be further discussed in  section 5.2.

5.2 Benchmark for Crack

For cracking region, it is more of a texture and pattern than an object. Thus, we use segmentation to identify this class because the detection cannot recognize it well. We use Deeplabv3 Chen et al. (2017) architecture with ResNet50 He et al. (2016) backbone as segmentation model. The model is trained for 9000 iterations with a batch size of 16. We use SGD with momentum as the optimizer. Momentum and weight decay are set to 0.01, 1e-4 respectively. For the evaluation of our benchmark we choose the most common benchmark on segmentation, which is Intersection over Union(IoU). The model achieves 98.9% IoU on background and 67.8% IoU on crack, which is much better than the detection performance. DeepLabv3 can learn the main and obvious crack, but will ignore the tiny one.

Rail Surface 98.1 98.0 98.1 97.7
Contact Band 78.4 77.9 77.1 77.0
Spalling 60.1 58.9 57.9 58.2
Corrugation 89.6 89.2 89.5 88.6
Grinding 23.0 23.6 23.5 22.1
Dark Contact Band 92.7 92.9 93.1 92.4
Fastener 86.5 86.1 85.8 83.2
Spike Screw 93.2 94.6 91.3 87.4
Set Screw 88 88.5 87.2 85.4
Indentation 15.9 16.4 13.4 15.3
Crack - - - -
mAP@0.5 63.29 63.27 62.43 61.55
Table 5: Metrics of baseline model for semi-supervised detection.
Figure 11: Images in first line are segmentation prediction results, and in second line are labels.

5.3 Benchmark for Semi-supervised Learning

With additonal 3k unlabeled images, we proposed a semi-supervised object detection benchmark. We presents results in  table 5.

These results are generated with simple pseudo label technique. We inferenced on the unlabeled images with YoloV5-s trained following strategy described in  section 5.1. Then we apply a confidence score threshold

on all predictions and use remaining predictions as pseudo labels. Finally, we finetuned this model jointly on labeled images and unlabeled images with pseudo labels for 1 epoch and a base learning rate of 4e-4. Other training settings are the as ones in  

section 5.1.

As shown  table 5, detectors usually perform worse after being finetuned under semi-supervision. This could be caused by corruption and noise in unlabeled images.

6 Conclusion

We introduce Rail-5k, a real-world dataset for rail surface defects detection. We capture rail images across China and provide fine-grained instance-level annotations. This dataset poses new challenges both in rail maintenance and computer vision. As a baseline, we provide a pilot study on Rail-5k using off-the-shelf detection models. In later versions, Rail-5k will include more images and patterns, as well as more defects categories and image modalities, such as 3D-scan or eddy current data. This would make Rail-5k an even more standardized and inclusive real-world dataset. We hope this dataset will encourage more work on improving visual recognition methods for rail maintenance, particularly on object detection and semantic segmentation for real-world, fine-grained, small, and dense defects.


  • [1] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CVPR. Cited by: §5.2.
  • [2] M. T. Chiu, X. Xu, K. Wang, J. Hobbs, N. Hovakimyan, T. S. Huang, and H. Shi (2020) The 1st agriculture-vision challenge: methods and results. In CVPR Workshop, Cited by: §1.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1, Table 1.
  • [4] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV. Cited by: §2.1, Table 1.
  • [5] S. Faghih-Roohi, S. Hajizadeh, A. Núñez, R. Babuska, and B. De Schutter (2016)

    Deep convolutional neural networks for detection of rail surface defects

    In IJCNN, Cited by: §1, §2.3, Table 1.
  • [6] J. H. Feng, H. Yuan, Y. Q. Hu, J. Lin, S. W. Liu, and X. Luo (2020) Research on deep learning method for rail surface defect detection. IET Electrical Systems in Transportation. Cited by: §1, §2.3, Table 1.
  • [7] J. Gan, Q. Li, J. Wang, and H. Yu (2017) A hierarchical extractor-based visual rail surface inspection system. IEEE Sensors Journal. Cited by: §1, §2.3, Table 1.
  • [8] M. Gao, Z. Zhang, G. Yu, S. Ö. Arik, L. S. Davis, and T. Pfister (2019)

    Consistency-based semi-supervised active learning: towards minimizing labeling cost

    In NeurIPS, Cited by: §1.
  • [9] A. Gupta, P. Dollár, and R. B. Girshick (2019) LVIS: A dataset for large vocabulary instance segmentation. In CVPR, Cited by: §1, §2.1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §5.2.
  • [11] D. Hendrycks and T. G. Dietterich (2019)

    Benchmarking neural network robustness to common corruptions and perturbations

    ICLR. Cited by: §2.2.
  • [12] S. E. IEM-RM (2003) B-scan ultrasonic image analysis for internal rail defect detection. Cited by: §2.3.
  • [13] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In AAAI, Cited by: §1.
  • [14] G. Jocher, A. Stoken, J. Borovec, NanoCode012, ChristopherSTAN, L. Changyu, Laughing, tkianai, yxNONG, A. Hogan, lorenzomammana, AlexWang1900, A. Chaurasia, L. Diaconu, Marc, wanghaoyang0106, ml5ah, Doug, Durgesh, F. Ingham, Frederik, Guilhen, A. Colmagro, H. Ye, Jacobsolawetz, J. Poznanski, J. Fang, J. Kim, K. Doan, and L. Yu (2021)

    ultralytics/yolov5: v4.0 - nn.SiLU() activations, Weights & Biases logging, PyTorch Hub integration

    Cited by: §5.1.
  • [15] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. (2020) The open images dataset v4. IJCV. Cited by: Table 1.
  • [16] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §5.1.
  • [17] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §2.1, Table 1, §5.1.
  • [18] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu (2019) Large-scale long-tailed recognition in an open world. In CVPR, Cited by: §1.
  • [19] C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bringmann, A. S. Ecker, M. Bethge, and W. Brendel (2019) Benchmarking robustness in object detection: autonomous driving when winter is coming. NeuriIPS Workshop. Cited by: §2.2.
  • [20] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §5.1.
  • [21] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §1, §5.1.
  • [22] K. Sohn, Z. Zhang, C. Li, H. Zhang, C. Lee, and T. Pfister (2020) A simple semi-supervised learning framework for object detection. arXiv:2005.04757. Cited by: §1.
  • [23] S. Yang, S. Yu, B. Zhao, and Y. Wang (2020) Reducing the feature divergence of rgb and near-infrared images using switchable normalization. In CVPR Workshop, Cited by: §1.
  • [24] O. Zendel, M. Murschitz, M. Zeilinger, D. Steininger, S. Abbasi, and C. Beleznai (2019)

    RailSem19: a dataset for semantic rail scene understanding

    In CVPR Workshop, Cited by: §2.3.


  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? And this work also fits to the scope of NeurIPS 2021 Datasets and Benchmarks Track.

    2. Did you describe the limitations of your work? This work only focus on rail surface defects, and the dataset indicates extreme label imbalance across categories along with real world corrupted images.

    3. Did you discuss any potential negative societal impacts of your work? This work helps to detect rail defects and save costs for maintenance. It will never do harm to society.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them? Our paper conforms all of the ethics rules.

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments (e.g. for benchmarks)…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Codes have benn uploaded on Github, see URL in appendix.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Data splits, baseline model hyperparameters, data augmentation, loss function… are all mentioned in paper and are reproducible.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? No obvious bias were found through multiple baseline models and a series of ablation experiments.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See  section 5.1

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets? This dataset is licensed under CC BY-NC-ND 4.0 license.

    3. Did you include any new assets either in the supplemental material or as a URL? More data and annotations will be added in the URL

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating? We have reached an agreement with the authority that we could collect, process, and mining the data for academic purposes.

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? No personal identifiable information is contained and all geographic details have been erased.

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?