1 Introduction
The introduction of large scale annotated datasets such as ImageNet
Deng et al. (2009)greatly speeds up the development of deep-learning based vision algorithms
He et al. (2016). Deep learning algorithms pre-trained on ImageNet
Deng et al. (2009) has also been shown to effectively transfer between domain and tasks such as object detection Ren et al. (2015), agriculture Yang et al. (2020); Chiu et al. (2020) or medical image analysis Irvin et al. (2019).As an important basic infrastructure of human life, the maintenance and status analysis of railways has a real world economy and safety-focused value. However, current datasets in the railways domain are either limited in size Gan et al. (2017), quality of images Faghih-Roohi et al. (2016); Feng et al. (2020); Gan et al. (2017), or the annotation types Feng et al. (2020); Gan et al. (2017) The limited size and quality of currently available dataset are not yet ready for support the training of deep learning methods.
Our dataset has enough high quality images captured from real-world railway to enable the training of deep learning models. Besides the labeled set with 1.1k images, we also provide a unlabeled set of 4k images to enable a semi-supervised setting. Several unique characteristics of our dataset also poses new challenges to vision algorithm. The first challenge is the long-tailed distribution of classes presented in out dataset, the imbalance ratio of the most majority class to the most minority class is up to 40.98, it has been shown that the long-tailed distribution would greatly hurt the performance of the learned model Liu et al. (2019); Gupta et al. (2019). Besides the long-tailed class distribution in the labeled set of the dataset, the unlabeled set of images also poses a difficult scenario of semi-supervised defect detection, semi-supervised object detection is a relatively new task with few recent works Gao et al. (2019); Sohn et al. (2020), the previous method often assumes that the unlabeled set is also curated. However, in our case, the unlabeled set is uncurated with multiple unknown image corruptions and unseen object in the labeled set. Given these unique properties, we believe that our proposed dataset could not only facilitate the development of algorithms for rail surface defects detection, but also the development for a more robust vision model to handle the long-tailed distribution and possible corruptions in the unlabeled set.
2 Related work
Traditional inspection methods like subjective manual observation, sampling checking, are all qualitative or compensating methods, can not provide a digital and automatic decision-making basis for intelligent maintenance of the whole line. Our dataset mainly focus on the task of defects detection, we summarize relevant literatures in the section.
2.1 Natural Image Dataset
The surface defect detection tasks are most related to the tasks of object detection in visual algorithms. Common benchmarks for visual object detection are constructed using natural images such as Pascal VOC
Everingham et al. (2010)and MS-COCO
Lin et al. (2014). These dataset are mostly balanced in terms of class distributions. The LVIS Gupta et al. (2019) dataset proposed a larger collection of images with a long-tailed distribution of classes. Our proposed dataset also has a long-tailed distribution with respect to classes. Unlike the general natural image datasets, our dataset also presents fine-grained class definition due to the nature of railway images.Domain | Dataset | Task | # class | # image | # box per image | Resolution | Annotation Quality |
Rail Defects | Delft Faghih-Roohi et al. (2016) | cls | 6 | 3240 | 1 | gray-scale | image-level |
RSDDs Gan et al. (2017) | seg | 2 | 195 | 5 | gray-scale | image-level | |
CRRC Feng et al. (2020) | det | 3 | >1000 | 1 | gray-scale | band-level | |
Rail-5k(labeled) | det | 13 | 1100 | 22.9 | RGB | instance-level | |
Natural Image | VOC-2007 | det | 20 | 12974 | 3.1 | - | instance-level |
VOC-2012 Everingham et al. (2010) | det | 20 | 34071 | 2.7 | 469 x 387 RGB | instance-level | |
ILSVRC-2014 Deng et al. (2009) | det | 200 | 516840 | 1.1 | 482 x 415 RGB | instance-level | |
MS COCO 2018 Lin et al. (2014) | det | 80 | 163957 | 7.3 | - | instance-level | |
OID V6 Kuznetsova et al. (2020) | det | 600 | 1910098 | 8.4 | - | instance-level |
2.2 Synthetic Corruption Dataset
There are also many datasets focusing on testing the robustness of deep-learning models under domain shift and image corruptions like ImageNet-C Hendrycks and Dietterich (2019)
, CityScapes-C
Michaelis et al. (2019), and COCO-C Michaelis et al. (2019). However, the corruptions in these dataset are synthetic, generated using image processing techniques. Also, they are mainly used as the test set to test the robustness rather than the training set. In our dataset, the labeled dataset are well-curated, but the unlabeled set mat contains various real-world corruption, thus poses a new challenge for semi-supervised learning method.2.3 Rail Defects Dataset
In the rail engineering domain, there are dataset focusing on the classification and detection of railway defects Zendel et al. (2019). As for rail engineering, images are mostly in the form of atlas for manual reference. There are classification and detection Zendel et al. (2019) datasets of railway scene, as well as ultrasonic inspection datasets IEM-RM (2003). But still lacking of real-world datasets for rail surface defects. Faghih-Roohi etal. Faghih-Roohi et al. (2016) collects and labels 100 x 50 resolution images in 6 defects classes. RSDDs datasets Gan et al. (2017) contains 195 gray-scale images in 2 kinds of railway with segmentation mask. Feng etal. Feng et al. (2020) collects thousands of images and annotate corrugation, fatigue and spalling in band region. Datasets above are all collected by high-speed linear scan cameras with low resolution and coarse-grained annotation. As a consequence, they all fail to drive the training of real-world robust deep learning algorithms.
Class | Running surface | Contact band | Dark Contact Band | Spalling | Crack | Corrugation | Grinding |
---|---|---|---|---|---|---|---|
# Boxes | 1082 | 1093 | 773 | 12582 | 3785 | 3349 | 337 |
#Images | 1080 | 1087 | 769 | 1005 | 375 | 445 | 179 |
# Large | 1082 | 1092 | 773 | 1277 | 2965 | 3329 | 336 |
#Medium | 0 | 0 | 0 | 5147 | 784 | 17 | 1 |
# Small | 0 | 1 | 0 | 6148 | 36 | 3 | 0 |
Class | Fastening | Spike Screw | Set Screw | Indentation | Burning | Welded Joint | |
# Boxes | 757 | 502 | 414 | 307 | 41 | 14 | |
# Images | 582 | 424 | 360 | 216 | 10 | 8 | |
# Large | 750 | 475 | 400 | 4 | 41 | 14 | |
# Medium | 7 | 27 | 14 | 237 | 0 | 0 | |
# Small | 0 | 0 | 0 | 66 | 0 | 0 |

3 The Rail-5k dataset
3.1 Rail Image Acquisition
The rail surface defects are mostly caused by the metal fatigue under the constant load from the wheel in high-speed section in a railway system. Rail images in the Rail-5k dataset were captured by specialized cameras mounted on inspection cars riding along the railway, making the lens 200 mm vertically away from the rail surface and focusing vertically downward. We exclude images with shadows or overexposure on the rail surface for railway experts to label. We collected annotations for 1100 RGB images with pixels in resolution, covering scenarios as tunnel, elevated bridge, straight and curve line, inner and outer rail, before and afer grinding or milling. fig. 1 shows the map of a typical rail section that we collect images. Each dot represents an image.
We also collected 3k images from uncurated images of rail surfaces. These images contains unknown corruption and unseen objects in the labeled set. fig. 3 shows some typical images in the unlabeled set.
In summary, our dataset contains two part of data, the first part is the labeled subset with a 1k labeled images, the second part is the unlabeled subset with 3k images. Thus our dataset can support both supervised and semi-supervised learning settings.


3.2 Fine-grained class definition and instance-level annotation
The annotations in our dataset were labeled by ten railway experts, each labeled images were at least checked by three experts. Based on the expert knowledge and railway standards, we use a fine-grained class definition and instance-level annotation paradigm for the railway defects detection. The labeling principle are listed in table 3.
Note that the crack area are sharp and thin objects with no clear edge boundary, we annotate with a segmentation mask.
Size | Boundary | Typical class | Annotation paradigm |
---|---|---|---|
Large | clear | Rail surface, Fastener, Screw | external rectangular box(same as common detection) |
obsure lump | Corrugation | wave valley of corrugation | |
Small | clear | Spalling,Indentation | stripped dent |
Diffuse | sharp | Crack | union regions of small and dense boxes envelops cracking diffuse regions |
3.3 Dataset Splitting
We randomly split 20% of the 1,100 labeled images in to the test set, the remaining images are used as the training set in the supervised setting. For the semi-supervised setting, we use the same test set to evaluate the performance for comparison.
4 Annotation Statistics
In this section, we present the statistics of our dataset. The statistics are presented in three aspects, namely the image and bounding box distribution among class, the bounding box sizes and aspect ratios, and the center point of annotated bounding boxes.
4.1 Class distribution
fig. 5 shows the number of images and annotations containing each classes. The Burning and welded joint are ignored in our experiments and benchmark because of their rare appearance. The imbalance ratio with respect to the number of bounding box between the most majority class and the most minority is 40.98, the imbalance ratio with respect to the number of images is 6.07.
4.2 Sizes and aspect ratios of bounding boxes
(box size ratio graph) Bounding box annotations in our dataset vary dramatically in sizes and aspect ratios. There exist both tall and narrow objects as well as short and wide objects such as rail surface and contact band, normal square objects(fastener and screw). Besides, as shown in fig. 5, there are tremendous numbers of densely distributed small objects like spalling.


4.3 Object positions
fig. 7 shows the distribution of objects’ center positions in our dataset. Because of the special shooting paradigm, rail surfaces usually lie horizontally or vertically in images. As a consequence, defects usually spread at the cross-zone in images.


5 Pilot Study on the Rail-5k Dataset
In this section, we conducted comprehensive experiments in several aspects to investigate the challenges and potential of the Rail-5k dataset. We trained an object detection model and a semantic segmentation model on Rail-5k as our baselines and showed the challenging attributes of our dataset.
Additionally, We proposed a semi-supervised benchmark for object detection.

5.1 Benchmark for Detection
There are many popular detectors Redmon et al. (2016); Ren et al. (2015); Lin et al. (2017) on general object detection datasets. Recently, many new methods have been proposed and achieved the state-of-the-art results in the MS-COCO benchmark Lin et al. (2014). For example, YOLOv5 Jocher et al. (2021) is a light-weight model with mosaic augmentation and Generalized Intersection over Union(GIOU) loss. In our experiments, we finetined Yolov5-s as baseline on our dataset with MS-COCO pretraining. Detailed training settings are according to data/hyp.finetune.yaml111We implemented our experiments with Release v4.0 from https://github.com/ultralytics/yolov5/blob/develop/data/hyp.finetune.yaml


Class | Precision | Recall | AP@0.5 | mAP@0.5:0.95 | AP |
---|---|---|---|---|---|
Rail Surface | 77.5 | 99.1 | 98.9 | 90.6 | 98.6 |
Contact Band | 60.2 | 97.7 | 94.5 | 71.9 | 96.3 |
Spalling | 33.2 | 74 | 60 | 24.8 | 58.9 |
Corrugation | 60.3 | 91.2 | 89.3 | 48.2 | 87.6 |
Grinding | 21.4 | 38.8 | 24 | 7.4 | 24.1 |
Dark Contact Band | 64.4 | 81.4 | 76.7 | 36.7 | 83.4 |
Fastener | 47.5 | 91.3 | 83.8 | 62.9 | 86.1 |
Spike Screw | 37.8 | 92.5 | 86.8 | 48.6 | 91.8 |
Set Screw | 58.6 | 88.2 | 87.3 | 52.2 | 88.5 |
Indentation | 0 | 0 | 0.7 | 0.1 | 16.2 |
Crack | - | - | - | - | - |
It can be noticed that the detector’s performance on crack is extremely low. This is because crack is more a texture than an object without clear definition of separated instances. Thus, we chose to tackle with this problem from another approach, which will be further discussed in section 5.2.
5.2 Benchmark for Crack
For cracking region, it is more of a texture and pattern than an object. Thus, we use segmentation to identify this class because the detection cannot recognize it well. We use Deeplabv3 Chen et al. (2017) architecture with ResNet50 He et al. (2016) backbone as segmentation model. The model is trained for 9000 iterations with a batch size of 16. We use SGD with momentum as the optimizer. Momentum and weight decay are set to 0.01, 1e-4 respectively. For the evaluation of our benchmark we choose the most common benchmark on segmentation, which is Intersection over Union(IoU). The model achieves 98.9% IoU on background and 67.8% IoU on crack, which is much better than the detection performance. DeepLabv3 can learn the main and obvious crack, but will ignore the tiny one.
Class | ||||
---|---|---|---|---|
Rail Surface | 98.1 | 98.0 | 98.1 | 97.7 |
Contact Band | 78.4 | 77.9 | 77.1 | 77.0 |
Spalling | 60.1 | 58.9 | 57.9 | 58.2 |
Corrugation | 89.6 | 89.2 | 89.5 | 88.6 |
Grinding | 23.0 | 23.6 | 23.5 | 22.1 |
Dark Contact Band | 92.7 | 92.9 | 93.1 | 92.4 |
Fastener | 86.5 | 86.1 | 85.8 | 83.2 |
Spike Screw | 93.2 | 94.6 | 91.3 | 87.4 |
Set Screw | 88 | 88.5 | 87.2 | 85.4 |
Indentation | 15.9 | 16.4 | 13.4 | 15.3 |
Crack | - | - | - | - |
mAP@0.5 | 63.29 | 63.27 | 62.43 | 61.55 |

5.3 Benchmark for Semi-supervised Learning
With additonal 3k unlabeled images, we proposed a semi-supervised object detection benchmark. We presents results in table 5.
These results are generated with simple pseudo label technique. We inferenced on the unlabeled images with YoloV5-s trained following strategy described in section 5.1. Then we apply a confidence score threshold
on all predictions and use remaining predictions as pseudo labels. Finally, we finetuned this model jointly on labeled images and unlabeled images with pseudo labels for 1 epoch and a base learning rate of 4e-4. Other training settings are the as ones in
section 5.1.As shown table 5, detectors usually perform worse after being finetuned under semi-supervision. This could be caused by corruption and noise in unlabeled images.
6 Conclusion
We introduce Rail-5k, a real-world dataset for rail surface defects detection. We capture rail images across China and provide fine-grained instance-level annotations. This dataset poses new challenges both in rail maintenance and computer vision. As a baseline, we provide a pilot study on Rail-5k using off-the-shelf detection models. In later versions, Rail-5k will include more images and patterns, as well as more defects categories and image modalities, such as 3D-scan or eddy current data. This would make Rail-5k an even more standardized and inclusive real-world dataset. We hope this dataset will encourage more work on improving visual recognition methods for rail maintenance, particularly on object detection and semantic segmentation for real-world, fine-grained, small, and dense defects.
References
- [1] (2017) Rethinking atrous convolution for semantic image segmentation. CVPR. Cited by: §5.2.
- [2] (2020) The 1st agriculture-vision challenge: methods and results. In CVPR Workshop, Cited by: §1.
- [3] (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §1, Table 1.
- [4] (2010) The pascal visual object classes (voc) challenge. IJCV. Cited by: §2.1, Table 1.
-
[5]
(2016)
Deep convolutional neural networks for detection of rail surface defects
. In IJCNN, Cited by: §1, §2.3, Table 1. - [6] (2020) Research on deep learning method for rail surface defect detection. IET Electrical Systems in Transportation. Cited by: §1, §2.3, Table 1.
- [7] (2017) A hierarchical extractor-based visual rail surface inspection system. IEEE Sensors Journal. Cited by: §1, §2.3, Table 1.
-
[8]
(2019)
Consistency-based semi-supervised active learning: towards minimizing labeling cost
. In NeurIPS, Cited by: §1. - [9] (2019) LVIS: A dataset for large vocabulary instance segmentation. In CVPR, Cited by: §1, §2.1.
- [10] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §5.2.
-
[11]
(2019)
Benchmarking neural network robustness to common corruptions and perturbations
. ICLR. Cited by: §2.2. - [12] (2003) B-scan ultrasonic image analysis for internal rail defect detection. Cited by: §2.3.
- [13] (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In AAAI, Cited by: §1.
-
[14]
(2021)
ultralytics/yolov5: v4.0 - nn.SiLU() activations, Weights & Biases logging, PyTorch Hub integration
. Cited by: §5.1. - [15] (2020) The open images dataset v4. IJCV. Cited by: Table 1.
- [16] (2017) Focal loss for dense object detection. In ICCV, Cited by: §5.1.
- [17] (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §2.1, Table 1, §5.1.
- [18] (2019) Large-scale long-tailed recognition in an open world. In CVPR, Cited by: §1.
- [19] (2019) Benchmarking robustness in object detection: autonomous driving when winter is coming. NeuriIPS Workshop. Cited by: §2.2.
- [20] (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §5.1.
- [21] (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §1, §5.1.
- [22] (2020) A simple semi-supervised learning framework for object detection. arXiv:2005.04757. Cited by: §1.
- [23] (2020) Reducing the feature divergence of rgb and near-infrared images using switchable normalization. In CVPR Workshop, Cited by: §1.
-
[24]
(2019)
RailSem19: a dataset for semantic rail scene understanding
. In CVPR Workshop, Cited by: §2.3.
Checklist
-
For all authors…
-
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? And this work also fits to the scope of NeurIPS 2021 Datasets and Benchmarks Track.
-
Did you describe the limitations of your work? This work only focus on rail surface defects, and the dataset indicates extreme label imbalance across categories along with real world corrupted images.
-
Did you discuss any potential negative societal impacts of your work? This work helps to detect rail defects and save costs for maintenance. It will never do harm to society.
-
Have you read the ethics review guidelines and ensured that your paper conforms to them? Our paper conforms all of the ethics rules.
-
-
If you are including theoretical results…
-
Did you state the full set of assumptions of all theoretical results?
-
Did you include complete proofs of all theoretical results?
-
-
If you ran experiments (e.g. for benchmarks)…
-
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Codes have benn uploaded on Github, see URL in appendix.
-
Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Data splits, baseline model hyperparameters, data augmentation, loss function… are all mentioned in paper and are reproducible.
-
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? No obvious bias were found through multiple baseline models and a series of ablation experiments.
-
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See section 5.1
-
-
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
-
If your work uses existing assets, did you cite the creators?
-
Did you mention the license of the assets? This dataset is licensed under CC BY-NC-ND 4.0 license.
-
Did you include any new assets either in the supplemental material or as a URL? More data and annotations will be added in the URL
-
Did you discuss whether and how consent was obtained from people whose data you’re using/curating? We have reached an agreement with the authority that we could collect, process, and mining the data for academic purposes.
-
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? No personal identifiable information is contained and all geographic details have been erased.
-
-
If you used crowdsourcing or conducted research with human subjects…
-
Did you include the full text of instructions given to participants and screenshots, if applicable?
-
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
-
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
-