This work is funded by the German Federal Ministry of Education and Research under grant number 13N13891.
The inspection of sewer pipes is a crucial task to ensure the functionality of sewage systems. Many sewer pipes in big cities are several decades old, some are even older than one hundred years. Therefore, regular risk assessment and sanitation planing is needed to ensure the correct functionality of the sewer system. At present, mobile robot systems equipped with cameras or other sensors are used to manually traverse the pipes. As a result, they produce large amounts of data in which defects have to be annotated manually by technical staff especially trained for this task. As a consequence, the obtained results are error prone due to the repeating and tiresome work. In order to assist workers, reliable computer systems are needed that can give support by automatically detecting certain defects in sewer pipes. Such a system would typically consist of to modules: First, a pre-processing module that can convert raw input images into a form that can be automatically processed, and second, a detection/classification module that performs automated annotation of the provided data.
One early work for the estimation of camera poses in sewer papers was proposed by Cooper . [Cooper.1998]. The authors exploited the longitudinal mortar lines for camera pose recovery, limiting the system to stonewalled pipes. In [Kannala.2008], the profile of sewer pipes is reconstructed solely from fisheye video sequences. The approach is based on the tracking of feature points for more than three views, which is not feasible in our application due to the distance of 5 cm and thus large changes between consecutive images. Furthermore, the system was only tested for concrete pipes, which have a relatively well structured texture for feature detection and matching. Esquivel . [Esquivel.2005, Esquivel.2010] proposed a system for the reconstruction of sewer shafts exploiting the fact that the camera always faces downward due to the force of gravity. Therefore, the system is restricted to vertical pipes.
With our work, we present a system, which assists the employee with the automatic detection and classification of damages in sewer pipes. We solely use unrolled and stitched images (like the one in Figure 6) as input for the detection and classification algorithm. To obtain an unrolled fisheye image, 3d motion of the camera is tracked and a cylindrical image is generated through back-projection on an ideal pipe. This image can then be easily snipped and unwound into a planar image.
The second part of this work is aimed towards automatic detection and classification of defects and structural elements in the pipe. We treat this as a semantic labeling problem which is tackled using deep convolutional neural networks. To our knowledge previous methods for this task mostly relied on image processing algorithms and heuristics to detect various types of defects and structural elements. In[Su2015] for example, the authors use edge detection and morphological operations based on CCTV images for crack and open joint detection. The authors of [Huynh2015] propose a novel edge detection algorithm for thin crack detection, that can overcome some difficulties encountered in noisy environments like sewer pipes.
Besides those algorithms, few works also use machine learning to implement diagnostic systems based on a range of image processing techniques. Yang and Su[Yang2008] for example use SVMs and simple neural networks using wavelet transform and co-occurrence matrices to detect open joints, cracks and broken pipes. Another example is presented by Wu in [Wu2015] where ensemble methods on contourlet transforms and statistical features are used to detect cracks, roots and collapsed pipes.
2 Creation of Unrolled Pipe Images
The images of the sewer pipe are taken with a mobile robot having a fisheye camera with a field of view of 185 degrees. Due to their characteristics and the viewing direction along the cylindrical pipe, the resolution of the depicted pipe surface decreases dramatically with increasing distance to the camera. In consequence, only the outer part of the circular image is used to produce the stitched unroll of the sewer pipe. To guarantee overlap big enough for registration, images are taken with five centimeter spacing.
The original images we obtain from the commercial robot system show strong artifacts from severe lossy compression and some image areas are overexposed because of the strong flashlights. In conjunction with the lack of texture information, classical approaches used by systems like Bundler [Snavely.2005] or VisualSfM [Wu.2011, Wu.2013] fail to track our robot camera. We therefore simplify the problem by assuming a cylindrical shape with a known diameter (for the absolute scale). Instead of having one unknown depth parameter for each image feature, the 3d position of all features can now be related by one unknown rigid body transform with 6 unknowns for the entire frame. The resulting equation system and its solution are explained in the following sections.
The imaging characteristics of the circular fisheye lenses can be described by
with representing the angle of incidence, the distance of the resulting image point from the image center in pixel, the field of view of the fisheye lens and the diameter of the projected circular part of the image in pixel. For the unwrapping, we regulary sample the cylindrical surface of the inner pipe at N 3d positions which correspond to the pixels of the cylindrical image (please refer to Figure 2 for an illustration). One can choose a reasonable number of points lying on the pipe perimeter, depending on the resolution of the source images. The spatial resolution of the pipe perimeter defines the resolution along the pipe axis as well if square pixels are assumed. Every point can then be projected into the corresponding fisheye image to acquire the color information at the location
. Usually, the projection will hit the image at subpixel positions, therefore image interpolation is needed.
Subsequently, the cylindrical unwrap for every fisheye image can be calculated for a given camera pose. In the existing commercial system, the camera is assumed to lie on and move along the pipe axis. This assumption is not valid due to the movement between the shots – major stitching artifacts are the consequence, making it nearly impossible to use these images for training of neural networks for automatic damage detection.
2.2 Camera Pose Estimation
We chose a feature based approach for the estimation of the camera pose. In [Zhang.2011], features are detected and filtered based on the images generate by back-projection. Due to the unknown camera pose, fairly strong artifacts may occur, causing less features or false matches. Due to the relatively homogeneous pipe surface and the limited image quality, we use an iterative feature matching scheme [Furch.2013] that considers neighborhood constraints and leads to more and more robust feature correspondences.
Local Pose Optimization
can be constructed for every feature point to describe the direction of the corresponding line of sight. With respect to a global coordinate system placed on the symmetric axis of the pipe, the position and orientation of the robot camera can be specified by a translation vector and a rotation matrix , respectively. For this pose, the straight-line equation
represents the line of sight originating from the camera center, depending on the camera position. Based on (missing), the intersections of the lines with the cylinder surface of the pipe can be calculated. For matched feature points, the intersections must be consistent on the 3d surface.
Our objective is to combine the equations in a linear equation system, which can be solved efficiently for the translational () and rotational () updates. We linearize (missing) with respect to the rotation angles
around an operating point of and , resulting in three equations, one for each component of . With the circle equation (with being the radius of the pipe) and the x- and y-component of the linearized straight-line equation (missing), the function for specifying the intersection point can be created. So the location of a certain point on the pipe surface can be specified by
for the first dimension. The equations for the remaining dimensions have the same structure but with and , respectively.
In our local pose estimation scheme, camera location and orientation is determined from point correspondences between two successive camera frames with unknown pose. Therefore, we always estimate pairs of 3D camera data corresponding to the two frames. To avoid pose ambiguity of the cylindrical shape, the first camera of each pair has to be fixed in its position along the z-axis and in its rotation around it, resulting inunknown pose parameters. Based on the partial derivatives for the parameters around an initial operating point, a linear equation system is created and solved for every camera pair. This procedure is iteratively applied to remove errors due to linearization.
After each iteration, the parameter vector must be updated which results in a simple addition for the translation parameters. The updated rotation matrix is calculated by .
During the iterations, we utilize the RANSAC algorithm to reject outliers. This makes the pose estimation more robust against e.g. connecting pipes or dangling roots, which introduce many features violating the assumption of a cylindrically shaped pipe with a known diameter. Otherwise, the pose estimation would likely fail at these points.
Global Pose Optimization
In the local optimization used for initialization, camera pose is estimated independently for each pair of frames. To get a smooth camera path, a global optimization of the camera poses is done, since pose of the current frame directly depends on pose of the previous frame. Every camera but the first has therefore six degrees of freedom and the pose is determined by the location of the matched features for both pairs the image is involved in. The first frame only has four degrees of freedom due to the ambiguity mentioned above. The matrices forframes are combined into one big sparse matrix with unknowns for which the linear equation system is solved iteratively. Therefore, all camera parameters are connected throughout the whole camera path and can influence each other during the optimization. Due to the initialization with the results of the local estimation, this equation system can be solved quickly by exploiting its sparse nature. With the estimated camera poses in place, the back-projection of the unwrapped image can be calculated without artifacts caused by unconsidered camera movement.
2.3 Image Enhancement and Stitching
Despite the removal of geometric artifacts, caused by the camera motion, there are still major artifacts caused by the uneven lighting of the images. The radial light falloff is easily noticeable in Figure 1 and Figure 3 causing a leap in lighting from dark to bright at the seam between two images. To get a smooth imperceptible transition between consecutive images, we apply these three steps:
Estimation and elimination of uneven lighting.
Identification of the optimal seam using Dynamic Programming.
Application of Poisson Blending in the transition zone.
Elimination of uneven lighting
In contrast to [Zhang.2011], the illumination is modeled separately for each image. The authors of [Zhang.2011] assume that the average grey level of a distinct pixel can be regarded as the illumination intensity. This assumption is only valid if the reflection properties and color remain the same over the entire length of the pipe. In our use case though, we have to deal with changing materials, and deposits on the surface affecting the reflection of light. Therefore, we modulate the lighting falloff for each image separately as a linear function of the distance to the camera.
The lighting estimation is done on the grey-scale images in several steps. To separate the low frequencies, a fairly strong Gaussian filter is applied. This practice helps to reduce the influence of locally strong reflections or dark spots, like pipe connections. After that, a linear function is fitted to every image column . We use a robust regression method by fitting a linear function iteratively to the image while trimming out image areas with the biggest residuals. In every image column , we also calculate the median value from all grey values to get an estimate for the offset. We then adjust every channel of with
getting the new corrected image . This step is applied to all three color channels of .
On behalf of a smooth transition between consecutive images, it is advantageous to place the image seam where the image difference is small. For that purpose, an optimal path is computed by dynamic programming. With two consecutive images and overlapping along the pipe direction within the regions and , we use the normalized, absolute difference
for the grey-values as error criterion.
To avoid a frayed seam, we add an additional cost to penalize a transition from one possible seam element to one in the next column by , favoring horizontal cuts through the images. For every element in the current column with index the cost for getting there is calculated by
with being the height of the difference image. The weights and control the influence of the difference image and the vertical distance respectively. With the optimal seam at hand, the transition between two consecutive images takes place where their difference is minimal, while areas with tiny registration errors get excised.
As final step of the image refinement, we apply Poisson Blending [Perez.2003] along the image seams. The blending is formulated as an Least-Squares problem, constrained by the previous image at one side and by the current image at the other. Thereby, the image gradients can be preserved, but leaps are avoided. We limit the range of the blending to only a few pixels, to preserve computation time.
3 Automatic Annotation
Automatic annotation and damage classification of the enhanced images is treated as a semantic labeling problem which is tackled using deep convolutional neural networks. As mentioned in section 1, many algorithms rely on image processing approaches and heuristics to detect defects in sewer pipes. Although those algorithms are able to detect some defects reliably, a common problem is the relatively low amount and quality of data available. On the one hand, this constrains the number of detectable defect classes, due to the lack of examples per class, whereas on the other hand, the varying visual appearance of those classes makes it nearly impossible to find a single general detection method on that few examples.
We think that given a sufficient amount of high quality labeled data, a deep convolutional neural network can learn to detect and differentiate between a variety of different classes with almost expert-like accuracy.
3.1 Data Acquisition
In order to successfully train deep neural networks, large amounts of data are needed. Additionally in the case of semantic labeling, this data must also be labeled in a pixel-precise way, meaning that each pixel must be assigned to one specific class.
Given the enhanced, unrolled image produced by the method in section 2 and the expertise of specially trained experts, we were able to produce such pixel-precise labeling for sewer pipes covering almost kilometers. We selected the pipes to represent a variety of materials ( stoneware and concrete) and diameters ( to millimeters). The number of pixels representing the circumference was set to and the resulting image resolution was computed accordingly. Overall, we manually annotated x pixels of unwrapped pipe images. Since each single image is too large to be used for training as a whole, we decided to split them into equally sized, overlapping chunks of size x.
For the annotation of the images, we selected some of the most common defects as well as structural elements. Overall, classes were used. Regarding defects the classes are residue, crack, root, obstacle and erosion/spalling, whereas for the structural elements we used joint, connection and shaft. A labeled example image can be seen in Figure 4.
|Class||No. objects||No. pixels||%|
3.2 Network Topology
The network structure we used to perform semantic labeling is based on the Full-Resolution Residual Networks (FRRN) by Pohlen . [Pohlen2017]. In their work, the authors develop a novel topology of deep convolutional neural network aimed at semantic labeling tasks.
The principle idea of this structure is to have two data streams through the whole network. One stream is responsible for object recognition, which undergoes a classical pipeline of feature extraction and down-sampling to learn robust features for object recognition, whereas the other stream is kept at full input resolution to learn features for precise object segmentation. Successively features learned by the recognition (pooling) stream are up-sampled and fused into the segmentation (residual) stream. This way, even more complex features are generated for the final pixel-wise labeling. The general structure of an FRRN can be seen inFigure 5. We adapted the original FRRN structure to better fit our problem and to reduce complexity, as well as computation time and model size. Compared to FRRNA structure in [Pohlen2017], we changed the number of full resolution residual units (FRRU) per resolution level to 3 and the number of filters to for the pooling stream and for the residual stream.
For training, the data generated in section 3.1 was split into a training and a test set using a ratio of :. Due to the large image size, training was performed on downscaled versions of the data with a size of x. This enabled us to decrease training time at no cost in terms of quality compared to the original resolution. Furthermore, the images were converted to YCbCr space and contrast normalized in a windowed fashion to compensate for the varying materials and color (e.g. of residues) and bright reflections due to wet surfaces.
To further increase the amount of training data available, minor data augmentation was applied. Since sewer pipe inspection is direction independent and the pipes are symmetrical, we randomly flipped the input images either vertically or horizontally with a probability of.
Training was performed on a single Nvidia Titan X for hours using Tensorflow. We optimized a bootstrapped cross entropy as introduced in [Wu2016]. The idea is to only take a certain percentage of pixels into account, which are misclassified or correctly classified with a low class probability
is the posterior probability for image pixeland its corresponding target class and is a threshold chosen so that elements fall below this. We selected with where is the total number of image pixels.
Training was performed using the Adam optimizer [Kingma2014] with a constant learning rate of . In Figure 7, the evolution of accuracy and mean-IoU can be seen. As expected, the accuracy increases rapidly because most of the images contain a large portion of background.
In this section, we first show results of our system for the generation of images depicting unwrapped pipe surfaces. We compare the results to images produced by a commercial sewer pipe inspection software. In the second part, we present the results and benchmarks for our damage detection algorithm.
4.1 Results – Unrolled Pipes
The images of Figure 6 illustrate the influence of the motion path estimation. A pipe section is shown on the left side, generated with the commercial sewer pipe inspection software in Figure a, and with our system in Figure c. The images on the right show the camera motion paths used for the calculation of the back projection. As depicted in Figure b, a camera motion along the pipe axis was assumed for the generation Figure a. Therefore, the actual camera movement causes several artifacts in the final image. It creates foremost the illusion of a bent pipe. A not centrically placed camera is also easily to notice by the depicted wavy pipe couplings. In Figure 8, details of the pipe couplings are shown for comparison.
The images in Figure 9 show close up views of a pipe surface region with many cracks. In the left image, some cracks appear twice showing ghosting effects due to the lack of camera pose information. With our estimation of the camera positions, these artifacts disappear (right). Furthermore, small registration errors get excised by the estimation of the optimal path. In addition, our algorithm led to a noticeable improvement of the image resolution and contrast.
These changes can be also noticed in the right image of Figure 10 where depositions at the pipe bottom can be inspected in much more detail. In the left image of Figure 10, produced by the commercial system, these details got lost due to the low resolution.
4.2 Results – Automatic Annotation
In Figure 11, Figure 12 and Figure 13, some exemplary labelings produced by the presented system are shown. It can be seen that structural elements can be detected and classified reliably, regardless of the pipes material. For defects however, there are some differences among the different pipe types. Depending on the material of the pipe, some defects are more likely to be missed. In Figure 11, and Figure 12, cracks and roots are detected reasonably and classified as such, whereas in Figure 13 the crack is missed completely. Also, sometimes cracks and roots, although detected, are mistaken for each other due to their sometimes similar color. All in all, our system produces visually satisfying results.
Table 2 shows the confusion matrix and mean-IoU on the test set and gives a more detailed overview of the results. It can be seen that the most problematic class is obstacle (seventh line), which most often gets confused with the background. This makes sense in that usually an obstacle that is viewed from the center of the pipe like on the unwrapped images is virtually invisible and cannot be distinguished from the background. Only in cases where there are also color changes, the obstacle can be detected. The second most problematic class is crack (fifth line) which also gets missed often, due to the complex texture of the concrete pipes. An example can be seen in Figure 13.
In general it seems that most often when a defect or structure is missed, the problem is not that it gets misclassified but rather is missed at all and mistaken for background. This could be a serious problem in terms of risk assessment, which we look out to have a closer look on in future work (see section 5).
Until now, we have only used a state-of-the-art network topology that shows general good performance in semantic segmentation tasks. We plan to enhance the used structure to incorporate the shape of a pipe in the sense that the images wrap around and that the first and last row are correlated. This way we hope to reduce the number of parameters needed, while keeping the quality of the results. This in turn would reduce computation time and model size. Furthermore, we aim at increasing the resolution used for training and prediction to overcome the problem of missing thin cracks due to down-sampling.
Although mean-IoU is a good measure to get an intuition for the quality of the system, it is in a sense very academic. In terms of risk assessment and sanitation planning, detection rate and false positive rate are perhaps the most interesting measures. We plan to also evaluate our results based on those measures and to give a qualitative evaluation (e.g. in terms of risk) with the help of experts.
Regarding the image enhancement side of this work, there are two possible ways to go. On the one hand we could use greatly improved images, which would lead to track-able features. On the other hand, one can use stereo optical techniques to retrieve a 3d model of the pipe. In addition to the used defects mentioned in section 3.1, this would enable us to also detect open joints which can be seen as high risk defects due to their influence on the pipe’s structural integrity. Furthermore, it would likely improve the detection of obstacles for the reasons mentioned in section 4.2.
In this paper, we present a method for enhancing low quality fisheye images of sewer pipes and produce high quality unwraps that then are used for automatic detection and classification of defects and structural elements. We show that, given a sufficient amount of data, a single system can be capable of detecting a wide range of defects despite the large visual variations. Although there are still some problems to tackle, the system achieves good results in terms of accuracy and IoU.