ImVisible
ImVisible: Pedestrian Traffic Light Dataset, Neural Network, and Mobile Application for the Visually Impaired (CAIP '19, ACVR'19)
view repo
Currently, the visually impaired rely on either a sighted human, guide dog, or white cane to safely navigate. However, the training of guide dogs is extremely expensive, and canes cannot provide essential information regarding the color of traffic lights and direction of crosswalks. In this paper, we propose a deep learning based solution that provides information regarding the traffic light mode and the position of the zebra crossing. Previous solutions that utilize machine learning only provide one piece of information and are mostly binary: only detecting red or green lights. The proposed convolutional neural network, LYTNet, is designed for comprehensiveness, accuracy, and computational efficiency. LYTNet delivers both of the two most important pieces of information for the visually impaired to cross the road. We provide five classes of pedestrian traffic lights rather than the commonly seen three or four, and a direction vector representing the midline of the zebra crossing that is converted from the 2D image plane to real-world positions. We created our own dataset of pedestrian traffic lights containing over 5000 photos taken at hundreds of intersections in Shanghai. The experiments carried out achieve a classification accuracy of 94 frame rate of 20 frames per second when testing the network on an iPhone 7 with additional post-processing steps.
READ FULL TEXT VIEW PDFImVisible: Pedestrian Traffic Light Dataset, Neural Network, and Mobile Application for the Visually Impaired (CAIP '19, ACVR'19)
None
None
None
None
The primary issue that the visually impaired face is not with obstacles, which can be detected by their cane, but with information that requires the ability to see. When we interviewed numerous visually impaired people, there was a shared concern regarding safely crossing the road when traveling alone. The reason for this concern is that the visually impaired cannot be informed of the color of pedestrian traffic lights and the direction in which they should cross the road to stay on the pedestrian zebra crossing. When interviewed, they reached a consensus that the information stated above is the most essential for crossing roads.
To solve this problem, some hardware products have been developed [1]. However, they are too financially burdening due to both the cost of the product itself and possible reliance on external servers to run the algorithm. The financial concern is especially important for the visually impaired community in developing countries, such as the people we interviewed who live in China. Accordingly, our paper addresses this issue by discussing LYTNet that can later be deployed on a mobile phone, both ios and android, and run locally. This method would be a cheap, comprehensive, and easily accessible alternative that supplements white-canes for the visually impaired community.
We propose LYTNet, an image classifier, to classify whether or not there is a traffic light in the image, and if so, what color/mode it is in. We also implement a zebra crossing detector in LYTNet that outputs coordinates for the midline of the zebra crossing.
The main contributions of our work are as follows:
To the best of our knowledge, we are the first to create a convolutional neural network (LYTNet) that outputs both the mode of the pedestrian traffic light and midline of the zebra crossing
We create and publish the largest pedestrian traffic light dataset, consisting of 5059 photos with labels of both the mode of traffic lights and the direction vector of the zebra crossing [2]
We design a lightweight deep learning model (LYTNet) that can be deployed efficiently on a mobile phone application and is able to run at 20 frames per second (FPS)
We train a unique deep learning model (LYTNet) that uses one-step image classification instead of multiple steps, and matches previous attempts that only focus on traffic light detection
The rest of the paper is organized in the following manner: Section II discusses previous work and contributions made to the development and advancements in the detection of pedestrian traffic light detectors and zebra crossings; Section III describes the proposed method of pedestrian traffic light and zebra crossing classifier; Section IV provides experiment results and comparisons against a published method; Section V concludes the paper and explores possible future work.
Some industrialized countries have developed acoustic pedestrian traffic lights that produce sound when the light is green, and is used as a signal for the visually impaired to know when to cross the street [3, 4, 5]. However, for less economically developed countries, crossing streets is still a problem for the blind, and acoustic pedestrian traffic lights are not ubiquitous even in developed nations [3].
The task of detecting traffic light for autonomous driving has been explored by many and has developed over the years [6, 7, 8, 9]. Behrendt et al. [10] created a model that is able to detect traffic lights as small as
pixels and with relatively high accuracy. Though most models for traffic lights have a high precision and recall rate of nearly 100% and show practical usage, the same cannot be said for pedestrian traffic lights. Pedestrian traffic lights differ because they are complex shaped and usually differ based on the region in which the pedestrian traffic light is placed. Traffic lights, on the other hand, are simple circles in nearly all countries.
Shioyama et al. [11] were one of the first to develop an algorithm to detect pedestrian traffic lights and the length of the zebra-crossing. Others such as Mascetti et al. and Charette et al. [3, 15] both developed an analytic image processing algorithm, which undergoes candidate extraction, candidate recognition, and candidate classification. Cheng et al. [5] proposed a more robust real-time pedestrian traffic lights detection algorithm, which gets rid of the analytic image processing method and uses candidate extraction and a concise machine learning scheme.
A limitation that many attempts faced was the speed of hardware. Thus, Ivanchenko et al. [12] created an algorithm specifically for mobile devices with an accelerator to detect pedestrian traffic lights in real time. Angin et al. [13] incorporated external servers to remove the limitation of hardware and provide more accurate information. Though the external servers are able to run deeper models than phones, it requires fast and stable internet connection at all times. Moreover, the advancement of efficient neural networks such as MobileNet v2 enable a deep-learning approach to be implemented on a mobile device [14].
Direction is another factor to consider when helping the visually impaired cross the street. Though the visually impaired can have a good sense of the general direction to cross the road in familiar environments, relying on one’s memory has its limitations [16]. Therefore, solutions to provide specific direction have also been devised. Other than detecting the color of pedestrian traffic lights, Ivanchenko et al. [16] also created an algorithm for detecting zebra crossings. The system obtains information of how much of the zebra-crossing is visible to help the visually impaired know whether or not they are generally facing in the correct direction, but it does not provide the specific location of the zebra crossing. Poggi et al., Lausser et al., and Banich [17, 18, 19]
also use deep learning neural network within computer vision to detect zebra crossings to help the visually impaired cross streets. However, no deep learning method is able to output both traffic light and zebra crossing information simultaneously.
Our method is performed on our labeled test-set. The training, test, and validation sets do not overlap.
Our data consists of images of street intersection scenes in Shanghai, China in varying weather and lighting conditions. Images were captured with two different cameras, an iPhone 7 and iPhone 6s at a resolution of [2]. The camera was positioned at varying heights and angles around the vertical and transverse axes, but the angle around the longitudinal axis was kept relatively constant under the assumption that the visually impaired are able to keep the phone in a horizontal orientation. At an intersection, images were captured at varying positions relative to the center of the crosswalk, and at different positions on the crosswalk. Images may contain multiple pedestrian traffic lights, or other traffic lights such as vehicle and bicycle traffic lights.
The final dataset consists of 5059 images [2]. Each image was labelled with a ground truth class for traffic lights: red, green, countdown_green, countdown_blank, and none. Sample images are shown in Figure 1. Images were also labelled with 2 image coordinates representing the endpoints of the zebra crossing as pictured on the image. The image coordinates define the midline of the zebra crossing. In a significant number of the images, the mid-line of the zebra crossing was obstructed by pedestrians, cars, bicycles, or motorcycles. Statistics regarding the labelled images are shown in Table 1.
Prior to training, each image was re-sized to a resolution of
. During each epoch, a random crop of size
and a random horizontal flip was applied to each image to prevent over-fitting. The training dataset was partitioned into 5 equal groups and 5-fold cross validation was performed. Images used in the validation dataset were directly re-sized from to without any transformations applied.Red | Green | CD Green | CD Blank | None | Total | |
---|---|---|---|---|---|---|
Number of Images | 1477 | 1303 | 963 | 904 | 412 | 5059 |
Percentage of Dataset |
29.2% | 25.8% | 19.0% | 17.9% | 8.1% | 100.0% |
Our neural network, LYTNet, follows the framework of MobileNet v2, a lightweight neural network designed to operate on mobile phones. MobileNet v2 primarily uses depthwise separable convolutions. In a depthwise separable convolution, a ”depthwise” convolution is first performed: the channels of the input image are separated and different filters are used for every convolution over each channel. Then, a pointwise convolution (regular convolution of kernel size ) is used to collapse the channels to a depth of 1. For an input of dimensions
convolved with stride 1 with a kernel of size
and output channels, the cost of a standard convolution is while the cost of a depthwise separable convolution is [14]. Thus, the total cost of a depthwise separable convolution is times less than a standard convolution while having similar performance [14]. Each ”bottleneck” block consists of a convolution to expand the number of channels by a factor of , and a depthwise separable convolution of stride and output channels . Multiple fully connected layers were used to achieve the two desired outputs of the network: the classification and the endpoints of the zebra crossing. Compared to MobileNet v2, LYTNet was adapted for a larger input of in order for the pedestrian traffic lights to retain a certain degree of clarity. We used a max-pool layer after the first convolution to decrease the size of the output and thus increase the speed of the network. LYTNet also features significantly fewer bottleneck layers (10 vs 17) compared to MobileNet v2 [14]. Table 2 shows the detailed structure of our network.During training, we used the Adam optimizer with momentum
Input | Operator | ||||
conv2d | - | 32 | 1 | 2 | |
|
maxpool | - | - | 1 | - |
|
Bottleneck | 1 | 16 | 1 | 1 |
|
Bottleneck | 6 | 24 | 1 | 2 |
|
Bottleneck | 6 | 24 | 2 | 1 |
|
Bottleneck | 6 | 32 | 1 | 2 |
|
Bottleneck | 6 | 64 | 1 | 2 |
|
Bottleneck | 6 | 64 | 2 | 1 |
|
Bottleneck | 6 | 96 | 1 | 1 |
|
Bottleneck | 6 | 160 | 2 | 1 |
|
Bottleneck | 6 | 320 | 1 | 1 |
|
conv2d | - | 1280 | 1 | 1 |
|
avgpool | - | 1280 | 1 | - |
1280 |
FC | - | 160 | 1 | - |
160 | FC | - | 5 | 1 | - |
1280 | FC | - | 80 | 1 | - |
80 |
FC | - | 4 | 1 | - |
and initial learning rate of . The learning rate was decreased by a factor of 10 at 150, 400, and 650 epochs, with the network converging at around 800 epochs. We used a combination of cross-entropy loss (for image classification to calculate the loss for classification) and mean-squared-error loss (for regression to calculate the loss for direction) function is defined as:
(1) |
in which is L-2 regularization. We used the value during training.
The predicted endpoints output from the network are assumed to be accurate in regards to the 2D image. However, the appearance of objects and the zebra crossing in the image plane is an incorrect representation of the position of objects in the 3D world. Since the desired object, the zebra crossing, is on the ground, it has a fixed z-value of , enabling the conversion of a 2D image to a 2D birds-eye perspective image to achieve the desired 3D real-world information of the zebra crossing.
On our base image in Figure 2, we define four points: (1671,1440), (2361,1440), (4032,2171), (0,2171) and four corresponding points in the real world: (1671,212), (2361,212), (2361,2812), (1671,2812), with the points defined on the xy-plane such that and . The matrix
maps each point on the image to its corresponding point in the real-world. Assuming a fixed height, and a fixed angle around the transverse and longitudinal axes, the matrix will perfectly map each point on the image to the correct birds-eye-view point. Though this is not the case due to varying heights and angles around the transverse axis, the matrix provides the rough position of zebra crossing in the real world, which is sufficient for the purposes of guiding the visually impaired to a correct orientation.
As a proof of concept, an application was created using Swift. LYTNet is deployed in the application. Additional post-processing steps are implemented in the application to increase safety and convert zebra crossing data into information for the visually impaired. Accordingly, the softmax probabilities of each class is stored in phone memory, and the probabilities are averaged over five consecutive frames. Since countdown_blank and countdown_green represent the same mode of traffic light - a green light that has numbers counting down - the probabilities of either class are added together. A probability threshold of 0.8 is set for the application to output a decision. This is used to prevent a decision from being made before or after the pedestrian traffic light changes color. If one frame of the five frame average is different, the probability threshold would not be reached. Users will be alerted by a choice of beeps or vibrations whenever the five-frame average changes to a different traffic light mode. The average of the endpoint coordinates is also taken over five consecutive frames to provide more stable instructions for the user. The direction is retrieved from the angle of the direction vector in the birds-eye perspective.
A threshold of was set for before instructions are output to the user. If then an instruction for the user to rotate left is output, and if an instruction for the user to rotate right is output. The -intercept of the line through the start and end-points is calculated with:
(2) |
For an image with width and midline at , if , instructions
are given to move left, and if , instructions are given to move right. In our defined area of the zebra crossing in transformed base image, the edges of the zebra crossing are within of the midline. With a constant width for the zebra crossing, if is outside of the range, the user will be outside of the zebra crossing. Refer to Figure 3 for a flow chart of the demo application and Figure 4 for a screenshot of our demo application.
We trained our network using 3456 images from our dataset and 864 images for validation [2]. Our testing dataset consists of 739 images. The width multiplier changes the number of output channels at each layer. A smaller width multiplier decreases the number of channels and makes the network less computationally expensive, but sacrifices accuracy. As seen in Table 3, networks using a higher width multiplier also have a lower accuracy due to overfitting. We performed further testing using the network with width multiplier 1.0, as it achieves the highest accuracy while maintaining near real-time speed when tested on an iPhone 7. The precisions and recalls of countdown_blank and none are the lowest out of all classes, which may be due to the limited number of training samples
Width | Accuracy (%) | Angle Error (degrees) | Start-point Error | Frame Rate (fps) |
---|---|---|---|---|
1.4 | 93.50 | 6.80 | 0.0805 | 15.69 |
1.25 |
92.96 | 6.73 | 0.0810 | 17.19 |
1.0 |
94.18 | 6.27 | 0.0763 | 20.32 |
0.9375 |
93.50 | 6.44 | 0.0768 | 21.69 |
0.875 |
93.23 | 7.08 | 0.0854 | 23.41 |
0.75 |
92.96 | 7.16 | 0.0825 | 24.33 |
0.5 |
89.99 | 7.19 | 0.0853 | 28.30 |
Red | Green | Countdown Green | Countdown Blank | None | |
---|---|---|---|---|---|
Precision | 0.97 | 0.94 | 0.99 | 0.86 | 0.92 |
Recall |
0.96 | 0.94 | 0.96 | 0.92 | 0.87 |
F1 Score |
0.96 | 0.94 | 0.97 | 0.89 | 0.89 |
for those two classes (Table 4). However, the precision and recall of red traffic lights, the most important class, is greater than 96%.
When the zebra crossing is clear/unblocked, the angle error, startpoint, and endpoint errors are significantly better than when it is obstructed (Table 5). For an obstructed zebra crossing, insufficient information is provided in the image for the network to output precise endpoints.
Figure 5 shows various outputs of our network. In (A), the network correctly predicts no traffic light despite two green car traffic lights taking a prominent place in the background, and is able to somewhat accurately predict the coordinates despite the zebra crossing appearing faint. In (B), the model correctly predicted the class despite the symbol being underexposed by the camera. (C) and (D) show examples of the model correctly predicting the traffic light despite rainy and snowy weather. (B), (C), and (D) all show the network predicting coordinates close to the ground truth.
To prove the effectiveness of LYTNet, we retrained it using only red, green, and none class pictures from our own dataset and tested it on the PTLR dataset [5]. Due to the small size of the PTLR training dataset, we were unable to perform further training or fine-tuning using the dataset without significant overfitting. Using the China portion of the PTLR dataset, we compared our algorithm with Cheng et al.’s algorithm, which is the most recent attempt for pedestrian traffic light detection to our knowledge.
LYTNet was able to outperform their algorithm in regards to the F1 score, despite the disadvantage of insufficient training data from the PTLR dataset to train our network (Table 6). Furthermore, LYTNet provides additional information about the direction of the zebra crossing, giving the visually impaired a more comprehensive set of information for crossing the street, and outputs information regarding 4 different modes of traffic lights rather than only 2. We also achieve a similar frame rate to Cheng et al.’s algorithm, which achieved a frame rate of 21, albeit on a different mobile device.
Number of Images | Angle Error | Startpoint Error | Endpoint Error | |
---|---|---|---|---|
Clear | 594 | 5.86 | 0.0725 | 0.476 |
Obstructed |
154 | 7.97 | 0.0918 | 0.0649 |
All |
739 | 6.27 | 0.0763 | 0.0510 |
In this paper, we proposed LYTNet, a convolutional neural network that uses image classification to detect the color of pedestrian traffic lights and to provide the direction and position of the zebra crossing to assist the visually impaired in crossing the street. LYTNet uses techniques taken from MobileNet v2,and was trained on our dataset, which is one of the largest pedestrian traffic light datasets in the world [2]. Images were captured at hundreds of traffic intersections within Shanghai at a variety of different heights, angles, and positions relative to the zebra crossing.
Unlike previous methods that use multiple steps like detecting candidate areas, LYTNet uses image classification, a one-step approach. Since the network can learn features from an entire image rather than only detecting the pedestrian traffic light symbol, it has the advantage of being more robust in cases such as images with multiple pedestrian traffic lights. With sufficient training data, the network can draw clues from the context of an image along with the traffic light color to reach the correct prediction.
Additionally, LYTNet provides the advantage of being more comprehensive than previous methods as it classifies the traffic light between five total classes compared to 3 or 4 in previous attempts. Furthermore, our network is also capable of outputting zebra crossing information, which other methods do not
Our Network | Cheng et al.’s Algorithm | ||
---|---|---|---|
Red | Recall | 92.23 | 86.43 |
Precision | 96.24 | 96.67 | |
F1 Score | 94.19 | 91.26 | |
Green |
Recall | 92.15 | 91.30 |
Precision | 98.83 | 98.03 | |
F1 Score | 95.37 | 94.55 |
provide. Thus, LYTNet elegantly combines the two most needed pieces of information without requiring two separate algorithms. Furthermore, our network is able to match the performance of the algorithm proposed by Cheng et al.
In the future, we will improve the robustness of our deep learning model through the expansion of our dataset, for further generalization. For the two classes with the least data, none and countdown_blank, additional data can greatly improve the precisions and recalls. Data from other areas around the world can also be collected to separately train the network to perform optimally in another region with pedestrian traffic lights with differently shaped symbols. Our demonstration mobile application will be further developed into a working application that converts the output into auditory and sensory information for the visually impaired.
We would like to express our sincerest gratitude to Professor Chunhua Shen, Dr. Facheng Li, and Dr. Rongyi Lan for their insight and expertise when helping us in our research.