LYTNet: A Convolutional Neural Network for Real-Time Pedestrian Traffic Lights and Zebra Crossing Recognition for the Visually Impaired

by   Samuel Yu, et al.
Shanghai American School

Currently, the visually impaired rely on either a sighted human, guide dog, or white cane to safely navigate. However, the training of guide dogs is extremely expensive, and canes cannot provide essential information regarding the color of traffic lights and direction of crosswalks. In this paper, we propose a deep learning based solution that provides information regarding the traffic light mode and the position of the zebra crossing. Previous solutions that utilize machine learning only provide one piece of information and are mostly binary: only detecting red or green lights. The proposed convolutional neural network, LYTNet, is designed for comprehensiveness, accuracy, and computational efficiency. LYTNet delivers both of the two most important pieces of information for the visually impaired to cross the road. We provide five classes of pedestrian traffic lights rather than the commonly seen three or four, and a direction vector representing the midline of the zebra crossing that is converted from the 2D image plane to real-world positions. We created our own dataset of pedestrian traffic lights containing over 5000 photos taken at hundreds of intersections in Shanghai. The experiments carried out achieve a classification accuracy of 94 frame rate of 20 frames per second when testing the network on an iPhone 7 with additional post-processing steps.



There are no comments yet.


page 4

page 5

page 8

page 10


Street Crossing Aid Using Light-weight CNNs for the Visually Impaired

In this paper, we address an issue that the visually impaired commonly f...

Virtual Guide Dog: Next Generation Pedestrian Signal for the Visually Impaired

Accessible pedestrian signal (APS) was proposed as a mean to achieve the...

IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture

Understanding pedestrian crossing behavior is an essential goal in intel...

Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation

Lacking the ability to sense ambient environments effectively, blind and...

Recognition and Co-Analysis of Pedestrian Activities in Different Parts of Road using Traffic Camera Video

Pedestrian safety is a priority for transportation system managers and o...

Crossing the Road Without Traffic Lights: An Android-based Safety Device

In the absence of pedestrian crossing lights, finding a safe moment to c...

A Review on Drivers Red Light Running and Turning Behaviour Prediction

Drivers behaviour prediction has been an unceasing concern for transport...

Code Repositories


ImVisible: Pedestrian Traffic Light Dataset, Neural Network, and Mobile Application for the Visually Impaired (CAIP '19, ACVR'19)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The primary issue that the visually impaired face is not with obstacles, which can be detected by their cane, but with information that requires the ability to see. When we interviewed numerous visually impaired people, there was a shared concern regarding safely crossing the road when traveling alone. The reason for this concern is that the visually impaired cannot be informed of the color of pedestrian traffic lights and the direction in which they should cross the road to stay on the pedestrian zebra crossing. When interviewed, they reached a consensus that the information stated above is the most essential for crossing roads.

To solve this problem, some hardware products have been developed [1]. However, they are too financially burdening due to both the cost of the product itself and possible reliance on external servers to run the algorithm. The financial concern is especially important for the visually impaired community in developing countries, such as the people we interviewed who live in China. Accordingly, our paper addresses this issue by discussing LYTNet that can later be deployed on a mobile phone, both ios and android, and run locally. This method would be a cheap, comprehensive, and easily accessible alternative that supplements white-canes for the visually impaired community.

We propose LYTNet, an image classifier, to classify whether or not there is a traffic light in the image, and if so, what color/mode it is in. We also implement a zebra crossing detector in LYTNet that outputs coordinates for the midline of the zebra crossing.

The main contributions of our work are as follows:

  • To the best of our knowledge, we are the first to create a convolutional neural network (LYTNet) that outputs both the mode of the pedestrian traffic light and midline of the zebra crossing

  • We create and publish the largest pedestrian traffic light dataset, consisting of 5059 photos with labels of both the mode of traffic lights and the direction vector of the zebra crossing [2]

  • We design a lightweight deep learning model (LYTNet) that can be deployed efficiently on a mobile phone application and is able to run at 20 frames per second (FPS)

  • We train a unique deep learning model (LYTNet) that uses one-step image classification instead of multiple steps, and matches previous attempts that only focus on traffic light detection

The rest of the paper is organized in the following manner: Section II discusses previous work and contributions made to the development and advancements in the detection of pedestrian traffic light detectors and zebra crossings; Section III describes the proposed method of pedestrian traffic light and zebra crossing classifier; Section IV provides experiment results and comparisons against a published method; Section V concludes the paper and explores possible future work.

2 Related Works

Some industrialized countries have developed acoustic pedestrian traffic lights that produce sound when the light is green, and is used as a signal for the visually impaired to know when to cross the street [3, 4, 5]. However, for less economically developed countries, crossing streets is still a problem for the blind, and acoustic pedestrian traffic lights are not ubiquitous even in developed nations [3].

The task of detecting traffic light for autonomous driving has been explored by many and has developed over the years [6, 7, 8, 9]. Behrendt et al. [10] created a model that is able to detect traffic lights as small as

pixels and with relatively high accuracy. Though most models for traffic lights have a high precision and recall rate of nearly 100% and show practical usage, the same cannot be said for pedestrian traffic lights. Pedestrian traffic lights differ because they are complex shaped and usually differ based on the region in which the pedestrian traffic light is placed. Traffic lights, on the other hand, are simple circles in nearly all countries.

Shioyama et al. [11] were one of the first to develop an algorithm to detect pedestrian traffic lights and the length of the zebra-crossing. Others such as Mascetti et al. and Charette et al. [3, 15] both developed an analytic image processing algorithm, which undergoes candidate extraction, candidate recognition, and candidate classification. Cheng et al. [5] proposed a more robust real-time pedestrian traffic lights detection algorithm, which gets rid of the analytic image processing method and uses candidate extraction and a concise machine learning scheme.

A limitation that many attempts faced was the speed of hardware. Thus, Ivanchenko et al. [12] created an algorithm specifically for mobile devices with an accelerator to detect pedestrian traffic lights in real time. Angin et al. [13] incorporated external servers to remove the limitation of hardware and provide more accurate information. Though the external servers are able to run deeper models than phones, it requires fast and stable internet connection at all times. Moreover, the advancement of efficient neural networks such as MobileNet v2 enable a deep-learning approach to be implemented on a mobile device [14].

Direction is another factor to consider when helping the visually impaired cross the street. Though the visually impaired can have a good sense of the general direction to cross the road in familiar environments, relying on one’s memory has its limitations [16]. Therefore, solutions to provide specific direction have also been devised. Other than detecting the color of pedestrian traffic lights, Ivanchenko et al. [16] also created an algorithm for detecting zebra crossings. The system obtains information of how much of the zebra-crossing is visible to help the visually impaired know whether or not they are generally facing in the correct direction, but it does not provide the specific location of the zebra crossing. Poggi et al., Lausser et al., and Banich [17, 18, 19]

also use deep learning neural network within computer vision to detect zebra crossings to help the visually impaired cross streets. However, no deep learning method is able to output both traffic light and zebra crossing information simultaneously.

3 Proposed Method

Our method is performed on our labeled test-set. The training, test, and validation sets do not overlap.

Figure 1: Sample images taken in different weather and lighting conditions. Other pedestrian traffic lights or vehicle/bicycle traffic lights can be seen in the images. The two endpoints of the zebra crossing are labelled as seen on the images.

3.1 Dataset Collection and Pre-Processing

Our data consists of images of street intersection scenes in Shanghai, China in varying weather and lighting conditions. Images were captured with two different cameras, an iPhone 7 and iPhone 6s at a resolution of [2]. The camera was positioned at varying heights and angles around the vertical and transverse axes, but the angle around the longitudinal axis was kept relatively constant under the assumption that the visually impaired are able to keep the phone in a horizontal orientation. At an intersection, images were captured at varying positions relative to the center of the crosswalk, and at different positions on the crosswalk. Images may contain multiple pedestrian traffic lights, or other traffic lights such as vehicle and bicycle traffic lights.

The final dataset consists of 5059 images [2]. Each image was labelled with a ground truth class for traffic lights: red, green, countdown_green, countdown_blank, and none. Sample images are shown in Figure 1. Images were also labelled with 2 image coordinates representing the endpoints of the zebra crossing as pictured on the image. The image coordinates define the midline of the zebra crossing. In a significant number of the images, the mid-line of the zebra crossing was obstructed by pedestrians, cars, bicycles, or motorcycles. Statistics regarding the labelled images are shown in Table 1.

Prior to training, each image was re-sized to a resolution of

. During each epoch, a random crop of size

and a random horizontal flip was applied to each image to prevent over-fitting. The training dataset was partitioned into 5 equal groups and 5-fold cross validation was performed. Images used in the validation dataset were directly re-sized from to without any transformations applied.

Red Green CD Green CD Blank None Total
Number of Images 1477 1303 963 904 412 5059

Percentage of Dataset
29.2% 25.8% 19.0% 17.9% 8.1% 100.0%
Table 1: Composition of Dataset

3.2 Classification and Regression Algorithm

Our neural network, LYTNet, follows the framework of MobileNet v2, a lightweight neural network designed to operate on mobile phones. MobileNet v2 primarily uses depthwise separable convolutions. In a depthwise separable convolution, a ”depthwise” convolution is first performed: the channels of the input image are separated and different filters are used for every convolution over each channel. Then, a pointwise convolution (regular convolution of kernel size ) is used to collapse the channels to a depth of 1. For an input of dimensions

convolved with stride 1 with a kernel of size

and output channels, the cost of a standard convolution is while the cost of a depthwise separable convolution is [14]. Thus, the total cost of a depthwise separable convolution is times less than a standard convolution while having similar performance [14]. Each ”bottleneck” block consists of a convolution to expand the number of channels by a factor of , and a depthwise separable convolution of stride and output channels . Multiple fully connected layers were used to achieve the two desired outputs of the network: the classification and the endpoints of the zebra crossing. Compared to MobileNet v2, LYTNet was adapted for a larger input of in order for the pedestrian traffic lights to retain a certain degree of clarity. We used a max-pool layer after the first convolution to decrease the size of the output and thus increase the speed of the network. LYTNet also features significantly fewer bottleneck layers (10 vs 17) compared to MobileNet v2 [14]. Table 2 shows the detailed structure of our network.

During training, we used the Adam optimizer with momentum

Input Operator
conv2d - 32 1 2

maxpool - - 1 -

Bottleneck 1 16 1 1

Bottleneck 6 24 1 2

Bottleneck 6 24 2 1

Bottleneck 6 32 1 2

Bottleneck 6 64 1 2

Bottleneck 6 64 2 1

Bottleneck 6 96 1 1

Bottleneck 6 160 2 1

Bottleneck 6 320 1 1

conv2d - 1280 1 1

avgpool - 1280 1 -

FC - 160 1 -
160 FC - 5 1 -
1280 FC - 80 1 -

FC - 4 1 -
Table 2: Structure of Our Network
Figure 2: The image on the left is the base image that was taken perpendicular to the zebra crossing and positioned in the center of the crossing, at a camera height of 1.4 m. Using our matrix, each point in the base image is mapped to a new point, creating the birds-eye image on the right. We can see that the zebra crossing is bounded by a rectangle with a midline centered and perpendicular to the x-axis.

and initial learning rate of . The learning rate was decreased by a factor of 10 at 150, 400, and 650 epochs, with the network converging at around 800 epochs. We used a combination of cross-entropy loss (for image classification to calculate the loss for classification) and mean-squared-error loss (for regression to calculate the loss for direction) function is defined as:


in which is L-2 regularization. We used the value during training.

3.3 Conversion of 2D Image Coordinates to 3D World Coordinates

The predicted endpoints output from the network are assumed to be accurate in regards to the 2D image. However, the appearance of objects and the zebra crossing in the image plane is an incorrect representation of the position of objects in the 3D world. Since the desired object, the zebra crossing, is on the ground, it has a fixed z-value of , enabling the conversion of a 2D image to a 2D birds-eye perspective image to achieve the desired 3D real-world information of the zebra crossing.

On our base image in Figure 2, we define four points: (1671,1440), (2361,1440), (4032,2171), (0,2171) and four corresponding points in the real world: (1671,212), (2361,212), (2361,2812), (1671,2812), with the points defined on the xy-plane such that and . The matrix

maps each point on the image to its corresponding point in the real-world. Assuming a fixed height, and a fixed angle around the transverse and longitudinal axes, the matrix will perfectly map each point on the image to the correct birds-eye-view point. Though this is not the case due to varying heights and angles around the transverse axis, the matrix provides the rough position of zebra crossing in the real world, which is sufficient for the purposes of guiding the visually impaired to a correct orientation.

3.4 Mobile Application

As a proof of concept, an application was created using Swift. LYTNet is deployed in the application. Additional post-processing steps are implemented in the application to increase safety and convert zebra crossing data into information for the visually impaired. Accordingly, the softmax probabilities of each class is stored in phone memory, and the probabilities are averaged over five consecutive frames. Since countdown_blank and countdown_green represent the same mode of traffic light - a green light that has numbers counting down - the probabilities of either class are added together. A probability threshold of 0.8 is set for the application to output a decision. This is used to prevent a decision from being made before or after the pedestrian traffic light changes color. If one frame of the five frame average is different, the probability threshold would not be reached. Users will be alerted by a choice of beeps or vibrations whenever the five-frame average changes to a different traffic light mode. The average of the endpoint coordinates is also taken over five consecutive frames to provide more stable instructions for the user. The direction is retrieved from the angle of the direction vector in the birds-eye perspective.

A threshold of was set for before instructions are output to the user. If then an instruction for the user to rotate left is output, and if an instruction for the user to rotate right is output. The -intercept of the line through the start and end-points is calculated with:


For an image with width and midline at , if , instructions

Figure 3: Our application continuously iterates through this flow chart at 20fps.
Figure 4: Sample screenshots from our demo application. In order from top to bottom is the: position instruction, orientation instruction, 5-frame average class, delay, frame rate, and current detected class. The blue line is the direction vector for the specific frame , and the red line is the five-frame average direction vector.

are given to move left, and if , instructions are given to move right. In our defined area of the zebra crossing in transformed base image, the edges of the zebra crossing are within of the midline. With a constant width for the zebra crossing, if is outside of the range, the user will be outside of the zebra crossing. Refer to Figure 3 for a flow chart of the demo application and Figure 4 for a screenshot of our demo application.

4 Experiments

We trained our network using 3456 images from our dataset and 864 images for validation [2]. Our testing dataset consists of 739 images. The width multiplier changes the number of output channels at each layer. A smaller width multiplier decreases the number of channels and makes the network less computationally expensive, but sacrifices accuracy. As seen in Table 3, networks using a higher width multiplier also have a lower accuracy due to overfitting. We performed further testing using the network with width multiplier 1.0, as it achieves the highest accuracy while maintaining near real-time speed when tested on an iPhone 7. The precisions and recalls of countdown_blank and none are the lowest out of all classes, which may be due to the limited number of training samples

Width Accuracy (%) Angle Error (degrees) Start-point Error Frame Rate (fps)
1.4 93.50 6.80 0.0805 15.69

92.96 6.73 0.0810 17.19

94.18 6.27 0.0763 20.32

93.50 6.44 0.0768 21.69

93.23 7.08 0.0854 23.41

92.96 7.16 0.0825 24.33

89.99 7.19 0.0853 28.30
Table 3: Comparison of Different Network Widths
Red Green Countdown Green Countdown Blank None
Precision 0.97 0.94 0.99 0.86 0.92

0.96 0.94 0.96 0.92 0.87

F1 Score
0.96 0.94 0.97 0.89 0.89
Table 4: Precision and Recalls by Class

for those two classes (Table 4). However, the precision and recall of red traffic lights, the most important class, is greater than 96%.

When the zebra crossing is clear/unblocked, the angle error, startpoint, and endpoint errors are significantly better than when it is obstructed (Table 5). For an obstructed zebra crossing, insufficient information is provided in the image for the network to output precise endpoints.

Figure 5 shows various outputs of our network. In (A), the network correctly predicts no traffic light despite two green car traffic lights taking a prominent place in the background, and is able to somewhat accurately predict the coordinates despite the zebra crossing appearing faint. In (B), the model correctly predicted the class despite the symbol being underexposed by the camera. (C) and (D) show examples of the model correctly predicting the traffic light despite rainy and snowy weather. (B), (C), and (D) all show the network predicting coordinates close to the ground truth.

To prove the effectiveness of LYTNet, we retrained it using only red, green, and none class pictures from our own dataset and tested it on the PTLR dataset [5]. Due to the small size of the PTLR training dataset, we were unable to perform further training or fine-tuning using the dataset without significant overfitting. Using the China portion of the PTLR dataset, we compared our algorithm with Cheng et al.’s algorithm, which is the most recent attempt for pedestrian traffic light detection to our knowledge.

LYTNet was able to outperform their algorithm in regards to the F1 score, despite the disadvantage of insufficient training data from the PTLR dataset to train our network (Table 6). Furthermore, LYTNet provides additional information about the direction of the zebra crossing, giving the visually impaired a more comprehensive set of information for crossing the street, and outputs information regarding 4 different modes of traffic lights rather than only 2. We also achieve a similar frame rate to Cheng et al.’s algorithm, which achieved a frame rate of 21, albeit on a different mobile device.

Number of Images Angle Error Startpoint Error Endpoint Error
Clear 594 5.86 0.0725 0.476

154 7.97 0.0918 0.0649

739 6.27 0.0763 0.0510
Table 5: Comparison of Network Performance on Clear and Obstructed Zebra Crossings
Figure 5: Example correct outputs from our neural network. The class is labelled on top of each image. Blue dots are ground truth coordinates and red dots are predicted coordinates.

5 Conclusion

In this paper, we proposed LYTNet, a convolutional neural network that uses image classification to detect the color of pedestrian traffic lights and to provide the direction and position of the zebra crossing to assist the visually impaired in crossing the street. LYTNet uses techniques taken from MobileNet v2,and was trained on our dataset, which is one of the largest pedestrian traffic light datasets in the world [2]. Images were captured at hundreds of traffic intersections within Shanghai at a variety of different heights, angles, and positions relative to the zebra crossing.

Unlike previous methods that use multiple steps like detecting candidate areas, LYTNet uses image classification, a one-step approach. Since the network can learn features from an entire image rather than only detecting the pedestrian traffic light symbol, it has the advantage of being more robust in cases such as images with multiple pedestrian traffic lights. With sufficient training data, the network can draw clues from the context of an image along with the traffic light color to reach the correct prediction.

Additionally, LYTNet provides the advantage of being more comprehensive than previous methods as it classifies the traffic light between five total classes compared to 3 or 4 in previous attempts. Furthermore, our network is also capable of outputting zebra crossing information, which other methods do not

Our Network Cheng et al.’s Algorithm
Red Recall 92.23 86.43
Precision 96.24 96.67
F1 Score 94.19 91.26

Recall 92.15 91.30
Precision 98.83 98.03
F1 Score 95.37 94.55
Table 6: Precision and Recall of Our Network and Cheng et al.’s Algorithm

provide. Thus, LYTNet elegantly combines the two most needed pieces of information without requiring two separate algorithms. Furthermore, our network is able to match the performance of the algorithm proposed by Cheng et al.

In the future, we will improve the robustness of our deep learning model through the expansion of our dataset, for further generalization. For the two classes with the least data, none and countdown_blank, additional data can greatly improve the precisions and recalls. Data from other areas around the world can also be collected to separately train the network to perform optimally in another region with pedestrian traffic lights with differently shaped symbols. Our demonstration mobile application will be further developed into a working application that converts the output into auditory and sensory information for the visually impaired.

6 Acknowledgements

We would like to express our sincerest gratitude to Professor Chunhua Shen, Dr. Facheng Li, and Dr. Rongyi Lan for their insight and expertise when helping us in our research.