When Vehicles See Pedestrians with Phones:A Multi-Cue Framework for Recognizing Phone-based Activities of Pedestrians

The intelligent vehicle community has devoted considerable efforts to model driver behavior, and in particular to detect and overcome driver distraction in an effort to reduce accidents caused by driver negligence. However, as the domain increasingly shifts towards autonomous and semi-autonomous solutions, the driver is no longer integral to the decision making process, indicating a need to refocus efforts elsewhere. To this end, we propose to study pedestrian distraction instead. In particular, we focus on detecting pedestrians who are engaged in secondary activities involving their cellphones and similar handheld multimedia devices from a purely vision-based standpoint. To achieve this objective, we propose a pipeline incorporating articulated human pose estimation, followed by a soft object label transfer from an ensemble of exemplar SVMs trained on the nearest neighbors in pose feature space. We additionally incorporate head gaze features and prior pose information to carry out cellphone related pedestrian activity recognition. Finally, we offer a method to reliably track the articulated pose of a pedestrian through a sequence of images using a particle filter with a Gaussian Process Dynamical Model (GPDM), which can then be used to estimate sequentially varying activity scores at a very low computational cost. The entire framework is fast (especially for sequential data) and accurate, and easily extensible to include other secondary activities and sources of distraction.



page 3

page 5

page 6

page 8

page 9


Intention Recognition of Pedestrians and Cyclists by 2D Pose Estimation

Anticipating the intentions of vulnerable road users (VRUs) such as pede...

Eye Contact Between Pedestrians and Drivers

When asked, a majority of people believe that, as pedestrians, they make...

Pedestrian Path, Pose and Intention Prediction through Gaussian Process Dynamical Models and Pedestrian Activity Recognition

According to several reports published by worldwide organisations, thous...

TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration

Traditional video-based human activity recognition has experienced remar...

Early warning of pedestrians and cyclists

State-of-the-art motor vehicles are able to break for pedestrians in an ...

Looking at the Driver/Rider in Autonomous Vehicles to Predict Take-Over Readiness

Continuous estimation the driver's take-over readiness is critical for s...

HandyNet: A One-stop Solution to Detect, Segment, Localize & Analyze Driver Hands

Tasks related to human hands have long been part of the computer vision ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the explosion of hand-held device usage globally, smart phones have made their way into most hands. This trend is expected to continue as devices get cheaper and find more utility in our day to day lives. As of 2011, there were more phones than people in the USA, and internationally, the number of mobile phone subscriptions is an estimated 5.9 billion. Though such devices are extremely useful and even indispensable for many, it is this very dependence that is a major cause of pedestrian distraction, and possible injury. From here on-wards, we shall make use of the term cellphone as a placeholder for any hand-held multimedia device that a pedestrian may interact with.

Distracted walking, like distracted driving, is likely to increase in parallel with the penetration of electronic devices into the consumer market. Although driver distraction has received abundant attention since the turn of the century, distraction among pedestrians is a relatively nascent area of research. This is surprising given that pedestrians are in fact prone to acting less cautiously when distracted. Furthermore, a recent report by the Governors Highway Safety Association (GHSA) reveals a disturbing trend - between the mid-1970s and early 2000s, pedestrian deaths steadily declined, eventually dipping to around 11 percent of all motor vehicle fatalities. But since 2009, pedestrian fatalities have actually increased by 15 percent, climbing to 4,735 in 2013. Meanwhile, the percentage of pedestrians killed while using cell phones has risen, from less than 1 percent in 2004 to more than 3.5 percent in 2010, according to[1]. Also, the study shows that the number of pedestrians injured while on their cellphones has more than doubled since 2005.

Fig. 1: Odds of failing to display optimal crossing behavior for different activities[2]

, along with their 95% confidence intervals.

The severity of this phenomenon is further reflected by the number of studies conducted over the last few years, each of which arrive at similar conclusions. In a recent study conducted by Thompson et al.[2], they conclude that nearly one-third (29.8%) of all pedestrians performed a distracting activity while crossing, with text messaging associated with the highest risk among different technological and social factors (Figure 1). Meanwhile, Nasar et al.[1] found that mobile-phone related injuries among pedestrians increased relative to total pedestrian injuries, and paralleled the increase in injuries for drivers, and in 2010 exceeded those for drivers. The study by Byington et al.[3] confirms this by a virtual street based simulation, stating that - while distracted, participants waited longer to cross the street, missed more safe opportunities to cross, took longer to initiate crossing when a safe gap was available, looked left and right less often, spent more time looking away from the road, and were more likely to be hit or almost hit by an oncoming vehicle. Moreover, it is noted that the demographic of individuals between ages 18 and 29 is more susceptible to exhibit such behavior. For a detailed report on the global nature of the pedestrian safety problem and the inadequacy of current systems in ensuring it, we refer the reader to[4].

It is also interesting to note that as the emphasis of automobile manufacturers gradually shifts towards more automated vehicles, so must the emphasis placed on preventing pedestrian distraction related injuries. In such scenarios, the intelligent vehicle must be able to gauge the risk associated with each pedestrian, and demonstrate more caution in avoiding those with larger risks.

In this study, we focus only on distraction due to technological factors, particularly the use of cellphones for different tasks, and ignore social impacts such as talking or walking in a group. To summarize, we propose to classify each of many pedestrians in an image, into one of 3 activity classes-

none, texting and handheld phone call. We additionally extend this approach to work on sequences of images, where knowledge about temporal dynamics can enable faster and more efficient operation.

The rest of the paper is organized as follows - Section II briefly outlines some related work in the field. Section III describes the data we are working with, and the semantic annotations that are available for use. Section IV details the proposed methodology to estimate confidence scores of cellphone based activities for a pedestrian using a single image, and section V extends this model to predict a score at every instance for a sequence of pedestrian images. Section VI lists the experiments carried out, and tabulates each of their results. Finally, section VII concludes this work.

Ii Related Work

There is an abundance of work related to human activity recognition and classification from the last decade. However, these studies pertain to generic human activities and are not of much use in studying pedestrian distraction. Even though there have been quite a few studies that deal with driver distraction and activity modeling [5, 6, 7, 8], these models are not directly applicable to pedestrians because the forms of distraction and the activities of interest are considerably different. Nonetheless, there have been ample efforts devoted to studying pedestrians in the context of path prediction, intent analysis and action/activity recognition. We briefly go over these tasks, highlighting how they differ from the goal of this work. This study may appear similar to our previous work [9] in terms of the end goal, however, the proposed methodology is entirely different. We also use a significantly larger dataset and provide a more exhaustive evaluation in comparison to [9]. For a more detailed list of studies conducted on humans around vehicles, we refer the reader to [10].

Path prediction and gait analysis: There have been numerous studies on predicting the trajectories of pedestrians to prevent collisions and improve surround vehicle safety. These methods generally ignore high-level semantics (such as pedestrian intent) and predict the paths based on low level cues alone [11, 12, 13, 14, 15].

Intent analysis: The aim of such studies is to make an estimate of the pedestrians’ intention in the near future, so as to take appropriate measures to reduce risk of collision. These studies are commonly carried out in conjunction with path prediction, in a manner that benefits both tasks. Recent examples in this domain are [16, 17, 18, 19, 20, 21].

Action/Activity recognition: The terms action and activity have been used quite loosely in the context of pedestrians. In most cases, these terms allude to the different stages in the trajectory of a pedestrian [22, 23], e.g. walking, waiting, crossing etc. This notion of activity has also been extended to groups of people, where portions of a crowd are assigned a common activity based on context and collective behavior [24]. In this study, we use the term activity to refer to the secondary activity of a pedestrian being performed in addition to walking/crossing.

Although, the tasks listed above are focused on modeling pedestrians and their behavior, none of them consider pedestrian distraction due to secondary activities like cellphone usage (see Table I for reference). Moreover, this study could be complementary to existing studies on pedestrian intention and path prediction, and could result in a more holistic understanding of pedestrian behavior.

(Output Classes)
Moeslund et al.[11] -
Gandhi et al.[12] - -
Goldhammer et al.[13] -
Köhler et al.[16] - -
Madrigal et al.[17] -
Schulz et al.[18] - cross, turn into road, stop
Bandyopadhyay et al.[19] - -
Keller et al.[20] - walking, stopping
Kooij et al.[21] walking, stopping
Kataoka et al.[22] - crossing, walking, standing, riding a bicycle
Quintero et al.[23] - walking, starting, standing, stopping
Choi et al.[24] - - collective activities of pedestrians like crossing, waiting, queuing, walking, talking
Rangesh et al.[9] - - using phone, none
This work - - texting, handheld phone call, none
TABLE I: Related work in image based pedestrian safety.
(a) Histogram of pedestrian bounding box heights
(b) Pedestrian activities
(c) Objects in pedestrian hands
(d) Pedestrian samples from the dataset. The joints obtained after articulated pose estimation have been overlaid for reference.
Fig. 2: Details pertaining to the proposed dataset. The dataset is demonstrably diverse in viewpoints, pedestrian size, activity and object interactions.

Iii Dataset Description & Semantic Annotations

Since pedestrian distraction due to cellphone usage is more common among a young demographic, we mounted 4 GoPro cameras, each facing a different direction, on an intelligent vehicle testbed parked at an intersection in the UC San Diego campus. By capturing different viewpoints on each camera, we ensure that pedestrians are not predisposed to appear in a particular location or facing a certain direction. Most of the data is captured on afternoons and evenings, on both sunny and overcast days to ensure diverse illumination conditions and reasonable foot traffic. Since the proposed methodology carries out fine-grained analysis of pedestrians, we avoid night time situations where it is hard to identify small objects and features even for humans. Furthermore, pedestrians are captured holding a variety of objects in addition to cellphones, such as bags, drinks, food and other miscellaneous items. To facilitate the finer analysis of each pedestrian, videos were captured at 2.7k resolution, resulting in pedestrians as large as 1000 pixels in height in a few cases. Figure 2 visualizes certain key statistics of our dataset, and shows a few sample pedestrians chosen at random.

The dataset comprises of a total of 1586 cropped pedestrians, each with annotated activities and objects. These pedestrians are then divided into train and test sets using a 75-25 split, while making sure that the fraction of occurrences of each activity is retained in both sets.

Additionally, we annotate 7 sequences of pedestrians (3 for training, 4 for testing), each approximately 10 seconds in duration ( frames). In this case, the pedestrian is assigned an activity for each frame to account for temporal dynamics. In addition to this, the upper body joints (listed in section IV) are annotated for each frame, to enable evaluation of the proposed articulated pose tracker.

Iv Single Frame Activity Classification

Figure 3 depicts the flow diagram of the proposed activity classification framework. The pipeline takes in an image patch corresponding to a pedestrian, and outputs the corresponding activity. We detail each processing block in the subsections that follow.

Fig. 3: Flow diagram of proposed methodology for single frame activity classification.

Iv-a Articulated Pose Estimation

The articulated pose of a pedestrian can be an invaluable cue in estimating the activity he/she is involved in. Recent advances in pose estimation using deep convolutional neural networks (ConvNets) have led to state of the art results on challenging benchmarks. We make use of one such architecture, called the Convolutional Pose Machines

[25] proposed by Wei et al. This is a multi-stage ConvNet, where each subsequent stage operates both on image evidence as well as belief maps from preceding stages, gradually refining the pose estimate. This setup offers us great flexibility while choosing the number of stages, with the trade-off being speed versus accuracy. The network has been trained on the MPII dataset comprising of 25K images containing over 40K people, involved in 410 different activities, and outputs the locations of 16 joints corresponding to the articulated pose of a human body. We use this pre-trained network and fine-tune it on our own dataset. This gives us marginal improvements in performance compared to an out-of-the-box implementation (see Table II). Additionally, we only make use of the upper body joints for any further processing, as these are the most informative in our application. The framework can easily accommodate the full body pose instead, if necessary. The final set of keypoint locations retained are - head, neck, left shoulder, left elbow, left wrist, right shoulder, right elbow and right wrist. See Figure 2 for some visual results of the pose estimation module on the proposed dataset.

1.00 1.00 0.99 1.00 0.96 1.00 1.00 0.94
1.00 1.00 0.99 1.00 0.97 1.00 1.00 0.96
TABLE II: PCK scores [32] of the pose estimation module [25] on the test set before and after fine-tuning on the train set. Values close to 1 indicate near-perfect keypoint localization.

Most human pose estimation algorithms require the rough location and scale of the human in the image plane. In this study, we assume that such information is available beforehand, and focus our attention on analyzing each pedestrian in finer detail. However, if desired, the location and scale of pedestrians may be obtained easily from any generic pedestrian detector. We would also like to point out that many recent studies like [26, 27] demonstrate state-of-the-art multi-person pose estimation in real time, without prior information on pedestrian locations and scales. This makes our approach viable for time critical applications like pedestrian safety and path planning.

The pose estimation module is used in our pipeline for three specific purposes. First, it allows us to localize the head and hands of each pedestrian for further examination. Second, it is used to identify similar training exemplars in the pose space. Third, the pose alone may be used as a informative prior over all activities. In each of the following subsections, we make use of the articulated pose in a manner mentioned above.

Iv-B Hand Analysis using Exemplar SVMs

An important cue for predicting the activity of a pedestrian are the objects they interact with. To identify the objects held in the hands of a pedestrian, we look at local image patches around the location of each hand. To do so, we first regress to the approximate location of the hands of a pedestrian, assuming that it is collinear with the joints corresponding to the elbow and wrist. Let and denote the image plane coordinates of the elbow and wrist respectively. Using the assumption above, the approximate location of the hand is obtained as follows:


where is a a parameter that depends on the ratio of distances of the elbow from the wrist and hand respectively. In our experiments, seemed to generate the best results.

Once we have the rough locations of both hands in the image plane, we crop out a local image window around these locations. The window size is chosen to be for a pedestrian parametrized by . Here is a hyper-parameter that ensures that the local window scales with the size of the pedestrian. In our experiments, is set to to extract training patches. is chosen to ensure that the hand almost always falls into the window, and also that the window is small enough to capture only the object of interest and nothing more. As demonstrated in Table III, offers the best results, beyond which increasing does not improve hand localization by much. Examples of such local patches for windows centered around both the wrist and the hand can be found in Figure 4. It is obvious that inferring the hand location, even if approximate, helps in centering the object of interest with respect to the window.

With a collection of such training patches centered around the hand, we proceed to build an object classifier. Our experiments demonstrated that traditional one-versus-all classifiers severely overfit the training data and failed to generalize well to new object instances. Moreover, training a separate classifier for each object class, as well as the intra-class variance (cellphones come in a variety of shapes and sizes) makes the classification task an especially hard one, considering the limited availability of training data.

We bypass all these limitations by training an ensemble of exemplar-SVMs (ESVMs) [28]. The method is based on training a separate linear SVM classifier for every exemplar in the training set. Each ESVM is thus defined by a single positive instance and millions of negatives, obtained by hard negative mining. In our case, an ESVM is trained to represent a rigid HOG template from an image patch around each hand of every pedestrian in the training set. At test time, the ESVM that results in the highest score is considered to provide the best match, and the object label associated with the exemplar is transferred to the new test instance. Figure 5 shows a few examples of matched hand-object instances.

0.05 0.0825 0.7250
0.07 0.3400 0.9275
0.10 0.9125 0.9900
0.12 0.9850 0.9950
0.15 0.9975 0.9975
0.20 0.9975 0.9975
TABLE III: Fraction of pedestrian hands falling within predicted wrist and hand centered windows for different values of window scale factor . These evaluations were carried on a separate validation set.

Iv-C Gaze Analysis

The rough gaze direction of a pedestrian can be very effective in separating out instances where pedestrians are just holding a phone, versus when they are actually engaged in its use.

In this study, we use the gaze pathway from the GazeFollow Convnet proposed in [29]. The gaze pathway takes in an image patch of the head along with its normalized location in the image plane (obtained from the articulated pose), and returns a heat-map (Figure 5

) that encodes the rough gaze direction of the pedestrian. This sub-network has five convolutional layers followed by three fully-connected layers, the final output of which is a single channel heat-map. Finally, we reshape this output to produce a 169-length feature vector that encodes the gaze.

Fig. 4: Image patches obtained when the local window is centered around the (a)wrist versus the (b)hand.

Iv-D Querying Nearest Neighbor Pose Exemplars

The main intuition behind our approach is that pedestrians with similar body poses tend to interact with objects in a similar form, and are likely to be involved in analogous activities. To have such a notion of similarity, it is necessary to construct a suitable feature representation of the articulated pose, and to enforce a reasonable distance metric that ensures that similar poses are close by.

We make use of a combination of the normalized joint locations and the normalized joint angles as the feature descriptor. Consider a pedestrian bounding box parametrized as . Here, and correspond to image coordinates of the top left corner of the bounding box, and and describe the dimensions of the box. For the pedestrian under consideration, the pose estimation network outputs a set of image locations corresponding to each joint in the upper body. The set of normalized joint locations are then found as follows:


Next, consider the set of joint triplets that are connected consecutively in the articulated pose tree. For each such triplet , let the angle subtended (in radians) at by the line segment joining points and be denoted by . We have 7 such joint angles in the upper body pose. The normalized joint angle at is then obtained as follows:


The final feature vector is obtained as a simple concatenation of the set of normalized joint locations and angles. Our experiments indicated this to be much more stable in terms of closest neighbors in comparison to using either just the joint locations, or just the joint angles. With a set of pose features gathered from the pedestrians in the training set, we train a simple -nearest neighbor classifier using a -d tree structure for fast neighbor retrieval.

Fig. 5: Illustration of head and hand related cues described in sections IV-B and IV-C. In each of the three examples above, for a pedestrian in the test set (left), the gaze heatmap obtained from the gaze ConvNet is shown on top and the best hand (object) exemplar match with a pedestrian from the train set is shown below. The matched pedestrian (right) and exemplar weights are shown in addition to the matched hand patch. Best viewed in color.

Iv-E Pedestrian Activity Classification

Having set up the individual parts, we now focus on integrating the cues from the different modalities to predict a final class probability score. For this study, the possible output classes for activity classification are

none, texting and handheld phone call, which we encode as and respectively.

Consider a new pedestrian with pose features calculated in the manner described above. The aim now is to predict a class label , and estimate the probability associated with this prediction. Let denote the set of nearest neighbor pose exemplars obtained from the trained classifier in IV-D. We denote this set as follows -


where and denote the trained ESVMs on the left and right hands (from IV-B), and denote the object labels associated with the left and right hands, represents the gaze features obtained as mentioned in IV-C, and denotes the activity label associated with the nearest neighbor exemplar.

Let denote the image evidence available for the pedestrian whose activity is to be predicted. The desired predictive distribution may then be expressed as -


Decomposing the image evidence into individual head and hand based evidences and , and making use of conditional independence yields -


Each term in the equation above is described below -


where and are the gaze descriptors, and is the indicator function for set

. This is simply the cosine similarity between the gaze features within the same class.

Next, let us denote the maximum match score obtained for ESVM on the left hand image patch as , and that for on the right hand image patch as . This probabilistic score is obtained by testing each ESVM on the corresponding image patch, and then re-scaling the match score using the parameters determined by carrying out the Platt calibration for each ESVM offline. Further, only matches with an overlap score greater than 0.4 with the test patch are retained as done in [30]. We can now define the hand evidence likelihood as follows -


where equals if the object associated with the left hand is a cellphone, else it equals 0. The same is true for and the right hand.

Finally, the term acts as a prior over the activities, given just the articulated pose of a pedestrian. This is defined to be -


Using the equations 6-9, the final predicted activity for the pedestrian is then chosen to be the MAP estimate -


Since the probability terms on the right hand side of equation 6 are not calibrated to provide compatible scores, we propose a second method based on late fusion of these scores. To do so, we create 9-length score vectors made up of the terms , and for . These vectors are created by performing a 5 fold cross validation split on the training set. Multi-class classification is carried out in a one versus all manner to predict the final activity of the pedestrian.

V Activity Classification for Sequential Data

In our proposed framework, the major bottleneck in terms of speed is the pose estimation network described in IV-A. Even though it is possible to reliably estimate the pose for a few pedestrians in real time using a GPU (for a reasonable number of stages in the network), the network can no longer operate at a desired frequency when the number of pedestrians in the scene are considerably large. This issue can be alleviated by tracking the articulated pose of pedestrians for the duration between successive outputs from the pose estimation network. This also ensures that the pose estimated by the network makes reasonable transitions between successive instances, thereby reducing single frame errors.

V-a GPDM-based Particle Filter for Articulated Pose Tracking

In this sub-section, we briefly describe the proposed particle filter based tracking framework with a Gaussian Process Dynamical Model (GPDM)[31]. Let be the state of particle at time , which represents the normalized pose features of a pedestrian as described in IV-D. Let denote the latent space projection of using a Gaussian Process Latent Variable Model (GPLVM) as described below -


In addition to this, a GPDM enforces an auto-regressive dynamical model in the latent space -


Here, and are zero-mean, white Gaussian processes, and are nonlinear mappings parametrized by and respectively. Using small training sequences , we can solve for both the corresponding latent space projections

, and the necessary hyperparameters in closed form

[31]. Despite the use of small data sets, the GPDM learns an effective representation of the highly non-linear dynamics associated with articulated pose tracking.

At any instant , the particle filter functions by propagating a set of particles in the latent space , by sampling (with noise) from the dynamical model in equation 12. This results in an updated set of particles . To determine the likelihood of each particle, it is necessary to project the latent particles back into the observation space using the learned GPLVM mapping (equation 11), where they may be evaluated against available measurements. This results in a corresponding set of particles in the observation space. To evaluate the likelihood of each particle, we note that most pose estimation networks output a heatmap for each joint location, which can be interpreted as a probabilistic score for its location in the image. With this in mind, let denote the heatmap for the joint; the function takes in the and coordinates of any location on the image plane, and returns the score associated with the location, encoded in the heatmap. The likelihood of a particle is then considered to be -


where denotes the co-ordinates of the joint obtained from the pose features .

In practice, we train a set of GPDMs for different activities and viewpoints (e.g. walking towards the camera, walking away from the camera, walking sideways etc.). During test time, particles are initialized in latent space by locating the latent point (across all GPDMs) whose mapping in observation space yields the best match with the current measurement. Figure 6 shows the learnt latent space mappings for 4 different viewpoints.

Fig. 6: Latent space projections of articulated pose trajectories for four different viewpoints.

V-B Avenues for Speedup

The tracking framework considerably reduces the burden on the pose estimation network in ensuring near real time operation. When a large number of pedestrians are present in the scene, we can simply run the networks on a subset of pedestrians, while the poses of the rest are updated based on state updates alone. When the network is finally run on a given pedestrian, the heatmaps are used as measurements to update the state of each particle in the filter. Alternatively, one can make use of more recent real-time algorithms for multi-person pose estimation [26, 27], which enables faster operation due to reduced overhead from having separate detection and pose estimation modules.

Additionally, by only running the ESVMs associated to the K nearest neighbor exemplars, we bypass the computational drawbacks associated with ESVMs. We can further reduce the computational burden required to predict class labels for a pedestrian at every instant in case of sequential data. Once the hand evidence term in equation 8 is reliably estimated for all output classes, we need only update the head evidence and prior terms in equation 6 at every instant. This removes the need to run the ensemble of ESVMs at every frame.

Fig. 7: Confusion matrices for MAP estimation (top row), and SVM based late fusion (bottom row). with SVM based late fusion results in the best overall accuracy.

Vi Experimental Analysis

Vi-a Single Frame Activity Classification

The critical hyperparamater that needs to be set for the proposed pipeline is the neighborhood size . We experiment with a set of different values - . Figure 7 shows confusion matrices for different values of , for both the MAP estimation scheme and the SVM based late fusion. The MAP estimation scheme is seen to perform relatively poorly. This can be attributed to the fact that the pose prior is far too dominant in comparison to the other two probability terms. This leads to predictions that are overly influenced by the pose term, and hence the considerable false positives for all values of . In comparison, the SVM weights each individual cue accordingly and predicts a more balanced output. This leads to a much better overall accuracy in comparison to the MAP estimate. The best performance is seen for , which results in % overall accuracy. We notice that most mis-classifications in the output correspond to pedestrians who are considerably small in size ( pixels in height), which makes it relatively harder to infer object labels and gaze information. Some other error modes are observed in cases where the pedestrian is holding objects that are considerably different to those observed in the training set. This issue however may be alleviated by collecting more data for training. Finally, there are cases where the correct label is ambiguous even to human annotators. This occurs when it is hard to infer the exact direction of gaze or the correct object label. For examplar results on the test dataset, we refer the reader to Figure 8.

for None
for Texting
for Phone Call
hand only 0.94 0.58 0.20 0.810
pose only 0.90 0.65 0.67 0.858
pose+hands 0.93 0.71 0.81 0.916
pose+hands+gaze 0.97 0.88 0.89 0.946
TABLE IV: Per class and overall accuracies for four sets of cues - hands alone, pose alone, pose and hands, and pose, hands and gaze.

To understand the contribution and utility of each individual cue while making a prediction, we consider corresponding to the best performing method. We train four separate SVM based fusion models, for four different sets of cues - hands only, pose only, pose and hands, and pose, hands and gaze. For each set of such cues, only features based on those cues are used for training the fusion SVM. Table IV shows the per class and overall accuracies for each of these configurations. Pose alone is seen to perform reasonably well, indicating that it is the strongest of the three cues. However, it tends to be too harsh in its assignment, depending too heavily on the nearest neighbors. Adding hand related cues from the ensemble of ESVMs considerably improves the classification accuracy, especially for the texting and phone call classes as these are more reliant on recognition of hand-object interactions. Finally, adding gaze information further enhances the performance, noticeably for the texting class since this requires the pedestrian to look at the phone directly.

As far as the processing time is concerned, our algorithm with nearest neighbors takes about 4ms on average for each pedestrian on a 6th generation i7 CPU. This does not include the time for running the articulated pose estimation module, which we run independently on a Titan X GPU. As mentioned in V-B, state-of-the-art pose estimation for multiple persons is almost nearing real time operation, and the processing times for other operations in our framework are minimal in comparison. This indicates feasibility for real world applications.

Fig. 8: Examples of pedestrians from the test set along with their predicted activity classes (). Correctly predicted classes are enclosed in green boxes, and incorrectly predicted classes are enclosed in red boxes along with the corresponding ground truth class .

Vi-B Pose Tracking

As tracking the pose enables us to achieve speedups during runtime, it is important to validate its reliability on pedestrian sequences. We do this by training the proposed GPDM based particle filter on three pedestrian sequences. The tracker is then evaluated on 4 separate sequences from the test set using the PCK metric [32]. The results are shown in Table V. While tracking pedestrians in videos captured at 30Hz, it is seen that providing pose measurements even once every 6 frames (5Hz) is more than enough to result in very suitable tracks. Furthermore, tracking with a measurement at every frame provides an improvement over the tracks obtained by running the pose ConvNet alone. This illustrates that tracking gives us robust estimates of the pose in addition to making our algorithm run faster.

ment rate
untracked 1.000 1.000 0.990 1.000 0.960 1.000 1.000 0.940
30 Hz 1.000 1.000 1.000 1.000 0.980 1.000 1.000 0.940
15 Hz 0.980 0.990 0.981 0.990 0.955 0.975 0.930 0.910
10 Hz 0.980 0.990 0.981 0.908 0.940 0.975 0.925 0.905
5 Hz 0.980 0.990 0.981 0.970 0.940 0.975 0.920 0.895
TABLE V: PCK scores of proposed articulated pose tracking for different measurement rates, evaluated on 4 different pedestrian sequences. The tracking stays consistent and reliable even for relatively infrequent measurements obtained from the pose ConvNet.

Vi-C Activity Classification for Sequential Data

Next, we evaluate our proposed framework on 4 test sequences, each captured at 30 Hz and approximately 10 seconds in duration (280 - 310 frames per sequence). Each frame in a test sequence is annotated with the correct activity class. Our activity classification framework (with ) is run with the proposed articulated pose tracker, and predicts an output class for each frame. Additionally, we run the ensemble of ESVMs once every 50 frames as described in section V-B to reduce the computational burden. We plot the predicted and ground truth classes as a function of the frame number for each of the 4 test sequences in Figure 9. It is clearly observable that the activity classification framework, along with the pose tracker result in class labels that are quite consistent with the ground truth, even under frequent changes in the activity dynamics.

(a) Sequence 1
(b) Sequence 2
(c) Sequence 3
(d) Sequence 4
Fig. 9: Plot of ground truth and predicted output class as a function of frame number for 4 test sequences.

Vii Concluding Remarks

In this paper, we studied pedestrian distraction caused by cellphone usage in an effort to reduce growing number of pedestrian fatalities. To this end, a multi-cue pipeline to recognize pedestrian activity is proposed. A pedestrian is classified to be either texting, in a phone call, or be involved in no secondary activities based on cues from the articulated pose, hands and gaze. ESVMs trained offline are used to encode hand-object labels, whereas gaze features are obtained from a pre-trained ConvNet. Each cue is then used to propose scores based on the neighboring pedestrians from the training set. Finally, these scores are combined effectively using an SVM based late fusion scheme. In addition to this, we propose a GPDM based particle filter that operates based on measurements obtained from a pose estimation ConvNet in order to improve pose estimation, and speedup operation. Both the proposed methodology and the tracking framework are trained and evaluated on a unique pedestrian distraction dataset, which provides rich semantic annotations to facilitate a more detailed study of pedestrians.

Although the results are promising, there are still many issues to be addressed. Pedestrian activities are rich in variety, and so are the objects they interact with. However, it must be noted that out proposal is highly scalable. Since it works on similarity based metrics obtained from pedestrians in the training set, as more diverse pedestrians are added to the training process, the performance would only improve, and without any evident drop in computational speed. Future work encompasses going beyond phone based distraction, and studying other sources of pedestrian distraction (e.g. talking, walking in a group, listening to music etc.), and integrating all such factors to predict a combined distraction score for each pedestrian.

Viii Acknowledgments

We would like to thank all our colleagues at the LISA lab, UCSD for their assistance in collecting and annotating the dataset. We would also like to express our gratitude to the reviewers and the editor for their valuable comments and suggestions.


  • [1] J. L. Nasar and D. Troyer, “Pedestrian injuries due to mobile phone use in public places,” Accident Analysis & Prevention, vol. 57, pp. 91–95, 2013.
  • [2] L. L. Thompson, F. P. Rivara, R. C. Ayyagari, and B. E. Ebel, “Impact of social and technological distraction on pedestrian crossing behaviour: an observational study,” Injury prevention, vol. 19, no. 4, pp. 232–237, 2013.
  • [3] K. W. Byington and D. C. Schwebel, “Effects of mobile internet use on college student pedestrian injury risk,” Accident Analysis & Prevention, vol. 51, pp. 78–83, 2013.
  • [4] T. Gandhi and M. M. Trivedi, “Pedestrian protection systems: Issues, survey, and challenges,” Transactions on Intelligent Transportation Systems, vol. 8, no. 3, pp. 413–430, 2007.
  • [5] M. Roth, F. Flohr, and D. M. Gavrila, “Driver and pedestrian awareness-based collision risk analysis,” in Intelligent Vehicles Symposium (IV), 2016 IEEE.   IEEE, 2016, pp. 454–459.
  • [6] T. Hoang Ngan Le, Y. Zheng, C. Zhu, K. Luu, and M. Savvides, “Multiple scale faster-rcnn approach to driver’s cell-phone usage and hands on steering wheel detection,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , 2016, pp. 46–53.
  • [7] E. Ohn-Bar, S. Martin, A. Tawari, and M. M. Trivedi, “Head, eye, and hand patterns for driver activity recognition,” in Pattern Recognition (ICPR), 2014 22nd International Conference on.   IEEE, 2014, pp. 660–665.
  • [8] A. Tawari, S. Sivaraman, M. M. Trivedi, T. Shannon, and M. Tippelhofer, “Looking-in and looking-out vision for urban intelligent assistance: Estimation of driver attentive state and dynamic surround for safe merging and braking,” in Intelligent Vehicles Symposium Proceedings, 2014 IEEE.   IEEE, 2014, pp. 115–120.
  • [9] A. Rangesh, E. Ohn-Bar, K. Yuen, and M. M. Trivedi, “Pedestrians and their phones-detecting phone-based activities of pedestrians for autonomous vehicles,” in Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference on.   IEEE, 2016, pp. 1882–1887.
  • [10] E. Ohn-Bar and M. M. Trivedi, “Looking at humans in the age of self-driving and highly automated vehicles,” Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 90–104, 2016.
  • [11] A. Mogelmose, M. M. Trivedi, and T. Moeslund, “Trajectory analysis and prediction for improved pedestrian safety: Integrated framework and evaluations,” in Intelligent Vehicles Symposium (IV).   IEEE, 2015, pp. 330–335.
  • [12] T. Gandhi and M. M. Trivedi, “Image based estimation of pedestrian orientation for improving path prediction,” in Intelligent Vehicles Symposium.   IEEE, 2008, pp. 506–511.
  • [13] M. Goldhammer, M. Gerhard, S. Zernetsch, K. Doll, and U. Brunsmann, “Early prediction of a pedestrian’s trajectory at intersections,” in 16th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2013, pp. 237–242.
  • [14] A. Prioletti, A. Møgelmose, P. Grisleri, M. M. Trivedi, A. Broggi, and T. B. Moeslund, “Part-based pedestrian detection and feature-based tracking for driver assistance: real-time, robust algorithms, and evaluation,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 3, pp. 1346–1359, 2013.
  • [15] A. Bera, S. Kim, T. Randhavane, S. Pratapa, and D. Manocha, “Glmp-realtime pedestrian path prediction using global and local movement patterns,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on.   IEEE, 2016, pp. 5528–5535.
  • [16] S. Köhler, M. Goldhammer, S. Bauer, K. Doll, U. Brunsmann, and K. Dietmayer, “Early detection of the pedestrian’s intention to cross the street,” in 15th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2012, pp. 1759–1764.
  • [17] F. Madrigal, J.-B. Hayet, and F. Lerasle, “Intention-aware multiple pedestrian tracking,” in 22nd International Conference on Pattern Recognition (ICPR).   IEEE, 2014, pp. 4122–4127.
  • [18] A. T. Schulz and R. Stiefelhagen, “Pedestrian intention recognition using latent-dynamic conditional random fields,” in Intelligent Vehicles Symposium (IV).   IEEE, 2015, pp. 622–627.
  • [19] T. Bandyopadhyay, C. Z. Jie, D. Hsu, M. H. Ang Jr, D. Rus, and E. Frazzoli, “Intention-aware pedestrian avoidance,” in Experimental Robotics.   Springer, 2013, pp. 963–977.
  • [20] C. G. Keller and D. M. Gavrila, “Will the pedestrian cross? a study on pedestrian path prediction,” Transactions on Intelligent Transportation Systems, vol. 15, no. 2, pp. 494–506, 2014.
  • [21] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila, “Context-based pedestrian path prediction,” in European Conference on Computer Vision.   Springer, 2014, pp. 618–633.
  • [22] H. Kataoka, Y. Aoki, Y. Satoh, S. Oikawa, and Y. Matsui, “Fine-grained walking activity recognition via driving recorder dataset,” in 18th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2015, pp. 620–625.
  • [23] R. Quintero, I. Parra, D. Llorca, and M. Sotelo, “Pedestrian path prediction based on body language and action classification,” in 17th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2014, pp. 679–684.
  • [24] W. Choi, K. Shahid, and S. Savarese, “What are they doing?: Collective activity classification using spatio-temporal relationship among people,” in 12th International Conference on Computer Vision Workshops (ICCV Workshops).   IEEE, 2009, pp. 1282–1289.
  • [25] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
  • [26] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” arXiv preprint arXiv:1611.08050, 2016.
  • [27] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy, “Towards accurate multi-person pose estimation in the wild,” arXiv preprint arXiv:1701.01779, 2017.
  • [28] T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble of exemplar-svms for object detection and beyond,” in Computer Vision (ICCV), 2011 IEEE International Conference on.   IEEE, 2011, pp. 89–96.
  • [29] A. Recasens, A. Khosla, C. Vondrick, and A. Torralba, “Where are they looking?” in Advances in Neural Information Processing Systems, 2015, pp. 199–207.
  • [30]

    A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros, “Data-driven visual similarity for cross-domain image matching,” in

    ACM Transactions on Graphics (TOG), vol. 30, no. 6.   ACM, 2011, p. 154.
  • [31] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dynamical models,” in NIPS, vol. 18, 2005, p. 3.
  • [32] Y. Yang and D. Ramanan, “Articulated human detection with flexible mixtures of parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2878–2890, 2013.