Log In Sign Up

Joint Attention in Driver-Pedestrian Interaction: from Theory to Practice

Today, one of the major challenges that autonomous vehicles are facing is the ability to drive in urban environments. Such a task requires communication between autonomous vehicles and other road users in order to resolve various traffic ambiguities. The interaction between road users is a form of negotiation in which the parties involved have to share their attention regarding a common objective or a goal (e.g. crossing an intersection), and coordinate their actions in order to accomplish it. In this literature review we aim to address the interaction problem between pedestrians and drivers (or vehicles) from joint attention point of view. More specifically, we will discuss the theoretical background behind joint attention, its application to traffic interaction and practical approaches to implementing joint attention for autonomous vehicles.


page 2

page 4

page 6

page 7

page 8

page 13

page 35

page 38


Autonomous Vehicles that Interact with Pedestrians: A Survey of Theory and Practice

One of the major challenges that autonomous cars are facing today is dri...

An Intelligent Intersection

Intersections are hazardous places. Threats arise from interactions amon...

Towards Reducing Energy Waste through Usage of External Communication of Autonomous Vehicles

Automated vehicles can implement strategies to drive with optimized fuel...

Autonomous Driving: Framework for Pedestrian Intention Estimationin a Real World Scenario

Rapid advancements in driver-assistance technology will lead to the inte...

Response of Vulnerable Road Users to Visual Information from Autonomous Vehicles in Shared Spaces

Completely unmanned autonomous vehicles have been anticipated for a whil...

Autonomous Last-mile Delivery Vehicles in Complex Traffic Environments

E-commerce has evolved with the digital technology revolution over the y...

Pedestrian Models for Autonomous Driving Part II: high level models of human behaviour

Autonomous vehicles (AVs) must share space with human pedestrians, both ...

1 Introduction

Ever since the introduction of early commercial automobiles, engineers and scientists have been striving to achieve autonomy, that is removing the need for the human involvement from controlling the vehicles. The fascination with the autonomous driving technology is not new and goes back to the 1950s. In that era articles were appearing in the press featuring the autonomous vehicles in Utopian cities of the future (Figure 1) where drivers, instead of spending time controlling the vehicles, could interact with their family members or undertake other activities while enjoying the ride to their destinations [1].

Figure 1: A view of a futuristic autonomous vehicle in which a family of four are playing a board game while enjoying a ride to their destination, 1956. Source: [1].

Apart from the increased level of comfort for drivers, autonomous vehicles can positively impact society both at the micro and macro levels. One important aspect of autonomous driving is the elimination of driver involvement, which reduces the human errors (e.g. fatigue, misperception or inattention), and consequently, lowers the number of accidents (up to 93.5%) [2]. The reduction in human error can improve both the safety of the driver or the passengers of the vehicle and other traffic participants such as pedestrians.

At the macro level, fleets of autonomous vehicles can improve the efficiency of driving, better the flow of traffic and reduce car ownership (by up to 43%) through car sharing, all of which can minimize the energy consumption, and as a result, lower the environmental impacts such as air pollution and road degradation [3].

Throughout the past century, the automotive industry has witnessed many significant breakthroughs in the field of autonomous driving, ranging from simple lane following [4] to complex maneuvers and interaction with traffic in complex urban environments [5]. Today, autonomous driving has become one of the major topics of interest in technology. This field not only has attracted the attention of the major automotive manufacturers, such as BMW, Toyota, and Tesla, but also enticed a number of technology giants such as Google, Apple and Intel.

Despite the significant amount of interest in the field, there is still much to be done to achieve fully autonomous driving behavior in a sense of designing a vehicle capable of handling all dynamic driving tasks without any human involvement. One of the major challenges, besides developing efficient and robust algorithms for tasks such as visual perception and control, is communication with other road users in chaotic traffic scenes. Communication is a vital component in traffic interactions and is often relied upon by humans to resolve various ambiguities such as yielding to others or asking for the right of way. In order for communication to be effective, the parties require to understand each others’ intentions as well as the context in which the communication is taking place.

The aim of this paper is to address the aforementioned problems from autonomous driving perspective. Particularly, the focus is on understanding pedestrian behavior at crosswalks. For this purpose, we organize the rest of this paper into three main chapters. In Chapter one, we present a brief introduction to autonomous driving and discuss some of the major milestones, and unresolved challenges in the field. Chapter two focuses on the theories behind the problems, starting with joint attention and its role in human interaction, and continues by discussing some of the studies on nonverbal communication and behavior understanding, with a particular focus on pedestrian crossing behavior. In addition, in this chapter we elaborate on human general reasoning techniques to highlight how we, as humans, make decisions in traffic interactions. Chapter three, which comprises more than half of this report, addresses the practical challenges in pedestrian behavior understanding. To this end, this chapter reviews the state-of-the-art algorithms and systems for solving different aspects of the problem from two different perspectives, hardware and software. The hardware section describes various physical sensors used for these purposes, and the software section deals with processing the raw data from the sensors to perform tasks such as object detection, pose estimation and activity recognition, and decision-making tasks such as action planning and reasoning.

2 Autonomous Driving: From the Past to the Future

Before reviewing the development of autonomous driving technologies, it is necessary to define what we mean by autonomy in the context of driving. Traditionally, there are four levels of autonomy including no autonomy (the driver is in the control of all driving aspects), advisory autonomy (such as warning systems in the vehicle which partially aid the driver), partial control (such as auto braking or lane adjustment) and full control (all aspects of the dynamic driving tasks are handled autonomously)[6].

Figure 2: Six levels of driving autonmation. Today we have achieved level 3 autonomy. Source: [7].

Today, the automotive industry further breaks down the levels of autonomy into six categories: (see Figure 2)[8]:

Level 0: No Automation, where the human driver controls all aspects of the dynamic driving tasks. This level may include enhanced warning system but no automatic control is taking place.
Level 1: Driver Assistance, where only one function of driving such as steering or acceleration/deceleration, using information about the driving environment, is handled autonomously. The driver is expected to control all other aspects of driving.
Level 2: Partial Automation. In this mode, at least two functionalities of the dynamic driving tasks, in both steering and acceleration/deceleration, are controlled autonomously.
Level 3: Conditional Automation, where the autonomous system can handle all aspects of the dynamic driving tasks in a specific environment, however, it may require human intervention in the cases of failure.
Level 4: High Automation. This mode is similar to level 3 with the exception that no human intervention is required at any time during the environment specific driving task.
Level 5: Full Automation. As the name implies, in this mode all aspects of the dynamic driving tasks under any environmental conditions are fully handled by an automated system.

The current level of autonomy available in the market, such as the one in Tesla, is level 2. Some manufacturers such as Audi are also promising autonomy level 3 capability on their newest models such as A8 [9].

Figure 3: A century of developments in driving automation. This timeline highlights some major milestones in autonomous driving technologies from the first attempts in the 1920s to today’s modern autonomous vehicles. Source (in chronological order): [10, 1, 11, 12, 13, 14, 15, 16, 17]

The following subsections will review the developments in the field of autonomous driving during the past century. A summary of some of the major milestones are illustrated in Figure 3.

2.1 The Beginning

Figure 4: a) W. G. Walter and his Tortoises [18], and b) Hans Moravec and Stanford Cart [19].

Much of today’s autonomous driving technology is owing to the pioneering works of roboticists such as Sir William Grey Walter, a British neurophysiologist who invented the robots Elsie and Elmer (also known as Tortoises)(Figure (a)a), in the year 1948 [20]. These simple robotic agents are equipped with light and pressure sensors and are capable of phototaxis by which they can navigate their way through the environment to their charging station. The robots are also sensitive to touch which allows them to detect simple obstacles on their path.

A more modern robotic platform capable of autonomous behaviors is Stanford Cart (Figure (b)b) [21, 22]. This mobile platform is equipped with an active stereo camera and could perceive the environment, build an occupancy map and navigate its way around obstacles. In terms of performance, the robot successfully navigated a 20 m course in a room filled with chairs in just under 5 hours.

Autonomous vehicles also rely on similar techniques as in robotics to perform various perception and control tasks. However, since vehicles are used on roads, they generally require different and often stricter performance evaluations, in terms of robustness, safety, and real time reactions. In the remainder of this report we will particularly focus on robotic applications that are used in the context of autonomous driving.

Figure 5: a) The first remote-controlled vehicle, 1921 [10], and b) a more modern version of a commercial vehicle at Safety Parade 1936 [1].

Early attempts at developing autonomous driving technology go as far back as the first commercial vehicles. In this era, autonomous driving was realized in the form of remote-controlled vehicles removing the need for the driver to be physically present in the car.

In 1921, the first driverless car (Figure (a)a) was developed by the McCook air force test base in Ohio [1]. This 2.5 meter-long cart was controlled via radio signals transmitted from the distance of up to 30 m. In the 1930s, this technology was implemented on actual vehicles some of which were exhibited in various parades (Figure (b)b) to promote the future of driveless cars and to show how they can increase driving safety [1].

2.2 Hitting the Road

The first instance of driving without human involvement was introduced in 1958 by General Motors (GM). The autonomous vehicle called “automatically guided automobile” was capable of autonomous driving on a test track with electric wires laid on the surface which were used to automatically guide the vehicle steering mechanism [1].

In the late 1980s, one of the pioneers of modern autonomous driving, E. D. Dickmanns [4, 23]

, alongside his team of researchers at Daimler, developed the first visual algorithm for road detection in real time. They employed two active cameras to scan the road, detect its boundaries, and then measure its curvature. To reduce computation time, a Kalman filter was used to estimate the changes in the curvature as the car was traversing the road.

In the early 1990s, the team at Daimler enhanced the algorithm by adding obstacle detection capability. This algorithm identifies the parts of the road as obstacles if their height is more than a certain elevation threshold above the 2D road surface [24]. In the same year, the visual perception algorithm was tested on an actual Mercedes van, VaMoRs (Figure 6). Using the algorithm in conjunction with an automatic steering mechanism, VaMoRs was able to drive up to the speed of 100 km/h on highways, and up to 50 km/h on regional roads. The vehicle could also perform basic lane changing maneuvers and safely stop before obstacles when driving up to 40 km/h.

Figure 6: VaMoRs and a view of its interior [25].

Throughout the same decade, we witnessed the emergence of learning algorithms such as neural nets which were designed to handle various driving tasks. ALVINN is an example of such systems that was developed as part of the NAVLAB autonomous car project by Carnegie Melon University (CMU). The system uses a neural net algorithm to learn and detect different types of roads (e.g. dirt or asphalt) and obstacles [26, 27, 28, 29]. The algorithm, besides passive camera sensors, relies on laser range finders and laser reflectance sensors (for surface material assessment) to achieve a more robust detection.

To guide the vehicle, a similar learning technique is used by the NAVLAB team to learn driving controls from recordings collected from an expert driver [30]

. An extension of this project uses an online supervised learning method to deal with illumination changes, and a neural net to identify more complex road structures such as intersections

[31]. The NAVLAB project is implemented on a U.S. Army HMMWV and is capable of obstacle avoidance and autonomous driving up to 28 km/h on rugged terrains and 88 km/h on regular roads.

Despite the fact that learning algorithms achieved promising performance in various visual perception tasks, in the late 90s, the traditional vision algorithms still remained popular. Methods such as color thresholding [12] and various edge detection filters such as Sobel filters [32] or model-based algorithms for road boundary estimation and prediction [33] were widely used.

In the mid-90s, autonomous assistive technologies have become standard features in a number of commercial vehicles. For instance, an extension of the lane detection and following algorithms developed by Dickmanns’ team [34, 35] was used in the new lines of Mercedes-Benz vehicles [36]. This new extension, in addition to road detection, can detect cars by identifying symmetric patterns of their rear views. Using the knowledge of the road, the automatic system adjusts the position of the vehicle within the lanes and performs emergency braking if the vehicle gets too close to an obstacle. An interesting feature of this system is the ability to track objects, allowing the vehicle to autonomously follow a car in the front, i.e. the ability to platoon.

The new millennium was the time in which autonomous vehicles started to enjoy the technological advancements in both the design of sensors and increase in computation power. At this time, we observe an increase in the use of high power sensors such as GPS, LIDAR, high-resolution stereo cameras [37] and IMU [38]. The information from various sources of sensors was commonly used by autonomous vehicles, thanks to the availability of high computation power, which allowed them to achieve a better performance in tasks such as assessment of the environment, localization, and navigation of the vehicle and mapping. The emergence of such features brought the automotive industry one step closer to achieving full autonomy.

2.3 Achieving Autonomy

In 2004 the Defense Advanced Research Projects Agency (DARPA) organized one of the first autonomous driving challenges in which the vehicles were tasked to traverse a distance of 240 km between Las Vegas and Los Angeles [38]. In this competition, none of the 15 finalists were able to complete the course, and the longest distance traversed was only 11.78 km by the team Red from CMU.

The following year a similar challenge was held over the course of 212 km on a desert terrain between California and Nevada [39]. In this year, however, 5 cars finished the entire course (one of them over the 10 hours limit), out of 23 teams that participated in the final event. Stanley (Figure (a)a), the winning car from Stanford, finished the race under 6 hours and 53 minutes while maintaining an average speed of 30 km/h throughout the race [40].

Figure 7: a) Stanley from Stanford in DARPA 2005 [13], and b) BOSS from CMU in DARPA 2007 [14].

Stanley benefited from various sources of sensory input including a mono color camera for road detection and assessment, GPS for global positioning and localization, and RADAR and laser sensors for long and short-range detection of the road respectively. The Stanley project produced a number of state-of-the-art algorithms for autonomous driving such as the probabilistic traversable terrain assessment method [41], a supervised learning algorithm for driving on different surfaces [42] and a dynamic path planning technique to deal with challenging rugged roads [43].

In the year 2007, DARPA hosted another challenge, and this time it took place in an urban environment. The goal of this competition was to test vehicles’ ability to drive a course of 96 kilometers under 6 hours on urban streets while obeying traffic laws. The cars had to be able to negotiate with other traffic participants (vehicles), avoid obstacles, merge into traffic and park in a dedicated spot. In addition to robot cars, some professional drivers were also hired to drive on the course.

Among the 11 finalists, BOSS (Figure (b)b) from CMU [44] won the race. Similar to Stanley, BOSS benefited from a wide range of sensors and was able to demonstrate safe driving in traffic at the speed of up to 48 km/h.

Ever since the DARPA challenges, continuous improvements have been made in various tasks that contribute to achieving full autonomy, such as high-resolution and accurate mapping [45, 46], and complex control algorithms capable of estimating traffic behavior and responding to it [47, 48, 49].

Autonomous vehicles have also been put to the test on larger scales. In the year 2010, VisLab held an intercontinental challenge by setting the goal of driving the distance of over 13000 km from Parma in Italy to Shanghai in China [50]. Four autonomous vans each with 5 engineers on board participated in the challenge over the course of three months. One unique feature of this challenge was that the autonomous vehicles, for most of the course, performed platooning in which one vehicle led the way and assessed the road while the others followed it.

Furthermore, autonomous cars have found their way into racing. Shelley from Stanford [51] is one of the first autonomous vehicles that autonomously drove the 20 km world-famous Pikes Peak International Hill Climb in only 27 minutes while reaching a maximum speed of 193 km/h.

2.4 Today’s Autonomous Vehicles

Figure 8: Modern autonomous cars: a) Uber [17], b) Waymo (from Google) [52], c) Baidu [53], and d) Toyota [54]

Today, more than 40 companies are actively working on autonomous vehicles [55] including Tesla [56], BMW [57], Mercedes [58], and Google [52]. Although most of the projects run by these companies are at the research stage, some are currently being tested on actual roads (see Figure (d)d). Few companies such as Tesla already sell their newest models with level 2 autonomy capability and claim that these vehicles have all the hardware needed for fully autonomous driving.

Autonomous driving research is not limited to passenger vehicles. Recently, Uber has successfully tested its autonomous truck system, Otto, to deliver 50,000 cans of beer by driving the distance of over 190 km [59]. The Reno lab, at the University of Nevada, also announced that they are working on an autonomous bus technology and are planning to put it to test on the road by 2019 [60]. Autonomous driving technology is even coming to ships. In a recent news release, Rolls-Royce has disclosed its plans on starting a joint industry project in Finland, called Advanced Autonomous Waterborne Applications (AAWA), to develop a fully autonomous ship technology by no later than 2020 [61].

2.5 What’s Next?

How far we are from achieving fully autonomous driving technology, is a subject of controversy. Companies such as Tesla [62] and BMW [57] are optimistic and claim that they will have their first fully autonomous vehicles entering the market by 2020 and 2022 respectively. Other companies such as Toyota are more skeptical and believe that we are nowhere close to achieving level 5 autonomy [63].

So the question is, what are the challenges that we need to overcome in order to achieve autonomy? Besides challenges associated with developing suitable infrastructures [64] and regulating autonomous cars [65], technologies currently used in autonomous vehicles are not robust enough to handle all traffic scenarios such as different weather or lighting conditions (e.g. snowy weather), road types (e.g. driving on roads without clear marking or bridges) or environments (e.g. cities with large buildings) [66]. Relying on active sensors for navigation significantly constraints these vehicles, especially in crowded areas. For instance, LIDAR, which is commonly used as a range finder, has a high chance of interference if similar sensors are present in the environment [67].

Some of the consequences associated with technological limitations are evident in recent accidents reports involving autonomous vehicles. Cases that have been reported include minor rear-end collisions [68, 69], car flipping over [70], and even fatal accidents [71, 72].

Moreover, autonomous cars are facing another major challenge, namely interaction with other road users in traffic scenes [73]. The interaction involves understanding the intention of other traffic participants, communicating with them and predicting what they are going to do next.

But why is the interaction between road users so important? The answer to this question is threefold:

  1. It ensures the flow of traffic. We as humans, in addition to official traffic laws, often rely on some forms of informal laws (or social norms) to interact with other road users. Such norms influence the way we perceive others and how we interpret their actions [74]. In addition, as part of interaction, we often communicate with each other to disambiguate certain situations. For example, if a car wants to turn at a non-signalized intersection on a heavily trafficked street, it might wait for another driver’s signal indicating the right of way. Failure to understand the intention of others, in an autonomous driving context, may result in accidents some of which were reported in the last year involving some of Google’s autonomous vehicles [75, 76].

  2. It improves safety. Interaction can guarantee the safety of road users, in particular, pedestrians as the most vulnerable traffic participants. For instance, at the point of crossing, pedestrians often establish eye contact with drivers or wait for an explicit signal from them. This assures the pedestrians that they have been seen, therefore if they commence crossing, the drivers will slow down or stop before them [77]. How the crossing takes place or the way pedestrians will behave, however, may vary significantly depending on various factors such as demographics and social factors (e.g. presence of other pedestrians) [78] as well as environmental (e.g. weather conditions) and dynamic (e.g. the speed of the vehicles, traffic congestion) factors [79].

  3. It prevents from being subject to malicious behaviors. Given that autonomous cars may potentially commute without any passengers on board, they are subject to being disrupted or bullied [74]. For example, people may step in front of the car to force it to stop or change its route. Such instances of bullying have been reported involving some autonomous robots currently being used in malls. Some of these robots were defaced, kicked or pushed over by drunk pedestrians [80].

3 Joint Attention in Human Interaction

The precursor to any form of social interaction between humans (or primates [81], see Figure 9) is the ability to coordinate attention [82], which means the interacting parties at very least should be able to pay attention to one another, discern the relevant objects and events of each other’s attentional focus, and implement their own lines of action by taking into account where and toward what others may be attending.

Figure 9: The monkey is imitating the human experimenter’s gestures. Source: [81]

In developmental psychology, the ability to share attention and to coordinate behavior is defined under the joint attention framework, or as some scholars term it, shared attention [82] or visual co-orientation [83]. Traditionally, joint attention has been studied as a visual perception mechanism in which two or more interacting parties establish eye contact and/or follow one another’s gaze towards an object or an event of interest [84, 85]. More recently, joint attention has also been investigated in different sensory modalities such as touch [86] or even remotely via the web [87]. Since the objective of this report is visual perception in autonomous driving, in the following chapters we only focus on the problem of joint visual attention and simply address it as joint attention.

What does joint attention really mean? Joint attention is often defined as a triadic relationship between two interacting parties and a shared object or event [88, 89, 90, 82]. Simply speaking, joint attention means the simultaneous engagement of two or more individuals with the same external thing [91].

In the traditional definitions, an important part of joint attention is the ability to reorient and follow the gaze of another subject to an object or an event [88, 90]. However, in more recent interpretations of joint attention, the gaze following requirement is relaxed and replaced by terms such as “mental focus” [91] or “shared intentionality” [92]. This means joint attention constitutes the ability of a person to engage with another for the sake of a common goal or task, which may not involve explicit gaze following action.

3.1 Joint Attention in Early Childhood Development

In 1975, Scaife and Bruner [84] were the first to discover the joint visual attention mechanism and its role in early developmental processes in infants. They observed that infants below the age of 4 months were able to respond to the gaze changes of the adults in interactive sessions about one-third of the times. In comparison, the older infants, above the age of 11 months, almost always responded to the changes and could follow the gaze of the adults while interacting with them. In addition, at this age, infants can follow the eye movements of the adults as well as their head movements.

Butterworth and Cochran [83] further investigate the joint attentional behavior and reveal that infants between the age of 6 to 18 months adjust their line of gaze with those of the adult’s focus of attention, however, they act only if the adult is referring to loci within the infant’s visual space. As a result, if the adult looked behind the infant, the infant only scans the space in front of them. The authors add that at early stages infants do not follow the gaze to the intended object, instead they turn their head to the corresponding side but focus their own gaze on the first object that comes in their field of view. The authors conclude that it is only in the second year when infants are able to focus on the same object that is intended by the adult.

In a subsequent study by Moore et al. [88] it is shown that while sharing attention, the actual movement is critical in gaze following. Through experimental evaluations, the authors illustrate that if only the final focus of the adult is presented to the infant they would not necessarily focus on the same object.

3.2 Joint Attention in Social Cognition Development

Joint attention has been linked to the development of social cognitive abilities such as learning of artifacts and environments [93, 94]. More specifically, joint attention is a fundamental component in language development through which infants learn to describe their surroundings [83, 85, 95]. In a study by Tomasello and Todd [85], it is argued that the lexical development of children depends on the way the joint attention activity is administered between the adult and the infant. It is shown that when mothers initiated interaction by directing their child’s attention, rather than following it, their child learned fewer object labels and more personal-social words, i.e. they were more expressive (and vice versa).

In short, Tomasello and Carpenter [92] summarize the social cognitive skills that are acquired through joint attention into four groups:

  1. Gaze following and joint attention

  2. Social manipulation and cooperative communication

  3. Group activity and collaboration

  4. Social learning and instructed learning

The importance of joint attention is not limited to early childhood and is believed to be vital in social competence at all ages. Adolescents and adults who cannot follow, initiate, or share attention in social interactions may be impaired in their capacity for relationship [82]. There are a large number of studies on the effects of joint attention incapabilities and social disorders, in particular in people with autism [89, 96, 90, 97]. For instance, autistic children are found to have minor deficits in responding to joint attention while struggle to initiate joint attention. The effect of aging on joint attention has also been investigated [97]. It is shown that adults tend to get slower in gaze-cuing as they age.

3.3 Moving from Imitation to Coordination

The traditional view of joint attention, as discussed earlier, focuses on the role of joint attention as a means whereby infants interact with adults and imitate their behavior to learn about their surroundings.

As adults, however, we often engage in more complex interactions, which can take many forms such as competition, conflict, coercion, accommodation and cooperation [98]. Cooperation, as in the context of traffic interaction, refers to a social process in which two or more individuals or groups intentionally combine their activities towards a common goal [99], e.g. crossing an intersection.

What makes a complex coordination possible? Of course, the immediate answer is that a form of joint attention has to take place so the parties involved focus on a common objective. However, joint attention in its classical definition does not fully satisfy the requirements for cooperation. First, although certain cooperative tasks can be resolved by imitation (e.g. make the same movements to balance a table while carrying it), in some scenarios complementary actions are required to accomplish the task (e.g. the person at the front watches for obstacles while the one behind carries the table) [100], as shown in Figure 10. Second, even though involved parties focus on a common object or event, this does not mean that they also share the same intention. In this regard, in the context of cooperation, some scholars use the term intentional joint attention [101] indicating that the agents are not only mutually attending to the same entity, but they also intend to do so.

Figure 10: Coordination been humans for carrying a table. Rather than imitating each other’s actions, (left image), people must sometimes perform complementary actions to reach a common goal (right image). Source: [100]

This type of cooperation that involves a form of attention sharing is often referred to as joint action [101, 100]. In some literature, joint attention is considered as the simplest form of joint action [101]. However, for simplicity, throughout the rest of this report, we address the whole phenomenon as joint attention.

3.4 How Humans Coordinate?

As described earlier, joint attention provides a mechanism for sharing the same perceptual input and directing the attention to the same event or object [100]. Here, perhaps, the most crucial components to trigger this attentional shift are eyes, because they naturally attract the observer’s attention even if they are irrelevant to the task. Of course, other means of communication such as hand gesture or body posture changes can be used for seeking attention [102]. This is particularly true in the case of traffic interaction where nonverbal communication between road users is the main means of establishing joint attention.

Next, the interacting parties have to understand each other’s intention in order to cooperate [101]. In some scenarios establishing joint attention might convey a message indicating the intention of the parties. For instance, at crosswalks, pedestrians often establish eye contact with the drivers indicating their intention of crossing [103]. However, joint attention on its own is not sufficient for understanding the intention of others (see Figure 11 for an example) or what they are going to do next. A more direct mechanism is action observation [100].

4 Observation or Prediction?

4.1 A Biological Perspective

When it comes to understanding observed actions of others, humans do not rely entirely on vision [104, 105]. In a study by Umilta et al. [105]

, the authors found that there is a set of neurons (referred to as mirror neurons) in the ventral premotor cortex (part of the motor cortex involved in the execution of voluntary movements) of macaque monkeys that fire both during the execution of the hand action and observing the same action in others. The authors show that a subset of these neurons become active during action presentation, even though the part of the action that is crucial in triggering the response is hidden and can therefore only be inferred. This implies that the neurons in the observer action system are the basis of action recognition.

Such anticipatory behaviors are also observed in humans. Humans in general have limited visual processing capability (even more so due to the loss of information during saccadic eye movements), especially when it comes to observing moving objects. Therefore, they actively anticipate the future pose of the object to interpret a perceived activity [106]. In an experiment by Flanagan and Johansson [107], the authors showed a video of a person stacking blocks to a number of human subjects and measured their eye movements. They noticed that the gaze of the observers constantly preceded the action of the person stacking blocks and predicted a forthcoming grip in the same way they would perform the same task themselves. The authors then concluded that when observing an action, the human behavior is predictive rather than reactive.

Figure 11: From joint attention to crossing. Are these pedestrians going to cross?222a) no, b) no, c) no, d) yes, e) yes, and f) no.

It is necessary to note that such predictive behaviors in action observation has biological advantages for humans (and also for machines). In addition to dealing with visual processing limitations, anticipatory behaviors can help to deal with visual interruptions due to occlusion in the scene.

4.2 A Philosophical Perspective

From a philosophical perspective, it can also be shown that behavior (or action) prediction is the only way to engage in social interaction. For this purpose we refer to the arguments presented by Dennet [108] and the comments on the topic by Baron-Cohen [109].

The ability to find “an explanation of the complex system’s behavior and predicting what it will do next”, which may include beliefs, thoughts, and intentions [108], or as Baron-Cohen terms it “mindreading” is crucial in both making sense of one’s behavior and communication. Dennet argues that mindreading, or as he calls it adopting “intentional stance”, is the only way to engage in social interaction. He further elaborates that the two alternatives to intentional stance, namely physical stance and design stance, are not sufficient for interpretation of one’s intentions or actions.

According to Dennet, physical stance refers to our understanding of systems whose physical stance we know about, for instance, we know cutting skin results in bleeding. In terms of understanding complex behavior, however, in order to rely on physical properties, we need to know millions of different physiological (brain) states that give rise to different behaviors. As a result, mindreading is an infinitely simpler and more powerful solution.

Design stance, on the other hand, tries to understand the system in terms of the functions of its observable parts. For example, one does not need to know anything about the microprocessor’s internal design to understand the function of pressing the Delete key on the keyboard. Similarly, design stance can explain some aspects of human behavior, such as blinking reflex in response to blowing on eye surface, but it does not suffice to make sense of complex behaviors. This is primarily due to the fact that people have very few external operational parts for which one could work out a functional or design description.

In addition to behavioral understanding, Baron-Cohen [109] argues that mindreading is a key element in communication. Apart from decoding communication cues or words, we try to understand the underlying communicative intention. In this sense, we try to find the “relevance” of the communication by asking questions such as what that person “means” or “intends me to understand”. For instance, if someone gestures towards a doorway with an outstretched arm and an open palm, we immediately assume that they mean (i.e., intend us to understand) that we should go through the door.

4.3 The Role of Context in Behavior Prediction

Now that humans are constantly relying on predicting forthcoming actions of one another in social contexts, the question is, what does enable us to predict behavior? we answer this question in two parts: making sense of one’s action and interpreting communication cues. Although these two components are inherently similar, not necessarily in all scenarios actions follow a form of communication, i.e. one might simply observe another person without interaction.

According to Humphrey [110], when observing someone’s action, we first need to perceive the current state of being by relying on our sensory inputs. Next, we need to understand the meaning of the action by relating it to the knowledge of the task (e.g. crossing the street). This knowledge is either biologically encoded in our brain, for instance, people have a very accurate knowledge of human body and how it moves [106], or, in more complex scenarios, it requires knowing the stimulus conditions (context) under which an individual performs an action. The context may include various physical or behavioral attributes present in the scene. Humphrey also emphasizes that in order to understand others, we need to predict the consequences of our actions and realize how they can influence their behavior [110].

The role of context is also highlighted in communication and how it can influence the way we convey communication cues. Sperber and Wilson [111], in their theory of relevance, argue that communication is achieved either by encoding and decoding messages via a code system which pairs internal messages with external signals, or by using the evidence from the context to infer the communicator’s intention. Although this theory is originally developed for verbal communication, it has implications which can certainly be relevant to nonverbal communication as well.

The scholars behind the theory of relevance claim that code model does not explain the transmission of semantic representations and thoughts that are actually communicated. They believe that there is a need for an alternative model of communication, what they call inferential model. In an inferential process, there is a set of premises as input and a set of conclusions as output which follow logically from the premises. When engaging in inferential communication, a communicator intentionally modifies the environment of his audience by providing a stimulus that takes two forms: the informative intention that informs the audience of something, and the communicative intention that informs the audience of the communicator’s informative intention. On the other hand, the communicatee makes an inference using his background knowledge that he is sharing with the communicator, i.e. their knowledge of context in which the communication is taking place [111].

To characterize the shared knowledge involved in communication, the authors use the term cognitive environment, which refers to a set of facts that are manifested to an individual. Intuitively speaking, the total cognitive environment of an individual consists of all the facts that he is aware of as well as all the facts that he is capable of becoming aware of at that time and place [111].

Sperber and Wilson use the term “relevance” to connect context to communication. They argue that any assumption or phenomena (as part of cognitive environment) are relevant in communication if and only if it has some effect in that context. They add that the word “relevance” signifies that the contextual effect has to be large in the given context and at the same time requires small effort to be processed [111]. The amount of processing required to understand the context, however, is a subject of controversy.

In traffic interactions context can be quite broad involving various elements such as dynamic factors, e.g. speed of the cars, the distance of the pedestrians; social factors, e.g. demographics, social norms; and environment configuration, e.g. street structure, traffic signals. Traffic context and its impact on pedestrian and driver behavior will be discussed in more detail in Section 6.2.

5 Nonverbal Communication: How the Human Body Speaks to Us

Nonverbal communication cues such as focusing on gaze direction, pointing gestures and postural movements play an important role in establishing joint attention and interacting with others [102].

In general, nonverbal communication refers to communication styles that do not include the literal verbal content of communication [112], i.e. it is affected by means other than words [113]. Buck and Vanlear [114] argue that nonverbal communication comes in three types (see Figure 12):

  1. Spontaneous: This form is based on biologically shared signal system and nonvoluntary movements. Spontaneous communication may include facial expressions, micro gestural movements and postures.

  2. Symbolic communication: This type of communication is deliberate and has arbitrary relationship with its referent and knowledge of what should be shared by sender and receiver. For instance, symbolic communication may include system of sign language, body movements or facial expressions associated with language.

  3. Pseudo-spontaneous: This form involves the intentional and propositional manipulation by the sender of the expressions that are virtually identical to spontaneous displays from the point of view of the receiver. This may include acting or performing.

Figure 12: A simplified view of nonverbal communication forms. Source: [114]

In traffic context all three types of nonverbal communication are observable. It is intuitive to imagine the occurrence of the first two types of communication. For example, pedestrians may perform various spontaneous movements including yawning, scratching their head, stretching their muscles, etc. As for symbolic communication, humans use various forms of nonverbal signals to transmit their intentions such as waving hands, nodding, or any other form of bodily movements. Symbolic communication in traffic interactions will be discussed in more detail in Section 5.5.

Distinguishing between pseudo-spontaneous and spontaneous nonverbal communication in traffic scenes is not easy, even for humans. It requires the knowledge of the context in which the behavior is observed, and, to some extent, the personality of the person who is communicating the signal. Although rare, the occurrence of pseudo-spontaneous movements is still a possibility in traffic context. People, for example, may perform various bodily movements to distract the driver (or the autonomous vehicle) as a joke or a prank.

5.1 Studies of Nonverbal Communication: An Overview

The modern study of nonverbal communication is dated back to the late 19 century. Darwin, in his book Expression of the Emotions in Man and Animals [115], was the first to focus on the possible modifying effects of body and facial expressions in communication. Darwin argues that nonverbal expressions and bodily movements have specific evolutionary functions, for instance, wrinkling the nose reduces the inhalation of bad odor.

In more recent studies, behavioral ethologists point that in humans, throughout their evolutionary history, these nonverbal bodily movements had gained communicative values [116]. In fact, it is estimated that 55% of communication between humans is trough facial expressions [117]. According to Birdwhistell [118], humans are capable of making and recognizing about 250,000 different facial expressions.

Scientists in behavioral psychology measured the importance of bodily movements in various interaction scenarios. For example, Dimatteo et al. [119] show that the ability to understand and express emotions through nonverbal communication significantly improves the level of patient satisfaction in a physician visit.

Comprehension or expression of nonverbal communication is linked to various factors. For instance, in a work by Nowicki and Duke [120], it is shown that the accuracy of emotional comprehension increases with age and academic achievement. Gender also plays a role in nonverbal communication. In general, women are found to engage in eye contact more often than men [121] and are also better at sending and receiving nonverbal signals [112]. Another important factor is culture which determines how people engage in nonverbal communication. For example, in Western culture, eye contact is much less of a taboo compared to Middle Eastern culture [121].

5.2 Studying Nonverbal Communication

Measuring behavioral responses is usually administered by showing a sequence of images or videos containing human faces to subjects. Then the subjects are either asked about how comfortable they feel making eye contact with the human in the picture [121] or their emotions are directly observed [120]. In another method known as Profile of Nonverbal Sensitivity (PONS), in addition to assessment of emotions, participants are asked to express certain emotions. The expressions are then shown to independent observers who are asked to identify the emotions they represent, for example, whether they imply sadness, anger or happiness. The final score is the combination of both assessment and expression of emotions by the participants [119]. In some studies, fMRI is used to measure brain activities of participants during nonverbal communication [122].

In the context of autonomous driving, however, communication is mainly studied through naturalistic observations [103, 123]. The observation is sometimes combined with other methods to minimize subjectivity. For instance, pedestrians are instructed to perform a certain behavior, e.g. engage in eye contact, and then the behavior of the drivers (who are unaware of the scenario) are observed naturalistically [124]. The observees sometimes are interviewed to find out how they felt regarding the communication that took place between them and the other road users [125].

5.3 Eye Contact: Establishing Connection

Eye contact, perhaps, is the most important part and the foundation of communication and social interaction. In fact, scientists argue that eye contact creates a phenomenon in the observer called “eye contact effect” which modulates the concurrent and/or immediately following cognitive processing and/or behavioral response [122]. Putting it differently, direct eye contact increases physiological arousal in humans, triggering the sense of trying to understand the other party’s intention by asking questions such as “why are they looking at me?” [109].

Depending on the context, in the course of social interaction, eye contact may serve different functions, which according to Argyle and Dean [121] can be one of the following:

  1. Information-seeking: It is possible to obtain a great deal of feedback by careful inspection of other’s face, especially in the region of the eyes. Various mental states such as noticing one, desire, trust, caring, etc. can be interpreted from the eyes [109].

  2. Signaling that the channel is open: Through eye contact a person knows that the other is attending to him, therefore further interaction is possible.

  3. Concealment and exhibitionism: Some people like to be seen, and eye contact is an evidence of them being seen. In contrast some people don’t like to be seen, and eye contact is an evidence of they are being depersonalized.

  4. Establishment and recognition of social relationship: Eye contact may establish a social relationship. For example, if person A wants to dominate person B, he stares at B with the appropriate expression. Person B may accept person A’s dominance by a submissive expression or deny it by looking away.

  5. The affiliative conflict theory: People may engage in eye contact for both approaching or avoiding contact with others.

Since the communication between road users is a form of social interaction, eye contacts in traffic scenes might serve any of the functions mentioned above. However, in the context of traffic interaction, the first two functions are particularly important. In most cases, prior to crossing, pedestrians assess their surroundings to check the state of approaching vehicles, traffic signals or road conditions. Likewise, drivers continuously observe the road for any potential hazards. It is also common that pedestrians engage in eye contact with drivers to transmit, for example, their intention of crossing. The role of eye contact in pedestrian crossing will be elaborated in Section 5.5.

5.4 Understanding Motives Through Bodily Movements

Besides eye contact, humans often rely on other forms of bodily movements for further message passing. For instance, hand gestures are commonly used during both nonverbal and verbal communication. Although all hand gestures are hand movements, all hand movements are not necessarily hand gestures. This depends on the movement and how the movement is done. Krauss et al. [116] group hand gestures into three categories (see Figure 13):

  1. Adapters: Aka body-focused movements or self-manipulation. These are the types of gestures that do not convey any particular meaning and have pure manipulative purposes, e.g. scratching, rubbing or tapping.

  2. Symbolic gestures: Purposeful motions to transfer a conversational meaning. Such motions are often presented in the absence of speech. Symbolic gestures are highly influenced by cultural background.

  3. Conversational gestures: Are hand movements that often accompany speech.

Figure 13: The function of hand gesture depending on its lexical meaning.

In addition to hand gestures, body posture and positioning may also convey a great deal of information regarding one’s intention. Scheflen [126] lists three functionalities of postural configuration in different aspects of communication:

  1. Distinguishes the contribution of individual behavior in the group activity.

  2. Indicates how the contributions are related to one another.

  3. Defines steps and order in interaction.

5.5 Nonverbal Communication in Traffic Scenes

The role of nonverbal communication in resolving traffic ambiguities is emphasized by a number of scholars [127, 128, 103, 129]. In this context, any kind of signals between road users constitutes communication. In traffic scenes, communication is particularly precarious because first, there exists no official set of signals and most of them are ambiguous, and second, the type of communication may change depending on the atmosphere of the traffic situation, e.g. city or country [125].

The lack of communication or miscommunication can greatly contribute to traffic conflicts. It is shown that more than a quarter of traffic conflicts is due to the absence of effective communication between road users. In a recent study it was found that out of conflicts caused by miscommunication, 47% of the cases occurred with no communication, 11% was due to the lack of necessary communication and 42% happened during communication [125].

Traffic participants use different methods to communicate with each other. For example, pedestrians use eye contact (gazing/staring), a subtle movement in the direction of the road, handwave, smile or head wag. Drivers, on the other hand, flash lights, wave hands or make eye contact [103]. Some researchers also point out that the speed changes of the vehicle can be an indicator of the driver’s intention. For example, in a case study by Varhelyi [130] it is shown that drivers use high speed as a signal to communicate to pedestrians that they do not intend to yield.

Among different forms of nonverbal communication, eye contact is particularly important. Pedestrians often establish eye contact with drivers to make sure they are seen [131]. Drivers also often rely on eye contact and gazing at the face of other road users to assess their intentions [132]. In addition, a number of studies show that establishing eye contact between road users increases compliance with instructions and rules [133, 131]. For instance, drivers who make eye contact with pedestrians will more likely yield the right of way at crosswalks [124].

Figure 14: Examples of pedestrian hand gestures in a traffic scene without context. What are these pedestrians trying to say?333a) Yielding, b) asking for right of way, c) showing gratitude, and d) greeting a person on the other side of the street.


In order to understand the meaning of nonverbal signals, care should be taken while interpreting them. For example, handwave by a pedestrian may be a sign of request for the right of way or showing gratitude. An illustrative example can be seen in Figure 14.

6 Context and Understanding Pedestrian Crossing Behavior

It is important to note that context highly depends on the task, which means a particular element that has relevance in one task, may be irrelevant in the other. Hence, in this section we only focus on traffic context and present a review of some of the past studies in the field of traffic behavioral analysis with a particular focus on factors that influence pedestrian crossing behavior.

6.1 Studying Pedestrian Behavior

The methods of studying human behavior (in traffic scenes) have transformed throughout the history as new technological advancements have emerged. Traditionally, questionnaires in written forms [127, 134] or direct interviews [135] are widely used to collect information from traffic participants or authorities monitoring the traffic. These forms of studies, however, have been criticized for a number of reasons such as the bias people have in answering questions, the honesty of participants in responding or even how well the interviewees are able to recall a particular traffic situation.

Traffic reports have also been widely used in a number of studies. These reports are mainly generated by professionals such as police force after accidents [136]. The advantage of traffic reports is that they provide good details regarding the elements involved in a traffic accident, albeit not being able to substantiate the underlying reasons.

In addition, behavior can be analyzed via direct observation by someone who is either present in the vehicle [125] or stands outside [137] while recording the behavior of the road users. The drawback of this method is the strong observer bias, which can be caused by both the observer’s misperception of the traffic scene or his subjective judgments.

New technological developments in the design of sensors and cameras gave rise to different modalities of recording traffic events. Eye tracking devices are one of such systems that can be placed on the participants’ heads to record their eye movements (see Figure (a)a) during the course of driving [128]. Computer simulations [138] and video recordings [134] are also widely used to study the behavior of drivers in laboratory environments. These methods, however, have been criticized for not providing the real driving conditions, therefore the observed behaviors may not necessarily reflect the ones exhibited by road users in a real traffic scenario.

Figure 15: Examples of using technology for behavioral studies: a) eyetracker for driver behavior analysis [139], and b) the placement of a concealed camera for naturalistic traffic recoding.

Naturalistic studies, perhaps, are one of the most effective methods used in traffic behavior understanding. Although the first instances of such studies are dated back to almost half a century ago [140], they have gained tremendous popularity in the recent years. In this method of study, a camera (or a network of cameras) are placed in either the vehicles [141, 142] (see Figure (b)b) or outdoor on road sides [78, 143]. Since the objective is to record the natural behavior of the road users, the cameras are located in inconspicuous places not visible to the observees. In the context of recording driving habits, although the presence of the camera might be known to the driver, it does not alter the driver’s behavior in a long run. In fact, studies show that the presence of camera may only influence the first 10-15 minutes of the driving, hence the beginning of each recording is usually discarded at the time of analysis [125].

Despite being very effective, naturalistic studies have some disadvantages. For example, researcher bias might affect the analysis. Moreover, in some cases it is hard to recognize certain behaviors, e.g. whether a pedestrian notices the presence of the car or looks at the traffic signal in the scene. To remedy this issue, it is a common practice to use multiple observers to analyze the data and use an average score for the final analysis [140]. In some studies, a hybrid approach is employed by combining naturalistic recordings with on-site interviews [103]. Using this method, after recording a behavior, the researcher approaches the corresponding road user and asks whether, for example, they looked at the signal prior to crossing. Overall, the hybrid approach can help lowering the ambiguities observed in certain behaviors.

6.2 Pedestrian Behavior and Context

Figure 16: Factors involved in pedestrian decision-making process at the time of crossing. The size of each circle indicates how many times the factors are found relevant in the literature. The branches with solid lines indicate the sub-factors of each category and the dashed lines show the interconnection between different factors and and arrows show the direction of influence.

The factors that influence pedestrian behavior can be divided into two main groups, the ones that directly relate to the pedestrian (e.g. demographics) and the ones that are environmental (e.g. traffic conditions). A summary of these factors and how they relate to one another can be found in Figure 16.

6.2.1 Pedestrian Factors

Social Factors. Among the social factors, perhaps, group size is one of the most influential ones. Heimstra et al. [140] conducted a naturalistic study to examine the crossing behavior of children and found that they commonly (more than 80%) tend to cross as a group rather than individually. Group size also changes both the behavior of the drivers with respect to the pedestrians and the way the pedestrians act at crosswalks. For instance, it is shown that drivers more likely yield to a large group of pedestrians (3 or more) than individuals [127, 78].

Moreover, crossing as a group, pedestrians tend to be more careless at crosswalks and often accept shorter gaps between the vehicles to cross [143] or do not look for the upcoming traffic [103]. Group size also impacts the way pedestrians comply with the traffic laws, i.e. group size exerts some form of social control over individual pedestrians [144]. It is observed that individuals in a group are less likely to follow a person who is breaking the law, e.g. crossing on the red light [127].

In addition, group size influences pedestrian flow which determines how fast pedestrians would cross the street. Ishaque and Noland [145] indicate that if there is no interaction between the pedestrians, there is a linear relationship between pedestrian flow and pedestrian speed. This means, in general, pedestrians walk slower in denser groups.

Social norms or, as some experts call it “informal rules” [74], play a significant role in how traffic participants behave and how they predict each other’s intention [127]. The difference between social norms and legal norms (or formal rules) can be illustrated using the following example: formal rules define the speed limit of a street, however, if the majority of drivers exceed this limit, the social norm is then quite different [127].

The influence of social norms is so significant that merely relying on formal rules does not guarantee safe interaction between traffic participants. This fact is highlighted in a study by Johnston [146] in which he describes the case of a 34-year old married woman who was extremely cautious (and often hesitant) when facing yield and stop signs. In a period of four years, this driver was involved in 4 accidents, none of which she was legally at fault. In three out of four cases the driver was hit from behind, once by a police car. This example clearly depicts how disobeying social norms, even if it is legal, can interrupt the flow of traffic.

Social norms even influence the way people interpret the law. For example, the concept of “psychological right of way” or “natural right of way” has been widely studied [127]. This concept describes the situation in which drivers want to cross a non-signalized intersection. The law requires the drivers to yield to the traffic from the right. However, in practice drivers may do quite the opposite depending on the social status (or configuration) of the crossing street. It is found that factors such as street width, lighting conditions or the presence of shops may determine how the drivers would behave [147].

Imitation is another social factor that defines the way pedestrians (as well as drivers [148]) would behave. A study by Yagil [149] shows that the presence of a law-adhering (or law-violating) pedestrian increases the likelihood of other pedestrians to obey (or disobey) the law. This study shows that the impact of law violation is often more significant.

The probability of

imitation occurrence may vary depending on the social status of the person who is being imitated. In the study by Leftkowitz et al. [150] a confederate was assigned by the experimenter to cross or stand on the sidewalk. The authors observed that when the confederate was wearing a fancy outfit, there was a higher chance of other pedestrians imitate his actions (either breaking the law or complying).

Demographics. Arguably, gender is one of the most influential factors that define the way pedestrians behave [140, 127, 151, 152]. In a study of children behavior at crosswalks, Heimstra et al. [140] show that girls in general are more cautious than boys and look for traffic more when crossing. Similar pattern of behavior is also observed in adults [149] and in some sense it defines the way men and women obey the law. In general, men tend to break the law (e.g. red-light crossing) more frequently than women [145, 137].

Furthermore, gender differences affect the motives of pedestrians when complying with the law. Yagil [149] argues that crossing behavior in men is mainly predicted by normative motives (the sense of obligation to the law) whereas in women it is better predicted by instrumental motives (the perceived danger or risk). He adds that women are more influenced by social values, e.g. how other people think about them, while men are mainly concerned with physical conditions, e.g. the structure of the street.

Men and women also differ in the way they assess the environment before or during crossing. For instance, Tom and Granie [137] show that prior to and during a crossing event, men more frequently look at vehicles whereas women look at traffic lights and other pedestrians. In addition, compared with women, male pedestrians tend to cross with a higher speed [145].

Age impacts pedestrian behavior in obvious ways. Generally, elderly pedestrians are physically less capable compared to adults, as a result, they walk slower [145], have more variation in walking pattern (e.g. do not have steady velocity) [153] and are more cautious in terms of accepting gap in traffic to cross [78, 154, 155]. Furthermore, the elderly and children are found to have a lesser ability to assess the speed of vehicles, hence are more vulnerable [128]. At the same time, this group of pedestrians has a higher law compliance rate than adults [145].

State. The speed of pedestrians is thought to influence their visual perception of dynamic objects. Oudejans et al. [156] argue that while walking, pedestrians have better optical flow information and have a better sense of speed and distance estimation. As a result, walking pedestrians are less conservative to cross compared to when they are standing.

(a) Zebra crossing
(b) Crossing with pedestrian refuge island
(c) Pelican crossing
Figure 17: Different types of crosswalks. Source: [157]

Pedestrian speed may vary depending on the situation. For instance, pedestrians tend to walk faster during crossing compared to when walk on side walk [158]. Crossing speed also varies in different types of intersections. Crompton [159] reports pedestrian mean speed at different crosswalks as follows: 1.49 m/s at zebra crossings, 1.71 m/s as crossing with pedestrian refuge island and 1.74 m/s at pelican crossings (see Figure 17).

The effect of attention on traffic safety has been extensively studied in the context of driving [160, 161, 162, 163]. Inattention of drivers is believed to be one of the leading causes of traffic accidents (up to 22% of the cases) [164]. In the context of pedestrian crossing, it is shown that inattention significantly increases the chance that the pedestrian is hit by a car [165, 166].

Hymann et al. [167] investigate the effect of attention on pedestrian walking trajectory. They show that the pedestrians who are distracted by the use of electronics, such as mobile phones, are 75% more likely to display inattentional blindness (not noticing the elements in the scene). The authors also point out that while using electronic devices pedestrians often change their walking direction and, on average, tend to walk slower than undistracted pedestrians.

Trajectory or pedestrians walking direction is another factor that plays a role in the way pedestrians make crossing decision. Schmidt and Farber [168] argue that when pedestrians are walking in the same direction as the vehicles, they tend to make riskier decisions regarding whether to cross. According to the authors, walking direction can alter the ability of pedestrians to estimate speed. In fact, pedestrians have a more accurate speed estimation when the approaching cars are coming from the opposite direction.

Characteristics. Without doubt one of the most influential factors altering pedestrian behavior is culture. It defines the way people think and behave, and forms a common set of social norms they obey [169]. Variations in traffic culture not only exist between different countries, but also within the same country, e.g. between town and countryside or even between different cities [170].

Attempts have been made to highlight the link between culture and the types of behavior that road users exhibit. Lindgren et al. [169] compare the behaviors of Swedish and Chinese drivers and show that they assign different levels of importance to various traffic problems such as speeding or jaywalking. Schmidt and Farber [168] point out the differences in gap acceptance of Indians (2-8s) versus Germans (2-7s). Clay [128] indicates the way people from different culture perceive and analyze a situation. She notes that when judging people during interaction, Americans focus more on pedestrian characteristics whereas Indians rely on contextual factors.

Some researchers go beyond culture and study the effect of faith or religious beliefs on pedestrian behavior. Rosenbloom et al. [171] gather that ultra-orthodox pedestrians in an ultra-orthodox setting are three times more likely to violate traffic laws than secular pedestrians.

Generally speaking, pedestrian level of law compliance defines how likely they would break the law (e.g. crossing at red light). In addition to demographics, law compliance can be influenced by physical factors which will be discussed later in this report.

Abilities. Pedestrians’ abilities, namely speed estimation and distance estimation, can influence the way they perceive the environment and consequently the way they react to it. In general, pedestrians are better at judging vehicle distance than vehicle speed [172]. They can correctly estimate vehicle speed when the vehicle is moving below the speed of 45 km/h, whereas vehicle distance can be correctly estimated when the vehicle is moving up to a speed of 65 km/h.

6.2.2 Environmental Factors

Physical context. The presence of street delineations, including traffic signals or zebra crossings, has a major effect on the way traffic participants behave [127] or on their degree of law compliance. Some scholars distinguish between the way traffic signals and zebra crossings influence yielding behavior. For example, traffic signals (e.g. traffic light) prohibit vehicles to go further and force them to yield to crossing pedestrians. At non-signalized zebra crossings, however, drivers usually yield if there is a pedestrian present at the curb who either clearly communicates their intention of crossing (often by eye contact) or starts crossing (by stepping on the road) [103].

Signals also alter pedestrians level of cautiousness. In a study by Tom and Granie [137], it is shown that pedestrians look at vehicles 69.5% of the time at signalized and 86% at unsignalized intersections. In addition, the authors point out that pedestrians’ trajectory differs at unsignalized crossing. They tend to cross diagonally when no signal is present. Tian et al. [158] also adds that when vehicles have the right of way, pedestrians tend to cross faster.

Road geometry (e.g. presence of pedestrian refuge in the middle of the road) and street width impact the level of crossing risk (or affordance), and as a result, pedestrian behavior [156]. In particular, these elements alter pedestrian gap acceptance for crossing. The narrower the street is, the smaller gap is required to cross [168].

Weather or lighting conditions affect pedestrian behavior in many ways [173]. For instance, in bad weather conditions (e.g. rainy weather) pedestrians’ speed estimation is poor, and they tend to be more conservative while crossing [172]. Moreover, lower illumination level (e.g. nighttime) reduces pedestrians’ major visual functions (e.g. resolution acuity, contrast sensitivity and depth perception), thus they tend to make riskier decisions. Another direct effect of weather would be on road conditions, such as slippery roads due to rain, that can impact movements of both drivers and pedestrians [174].

Dynamic factors. One of the key dynamic factors is gap acceptance or how much, generally in terms of time, gap in traffic pedestrians consider safe to cross. Gap acceptance depends on two dynamic factors, vehicle speed and vehicle distance from the pedestrian. The combination of these two factors defines Time To Collision (or Contact) (TTC), or how far the approaching vehicle is from the point of impact [175]. The average pedestrian gap acceptance is between 3-7s, i.e. usually pedestrians do not cross when TTC is below 3s and very likely cross when it is higher than 7s [168]. As mentioned earlier, gap acceptance may highly vary depending on social factors (e.g. demographics [143], group size [127], culture [168]), level of law compliance [145], and the street width.

The effects of vehicle speed and vehicle distance are also studied in isolation. In general, it is shown that increase in vehicle speed deteriorates pedestrians’ ability to estimate speed [128] and distance [172]. In addition, Schmidth and Farber [168] show that pedestrians tend to rely more on distance when crossing, i.e. within the same TTC, they tend to cross more often when the speed is higher.

Some scholars look at the relationship between pedestrian waiting time prior to crossing and gap acceptance. Sun et al. [78] argue that the longer pedestrians wait, the more frustrated they become, and as a result, their gap acceptance lowers. The impact of waiting time on crossing behavior, however, is controversial. Wang et al. [143] dispute the role of waiting time and mention that in isolation, waiting time does not explain the changes in gap acceptance. They add that to be considered effective, it should be studied in conjunction with other factors such as personal characteristics.

Although traffic flow is a byproduct of vehicle speed and distance, on its own it can be a predictor of pedestrian crossing behavior [168]. By seeing the overall pattern of vehicles movements, pedestrians might form an expectation about what other vehicles approaching the crosswalk might do next.

Traffic characteristics. Traffic volume or density is shown to affect pedestrian [148] and driver behavior [168] significantly. Essentially, the higher the density of traffic, the lower the chance of the pedestrian to cross [145]. This is particularly true when it comes to law compliance, i.e. pedestrians less likely cross against the signal (e.g. red light) if the traffic volume is high. The effect of traffic volume, however, is stronger on male pedestrians than women [149].

The effects of vehicle characteristics such as vehicle size and vehicle color on pedestrian behavior have also been investigated. Although vehicle color is shown not to have a significant effect, vehicle size can influence crossing behavior in two ways. First, pedestrians tend to be more cautious when facing a larger vehicle [151]. Second, the size of the vehicle impacts pedestrian speed and distance estimation abilities. In an experiment involving 48 men and women, Caird and Hancock [176] reveal that as the size of the vehicle increases, there is a higher chance that people will underestimate its arrival time.

7 Reasoning and Pedestrian Crossing

To better understand the pedestrian behavior, it is important to know the underlying reasoning and decision-making processes during interactions. In particular, we need to identify how pedestrians process sensory input, reason about their surroundings and infer what to do next. In the following subsections we start by reviewing the classical views of decision-making and reasoning and talk about various types of reasoning involved in traffic interaction.

7.1 The Economic Theory of Decision-Making

The early works in the domain of logic and reasoning define the problem of decision-making in terms of selecting an action based on its utility value, i.e. what positive or negative returns are obtained from performing the action. In the context of economic reasoning, the utility of actions is calculated based on the monetary cost they incur versus the amount of return they promise [177].

Some scientists consider decision-making process and reasoning to inherently be alike because they both depend on the construction of mental models, and so they should both give rise to similar phenomena [178]. Conversely, a number of scholars believe the similarity is only metaphorical as there are different rules to establish the validity of premises [179].

For the sake of this paper, we do not intend to go any further into differences between decision-making and reasoning. Rather, given that reasoning is thought to be a fundamental component of intelligence [180], we focus the rest of our discussion only on reasoning and try to identify its variations.

7.2 Classical Views of Reasoning

There are numerous attempts in the literature to identify the types of reasoning, especially from human cognitive perspective [181, 182, 180], and the way they are programmed in our subconscious (e.g. rule-based, model-based or probabilistic)[178, 183, 184].

In the early works of psychology, there are two dominant types of reasoning identified: Deductive reasoning in which there is a deductively certain solution and inductive reasoning where no logically valid solution exists but rather there is an inductively probable one [181]. In his definition of reasoning, Sternberg [181] divides deduction and induction into subcategories including syllogistic and transitive inference (in deduction), and causal inference and either of analogical, classificational, serial, topological or metaphorical (in induction) (see Figure 18).

Figure 18: The classical view of human reasoning according to Sternberg [181].

Syllogism, in particular, is well studied in the literature and refers to a form of deduction that involves two premises and a conclusion [183]. The components can be either categorical (exactly three categorical propositions) or conditional (if … then) [181]. Transitive inference is also a form of linear syllogism that contains two premises and a question. Each premise describes a relationship between two items out of which one is overlapping between two premises.

Causal inference, as the name implies, presents a series of causations based on which a conclusion is made about what causes what. As for the other components of inductive reasoning, they depend on the way the reasoning task is presented. For instance, consider the task of selecting analogically which option (a) Louis XIV, (b) Robespierre completes the sequence, Truman : Eisenhower :: Louis XIII : ?. Performing the same task in the classification format would look like this: Out of groups (a) and (b) in (a) Eisenhower, Truman (b) Louis XIII, Louis XIV, which group Robespierre belongs to? For more examples of different types of reasoning please refer to [181].

In other categorizations of reasoning, in addition to deduction and induction, researchers consider a third group, abductive reasoning. Generally speaking, abduction refers to a process invoked to explain a puzzling observation. For instance, when a doctor observes a symptom in a patient, he hypothesizes about its possible causes based on his knowledge of the causal relations between diseases and symptoms [185]. According to Fischer [186], there are three stages in abduction: 1) a phenomenon to be explained or understood is presented, 2) an available or newly constructed hypothesis is introduced, 3) by means of which the case is abducted.

Whether abduction should be considered as a separate class of reasoning or a variation of induction, or induction be a subsidiary of abduction, is a subject of controversy among scientists [185]. Quoting from Peirce [185], who distinguishes between three different types of reasoning: “Deduction proves that something must be; induction shows that something actually is operative; abduction merely suggests that something may be”. More precisely, if one wants to separate abduction from induction the following characteristics should be considered: abduction reasons from a single observation whereas induction from enumerative samples to general statements. Consequently, induction explains a set of observations while abduction explains only a single one. In addition, induction predicts further observations but abduction does not directly account for later observations. Finally, induction needs no background theory necessarily, however, abduction heavily relies on a background theory to construct and test its abductive explanations.

In the psychology literature, some researchers, in addition to deduction and induction, identify the third group of reasoning as quantitative reasoning [180]. This type of reasoning involves the identification of quantitative relationships between phenomena that can be described mathematically.

Reasoning abilities, in particular, deduction and induction, are widely used in various daily tasks such as crossing. For example, deductive abilities help to make sense of various pictorial or verbal premises and may help to solve various cognitive tasks such as general inferences or syllogisms that help one to reason about selecting a proper route (this task is commonly referred to as ship destination in the literature). Inductive reasoning may also be used in judging people or situations based on the past experiences, e.g. understanding social norms, nonverbal cues, etc.

7.3 Variations in Reasoning

Listing all types of reasoning (subcategories of main reasoning groups) is a daunting task because the terminology, and reasoning tasks and definitions may vary significantly in different communities (e.g. AI and psychology). Some sources attempt to enumerate types of reasoning by distinguishing between the tasks they are applied to. For instance, in [187] over 20 categories of reasoning are enumerated such as backward reasoning (start from what we want and work back), butterfly logic (one induces a way of thinking in others), exemplar reasoning (using examples), criteria reasoning (comparing to established criteria), and more.

In this subsection, however, we list and discuss the types of reasoning based on two criteria: there exist a significant support for that type of reasoning and relevance to the task of traffic interaction.

Probabilistic reasoning

Probabilistic reasoning aka approximate reasoning is a branch of reasoning that challenges the traditional logic reasoning [184, 183, 188]. It refers to a process of finding an approximate solution to the relational assignment of equations. What distinguishes probabilistic reasoning from the traditional reasoning is its fuzziness and non-uniqueness of consequents. A common example to illustrate probabilistic reasoning is as follows: most men are vain; Socrates is a man; therefore, it is very likely that Socrates is vain [188].

To explain the advantage of probabilistic reasoning over traditional logic reasoning, probabilistic logicians argue that people are irrational beings and systematically make large errors (as evident in various logical tasks such as Wason selection task [189]). However, at the same time, they seem very successful in everyday rationality in guiding thoughts and actions [184].

In addition, probabilistic reasoning argues against both mental-logic and mental-model approaches444Mental-logic argues that deduction depends on an unconscious system of formal rules of inference akin to those in proof-theoretic logic, i.e. we use a series of rules to deduce a conclusion.
Mental-model postulates that people understand discourse and construct mental models of the possibilities consistent with the discourse. Then a conclusion is reached by selecting a possibility over the others.
by stating that working memory is limited in terms of storing a length of a formal derivation or the number of mental models. In this sense, errors in so-called logical tasks can be explained by the fact that people use probabilistic reasoning in everyday tasks and generalize them to apply to logic.

In the traffic context, the decision process of road users is highly probabilistic, for instance, they reason in approximate terms to decide whether to cross (e.g. by estimating the behavior of drivers) or which route to take.

Spatial reasoning

Spatial reasoning refers to the ability to plan a route, to localize entities and visualize objects from descriptions of their arrangement. An example task can be as follows: B is on the right of A, C is on the left of B, then A is on the right of C [190]. Applications that benefit from spatial reasoning include (but not limited to) various navigational tasks, path finding, geographical localization, unknown object registration, etc.[191].

It is intuitive to link spatial reasoning to various traffic tasks. We often use such reasoning abilities to plan our trip, to choose the direction of crossing or to estimate risk by measuring our position with respect to the other road users.

Temporal reasoning

As the name implies, temporal reasoning deals with time and change in state of premises. Formally speaking, temporal reasoning consists of formalizing the notion of time and providing means to represent and reason about the temporal aspect of knowledge [192]. Temporal reasoning accounts for change and action [193], i.e. it allows us to describe change and the characteristics of its occurrence (e.g. during interaction with others). Temporal reasoning typically involves three elements, explanation, to produce a description of the world at some past time, prediction, to determine the state of the world at a given future time and learning about physics, to be able to describe the world at different times and to produce a set of rules governing change which account for the regularities in the world [192].

Temporal reasoning can be used in various tasks such as medical diagnosis, planning, industrial processes, etc. [192]. In the context of traffic interaction, as discussed earlier, people deal with the prediction of each others’ intentions, which, in addition to static context (e.g. physical characteristics), highly depends on the past observed actions of others and changes in their behavior (e.g. nonverbal communication). Therefore, a form of temporal reasoning is necessary to make a decision in such situations.

Qualitative (physical) reasoning

There is a vast amount of literature focusing on qualitative (or as some call it physical) reasoning [194, 195, 196, 197, 191]. The main objective of qualitative reasoning is to make sense of the physical world [194], i.e. reason about various types of physical phenomena such as support or collision phenomena [195].

In physical reasoning, factors such as space, time or spacetime are divided into physically significant intervals (or histories or regions) bounded by significant events (or boundaries) [197]. All values and relations between numerical quantities and functions are discretized and represented symbolically based on their relevance to the behavior that is being modeled [196]. Although such qualitative discretization may result in the loss of precision, it significantly simplifies the reasoning process and allows deduction when precise information is not available [191].

Qualitative reasoning can be applied to a wide range of applications such as Geographical Information Systems (GIS), robotic navigation, high-level vision, the semantics of spatial prepositions in natural languages, commonsense reasoning about physical situations, and specifying visual language syntax and semantics [196]. It is also easy to see the role of qualitative reasoning role in traffic scenarios. For example, when observing a vehicle, the exact distance or speed of the vehicle is not known to the pedestrian. Rather, the pedestrian reasons about the state of the vehicle using terms such as the vehicle is very near, near, far or very far, or in the case of speed, the vehicle is moving very slow, slow, fast or very fast.

Deep neural reasoning

Recent developments in machine-learning techniques gave rise to deep neural reasoning

[198, 199, 200, 201]

. In this method, the reasoning task is performed via a neural network which is trained using many solved examples


. As for the working memory, which is a necessity in extensive reasoning tasks such as language comprehension, question answering or activity understanding, deep neural reasoning models either implicitly keep a representation of the previous events, e.g. long short term memory (LSTM) networks

[202], or have a dedicated memory that interacts with the neural net (i.e. network only acts as a “controller”) [203, 201]. These reasoning techniques show promising performance in tasks such as question answering, finding the shortest path between two points or moving blocks puzzle [201].

Social reasoning

In psychology literature, the term social reasoning [204, 205] is widely used to refer to the abilities that involve reasoning about others. More specifically, social reasoning refers to a mechanism that uses information about others in order to infer some conclusions. It is thought to be essential in any intelligent agent in order to react properly for various interactive situations [204].

To highlight the importance of social reasoning (also known as social intelligence or social cognition) in the human interactions, we will refer to Humphrey [110] who believes that social intelligence is the foundation of holding society together. In Humphrey’s opinion, the essential part of social intelligence is abstract reasoning, which never was needed before in performing other intelligent tasks. He adds that in order to interact social primates should be calculating beings, which means they must be able to calculate the consequences of their own behavior, the likely behavior of others, and the balance of advantage and loss.

An important implication of Humphrey’s discourse is the necessity of prediction and forward planning in social interaction (which also was discussed earlier in Section 3). Humphrey refers to social interaction as “social chess” and argues that like chess, in addition to the cognitive skills, which are required merely to perceive the current state of play, the interacting parties must be capable of a special sort of forward planning.

Sensory reasoning

Reasoning may also be involved in the way we use or control our sensory input. In humans, two types of visual reasoning are commonly used to assess surroundings [180]. These are dynamic spatial reasoning which helps to infer where a moving object is going and when it will arrive at its predicted destination, and ecological spatial reasoning, which refers to an individual’s ability to orient himself in the real world. These reasoning abilities can help a pedestrian to estimate the movement of the traffic and calculate a safe gap for crossing.

8 Attention and its Role in Human Visual System

The prerequisite to understanding pedestrian behavior is to identify relevant contextual elements and visually attend to them. If we consider visual perception as the process of constructing an internal representation of the local environment [206], visual attention can be seen as perceptual operations responsible for selecting important or task-relevant objects for further detailed visual processing such as identification [207]. Generally speaking, perception helps analyzing lower level physical properties of visual stimuli (e.g. color, size, motion), while attention facilitates such processing by directing perception towards a particular object [206]. Attention is thought to be the foundation of visual perception without which we simply do not have the capacity to make sense of the environment.

Whether visual attention is stimulus-based (activated by distinctive characteristics of objects) or task-oriented has been a subject of controversy for many years. Despite such disagreement on the source of attention, the majority of recent studies strongly support the dominant role of the task in triggering attention [208]. In fact, experimental evidence shows that the nature of the task, i.e. the objective of the observer, influences the way attention is taking place. This may include the priming of visual receptors at early stages of perception to the properties of the object or higher level processing of the features and eye movements to fixate on where the object may be located [206].

8.1 Attention and Driving

In behavioral psychology, a large body of research has been dedicated to studying the role of attention in driving safety. Some of these studies were introduced earlier, for example, studies on how frequently pedestrians look prior to crossing or how they engage in eye contact [209]. In addition, a number of scholars have studied the attention patterns of drivers (e.g. eye movements) with the purpose of designing assistive systems for vehicles.

On the importance of visual perception in driving, studies show that over 90% of the information during driving is received through the visual system [79]. In this context, a vital component of visual perception is attention, lack of which accounts for 78% of all crashes and 65% of all near-crashes. Of interest to autonomous driving systems, non-specific eye glances to random objects in the scene caused 7% of all crashes and 13% of all near-crashes [210].

Moreover, some findings suggest that the way drivers allocate their attention greatly contributes to the safety of driving [160, 162]. For instance, in [160] the authors measured and compared the driving habits of novice and experts drivers and the way they pay attention to the environment by recording their eye movements while driving. They revealed that novice drivers frequently had more fixation transitions from objects in the environment to the road ahead. Expert drivers, on the other hand, performed fewer fixations suggesting that they typically relied on their peripheral vision to monitor the road. Expert drivers had also a better road monitoring strategy by performing horizontal scanning on mid range road (1 to 2 seconds away from the vehicle) whereas novice drivers were mainly focusing on the road far ahead. In addition, expert drivers are thought to be more aware of their surroundings by keeping track of various objects around the vehicle.

More recent works focus on the effects of technology on driver’s attention. Llaneras et al.[211] show the effect of Limited Ability Autonomous Driving Systems (LAADS) such as adaptive cruise control and lane keeping assist on drivers’ attention. They found that when the drivers had the opportunity to relinquish vehicle speed maintenance and lane control functions to a reliable autonomous system, they were frequently looking away from the forward roadway (about 33% of the time) while driving. Takeda et al.[212]

also evaluate driver attention in level 3 self-driving vehicles compared to when they are in control of driving tasks. They show that while not driving, drivers tend to have longer blinks and fewer saccadic eye movements to monitor the environment. The authors add that such changes in behavioral attention can potentially be hazardous as level 3 autonomous vehicles may require the intervention of the driver at certain critical moments.

9 Joint Attention in Practical Systems

The majority of practical joint attention systems are done in social and assistive robotics, in particular, for tasks involving close encounters with humans. Although most of these works do not have a direct application to traffic interaction, some of their subcomponents and ideas can potentially be used in autonomous driving. Hence, in the following subsections we briefly introduce some of the main practical systems that take advantage of joint attention mechanisms.

9.1 Mindreading through Joint Attention

The joint attention model introduced by Baron-Cohen

[109] is the backbone of many practical systems capable of establishing joint attention such as Cog and Kismet [213, 214].

Baron-Cohen argues that a shared attention model, or as he terms it mindreading, requires at least four modules, which can be organized into three tiers depending on the complexity of the task they are handling. The overall structure of his system is illustrated in Figure 19 and is briefly described below:

Figure 19: Baron-Cohen’s joint attention (mindreading) system. The term M-Representations refers to propositional attitudes expressed as e.g. believes, thinks. Source: [109]

The Intentional Detector (ID). According to Baron-Cohen, ID is “a perceptual device that interprets motion stimuli in terms of the primitive volitional mental states of goal and desire”. This module is activated whenever any perceptual input (vision, touch, or audition) identifies something as an agent. This may include any agent-like entity (defined as anything with self-propelled motion) such as a person, a butterfly or even a billiard ball, which is initially considered as a query agent with goals and desires. It should be noted that after discovering that an object is not an agent (i.e. its motion is not self-caused), the initial reading is revised.

The Eye-Direction Detector (EDD). Baron-Cohen lists three functionalities for EDD including detecting the presence of eyes or eye-like stimuli, computing whether eyes are directed towards it or towards something else, and inferring that if another organism’s eyes are directed at something then that organism sees that thing. The main difference of EDD and ID is that ID interprets stimuli in terms of the volitional mental states of desire and goal, whereas EDD does so in terms of what an agent sees. Therefore, EDD is a specialized part of the human visual system.

The Shared-Attention Mechanism (SAM). This is the module responsible for building triadic representation (between the agent, self and an object). This representation can be generated by using EDD, through which the agent can follow the gaze direction of the agent and identify the object of interest. However, SAM is limited in terms of its power to infer more complex relationship when the object is not within reach of both the agent and self, for instance, when communicating with someone who is not present in the scene.

The Theory-of-Mind Mechanism (ToMM). This system is needed for inferring the full range of mental states from behavior, thus employing a “theory of mind”. ToMM is much more than simply confirming the intention of others or observing their gaze changes. It deals with two major tasks: representing epistemic mental states (e.g. pretending, thinking and deceiving) and combining them with other mental states (the volitional and the perceptual) to help to understand how mental states and actions are related.

9.2 Practical Systems

The majority of the socially interactive systems that employ a form of joint attention use this mechanism for the purpose of learning, such as gaze control, similar to infants. In addition, most of these applications are designed for close interactions which usually involve a physical object. This means these models have little (if any) implication for joint attention and intention estimation in the context of traffic interaction. As a result, we only briefly discuss some of the most known systems developed during the past couple of decades. A more comprehensive list of socially interactive systems that employ the joint attention mechanism can be found in [215, 216].

(a) Cog
(b) Kismet
Figure 20: Examples of robotic platforms with joint attention capability for learning. Source: a) [217] and b) [218].
9.2.1 Socially Interactive Systems

Cog (Figure (a)a) [213, 217, 214]

is one of the early social robots that has the mechanism of joint attention. The robot has 21 degrees of freedom (DoF) and a variety of sensory modalities such as vision and audition. Its physical structure consists of a movable torso, arms, neck, and eyes which give it a human-like motion. The robot is designed to perform gaze monitoring and to engage in shared attention as well as to perform certain actions (such as pointing) to request shared attention to an object.

A descendant of Cog is Kismet (Figure (b)b) [217, 219, 220], a robot head that uses a similar active vision approach as Cog for eye movements and engages in various social activities. Kismet has 3 DoF in eyes, one in its neck and 11 DoF in its face including eyebrows (each with two DoF to lift and arch), ears (each with two DoF to lift and rotate), eyelids (each with one DoF to open or close), and a mouth (with one degree of freedom to open or close). Such flexibility allows Kismet to communicate, via facial expressions, a range of emotions such as anger, fatigue, fear, disgust, excitement, happiness, interest, sadness, and surprise in response to perceptual stimuli.

A particular feature of Kismet that enables the detection of joint attention is its two-part perceptual system containing an attentional system and gaze detection model. The attentional system is designed to detect salient objects (objects of interest in joint attention). The model contains a bottom-up component that finds interesting regions using face and motion detection algorithms as well as low-level color information (using opponent axis space to find saturated main colors). There is also a top-down component in the attentional system controlled by the emotions of the robot. For instance, if the robot is bored the focus of attention shifts more towards the face of the instructor than the object [219].

The gaze finding algorithm in Kismet has a hierarchical structure. First, through detection of motion, regions that may contain the instructor’s face are selected, then a face detection algorithm is run to locate the face of the instructor. Once identified, the robot fixates on the face and a second camera with zoom capability captures a close-up image of the face. The close range image allows the robot to find the instructor’s eyes and as a result determine her gaze direction


(a) Robovie
(b) Geminoid
(c) Leonardo “Leo”
(d) PeopleBot
Figure 21: Examples of robotic platforms with joint attention capability used in social interactions. Source: a) [221] and b) [222], c) [223], and d) [224].

Joint attention mechanisms are typically used in social robots as a means of enhancing the quality of human interaction with the robot. Robovie (Figure (a)a) [222] is one of the early robots designed for this purpose. This humanoid robot is 120 cm tall, has two arms (8 DoF), a head (3 DoF), two eyes (4 DoF) and a mobile platform. The robot’s eyes have a pan-tilt capability to simulate human eye movements. In a number of experiments conducted by the creators of Robovie, it was found that human-like behaviors, such as gaze sharing, engage humans more in interaction.

The role of shared attention and gaze control has also been investigated in more specific types of interactions. Mutlu et al. [225] show how gaze sharing can improve the ability of a human to play a guessing game with a robotic counterpart. The authors used two robotic platforms Robovie R-2 and Geminoid (Figure (b)b), a human look-alike humanoid robot. The game rules are as follows: there is a table with a number of objects placed on it. The goal is for the human player to guess which object is selected by the robot by asking questions about the nature of the object. The authors observed that if the robot used very subtle eye movements towards the region where the object was placed, it would significantly improve the human player ability to correctly guess the object. Interestingly, the human participants had a better response to the Geminoid robot owing to its human-like facial features.

Zheng et al. [226] examined how gaze control can contribute to a handover activity. In this study, a PR2 robot was instructed to give a bottle of water to the human participants in three different ways: 1) while keeping its gaze constantly downward, 2) while keeping its gaze on the bottle, and 3) while maintaining eye-contact with the human participants. The authors found that in the second scenario the human participants reached for the bottle significantly earlier than in the first scenario, and in the third scenario even earlier than in the second.

Moreover, joint attention is shown to improve the efficiency of verbal communication between humans and robots. Breazeal et al. [221] examine such influence using Leo (Figure (c)c), a 65 DoF social robot. Through a number of interactive experiments between the robot and human participants, they show that joint attention cues such as hand gesture and eye contact help people to better understand the mental process of the robot. In this manner, the authors realized that the most effective technique is when the robot exhibits both implicit communication cues and explicit social cues during the interaction. For instance, eye contact shows that the robot is waiting for the user to say a word, and shrug indicates that it did not understand an utterance of the user.

Imai et al. [227] investigate the role of joint attention in a conversational context where the robot attempts to attract the attention of a human to an object. In their experiments, the robot approached the human participants and asked them to look at the poster on the wall. They noticed that when the robot moved its gaze towards the wall, the participants were more likely to follow the robot’s instructions to look at the poster. A similar scenario was examined by Staudte et al. [224, 228] in which the authors instructed the robot called PeopleBot (Figure (d)d) to name the objects placed in front of it. In the first scenario, the robot was programmed to change its gaze towards the named objects, and in the second one it did not do so. A video of the robot’s performance was recorded and displayed to a number of human participants. The authors observed the eye movements of the participants and found that the gaze changes of the robot restrict the spatial domain of the potential referents allowing the participants to disambiguate spoken references and identify the intended object faster. Similarly, Mutlu et al. [229] show that in joint action, shared attention mechanisms not only help the agents to make sense of each other’s behavior, but also help them to repair conversational breakdowns as a result of failure to exchange information.

Human-like gaze following and shared attention capabilities in a robot can also make interaction more interesting and engaging [230]. Shiomi and his colleagues [231] tested this hypothesis over a course of two-month trial at a science exhibition. They used various humanoid robotic platforms such as Robovie, Robovie-m (22 DoF robot with legged locomotion and bowing ability), and Robovie-ms (a stationary humanoid robot). The exhibition visitors were provided with RFID tags allowing the robots to become aware of their presence and also to retain some personal information about them to use throughout interactions. During their observations, the authors noticed that joint attention capabilities gave some sort of personality to the robots. A human-like personality makes the robots more approachable by humans and more interesting in terms of interacting with them and listening to their instructions. In other sets of experiments where the joint attention mechanisms were not present, the authors witnessed that people tended to get distracted easily by the appearance of the robots and treated them just as interesting objects, not social beings.

In addition, there are other social interactive applications that benefit from joint attention mechanism. Miyauchi et al. [232] exploit eye contact cues as a means of making sense of the user’s hand gestures. Intuitively speaking, the robot can distinguish between an irrelevant and an instructive hand gesture by detecting whether the user is gazing at the robot. Kim et al. [233]

use joint attention mechanism in conjunction with an object detector and an action classifier to recognize the intention of the user. Joint attention, which is realized by estimating the skeleton pose of the user, helps to identify the object of interest and then associate it with the human user’s activity. Monajjemi

et al. [234] employ a shared attention mechanism to interface with an unmanned aerial vehicle (UAV) to attract the robot to the location of the human that needs assistance (e.g. in a search and rescue context). In this work, shared attention is realized in the form of fast hand-wave movements, which attracts the attention of the UAV from distance.

9.2.2 Learning

In recent years, sophisticated algorithms have been developed in which gaze direction guides the identification of objects of interest [235, 236, 237, 238, 239]. For instance, in the context of robot learning, Yucel et al. [237] use a joint attention algorithm that works as follows: the instructor’s face is first detected using Haar-like features. Next, the 3D pose of the face is estimated using optical flow information based on which a series of hypotheses are generated to estimate the gaze direction of the instructor. The final gaze is estimated using a tolerable error threshold which forms a cone-shaped region in front of the instructor covering a portion of the scene. In parallel with this process, a saliency map of the environment is generated using low-level features such as color and edge information to identify objects in the scene. Any of the salient objects that fall within the gaze region of the instructor can be selected as the object of interest (the object that instructor is looking at). The authors tested the proposed algorithm on a Nao humanoid robot placed on a moving platform.

The use of learning algorithms such as neural nets has also been investigated [240, 241, 242, 243]. In one such work [243], the authors implicitly teach joint attention behavior to the robot. This means, given an image of the scene, the robot has to learn the motor movements that connect the gaze of the instructor to the object of interest. To do so, the robot produces a saliency map of the scene (using low-level features such as color, edges, and motion) to identify the objects. It then performs a face detection to identify the instructor’s face. The detection results are then fed into a 3-layer neural net to learn the correct motor motions to follow the gaze of the instructor to the object of interest.

Besides head and eye movements, some scientists developed algorithms capable of identifying the object of interest using auditory input and pointing gestures [244, 245]. For instance, Haasch et al. [244] use skin color to identify the instructor’s hand, and then, by estimating its trajectory of movement, determine what the instructor is pointing at. This helps to narrow down the attention to a small region of the scene. In addition, using speech input, the instructor specifies the characteristics of the object of interest. Using this information the system then filters out unwanted regions. For example, within the scene, if the object is blue, all regions that are not blue are filtered out. The filtered image in conjunction with the direction of pointing gesture helps the system to figure out what the intended object is.

Moreover, some scientists experimented with other means of communication and establishing joint attention [246, 247]. For instance, in [247] the authors show that how a mobile robot that is wirelessly communicating with another robot can learn a proto-language by simply following and imitating the instructing robot’s movements. In addition, in the same work, a doll-like robot is taught to imitate the hand movements of the instructor using a series of connected infrared emission sensors placed on the instructor’s arm and glasses.

9.2.3 Rehabilitation
(a) Infanoid
(b) Keepon
Figure 22: Examples of robotic platforms with joint attention capability used for rehabilitation purposes. Source: a) [248, 249], b) [250]

Rehabilitation robots are specifically designed for children with social disorders such as autism. Autism is a complex behavioral and cognitive disorder that causes learning problems as well as communication and interaction deficiencies [251]. Through interaction with autistic children, these robotic platforms can help to improve their communicative and social abilities.

Infanoid (Figure (a)a) [252, 253, 248] is one example of rehabilitation social robots. This humanoid robot has an anthropomorphic head with a pair of eyes and servo motors that enable various eye movements. In addition, the robot has a moving neck, arms, lips, and eyebrows enabling the system to perform human-like gestures and facial expressions. The visual processing of the robot for human gaze detection is hierarchical. The robot first detects the face of the user, and then saccades to the face by centering it in the image. The zoom cameras then focus on the eye regions of the face and an algorithm detects the gaze direction. Following the direction of the gaze, the robot identifies where the object of interest is.

Through experimental evaluations, which were performed with autistics kids in the presence of their mothers, the creators of Infanoid found that although the autistic kids were reluctant at first, proper intentional human-like eye movements could eventually engage them in close interaction and help to maintain it for a long period of time [254]. The authors observe that during interaction between the robot and the kids both dyadic and triadic (the kid, the robot, and the mother) behaviors emerged [255]. The authors later introduced a new robot Keepon (Figure (b)b) [249, 250] for interaction with infants. This small soft creature-like robot has two main functions: orienting its face and showing the emotional state, such as pleasure or excitement, by rocking its body. The use of joint attention mechanism in this new robot shows similar effect in engaging autistic children. In fact, the authors reveal that children go through three phases in order to interact with the robot: 1) neophobia (don’t know how to deal with the robot), 2) exploration (through parents explore how the robot behaves), and 3) interaction. Similar studies have also been conducted on autistic kids using different robotic platforms such as a simple mobile robot [251] or LEGO robots, the result of which agree with the findings of works on Infanoid and Keepon.

10 Hardware Means of Traffic Interaction

Although the focus of this report is on visual scene understanding, for the sake of completeness, we will briefly discuss hardware approaches that are being (or can potentially be) used in the context of traffic interaction.

10.1 Communication in Traffic Scenes

Establishing communication between vehicles, and vehicles and infrastructures have been the topic of interest for the past decades. Technologies such as Vehicle to Vehicle (V2V) and Vehicle to Infrastructure (V2I), which are collectively known as V2X (or Car2X in Europe), are examples of recent developments in the field of traffic communication [256, 257]. These technologies are essentially a real-time short-range wireless data exchange between the entities allowing them to share information regarding their pose, speed, and location [258].

Sharing this information between the vehicles and infrastructures enables the road users to detect hazards, calculate risks, issue warning signals to the drivers and take necessary measures to avoid collisions. In fact, it is estimated that V2V alone could prevent up to 76% of the roadway crashes and that V2X technologies could lower unimpaired crashes by up to 80% [258].

Recent developments extend the idea of V2X communication to connect Vehicles to Pedestrians (V2P). For instance, Honda proposes to use pedestrians’ smart phones to broadcast their whereabouts as well as to receive information regarding the vehicles in their proximity. In this way, both smart vehicles and pedestrians are aware of each other’s movements, and if necessary, receive warning signals when an accident is imminent [259].

In spite of their effectiveness in preventing accidents, V2X technologies are being criticized on two grounds. First, efficient communication between the road users is highly dependent on the functioning of all involved parties. If one device malfunctions or transmits malicious signals, it can interrupt the entire communication network. The second issue is privacy concerns. A recent study shows that one of the major problems of the road users with employing V2X technologies is sharing their personal information in the network [260].

Since recently, car manufacturers are turning towards less intrusive forms of communication that do not require the corresponding parties to share their personal information. Techniques such as using blinking LED lights [261], color displays [262] or small surface projectors [263] to visualize the vehicle’s intention have been investigated. Some vehicles also use a combination of these methods to communicate with traffic. For instance, Mercedes Benz, in their most recent concept autonomous vehicle (as illustrated in Figure 23), uses a series of LED lights at the rear end of the car to ask other vehicles to stop/slow or inform them if a pedestrian is crossing, a set of LED fields at the front to indicate whether the vehicle is in autonomous or manual mode and a projector that can project zebra crossing on the ground for pedestrians [264].

Figure 23: The concept autonomous vehicle by Mercedes Benz with communication capability: a) grill LED lights, blue indicates autonomous mode and white manual mode, b) the car is projecting a zebra crossing for the pedestrian to cross, c) rear LED lights are requesting the vehicles behind to stop, and d) rear LED lights are showing that a pedestrian is crossing in front of the vehicle. Source [264].

To make the communication with pedestrians more human-like, some researchers use moving-eye approach. In this method, the vehicle is able to detect the gaze of the pedestrians, and using rotatable front lights, it establishes (the feeling of) eye contact with the pedestrians and follow their gaze [265]. Some researchers also go as far as suggesting to use a humanoid robot in the driver seat so it performs human-like gestures or body movements during communication [266].

Aside from establishing direct communication between traffic participants, roadways can be used to transmit the intentions and whereabouts of the road users. During the recent years, the concept of smart roads has been gaining popularity in the field of intelligent driving. Smart roads are equipped with sensors and lighting equipment, which can sense various conditions such as vehicle or pedestrian crossing, changes in weather conditions or other forms of hazards that can potentially result in accidents. Through the use of visual effects, the roads then inform the road users about the potential threats [267].

In addition to the transmission of warning signals, smart roads can potentially improve the visibility of roads and attract the attention of traffic participants (especially pedestrians) from electronic devices or other distractors to the road [267]. Today, in some countries such as Netherlands smart roads are in use for recreational and safety purposes (see Figure 24).

(a) Van Gogh Path
(b) Glowing Lights
Figure 24: Examples of smart roads used in Netherlands. Source [268].

10.2 Pedestrian Intention Estimation Using Sensors

The use of various sensors for activity and event recognition has been extensively investigated in the past decades. Although the majority of the research in this field focuses on general activity recognition [269, 270] or health monitoring systems such as fall detection in indoor environments [271, 272], some works have potential application for outdoor use in the context of traffic interaction. Below, we briefly list some of these systems.

Sensors used in activity recognition can be divided into three categories: 1) sensors mounted on the human body, 2) sensors that are in devices carried by humans or 3) external sensors that monitor the environment. The first group are referred to as wearable sensors and can be in the form of bi-axial accelerometer [273, 274, 275], digital compass [273], bi-axial gyroscope sensor [36], inertial measurement sensors [276, 277] and Surface Electromyography (SEMG) sensors [278]. The second group include sensors such as accelerometer [270] or orientation sensors [272] in smartphone devices [279, 280]. The third group, external sensors, ranging from RADAR [281, 282], WiFi [283] and motion detectors [284] to infrared light detectors [273], microphones [285], and state-change sensors (e.g. reed switches) [284, 269]. The interested reader is referred to these survey papers [286, 287] for more information.

11 Visual Perception: The Key to Scene Understanding

Visual perception in autonomous driving can either be realized through the use of passive sensors such as optical cameras and infrared sensors or active sensors such as LIDAR and RADAR. Although autonomous vehicles take advantage of active sensors for range detection and depth estimation (in particular under bad weather or lighting conditions), optical cameras are still the dominant means of scene analysis and understanding. For autonomous driving optical cameras are better for a number of reasons: they provide richer information, are non-invasive, more robust to interference and less dependent on the type of objects (whereas for active sensors reflection and interference are major issues) [288]. As a result, optical cameras are expected to play the main role in traffic scene analysis for years to come.

For the reasons mentioned above, we put our main focus on reviewing algorithms that use optical cameras and only briefly review some algorithms based on active sensors. Even though active sensors are fundamentally different from optical cameras, the algorithms for processing the data generated by these sensors share some commonalities.

we subdivide the materials on visual perception into three categories. Since the prerequisite to scene understanding is identifying the relevant components in the scene, we begin the discussion with algorithms for detection and classification. Then, we review pose estimation and activity recognition algorithms, which help to understand the current state of the road users in the scene. In the end, we conclude the revision by discussing the methods that predict the upcoming events in the scene, i.e. intention estimation algorithms.

11.1 Realizing the Context: Detection and Classification

In computer vision literature, detection and classification algorithms are either optimized for a particular class of objects or are multi-purpose and can be used for different classes of objects (generic algorithms). Here, we will review the algorithms in both categories. In the case of the object specific algorithms we put our focus on reviewing the algorithms designed for recognizing traffic scene elements, namely pedestrians, vehicles, traffic signs and roads.

Object recognition algorithms are commonly learning-based and rely on different types of datasets to train. Knowing the nature of the datasets sheds some light on the ability of the proposed algorithms in identifying various classes of objects. Hence, we begin by discussing some of the most popular datasets in the field.

11.1.1 Datasets
Generic Datasets

Generic datasets, as the name implies, contain a broad range of object classes. They may include live objects such as people, animals or artificial objects such as cars, tools or furniture. A comprehensive list of the most known datasets and their properties can be found in Table 1.

Among the available datasets, many are being widely used today. The Visual Object Classes (VOC) dataset is one of the most used and cited datasets since 2005 and contains images for object detection and classification (with bounding box annotations) and pixel-wise image segmentation. The object recognition part of the dataset contains more than 17K color images comprising 20 object classes. ImageNet is similar to VOC but in a much larger scale. Today, ImageNet contains more than 14 million images out of which over 1 million have bounding boxes for 1000 object categories. In addition, a portion of the samples in ImageNet comes with object attributes such as the type of materials, color of the objects, etc. Similar to VOC, ImageNet is used in a yearly challenge known as ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

(a) ImageNet
(c) Places-205
Figure 25: Sample images from 3 major object recognition datasets.

MS COCO is another large-scale object recognition and segmentation dataset that consists of more than 300K color images with pixel-wise ground truth labels. For scene recognition, researchers at MIT have introduced the Places-205 and Places-365 datasets each with more than 2.5 and 10 million color images of various scenes respectively. The extension numbers of the Places datasets refer to the number of scene categories each dataset has (for sample images from major datasets see Figure 25).

Dataset Year Obj. Type/No. Categories Data Type No. Frames Ground Truth
COIL-20 [289] 1996 O/20 Gr/Multiviews 2K+ CL
MNIST [290] 1998 D/10 Gr 70K PW
MSRC-21 [291] 2000 O/32 Col 800 PW
Caltech4 [292] 2003 O/10 Col 1.4K+ BB/CL
Caltech 101 [293] 2004 O/101 Col 32K+ OB/CL
VOC2006 [294] 2006 O/10 Col 5200 BB/CL
VOC2007[295] 2007 O/20 Col 5K+ BB/CL
VOC2008[296] 2008 O/20 Col 4.3K+ BB/CL
SUN09 [297] 2009 O/200 Col 12K BB/CL
AwA [298] 2009 A/50, AT/85 Col 180K+ CL/AL
NUS-WIDE [299] 2009 O/81 Col 269K+ CL
CIFAR-100 [300] 2009 O/100 Col 60K CL
ImageNet [301] 2010 O/1000, AT/21K Col 1.2M+ OB/CL/AL
VOC [302] 2010 O/20 Col 15K+ BB/CL
SUN297 [303] 2010 O/367 Col 108K+ BB/CL
RGB-D [304] 2011 O/300 Col and R 41K+ P/CL
Willow Garage [305] 2011 O/110 Col and R 353 CL/ PW/P
2D3D [306] 2011 O/156 Col and R 11K+ -
Sun Attribute [307] 2011 AT/102 Col 14K + CL
SVHN [308] 2011 D/10 Col 600K+ OB/CL
NYUD2 [309] 2012 O/1000+ Col and R 407K+ (1449 labeled) PW/CL
VOC2012[310] 2012 O/20 Col 17K+ OB/CL
MS COCO [311] 2014 O/80 Col 300K+ PW/CL
Places-205 [312] 2014 S/205 Col 2.5M CL
ModelNet [313] 2015 O660 CAD 151K+ CL
Places-365 [314] 2017 S/365 Col 2.5M CL
Table 1: Generic object recognition datasets. Abbreviations: object type: O = Objects, D = Digits, A = Animals, AT = Attributes, S = Scenes, data type: Gr = Greyscale, Col = Color, R = Range, CAD = 3D meshes, ground truth: CL = Class Label, PW = Pixel-Wise, BB = Bounding Box, OB = Object Boundaries, AL = Attribute Label, P = Pose (3D).
Traffic Scene Datasets
(b) Cityscapes
Figure 26: Sample images from two major traffic scene datasets and their associated ground truth annotations.

A number of datasets are particularly catered to traffic scene understanding applications. KITTI [315] is one of the best known and widely used datasets with over 30K color stereo image samples as well as associated laser point cloud data and 3D GPS information. For object recognition purposes, the dataset contains bounding box annotations for 7 classes of objects including vehicles (car, van, and truck), people (person and pedestrian), cyclists and trams. A part of the dataset is annotated with pixel-wise ground truth for road surfaces. Another major dataset, Cityscapes [316], has pixel-wise ground truth annotations for other types of objects in traffic scenes including sky, signs, and trees. For sample images from KITTI and Cityscapes see Figure 26.

Table 2 lists 7 most common datasets for traffic scene understanding. Note that these datasets are mentioned here because they contain images that are collected from street-level view via a camera placed in the car.

Dataset Year No. Categories Data Type No. Frames Ground Truth
CBCL [317] 2007 9 Col 3.4K BB/PW/CL
CamVid [318] 2008 32 Col 701 PW/CL
Karlsruhe [319] 2011 2 Gr 1.7K BB/PW/CL
KITTI [315] 2013 7 Col, R, St, GPS 30K+ P/CL
Daimler-Urban [320] 2014 3 Col, St 5K PW/CL
Cityscapes [316] 2016 30 Col 25K+ PW/CL
Inner-city [321] 2016 4 Col 24K+ BB/CL
Table 2: Datasets for traffic scene understanding. Abbreviations: data type: Gr = Greyscale, Col = Color, R = Range, St = Stereo, ground truth: CL = Class Label, PW = Pixel-Wise, BB = Bounding Box, P = Pose (3D).
Pedestrian Detection Datasets

Since identifying pedestrians in traffic scenes is one of the most challenging tasks, there are a number datasets that are specifically designed for training pedestrian detection algorithms. Some of the most common ones that are being used today are Caltech [322], INRIA [323], and Daimler [324]. The Caltech dataset, in particular, is one of the best known datasets for two reasons: it contains a very large number of pedestrian samples (350K) and also occlusion information in the form of a bounding box that covers only the visible portion of occluded pedestrians. Table 3 enumerates some widely used pedestrian detection datasets. One should note that since the focus of this report is on autonomous driving applications, the datasets listed in the table are the ones that contain street-level view of the pedestrians. There are, however, few exceptions that also have images from other viewpoints such as CUHK, PETA and INRIA.

Dataset Year No. Ped. Samples Data Type No. Frames Ground Truth
MIT [325] 1997 924 Col 924 CL
INRIA [323] 2005 1.7K+ Col 2.5K BB
Daimler-Class [326] 2006 4K Gr 9K CL
Penn-Fudan [327] 2007 345 Col 170 BB/PW
ETHZ [328] 2008 14.4K Col 2.2K BB
Caltech [322] 2009 350K Col 250K+ BB,V
Daimler-Mono [324] 2009 15.5K+ Gr 21K+ BB
Daimler-Occ [329] 2010 90K Gr,St 150K+ CL
TUD Brussels [330] 2010 3K+ Col 1.6K BB
Daimler-Stereo [331] 2011 56K+ Gr,St 28K+ BB
Berkley [332] 2012 2.5K Col,St 18 Videos PW
CUHK [333] 2012 - Col 1063 BB
PETA [334] 2014 8.7K Col,St 19K AL
Kaist [335] 2015 103K+ Col,T 95K BB,V
CityPerson [336] 2017 35K Col,St 5K BB/PW
Table 3: Pedestrian detection datasets. Abbreviations: data type: Gr = Greyscale, Col = Color, St = Stereo, ground truth: CL = Class Label, PW = Pixel-Wise, BB = Bounding Box, V = Visibility, P = Pose (3D).
Vehicle Recognition Datasets

There exists a number of vehicle detection datasets that contain mainly images of passenger cars for detection and classification purposes. Some of these datasets such as Standford [337] have cropped out images of vehicles for classification purposes while the others include vehicle images in street scenes (e.g. NYC3D [338]), making them suitable for detection tasks. A list of vehicle datasets can be found in Table 4. It should be noted that only three of these datasets (e.g. LISA [339], GTI [340] and TME [341]) are recorded strictly from the vehicle’s point of view. The rest, contain either views from different perspectives (e.g. CVLAB [342]) or only show side-views of the vehicles (e.g. UIUC [343]).

Dataset Year No. Vehicle Samples Data Type No. Frames Ground Truth
CMU/VASC [344] 2000 213 Gr 104 BB
MIT-CBCL [345] 2000 516 Col 516 CL
UIUC [343] 2002 550 Gr 1050 BB
CVLAB [342] 2009 20 Col 2K+ BB/RA
LISA [339] 2010 - Col 3 Videos BB
GTI [340] 2012 3.4K+ Col 7K+ BB
TME [341] 2012 - Col,R,EM 30k+ BB
NYC3D [338] 2013 3.7K+ Col 5K+ 3D/2D BB, CL, GL,TD
Stanford [337] 2013 16K Col 16K+ CL
Table 4: Vehicle detection and classification datasets. Abbreviations: data type: Gr = Greyscale, Col = Color, R = Range, EM = Ego-Motion, ground truth: CL = Class Label, BB = Bounding Box, RA = Rotation Angle, GL = Geographical Location, TD = Time of Day.
Traffic Sign Recognition Datasets

Among the observable elements in traffic scenes, traffic signs have one of the highest variabilities in appearances. Depending on the shape, color or descriptions on the signs, each can convey a different meaning for controlling traffic flow. Some datasets try to capture such variability by putting together cropped samples from traffic scenes (e.g. GTSRB [346]) or showing the signs in the streets (e.g. BTSD [347]) accompanied by bounding box information and class labels. A number of traffic sign datasets are shown in Table 5.

Dataset Year No. Categories Data Type No. Frames Ground Truth
UAH [348] 2007 - Col 200+ -
MASTIF [349] 2009-2011 28 Col 10k BB, S,CL
STSD [350] 2011 - Col 20k+ BB,V,CL
GTSRB [346] 2011 43 Col 50k+ BB, S,CL
Lisa [351] 2012 47 Col 6.6k BB,S,V,CL
GTSD [352] 2013 42 Col 900 BB,S,V,CL
BTSD [347] 2013 62 Col 9k+ BB,S,CL
Table 5: Sign recognition datasets. Abbreviations: data type: Col = Color, ground truth: CL = Class Label, BB = Bounding Box, V = Visibility, S = Size.
Road Recognition Datasets

Road detection algorithms often use traffic scene datasets such as KITTI or Cityscape. KITTI dataset, for instance, has pixel-wise annotations for different categories of road surfaces such urban unmarked (UU), and urban marked (UM), urban multiple marked lanes (UMM). There are very few datasets particularly designed for road recognition purposes. Of interest for autonomous driving applications is the Caltech-Lanes dataset [353] which contains 1225 color images of streets with pixel-wise ground truth annotations of lane markings.

11.1.2 Generic Object Recognition Algorithms

Early works in object recognition are mainly concerned with identifying 3D shapes. These algorithms often rely on a pre-existing 3D model of an object that defines the relationship between its various aspects such as edges [354], vertices [355] and/or surfaces [356]. For instance, Chakravarty and Freeman [355] characterize a 3D object in terms of its lines (edges) and junctions (vertices), the combination of which can take one of the 5 possible junction forms (e.g. T shape). In their representation, each object, depending on what form of junctions are observable (e.g. 2 T-junctions and 1 U-junction), can have a number of unique views or as they term them characteristic-views (CVs). The task of recognition is then limited to identifying junction forms in the image and match them to those model representations in the database. For a review of similar 3D object recognition algorithms refer to [357].

The aforementioned 3D object recognition algorithms, deal with identification of rather simplistic 3D shapes such as cubes or cylinders. Later works, however, attempt to recognize more sophisticated objects using model-based techniques [358, 359]. In a well-known algorithm by Lowe [358], the author reduces the problem of 3D object recognition to estimating the projection parameters of the given model points into the corresponding image points by identifying correspondences between a known 3D model of the object and points observed in the scene. He argues that features used for such a purpose should have two characteristics: be viewpoint invariant and not occur accidentally in the image (due to viewpoint selection or background clutter). Based on these criteria, to form features, Lowe’s algorithm perceptually groups the edges of a model based on their proximity, parallelism and collinearity. Having identifying such relationships in the image, the author then tries to match them with the object 3D model using a least square technique. Once the match occurred, the projection parameters (rotation and translation) can be calculated. Lowe points that at least three matches are required to accurately estimate the parameters.

To solve the problem of 3D object recognition, some researchers use an active approach in which the reasoning is based on the changes in the appearance of the object in the scene [360, 361]. Wilkes and Tsotsos [360] perform such active recognition by using a camera mounted on a robotic arm attached to a moving platform. Here, the objective is to first find a prominent line (the longest line) in the scene, and then move the camera so the line is placed in the center. Next, the camera moves parallel to the line in 16 consecutive poses to find the view that maximizes the length of the line. In the same manner, a second prominent line is selected (that is not parallel to the first one) but this time by moving the camera perpendicularly with respect to the first line. The second viewpoint then would be the standard viewpoint of the object in which two prominent lines have the maximum length. The standard viewpoint is used in a tree search algorithm to match against the available models in the database to recognize the object.

Figure 27: Generation of a spin image from a 3D surface.

One particular drawback of the earlier 3D recognition algorithms is their dependency on identifying the edges of the objects. Therefore, they perform best when the background clutter is minimal, for instance, the objects are placed in front of a uniform black or white background. In addition, to lower the complexity of search and reasoning, assumptions are often made to simplify the problem. These may include constraining the distance of the camera to the object, limiting the possible poses of the object and controlling the illumination of the scene. To avoid constraints, some authors use more complex 3D features to define the shape of an object. Johnson and Hebert [362] introduce the concept of spin images to characterize 3D shapes. Intuitively, spin images are 2D projections of 3D parts of an object that are generated by passing a 2D plane through the 3D shape (see Figure 27). These images are generated uniformly from the 3D model of the object, hence, the relationship between different spin images is also preserved. The combination of spin images characterizes a 3D object and can be used to distinguish the object from the distractors.

Advancements in the quality of imaging sensors gave rise to the use of pixel intensity and color values for object recognition. The methods using these features either naively learn the pixel value distributions of an object [363, 364], average pixel value distributions over multiple viewpoints [365], or generate more powerful features by finding the relationship between pixel values in local regions of the scene [366, 367]. One of the early attempts to use color for recognition is by Swain and Ballard [363]. The authors propose to learn the characteristics of an object by generating a 3D color histogram of the object’s appearance. Then they use the distribution to localize and classify the scene in the following way. For localization, they perform histogram backprojection which replaces the color pixel values in the scene by their distribution values from the object’s 3D histogram. To decide whether the observed scene contains the object of interest, histogram intersection operation is used. Overall, although color and pixel intensities can be used to characterize an object, they are very sensitive to noise, illumination changes, and background clutter.

In the late 90’s, increase in computation power allowed the researchers to develop more powerful feature descriptors for object recognition. One such method was proposed by Lowe [368, 369]

the Scale Invariant Feature Transform, better known as SIFT. The purpose of this method is to find key points in an image that are invariant to scale, rotation and translation. SIFT features are generated as follows: at each octave, images are successively blurred by applying a Gaussian filter and subtracted from the image in the previous scale (e.g. at the beginning first blurred image is subtracted from the original image) to generate difference of Gaussian (DOG) images. Next, in each series of images at the given octave, maxima and minima pixels are localized by comparing the pixel value with its 8 neighboring pixels in all scales. Once the key points are detected, their local gradient orientation and magnitude are calculated. A threshold is applied to pick the most dominant key points. For classification purposes, a local descriptor (for a 2x2 or 4x4 patch) around the key point is generated by calculating the gradient characteristics of the neighboring pixels. The descriptors of the object are learned by using a classification method such as decision tree

[368] or SVM [370].

Other types of features commonly used in object recognition are Haar (a local intensity pattern) [371, 372], steerable filters [373], and histogram of oriented gradients (HOG) [374, 375]. Depending on the application, these features achieve a different level of performance. For instance, Haar features are shown to have a discriminative representation for face detection [371] whereas HOG features are quite effective in person detection [333]. In more recent years, due to the availability of range sensors, some researchers also take advantage of depth information for object recognition [376, 377] to segment an object from its surroundings or to retrieve the 3D shape of an object. To achieve a better recognition performance, some scientists also experiment with combining various features for more robust representation of objects. For instance, SIFT has been used with color features [378], color and HOG features [379] or with HOG and local binary pattern (LBP) features [380].

Today, the field of object recognition is dominated by Convolutional Neural Networks (CNNs) both for object recognition in 2D

[381, 382, 383] and 3D with the aid of range data [384, 385]. AlexNet, introduced by Krizhevsky et al. [382], is one of the early CNN models that achieved state-of-the-art performance in object classification (15.3% top-5 error rate in the ImageNet dataset). AlexNet has 5 convolutional layers for learning object features and a 3-layer fully connected network to classify the learned object features into one of the 1000 classes in the ImageNet dataset. To improve the results even further, more recent CNN models propose even deeper architectures. VGG-16 [386] is one of the widely used architectures that has 16 convolutional layers (or 19 in VGG-19 architecture) for object representation. By increasing the depth of the network, VGG-19 could achieve the top-5 error rate of 6.8% on the ImageNet dataset.

Another widely known CNN model is GoogleNet with a 22-layer architecture [387]

. To increase the depth of the network while minimizing network parameters, the creators of GoogleNet introduce the notion of inception layers. Inception layer is based on the idea that in the earlier convolutional layers (the ones closer to the input), activations tend to be concentrated in local regions. Given that, these regions can be covered by simply applying 1x1 convolutions, which also greatly reduces the depth of the filter banks. Furthermore, to capture more spatially sparse clusters, two successive 3x3 and 5x5 convolutions are also applied. The concatenation of these three convolutions in conjunction with a max pooling operation forms the input to the next layer of the network.

Going deeper in convolutional layers, however, comes with the cost of accuracy degradation due to the saturation of the network. This is addressed by residual networks (ResNets) [383]. The basic structure of the network is a series of 3x3 convolutional layers. The input to each block of 2 convolutional layers, or 3 in deeper architectures, is added directly to the output of the block. Such direct link connections are repeated throughout the layers. Using this methodology, ResNet offers an architecture as deep as 152 layers which can achieve the error rate of 5.71% (top-5 estimation) on the ImageNet dataset.

The CNN architectures discussed above are widely used in scene recognition applications as well. Zhou et al. [312] train weakly supervised CNN models for scene recognition using Places-205 dataset. The networks not only perform well in scene recognition but are also surprisingly good at identifying the key attributes in a given scene. This is achieved by analyzing the activations of the last pool layer before the fully connected layers. For example, the authors show that in an indoor scene of an art gallery, the activations are concentrated on the regions with paintings. Using a similar weakly supervised method, authors in [388] explicitly attempt to learn scene attributes from a partially labeled data. Here, in the initial stage, the network generates a series of pseudo-labels by averaging the responses of early convolutional layers. These labels are then used to train the network to recognize scene attributes.

Moreover, weakly supervised training can be used in object classification. In [389], Oquab et al.

exhibit the ability of neural nets to learn object classes from images without bounding box annotations. The authors propose a multi-scale training procedure in which for each epoch of training the scale of the input images is randomized (between 0.7 to 1.4 times the original dimensions). To tackle the problem of dimension mismatch between the input size and the fully connected layers, the authors transformed the network to a fully convolutional architecture by replacing the fully connected layers with equivalent convolutional layers. The classification is then achieved by a global pooling over the final score map. At the test time, the network is run on the input image at all scales and then the results are averaged.

Figure 28: Selective search for region proposal generation. Source: [378].

A more challenging task in recognition is the detection and localization of objects in the scene. In the literature, there are numerous works attempting to solve this problem in a minimum amount of time [390, 391, 392, 393, 394, 395]. One of the well-known detection algorithms is Regions with CNN features (R-CNN) [391] which unlike the methods that generate object specific region proposals (e.g. [390]) produces proposals based on the structure of the scene. R-CNN relies on the method called selective search [378] (see Figure 28

) to identify coherent regions of the scene in different spatial scales. Selective search starts by a small-scale superpixel segmentation to find regions with coherent properties. Then, in each subsequent step, it measures the similarity between the segmented regions and merges the ones that are similar. This process is repeated until a single region covers the entire image. Along with this process, at each step, proposals are generated around the identified coherent regions. Once the proposals are generated, R-CNN warps them to a fixed dimension and feeds them to a CNN (in this case AlexNet) model. The output of the CNN is a fixed feature vector that is used to identify the class of the objects in the region using a linear SVM model.

The successors of R-CNN, fast R-CNN [393] and faster R-CNN [396], combine a part of the region proposal module with the CNN network to speed up the process. In these methods, instead of feeding each proposal to the CNN, the entire image is first processed with the convolutional layers, and then the regions of interest (ROIs) are extracted from the convolutional feature map by a ROI pooling operation. The output of the ROI pooling layer is fed to a fully connected network with two branches: one for identifying the class of the object and the other for optimizing the positions of the bounding box. The faster R-CNN model goes one step further and generates the object proposals based on the output of the convolutional feature map using a what so called a Region Proposal Network (RPN). RPN uses a 3x3 sliding window technique to generate region proposals in the form of 256D feature vectors. These features are then fed to two fully connected networks, namely a box-regression layer and a box-classification layer. The former layer learns the position of the bounding boxes by calculating the intersection of the proposals with the ground truth information. The latter layer learns the objectness of the proposals, that is how likely the bounding box contains an object or a part of the background. In the end, the output of RPN is fed to ROI pooling layer of Fast R-CNN for final classification.

To deal with objects of different scales, He et al. [397] introduce spatial pooling technique in which the pooling operation is performed at different scales (after convolutional feature map is generated), and then resulting features are concatenated to form a fixed feature vector that goes into the fully connected layers. To further speed up the proposal generation process, some algorithms such as You Only Look Once (YOLO) [395] treat the process as a regression problem. YOLO, in essence, is a FCN network. It takes as input the full resolution image, and divides it into a SxS grid. Each cell predicts 5 bounding boxes with a confidence score indicating how likely it contains an object. Cells in turn are classified to identify which class of objects they belong to. The combination of the bounding box predictions and the predicted class of each cell forms the final detection and classification of objects in the scene. Despite of the fact that YOLO achieves an efficient processing time of up to 45 fps (and up to 150 fps in fast version), it suffers from lower detection accuracy, inability to localize small objects and lack of generalization to unusual aspect ratios or configurations.

In more recent years, improved results have been achieved by exploiting contextual information in both detection and classification tasks. For instance, for detection purposes, in [398] for each generated ROI candidate, an outer region, representing the local context, is taken into account. To learn the proposals, a regression operation is performed on both addition and subtraction of ROIs and their corresponding contexts. The authors claim that this method helps generating a more discriminative representation to separate the object from its background and at the same time inferring the potential locations for the object.

Furthermore, context is shown to improve object classification. For example, Hu et al. [399]

introduce a technique for scene classification in which the relationships between the multiple objects in the scene are exploited to improve the overall classification results. In their formulation, there are 4 different levels of object annotations: coarse scene category (e.g. indoor, outdoor), scene attributes (e.g. sports field), fine-grained scene category (e.g. playground) and object-level annotation (e.g. bat, people). To capture inter-connections between different elements and their labels in the scene, a recurrent neural network (RNN) is trained. The output of this network is then used to optimize the activations of each of 4 categories prior to final classification.

A summary of algorithms discussed earlier can be found in Table 6.

Model Year Features Classification Sensor Type Data Type Object Type Recognition Loc. Output
CV3D [355] 1982 Lines - Cam Gr + UB 3D Objects C -
3DI [356] 1983
3D shapes
- R D + UB 3D Objects L,C 3D
2D3D [358] 1987
- Cam Gr + UB Razors L,C 3D
AffIM [359] 1988
2D models, lines
- Cam Gr + UB
L,C 3D
CI [363] 1991 Color - Cam Col + UB Random Objects L,C 2D
AOR [360] 1992
- Cam Gr + UB Origami C -
IAVC [361] 1994
- Cam Gr + UB
3D Objects
L,C 3D
SVM3D [365] 1998 Intensitty SVM Cam Col + UB Random Objects C -
CBOR [364] 1999 Color - Cam Col + UB Random Objects C -
Spin-3D [362] 1999 3D meshes - R D Toys L,C 3D
SIFT [368] 1999 SIFT Kd-tree Cam Gr Random Objects L,C 3D
Shape-C [354] 2001 Shape Context K-NN Cam Gr + UB MNIST, COIL-20 C -
Boosted-C [371] 2001 Haar Cascade Cam Gr + UB Faces L,C 2D
Haar-ROD [372] 2002 Haar Cascade Cam Gr + UB Faces L,C 2D
Context-Pl [373] 2003
HMM Cam Gr
L,C 2D
MOD [367] 2004 Paches Boosting Cam Gr Random Objects L,C 2D
STF [366] 2008 Color, STF SVM Cam Col
VOC 2007
L,C 2D
Part-Base [374] 2010 HOG SVM Cam Col VOC 2006-08 L,C 2D
WSL [379] 2012
SURF, Color
CRF, SVM Cam Col
VOC 2006-07,
Caltech 4
L,C 2D
AlexNet [382] 2012 Conv Neural Net Cam Col ImageNet C -
origMSRC [375] 2012 HOG SVM Cam Col
MSRC 21,
VOC 2010
L,C 2D
SSHG [378] 2013 Color, SIFT SVM Cam Col VOC 2007 L,C 2D
Deep-Scene [312] 2014 Conv Neural Net Cam Col
C -
Adapt-R [370] 2014 SIFT SVM Cam Col VOC 2007 L,C 2D
HMP-D [376] 2014
SPC, Color
SVM K Col + D
L,C 2D
R-CNN [391] 2014 Conv Neural Net Cam Col VOC 2007,10-12 L,C 2D
MultiBox [390] 2014 Conv Neural Net Cam Col ILSVRC 2012 L,C 2D
LRF-D [377] 2014
SPP-net [397] 2014 Conv Neural Net Cam Col
ILSVRC 2012,
VOC 2007,
C -
VGG [386] 2014 Conv Neural Net Cam Col
ILSVRC 2014,
VOC 2012
L,C 2D
WSO-CNN [389] 2014 Conv Neural Net Cam Col VOC 2012 C -
HCP [392] 2014 Conv Neural Net Cam Col VOC 2007/12 L,C -
AAR [394] 2015 Conv Neural Net Cam Col VOC 2007 L,C 2D
CtxSVM-AMM [400] 2015
SVM Cam Col
VOC 2007/10,
L,C 2D
Fast R-CNN [393] 2015 Conv Neural Net Cam Col VOC 2007/10/12 L,C 2D
PReLU [381] 2015 Conv Neural Net Cam Col
VOC 2007,
L,C 2D
RCNN [401] 2015 Conv Neural Net Cam Col
CIFAR-10, CIFAR-100,
C 2D
VoxNet [384] 2015 Conv Neural Net L D NYUD2, ModelNet 40 C 2D
Faster R-CNN [396] 2015 Conv Neural Net Cam Col VOC 2007/12 L,C 2D
DEEP-CARVING [388] 2015 Conv Neural Net Cam Col
SUN attributes
C -
GoogleNet [387] 2015 Conv Neural Net Cam Col ImageNet 2014 L,C 2D
3DP-CNN [385] 2015 Conv K-NN K Col + D
L,C 3D
ResNet [383] 2016 Conv Neural Net Cam Col
ILSVRC 2014,
L,C 2D
SINN [399] 2016 Conv Neural Net Cam Col
C -
ContextLocNet [398] 2016 Conv Neural Net Cam Col VOC 2007/12 L,C 2D
YOLO [395] 2016 Conv Neural Net Cam Col VOC 2007/12 L,C 2D
Table 6: A summary of generic object recognition algorithms. Abbreviations: sensor type: Cam = Camera, R = Range camera, L = LIDAR, K = Kinect, data type: Gr = Greyscale, Col = Color image, D = Depth image, UB = Uniform Background, recognition: L = Localization, C = Classification, loc. type: 2D = 2D location, 3D = 3D Pose.
11.1.3 Algorithms for Traffic Scene Understanding

The nature of algorithms for traffic scene understanding, both in terms of detection and classification, is very similar to the generic object recognition algorithms discussed earlier. As a result, here, we will only focus on the parts of the algorithms that deal with unique characteristics of objects in traffic scenes.

A number of algorithms try to make sense of traffic scenes by identifying and localizing multiple objects or structures either in 2D image plane [402, 403, 404] or 3D world coordinates [405, 406] (see Table 7 for a short summary). For instance, in one of the early works [402] a hierarchical representation of 2D models is used to detect humans and recognize signs. The hierarchical representation allows one to efficiently search through a large number of available models to find the best match with the candidate object in the scene. Using this approach, for example, sign models form a tree structure with branches corresponding to different color, shape, or type of the sign.

Recent approaches use more powerful techniques for traffic object detection such as CNNs. Chen et al. [406] introduce the Mono-3D algorithm in which the network estimates the 3D pose of the objects in the scene using only the 2D image plane information. To achieve this, the authors first use the prior knowledge of the road with respect to the image plane (the distance of the camera from the ground is known and the road is assumed to be orthogonal to the image plane). This knowledge is used to generate initial 3D region proposals for objects. Using these proposals in conjunction with pixel-wise semantic information allows the algorithm to select the regions that contain an object. Note that the dimension and orientation of the 3D proposal boxes are learned based on the data and are not recovered from the scene, i.e. for all three classes of objects (pedestrians, cyclists, and cars), the authors estimated three possible sizes and two possible orientations ([0,90] degrees) for 3D boxes.

Model Year Features Classification Sensor Type Data Type Object Type Recognition Loc. Output
RT-Smart [402] 1999 2D models - Cam Gr Pedestrians, Signs L,C 2D
TSOD [403] 2000 Haar SVM Cam Gr
L,C 2D
SUTSU [404] 2009 Color Adaboost Cam Col
13 Class
e.g. Signs,Cars
C Pix
Tracklet [405] 2013 HOG SVM Cam + IMU Col Cars, Pedestrians L,C 3D
Mono-3D [406] 2016 Conv Neural Net Cam Col
Cars, Pedestrians,
Cyclists, Roads
L,C 3D
Table 7: Examples of traffic scene understanding algorithms. Abbreviations: sensor type: Cam = Camera, IMU = Inertial Measurement Unit, data type: Col = Color, Gr = Greyscale, recognition: L = Localization, C = Classification, loc. type: 2D = 2D location, 3D = 3D Pose, Pix = Pixel-wise.
11.1.4 Pedestrian Detection

Pedestrian detection, perhaps, is one of the most challenging tasks in traffic scene understanding. This is due to the fact that human body shape is often confused with various elements in the scene such as trees, poles, mailboxes, etc. Given the challenging nature of pedestrian detection, a large number of attempts have been made in the past decades trying to develop robust representations of human body that would be discriminative enough to separate them from other objects in the scene.

To deal with these ambiguities, intelligent transportation systems often rely on different types of sensors such as LIDAR [407, 408, 409] to segment pedestrians from the background, Ultrasonic or RADAR sensors [410] to detect pedestrians by analyzing motion information or infrared [411, 412, 413] to recognize humans by detecting body heat. The models that only detect motion are generally susceptible to noise and can easily be distracted by any kind of moving objects. That is why they are being used for monitoring places that are specifically dedicated to pedestrians (e.g. zebra crossings) [410]. The depth-based approaches that rely only on point clouds or depth information are vulnerable to occlusion or background clutter. These methods use stacks of planar laser scans by which they extract the shape of pedestrians [408]. So if the pedestrian is standing next to another object, the extracted shape can be distorted.

Among all non-optical sensors, infrared is proven to be very useful when lighting conditions are not favorable, e.g. at night. Infrared sensors identify the heat emitted by the human body and generate a dense greyscale representation of the environment. Naive algorithms simply threshold the captured heat map by a value close to expected human body temperature to identify pedestrians [412]. Since heat source alone can be sensitive to noise, especially in hot weather, some algorithms try to identify heat patterns similar to human body shape. For instance, Nanda and Davis [411] use a probabilistic template of the human body to match against the detected heat patterns. Bertozzi et al. [413] use a stereo infrared camera to generate a depth map whereby the human is segmented from other heat sources in the scene.

Some early camera-based pedestrian detection models use techniques similar to those used by infrared-based models. For instance, in the cases where the camera position is fixed, motion is often used to detect human subjects. The detected motion pattern is then either identified as a pedestrian [414] or is compared with pre-learned 2D human body models to confirm the presence of the human in the scene [415]. Stereo techniques, on the other hand, generate a depth map to identify the potential regions where human might be [416, 417]. For instance, Broggi et al. [417] use a similar technique but further refine the identified regions by discarding the ones with low symmetry. Then, the authors use a human head template and compare it to the final regions to identify the ones that correspond to a pedestrian. Of course, one major drawback of this technique is that it does not detect pedestrians from side-view.

A better detection rate can be achieved using more descriptive features including but not limited to Haar [418], color [419], and HOG [420, 333, 421]. Among these features, HOG, originally designed for human detection [323] have been one of the most common techniques until recently. For detection purposes, these features are used in various forms. For instance, some algorithms use them in isolation [421] whereas others combine them with other types of features such as color [419] or optical flow [420] to build more powerful image descriptors.

The most recent state-of-the-art techniques are based on convolutional neural networks [422, 423, 424, 425]

. The feature generation and classification of pedestrian detection algorithms are similar to those of the generic object recognition algorithms described earlier. However, CNNs designed for pedestrian detection pay special attention to challenges related to distinguishing pedestrians from outliers. For instance, to deal with the problem of occlusion in the scene, Tian

et al. [422] detect pedestrians based on their body parts. The authors construct a pool of pedestrian parts comprising 45 different types of human body parts under various forms of occlusion. These parts are used to train a series of CNNs which are used at the test time to detect humans. To reduce the number of false positives and to increase true positives, in [423], the authors explicitly train a classifier on objects that are most often confused with pedestrians. For instance, they identify hard negative samples (e.g. trees, poles) and explicitly learn them as separate classes in the network. In addition, they improve the detection of the pedestrians by identifying relevant attributes such as gender, their angle with respect to the camera or even what they carry (e.g. handbag). A more recent approach also achieves a similar objective by combining pedestrian detection with semantic segmentation of the scene to explicitly identify the element that may be confused with pedestrians [426]. A summary of pedestrian detection algorithms discussed in this section can be found in Table 8.

In the literature, there are a number of benchmark studies [427, 428, 429] that give a good overview of recent algorithms and their performance on various datasets. In addition, there are a number of survey papers that provide a comprehensive overview of pedestrian detection algorithms, for example [324], a survey of monocular camera-based pedestrian detection algorithms, and [430], which summarizes models that use different types of sensors.

Model Year Features Classification Sensor Type Sensor Pose Data Type Loc. Output
IRP [415] 1993
2D model
Matching Cam FiP +BeV +SiV Vid+Gr 3D
VBI [414] 1995 Motion - Cam FiP +BeV Vid+Gr 2D
TSPD [418] 1997 Haar SVM Cam StL Col 2D
PPD [410] 1998 Motion - R + U + I FiP+BeV Ht -
UTA [416] 1998 Depth Neural Net StCam StL Vid+Gr 2D
ARGO [417] 2000
Head model
- StCam StL Gr 2D
PRUT [407] 2002 Depth Parameter-based L StL PC 2D
PTB [411] 2002
- I StL Ht 2D
PDII [412] 2003 Intensity - I StL Ht 2D
SVB [413] 2005
Intensity, Edges,
- SI StL Ht 2D
HOG [420] 2006
Optical flow,
SVM Cam MuV M 2D
ACF [419] 2009 Color, HOG AdaBoost Cam MuV Caltech, INRIA 2D
PDAB [409] 2009 Haar AdaBoost Cam + L StL Gr 2D
PR-L [408] 2011 Depth SVM L StL PC 2D
DPM [333] 2013 HOG SVM Cam StL
MT-DPM [421] 2013 HOG SVM Cam StL Caltech 2D
CCF [431] 2015 Conv Decision forest Cam StL Caltech 2D
Deep-Parts [422] 2015 Conv Neural Net, SVM Cam StL Caltech 2D
TA-CNN [423] 2015 Conv Neural Net Cam StL Caltech, ETHZ 2D
LIDAR-CNN [424] 2016 Conv Neural Net Cam + L StL KITTI 2D
RPN-BF [432] 2016 Conv Neural Net Cam StL
Caltech, KITTI,
Fast−CFM [425] 2017 Conv Decision forest Cam StL
Caltech, KITTI,
SDS-RCNN [426] 2017 Conv Neural Net Cam StL Caltech, KITTI 2D
Table 8: A summary of pedestrian detection algorithms. Abbreviations: sensor type: Cam = Camera, R = RADAR, U = Ultrasonic, I = Infrared, St = Stereo, L = LIDAR, sensor pose: FiP = Fixed Position, SiV = Side-View, StL = Street-Level, BeV = Bird’s eye View, MuV = Multiple-Views, data type: Col = Color, Gr = Greyscale, Vid = Video, Ht = Heat map, PC = Point Cloud, D = Depth map, M = Movies, loc. type: 2D = 2D location, 3D = 3D pose.
11.1.5 Vehicle Recognition

Vehicles are relatively easier to detect in traffic scenes. Given their rigid body property, early detection methods often rely on simple shape detectors, e.g. circular shapes for wheels [433] or rectangular shapes for vehicle bodies [434, 435]. Vehicles also have symmetric appearance when they are seen from the front or the back, which is often the case with detecting vehicles on the road. Some authors take advantage of this property and identify symmetric patterns in the scene to detect vehicles [436, 437]. For instance, in [436], a stereo camera is used to segment the road edges. Then, a vertical line moves across the image, and at each step the regions on the right and left sides of the line are compared to measure their symmetry. The areas that yield the highest symmetry measure (above a certain threshold) are identified as vehicles.

Similar to pedestrian detection, if the position and view of the camera are fixed, motion can be used to detect potential regions that may correspond to vehicles [438, 439]. If the type of the vehicle (e.g. truck or sedan) is of interest, the nominated regions can be further processed, for example by measuring the aspect ratio of the bounding boxes to classify the vehicles [439].

Given the lack of robustness of simple features such as edges, corners or motion, more complex 2D features are widely used in vehicle detection. Some popular examples are Haar features [440], Gabor filters [441], HOG features [442, 443] and color [444]. Compared to generic detection algorithms, vehicle detection methods use these features differently. For instance, Sivaraman and Trivedi [443], instead of generating HOG descriptors for vehicles as a whole, learn the back and front parts of the vehicles separately. Then, at the test time, once each individual component is found, they fit a bounding box that covers both parts and consequently localize the entire vehicle. The authors argue that using this technique helps to identify vehicles despite occlusion. In another work [442] a part-based learning approach using HOG features is employed to detect vehicles at night. To improve the detection results, the authors modify the contributing weight of each part differently. For example, they increase the weight of back lights which are easily observable at night. Lopez et al. [444] use the intensity and the color of vehicle lights at nighttime to determine the distance to the vehicle and its direction of motion

Today, the field of vehicle detection algorithms is dominated by CNNs in such applications as satellite view traffic control [445] or autonomous driving [446]. The techniques used here are fairly similar to those used in generic algorithms. For instance, in [321] a multi-scale ROI pooling algorithm is used to detect vehicles in various scales. The authors use a VGG-16 architecture and connect three ROI pooling layers after convolutional layers 3, 4 and 5. The output of each ROI pooling layer is connected to a separate fully connected network to perform classification at different scales. The motivation for this architecture is that different scale object features emerge in different levels of the neural network, i.e. earlier layers are generally better at detecting smaller objects (due to their larger field of view) and the later ones are better at finding the large objects.

In autonomous driving, different sensor modalities can be used to improve the performance of detection. For instance, Lange et al. [447], to save processing time, use LIDAR readings to discard the ROI candidates that belong to cars far away from the vehicle (and focus only on up to 10 closest vehicles). Chen et al. [448] combine LIDAR readings from two different points of view (bird’s eye view and street level) with RGB images to improve the detection results. Here, the final classification takes place using the output of 3 ROI pooling sources from each input sensor reading. In this algorithm, the variations in viewpoints can disambiguate challenging situations caused by occlusion or illumination conditions.

The algorithms discussed in this section are summarized in Table 9. In addition, there are a number of survey papers that give a good overview of early [449] and more recent [443] vehicle detection algorithms.

Model Year Features Classification Sensor Type Sensor Pose Data Type Loc. Output
OWS-HT [433] 1989 Lines - Cam SiV Gr 2D
WADS [438] 1993 Intensity Neural Net Cam FiP + BeV Gr P
MVD [434] 1996
Lines, Corners,
- Cam StL Gr 2D
PAPRICA [435] 1997 Lines, Corners - Cam StL Gr 2D
Haar-CD [440] 1999 Haar SVM Cam StL Col -
SVB [436] 2000 Lines, Corners - StCam StL Gr 2D
DCV [439] 2002 Motion, Lines - Cam FiP + BeV Gr 2D
EGFO [441] 2005 Gabor SVM Cam StL Gr 2D
VGRD [437] 2007 Lines - Cam + R StL Gr 2D
NVD-IHC [444] 2008 Color AdaBoost Cam StL Col 2D
NUAD [442] 2011 HOG AdaBoost, SVM I StL Ht 2D
VDIP [450] 2013 HOG AdaBoost, SVM Cam StL Gr 2D
On-DNN [445] 2016 Conv SVM Cam Sat Col 2D
LID-CNN [447] 2016 Conv Neural Net Cam + L + R StL Gr + PC 3D
SDP [321] 2016 Conv
Neural Net,
Cam StL
VOC 2007,
Deep-MANTA [446] 2017 Conv Neural Net Cam StL KITTI 3D
Multi-3D [448] 2017 Conv Neural Net Cam + L StL KITTI 3D
Table 9: A summary of vehicle recognition algorithms. Abbreviations: sensor type: Cam = Camera, R = RADAR, I = Infrared, St = Stereo, L = LIDAR, sensor pose: FiP = Fixed Position, SiV = Side-View, StL = Street-Level, BeV = Bird’s eye View, data type: Col = Color, Gr = Greyscale, Ht = Heat map, PC = Point Cloud, loc. type: 2D = 2D location, 3D = 3D Pose, P = Presence only.
11.1.6 Traffic Sign Recognition

Color and shape are perhaps the two most distinctive features that separate signs from the background and at the same time, define their meaning (e.g. regulatory, informative or warning). Relying on these features alone, signs can be segmented from the background [451, 452, 453, 454] and classified [455, 456]. The algorithms that use color for sign detection, rely on a similar approach in which the image color values are thresholded by color values of the sought sign. For this purpose, different color spaces are investigated such as YIQ [452], HSI [453, 457] and HSV [454]. The classification methods are similar to those widely used in object recognition and may include neural nets [452, 453, 457], Parzen window [458] and SVM [456].

Besides shape and color, the position of traffic signs can be used to improve localization. For instance, Zhu et al. [459] use AlexNet architecture to detect and classify traffic signs in the scene. Here, instead of generating object proposals on the entire input image, the position of the signs, which are often located on the both sides of the roads, are used as a prior knowledge to generate proposals for classification.

A summary of papers reviewed in this section can be found in Table 10. In addition, a comprehensive survey of sign recognition algorithms can be found in [351].

Model Year Features Classification Sensor Type Sensor Pose Data Type Class Type Loc. Output
ART2 [451] 1994 Color, Shape - Cam StL Col Cir, Tri 2D
NOS [452] 1995 Color, Edges Neural Net Cam MuV Col All -
RCE [453] 1997 Color Neural Net Cam StL Col Cir, Tri 2D
HSFM [458] 2000 Intensity Parzen Window Cam StL Gr Cir -
TSRA [457] 2003 Color, Shape Neural Net Cam StL Col Cir, Tri 2D
CRSE [454] 2009 Color, Shape - Cam MuV Col Cir 2D
RTSC [455] 2012 Color SVM Cam MuV Col Work 2D
GSV [456] 2015 Color, HOG SVM Cam StL Col All 2D
FCN [459] 2016 Conv Neural Net Cam StL Col STSD 2D
Table 10: A summary of traffic sign recognition algorithms. Abbreviations: sensor type: Cam = Camera sensor pose: StL = Street-Level, MuV = Multi View, data type: Col = Color, Gr = Greyscale, class type: All = All types of signs, Cir = Circular signs, Tri = Triangular signs, Work = Work zone signs, loc. type: 2D = 2D location, 3D = 3D Pose, P = Presence only.
11.1.7 Road Detection

Road detection algorithms are useful in two ways: they help us to understand the structure of the streets, and at the same time can improve the localization of objects such as cars, pedestrians or signs (e.g. as in [406]). The early works on road detection mainly use edge features to identify the boundaries of the road. The detected boundaries are often compared with a pre-learned model to estimate the structure of the road [460, 461, 462, 463]. Depending on the type of the road, edge segmentation can be achieved by lane markings (in structured roads) [461, 462] or the color of the surface (in unstructured roads) [460, 463]. Some algorithms such as the one used in Stanley at DARPA 2005 [463] use LIDAR readings to localize the road surface prior to road boundary detection.

In contrast to traditional learning techniques, neural networks learn an explicit model of the road surfaces through successive generation and classification of convolutional features [464, 465, 466, 467]. However, generating enough annotated sample data for training is a daunting task because it requires pixel-wise ground truth binary masks. To deal with this problem, Laddha et al. [465] propose a technique that automatically generates ground truth annotations based on the information of the vehicle position, detailed map information (including the position of rigid objects) and the camera’s calibration parameters. Based on this knowledge, a 3D scene around the vehicle is constructed and is used for identifying a rough estimate of the road surface. The result is further refined using the color information in the scene by forcing the road segments to contain only colors that are coherent with its characteristic.

Another important factor in road detection, especially in autonomous driving applications, is the speed of processing. A number of approaches are proposed to reduce the computational load, such as classifying patches instead of pixels [466] or using smaller convolutional filter sizes and employing fully convolutional architectures [467].

Table 11 gives an overview of algorithms discussed in this section. A more in depth review of road detection algorithms can be found in [468].

Model Year Road Type Features Classification Sensor Type Data Type Loc. Output
UNSCARF [460] 1991 Unstructured roads
Color, Edge,
Road model
- Cam Col BD
VBRD [461] 1995 Streets Edge - Cam Gr BD
CBRD [462] 2004 Streets Color, Edge - Cam Col BD + SF
Stan [463] 2006 Unstructured roads Color, Edge K-means Cam + L Col + PC BD
DNN-SP [464] 2014 Streets Conv Neural Net Cam
Stanford Background,
SIFT Flow,
MSRD [465] 2016 Streets Conv Neural Net Cam KITTI SF
NiN [466] 2016 Streets Conv Neural Net Cam KITTI SF
Deep-MRS [467] 2016 Streets Conv Neural Net Cam KITTI SF
Table 11: A summary of road detection algorithms. Abbreviations: sensor type: Cam = Camera, L = LIDAR, data type: Col = Color, Gr = Greyscale, PC = Point Cloud, loc. type: BD = Boundaries, SF = Surface.

11.2 What the Pedestrian is Doing: Pose Estimation and Activity Recognition

In this section, we review two topics: pose estimation and activity recognition. In the context of pedestrian behavior understanding, pose plays an important role. On its own pose may imply the state of the pedestrian. For example, head pose or body posture may indicate that the pedestrian is intending to cross. In some applications, changes in pose are also used to understand one’s activity, i.e. it serves as a prerequisite to activity recognition. Activity recognition is also important for obvious reasons. For instance, it helps identifying someone’s walking direction towards the street or handwave to send a signal.

We start the discussion by listing some datasets publicly available for studying pose estimation and activity recognition. Then, for each topic, we discuss some of the known algorithms. As before, we try to cover a broad range of methods used in different applications, with a particular focus on the ones that can be potentially used in the context of pedestrian behavior understanding.

11.2.1 Datasets
Pose Datasets

Pose estimation datasets are collected in different ways. Some are gathered for sports scene analysis [469, 470] while the others are catered to a wider range of applications [471, 472]. The data type and the availability of ground truth annotations varies from one dataset to another. For instance, datasets such as Buffy [473] and Human Pose Evaluator Dataset (HPED) [474] are collected from TV shows and movies and only contain upper body pose information. The ones for sport or general scene understanding often come from various sport broadcasts videos [470], videos collected from the web [475] or generated by the researchers for special purposes [476].

The ground truth annotations that come with pose datasets are often in the form of joint locations and their connections [473, 477]. A few datasets include depth information and 3D joint positions [470] or body parts tags [478]. Table 12 shows some of the most common datasets for pose estimation.

Dataset Year Cat. Act. type Parts No. Class Cam. Pose Data Type No. Frames GT
Buffy [473] 2008 P M UT - F Img+Col 748 J
Buffy-Pose [479] 2009 P M UT 3 F Img + Col 245 BB,PL
LSP [469] 2010 P Sp Full - F Img+Col 2K J
VideoPose [480] 2011 P M UT - Mul Vid+Col 44 J
CAD-60 [481] 2012 P,A B Full 12 F D+Vid+Col 60 J,AL
KTHI [482] 2012 P Sp Full - Mul Img+Col 771 J
HPED [474] 2012 P M UT - F Vid+Col 6.5K SM
CAD-120 [483] 2013 P,A B Full 20 F D+Vid+Col 120 J,AL
FashionPose [484] 2013 P B Full - F Img+Col 7.5K J
KTHII [470] 2013 P Sp Full - Mul Vid+Img+Col 800+5.9K 3DJ
VGG [475] 2013 P Mix UT - F D+Vid+Col 172 J
FLIC [477] 2013 P M UT - F Img+Col 5K J
APE [485] 2013 P,A B Full 7 F D+Vid+Col 245 3DJ,AL
ChaLearn [486] 2013 P,A B Full 20 F D+Vid+Col 23Hr J,AL
PennAction [487] 2013 P,A Mix Full 15 F Vid+Col 2.3K J,AL
Human3.6M [488] 2014 P,A B Full 17 Mul D+Vid+Col 3.6M 3DJ,AL
PARSE [489] 2014 P,A Sp Full 14 F Syn+Img+Col 1.5K+305 J,AL
TST-Fallv1 [476] 2014 P B Full 2 Sky D+Vid 20 PL
FLD [472] 2014 P Mix Face - CU Img+Col 33K FP
MHPE [490] 2014 p B Full - Mul Vid+Col 2 J
PiW [491] 2014 P M UT - F Vid+Col 30 J
MPII [471] 2014 P,A Mix Full 491 F Img+Col 40K J,AL
TST-TUG [492] 2015 P B Full - F D+Vid+Col 20 J
HandNet [493] 2015 P B Hand - F D+Vid 100K J
SHPED [494] 2015 P Mix UT - F St+Img+Col 630K SM
VI-3DHP [495] 2016 P,A B Full 15 Mul D+Img 100K J,AL
UBC3V [478] 2016 P B Full - F D 210K BP
TST-Fallv2 [496] 2016 P,A B Full 264 F D+Vid 264 J,AL
Table 12: Pose estimation datasets. Abbreviations: categories: P = Pose, A = Activity, action Type: M = Movies, Sp = Sport, B = Basics (e.g. walking, sitting), parts: UT = Upper Torso, Full = Full body, camera Pose: F = Front view, Mul = Multi view, Sky = Sky view, CU = Close Up, data Type: Img = Image, Col = Color, D = Depth, Vid = Video, Syn = Synthetic, St = Stereo, ground Truth: J = Joints, AL = Activity Label, SM = Stickmen, PL = Pose Label, FP = Facial Points, BP = Body Pose
Activity Datasets

Activity recognition datasets often comprise temporal sequences and, similarly to pose datasets, are extracted from different sources including sport videos [497], movies [498] or are made for a particular application [499]. The ground truth annotations of these datasets are often in the form of activity labels with [495] or without temporal correspondence [500]. In addition, some datasets have explicit pose information as joint positions [481, 485], which makes them suitable for both pose estimation and activity recognition. Table 13 lists some common datasets for activity recognition. For a comprehensive list of activity recognition datasets including non-vision-based ones refer to [501].

Dataset Year Cat. Act. type Parts No. Class Cam. Pose Data Type No. Frames GT
CASIA Giat [502] 2001 A G Full 3 F Vid+Col 12 Dir
KTH [499] 2004 A B Full 6 F Vid+Gr 2.3K+ AL
ASTS [503] 2005 A B Full 10 F Vid+Col 90 AL
IXMAS [504] 2006 A B Mix 13 Mul Vid+Col 36 AL
Cam-HGD [505] 2007 A B H 9 F Vid+Col 900 PL
CASIA-Act [506] 2007 A B,Int Full 8,7 Mul Vid+Col 1446 BB,AL
UCF [497] 2008 A Sp Full 10 F ViD+Col 150 AL,BB
UCF-Crowd [507] 2008 A Grp Full 2 BeV Vid+Col 3 BB,Dir
LHA [508] 2008 A M Mix 8 Mul Vid+Col 32 AL
MAR-ML [509] 2008 A Sp Full 14 BeV Vid+Col 541 BB,AL
UCF-Aerial [510] 2009 A B Full 9 Sky Vid+Col 7 BB,AL
MSR [511] 2009 A B Full 20 F D+Vid+Col 20 AL
Col-AD [512] 2009 A Grp Full 5 F Vid+Col 44 BB,AL,Dir
PETS-Flow [513] 2009 A Grp Full 6 BeV Vid+Col 9 AL
Buffy-Pose [479] 2009 P M UT 3 F Img+Col 245 BB,PL
Holly2 [498] 2009 A M Mix 12 Mul Vid+Col 3.6K+ AL
UCF11 [514] 2009 A Sp Mix 11 Mul Vid+Col 1.1K+ AL
OSD [515] 2010 A Sp Mix 16 Mul Vid+Col 800 AL
MCFA [516] 2010 A Fall Full 24 Mul Vid+Col 24 0
BEHAVE [517] 2010 A Int Full 10 BeV Vid+Col 4 BB,AL
TV-HID [518] 2010 A M,Int Mix 4 Mul Vid+Col 300 AL
SDHA [519] 2010 A Int Full 6 BeV Vid+Col 20 AL
SDHA-Air [520] 2010 A B Full 9 BeV Vid+Col 108 BB,AL
Videoweb [521] 2010 A Int Mix 9 Mul Vid+Col 2.5Hr AL
Willow-Act [522] 2010 A B Mix 7 Mul Img+Col 968 AL
MUHAVI [523] 2010 A B Full 17 Mul Vid+Col 17 AL,PB
VIRAT [524] 2011 A B Full 12 BeV Vid+Col 8,5Hr BB,AL
UCF-ARG [525] 2011 A B Mix 10 Mul Vid+Col 480 BB,AL
HMDB [526] 2011 A Mix Mix 51 Mul Vid+Col 6.8K+ AL
UCF50 [527] 2012 A Mix Mix 50 Mul Vid+Col 6.6K+ AL
CAD-60 [481] 2012 P,A B Full 12 F D+Vid+Col 60 J,AL
ChaLearn [486] 2013 P,A B Full 20 F D+Vid+Col 23Hr J,AL
CAD-120 [483] 2013 P,A B Full 20 F D+Vid+Col 120 J,AL
JPL-FPI [528] 2013 A Int Full 7 FP Vid+Col 57 AL
UCF101 [529] 2013 A Mix Mix 101 Mul Vid+Col 13K+ AL
APE [485] 2013 P,A B Full 7 F D+Vid+Col 245 3DJ,AL
ChaLearn [486] 2013 P,A B Full 20 F D+Vid+Col 23Hr J,AL
MPII [471] 2014 P,A Mix Full 491 F Img+Col 40K J,AL
VAD [530] 2014 A Sp Full 7 F Vid+Col 6 AL
Sports-1M [500] 2014 A Mix Mix 487 Mul Vid+Col 1.1M AL
PARSE [489] 2014 P,A Sp Full 14 F Syn+Img+Col 1.5K+,305 J,AL
Human3.6M [488] 2014 P,A B Full 17 Mul D+Vid+Col 3.6M 3DJ,AL
Crepe [531] 2015 A C Full 9 F Vid+Col 6.5K AL,BB
ActivityNet [532] 2015 A B Mix 203 Mul Vid+Col 27K Al,TL
TST-Fallv2 [496] 2016 P,A B Full 264 F D+Vid 264 J,AL
VI-3DHP [495] 2016 P,A B Full 15 Mul D+Img 100K J,AL
Table 13: Activity recognition datasets. Abbreviations: categories: P = Pose, A = Activity, action Type: M = Movies, Sp = Sport, B = Basics (e.g. walking, sitting), C = Cooking, Int = interaction, G = Gait, Fall = Fall detection, Grp = Group parts: H = Hand, UT = Upper Torso, Full = Full body, camera Pose: F = Front view, Mul = Multi view, Sky = Sky view, CU = Close Up, FPer = First Person, BeV = Bird’s eye View, data Type: Img = Image, Col = Color, D = Depth, Vid = Video, Syn = Synthetic, St = Stereo, Gr = Grey, ground Truth: J = Joints, AL = Activity Label, PL = Pose Label, FP = Facial Points, BP = Body Pose, BB = Bounding Box, Dir = Direction, PB = Pixel-wise Binary, TL = Temporal Localization.
11.2.2 Pose Estimation

Pose estimation algorithms can be divided into two main groups: exemplar-based and part-based models [533]. The former approaches try to identify the pose as a whole whereas part-based models use local body part appearances and the connections between them to estimate the overall pose.

Early pose estimation methods are predominantly exemplar-based and rely on body templates to estimate the pose. Templates can be created using different techniques such as simple silhouette models of human body [534], real image samples pre-processed by applying some form of filters (e.g Gabor filters) [535] or 3D models of human body [536].

More recent algorithms are learning-based and estimate body poses by learning from examples. For instance, Ng and Gong [537] use a large dataset of greyscale head pose images, each corresponding to a specific head orientation. These images are then normalized and used to train an SVM model to learn the correspondences between the 2D images and 3D head poses. Likewise, Agrawal and Triggs [538] use a regression technique to learn the full 3D human body poses from a series of image silhouettes. The input to regression model is a 100D feature vector generated from the image silhouettes and the output is a 55D vector that estimates 3 angles for each of 18 body joints.

The exemplar-based models, however, suffer from a major drawback, that is they require a good match between the templates and proposals. This is problematic because it is often the case that a complete model of the body (or its component) is not retrievable from the image due to occlusion or background clutter. More importantly, even when a complete model of the body is available, if the ratios or the poses of the body components do not match entirely to those of the training images (e.g. legs are similar to some training images, but the arms are similar to the others), the algorithm will fail to estimate the pose.

Part-based models, as explained earlier, consist of two stages: learning human body part appearances, and determining the relationship between them. The body parts can be represented using simple edge maps and color distributions [539] or more complex descriptors such as HOG [540].

There are two ways to learn the relationships between body parts- explicit or implicit. The explicit techniques generally use tree-structured graphical models, such as CRF [539, 533], to learn the interdependencies between the connected parts. Here, the learning is in the form of a spatial prior that describes the relative arrangement of the parts both in terms of the location and orientation of the parts. Implicit techniques, on the other hand, learn the relationships between the parts directly from the data. Regression techniques such as SVM [540] or neural nets [541] are examples of such methods.

Part-based models also have a disadvantage that they only learn the relationships between parts locally and lack a global knowledge of how the overall human body, in a given pose, should look like. There are a number of solutions proposed to overcome this problem. Wang et al. [533] introduce a hierarchical part representation, which, in addition to 10 common body parts (e.g hands, torso, head), uses 10 intermediate part representations formed by combining basic human parts together, such as torso with right arm and hand, left arm with the left hand, all the way up to full human body. The appearance of every part representation is learned and used at the test time to improve the result. The authors claim that this form of representation is a bridge between the pure exemplar- and part-based methods.

To improve pose estimation even further, other types of body part representations are also investigated. For instance, Dantone et al. [484], in addition to human limbs, learn the joint models that connect the limbs together. The tree structure that describes the overall pose is based on the position of the joints starting from the nose position all the way down to the ankles. Cherian et al. [542], in addition to modeling the relationships between body joints, learn the temporal correspondence of the same joint across a number of frames. The potential position of a joint from one frame to another is measured via calculating the optical flow.

Pishchulin et al. [543] combine the methods of [533] and [484] and enhance the robustness of body part representations by learning both their absolute rotation appearance and their relative rotation appearance with respect to the overall body pose. To achieve the former, body part training data is divided into 16 bins (each corresponding to 22.5 degrees of rotation) and a detector is trained for each bin. The second component also performs a similar training procedure with the difference of normalizing the body part rotation with respect to the overall body pose.

Not surprisingly, the majority of the recent pose estimation models are based on CNNs in which the body part representations, as well as underlying relations between them, are all learned together under one framework. These models achieve state-of-the-art performance in both 2D [544] and 3D [545] pose estimation. In one of the early CNN-based techniques, DeepPose [546], the authors first train a network to localize the positions of the joints on the full body image. Then, in the next stage, a series of sub-images are cropped around the predicted joints from the first stage and are fed to another regressor for each joint. Using this method, the subsequent regressors see higher resolution images, and as a result, learn features at finer scales and ultimately achieve a higher precision. Employing a similar technique, in [547], instead of a two-stage learning, the authors use a two-stream network. In one stream the close-up views of body parts are learned and another stream captures a holistic view of the human body. The features learned from each stream are concatenated before the fully connected layers.

To better capture the spatial context around each joint, Wei et al. [544] use a multi-stage training scheme. In the first stage, the network only predicts part beliefs from local image evidence. The next stage accepts as input the image features and the features computed for each of the parts in the previous stage and applies convolutional filters with larger receptive fields. The increase in the receptive field of filters allows the learning of potentially complex and long-range correlations between parts.

Figure 29: The pose estimation architecture proposed in [548].

Pfister et al. [548] (see Figure 29) take advantage of temporal correspondences in videos to improve pose estimation. At each step, the poses in N frames before and after the given frame are estimated. Then, the spatial relationships between the joints are learned using a second network that takes as input the responses from the 3rd and 7th convolutional layers from the first network. Once the joint locations are optimized on all frames, the results are warped into the reference frame by calculating their optical flow. Next, the warped predictions (from multiple frames) are pooled with another convolutional layer (a temporal pooler) that learns how to weigh the warped predictions from nearby frames. The final joint locations are selected as the maximum of the pooled predictions.

In one of the most recent works, a deep learning method is introduced that is capable of robust estimation of the poses of multiple people in the scene

[549]. In this work, the authors extend the method in [544] to explicitly learn the associations between different parts. They use a descriptor known as Part Affinity Fields (PAFs), which is a set of 2D vector fields that encode the location and orientation of limbs over the image domain. These fields help to disambiguate the relationships between various detected parts which might belong to different people.

A summary of described models can be found in Table 14. For a more in depth review of pose estimation algorithms please refer to [550, 551].

Model Year Features Classif. Parts Data Type
HPE-GA [534] 1994 Templates - UT Img+Syn
RA-FPE [535] 1998 Gabor - Hd Img+Gr
SVM-FACE [537] 2002 Intensity SVM Hd Img+Gr
Hand-PE [536] 2003 3D Model - H Mul+Img+Gr
3D-HPE [538] 2004 Silhouette RVM Full Img+Col
Parse-AB [539] 2007 Color, Edge CRF Full Weizmann
Cluster-P [540] 2010 HOG SVM Full LSP
Poslet [533] 2011 HOG SVM Full UIUC,Sport
Body-Parts [484] 2013 HOG,Color Dec. Forest Full FashionPose
ESM [543] 2013 HOG SVM Full Parse
MBP [542] 2014 HOG,Optical Flow SVM UT VideoPose,MPII,PiW
MDL [541] 2014 HOG SVM, Neural Net Full LSP, UIUC,PARSE
DeepPose [546] 2014 Conv Neural Net Full FLIC, LSP
DS-CNN [547] 2015 Conv Neural Net Full FLIC, LSP
Flow-CNN [548] 2015 Conv Neural Net Full FLIC, ChaLearn, PiW, BBC
CPM [544] 2016 Conv Neural Net Full MPII, LSP, FLIC
SMD [545] 2016 Conv Neural Net Full Human3.6M, PennAction
PFA [549] 2017 Conv Neural Net Full MPII,COCO
Table 14: A summary of pose estimation algorithms. Abbreviations: Parts: H = Hand, UT = Upper Torso, Full = Full body, Hd = Head, Data Type: Img = Image, Col = Color, Syn = Synthetic, Gr = Grey.
11.2.3 Action Recognition

Activity recognition is perhaps one of the most studied fields in computer vision. The amount of work done in this field is apparent from a long list of survey papers published throughout the past decades, such as surveys on hand gesture recognition [552], general vision-based action recognition [553, 554], video surveillance [555], action recognition using 3D data [556], action recognition in still images [557], semantic action recognition [558], action recognition using deep networks [559] or hand-crafted feature learning [560]. Some works also present extensive evaluation of the state-of-the-art on popular datasets in the field [532].

Depending on the difficulty of the task, activity recognition algorithms come in 4 types (in the order of difficulty from easiest to hardest): gesture recognition, action recognition, interaction recognition and group activity.

Gesture recognition is commonly used in close encounters for applications such as Human Computer Interaction (HCI). The process often involves the recognition of the hand, and identifying its transformation in terms of pose and velocity in temporal domain [561]. Given that gesture recognition is very application specific and the fact that activity recognition algorithms capture the gesture changes to some degree, we will focus our discussion mainly on other three categories of activity recognition.

Action recognition algorithms, in their simplest form, often rely on some form of template matching to identify a particular action pattern. Template matching can be done in spatial domain by identifying certain body forms (e.g. pedestrian legs) to infer a type of action (e.g. walking) [562]. More sophisticated algorithms use templates generated in spatiotemporal domain [563, 564, 565]. For instance, Niyogi and Adelson [563] detect a pedestrian in the image sequence using background subtraction. They then cut through the temporal domain and generate a 2D image, which reflects the temporal changes (at a given height of the pedestrian). This 2D image is compared with a pre-learned template to realize the gait of the pedestrian.

Bobick and Davis [564] introduce two types of features, motion-energy image (MEI) and motion-history image (MHI), to characterize human activities through time. MEI is a binary image that shows instantaneous motion pattern in the sequence. This image is generated by simply aggregating the motion patterns through time in a single 2D image. MHI, on the other hand, is a scalar-valued image in which the value of each pixel is a function of time where pixels corresponding to more recent movements have higher intensity values compared to the rest. The combination of these two images forms a feature vector, which in turn can be compared with pre-learned features to identify the action. An extension of this approach for more realistic scenes is employed in [565], where, in addition to temporal changes, color features are used to separate the human body from the background. To deal with occlusion, the templates are divided into parts (e.g. upper body and legs) and are identified separately at test time.

The major drawback of the template-based approaches is that they often make simplistic assumptions about the environment setting. For example, in some works the height or velocity of the pedestrian is assumed to be fixed [563], or the camera position is considered fixed with minimal background motion [564]. In practice, such assumptions can significantly constrain the applicability of these approaches to complex and cluttered scenes.

Learning algorithms are also very popular tools for action recognition both in still images [566, 567] and videos [568, 569]. For instance, in still images, human pose is often used to recognize actions [566, 567, 570]. Ikizler et al. [566] use the pose detection algorithm in [539] to identify body parts. The orientation of body parts is measured and quantized in a histogram to form a descriptor of the image. The descriptors for each activity is then learned using a linear SVM algorithm. In addition to the pose, in [567], the authors learn the relationship between the pose and certain objects (e.g. tennis racquet) via a graphical model. Besides human pose, Delaitre et al. [522] investigate the role of 2D features, e.g. SIFT, in conjunction with popular regression techniques such as SVM. The authors show that for simple action recognition tasks, using 2D features can result in state-of-the-art performance.

Similar to action recognition in still images, video-based models take advantage of various learning techniques such graphical models (e.g. Bayesian networks) and regression algorithms (e.g. SVM). For instance, Yamato

et al. [571]

use a Hidden Markov Model (HMM) for inference of action types in videos. The authors train a separate HMM for each action category (6 common actions in tennis). The regression based models often use features that are generated from optical flow information

[572] or extracted from spatiotemporal domain [499, 573]. For example, Schuldt et al. [499] use local space-time features to characterize the critical moments in an observed action. These features are generated by computing second-moment matrix using spatiotemporal image gradients within a Gaussian neighborhood around each point and selecting positions with local maxima. Blank et al. [573] employ a similar approach to [564] by characterizing the entire space-time volume resulting from the action. Here, every internal space-time point is assigned a value that reflects its relative position within the space-time shape (the boundaries of the volume).

Today, action recognition algorithms are moving towards neural networks. In its simplest form, a CNN, using 3D convolutional filters, can infer actions from a stacked image sequence [574]. This method can be further improved by performing the convolutions at multiple resolution levels to extract both fine and coarse features. The authors of [575] investigate such a technique by using a two-stream network. One stream receives as input the image in its original resolution (context-stream) and the other a zoomed cropped portion of the image from the center (fovea-stream). The output of these two networks is then fused for final classification.

To achieve a better representation, more recent algorithms learn spatial and temporal features separately [576, 577, 569, 578]. These models typically use a two-stream architecture, in which one stream receives a color image at time and another stream takes a stack of temporal representations, often in the form of optical flow measurements for frames before and after the image at time . The outputs of the streams are then fused prior to final classification. The fusion process might be at the early (e.g. after 3rd convolutional layer) or late (e.g. right before fully connected layers) stages.

Some CNN-based algorithms implement graphical reasoning using Recurrent Neural Nets (RNNs) or its variants such as popular Long Short Term Memory (LSTM) models [569, 579]. LSTM networks are equipped with a form of internal memory, which allows them to learn the dependencies between the consecutive frames. The internal control of LSTMs is through the use of gates, whereby the contribution of new information and past learning for final inference is determined. For instance, Yue et al. [569] propose a method for learning spatial and temporal features using two common CNN models (e.g. GoogleNet). The outputs of the networks prior to the loss layers are fused and fed to a 5-stack LSTM network for final inference.

Figure 30: The transformation framework proposed in [578].

Other methods of inference using neural nets have also been investigated [578, 580]. For example, Wang et al. [578] (see Figure 30

) treat the action recognition problem as learning the transformation between the precondition frames (typically from the beginning of the video) and the effects (a portion of the frames illustrating the consequence of the action). The authors use a Siamese network architecture which receives as input two streams of frames (i.e. the precondition and effect frames), learns their corresponding characteristics by applying a series of convolutional and fully connected layers, applies a linear transformation to the output of precondition stream, and finally measures the similarity in terms of distance between the transformed output of the precondition stream with the effect one. Here, the objective of learning is to find a transformation matrix that minimizes the distance between the two streams of the network. Using this approach, the authors argue that the overfitting problem that exists in other approaches due to scene context learning can be avoided because the networks are explicitly forced to encode the change in the environment, i.e. the inherent reason that convinced the agent to perform the action.

Interaction recognition algorithms [581, 582] often infer the type of activity by directly estimating the relationship between the subjects’ body parts. For instance, Park et al. [581] first identify the body parts of the humans in the scene, and then, using a graphical model, infer their interaction using the distance between each person’s body parts and the corresponding one in another person as well as the overall orientation of their entire bodies.

Similar direct approaches have been used in group activity recognition. Choi et al. [512], for instance, use so called spatiotemporal local (STL) features to characterize individuals in the scene. This descriptor, in essence, is a histogram that records the following factors with respect to the reference person: the number of people surrounding the person, their relative pose and distance to that person. These descriptors are generated for each frame and concatenated for the entire video. At the end, using a linear classifier such as SVM, the descriptors are classified into different group activities.

A number of algorithms use two-stage inference where, first, the actions of individuals are recognized, and then their spatial formation and dynamic changes are used to recognize the overall activity [583, 584]. For example, in [584] a neural net architecture is proposed. This network, by applying a series of ConvNets, first estimates the pose and action of each individual as well as the class of the scene (e.g. fall event). Next, the output of these networks is passed to a message passing network that learns the semantic dependencies between the inputs. At the end, the output of the message passing network facilitates the refinement of estimations for individual activities. For instance, a person standing in a queue might be identified as standing, but after taking into account the entire scene, it would be relabeled as queuing. For a summary of algorithms reviewed in the section please refer to Table 15.

Model Year Features Classif. Act. Type Data Type No. Class
Mesh-HMM [571] 1992 Mesh features HMM Act Vid+Gr+FiP 6
XYT [563] 1994 Motion,Template - Act Vid+Gr+FiP 1
3D-Gesture [561] 1996 Position, Velocity HMM Gest St+Vid+Gr+FiP 19
Walk-Ped [562] 2000 Template, Edges - Act Vid+Gr 1
Aerobics [564] 2001 Template, Motion - Act Vid+Gr+FiP 18
Act-Dist [572] 2003 Optical flow KNN Act Vid+Col 30
Two-Int [499] 2003 Pose BN, HMM Int Vid+Col+FiP 9
Local-SVM [499] 2004 ST SVM Act RHA 6
Space-time [573] 2005 ST KNN Act ASTS 9
EDCV [565] 2007 Template, Optical flow - Act Vid+Col 6
RA-Still [566] 2008 Pose SVM Act Img+Col 6
CAC-STP [512] 2009 Pose, ST SVM Grp Vid+Col 5
STRM [582] 2009 ST K-means Int Vid+Col+FiP 6
MMC-OHP [567] 2010 Shape context Graphical Act Img+Col 6
Bag-AR [522] 2010 SIFT SVM Act Willow-Act 7
RHA-SI [570] 2010 Pose SVM Act Img+Col 5
Social [583] 2012 HOG SVM Grp Vid+Col 3
3D-Conv [574] 2013 Conv Neural Net Act KTH 6
LSVC [575] 2014 Conv Neural Net Act Sports-1M,UCF101 588
Two-Stream [576] 2014 Conv Neural Net Act UCF101, HMDB 152
Deep-Struct [584] 2015 Conv Neural Net Grp Col-AD 5
Deep-Snippets [569] 2015 Conv Neural Net Act Sports-1M,UCF101 487
SIM [585] 2016 Conv Neural Net Grp Col-AD 5
Early-Det [579] 2016 Conv Neural Net Act ActivityNet 203
Act-Transform [578] 2016 Conv Neural Net Act ACT 43
Two-Stream-3D [577] 2016 Conv Neural Net Act UCF101, HMDB 152
AdaScan [580] 2017 Conv Neural Net Act HMDB,UCF101 152
Table 15: A summary of action recognition algorithms. Abbreviations: Features: ST - Spatiotemporal features, Act. Type: Gest = Gesture, Act = Action, Int = Interaction, Grp = Group, Data Type: Img = Image, Col = Color, Vid = Video, Gr = Grey, FiP = Fixed Position.

11.3 What the Pedestrian is Going to Do: Intention Estimation

11.3.1 Datasets

In the past few decades, a number of large-scale datasets have been collected in different geographical locations to study pedestrian and driver behavior in traffic scenes. Some of the most well-known datasets are 100-car Naturalistic Study [141], UDRIVE [142], SHRP2 [586], and many more [587, 588]. Although these datasets provide invaluable statistics about various factors that influence the behavior of road users, the raw videos or any information that can be used for visual modeling is not publicly available.

A number of existing vision datasets can be potentially used for intention estimation such as activity recognition and object tracking datasets [589, 513] or pedestrian detection datasets that have temporal correspondences [322, 315]. The drawback of these datasets is that besides bounding box information, they only contain a limited (if any) number of annotations for contextual elements such as street structure, activities, group size, signals, etc. Therefore they are mainly suitable for predicting behavior based on pedestrians’ dynamics.

The number of datasets that are specifically tailored for intention estimation applications is very small. One of the main datasets suitable for behavioral studies with visual information is Joint Attention in Autonomous Driving (JAAD) [590]. This dataset comprises 346 high resolution videos with bounding box information for pedestrians. A subset of the pedestrians (over 600) are annotated with behavioral data, such as whether the pedestrians are looking, walking, crossing or waving hands, as well as those pedestrians’ demographic information. In addition each frame in the video sequences is annotated with contextual information such as street delineation, street width, weather, etc. Daimler-Path [591] is a dataset designed for pedestrian path prediction. It contains 68 greyscale video samples of pedestrians from a moving car perspective. The data is collected using a stereo camera and is annotated with vehicle speed, bounding box information and 4 pedestrian motion types including crossing, stopping, starting to walk and bending in.

Another dataset for pedestrian intention estimation is Daimler-Intend [592] that comes with 58 greyscale stereo video sequences annotated with bounding box information, the degree of head orientation, pedestrian distance to the curb and the vehicle as well as the vehicle’s speed.

11.3.2 Pedestrian Intention Estimation

As discussed earlier, intention estimation is a broad topic and can be applied in various applications. In intelligent driving systems alone, intention estimation techniques have been widely used for predicting the behavior of the drivers [593, 594], other drivers [595, 596], pedestrians [597, 598] or combinations of any of these three [599, 600] (for a more detailed list of these techniques see [601]). In this section, however, we particularly discuss pedestrian intention estimation methods in the context of intelligent transportation systems with a few mentions of some of the techniques used in mobile robotics.

Essentially, intention estimation algorithms are very similar to object tracking systems. One’s intention can be estimated by looking at their past and current behavior including their dynamics, current activity and context. In autonomous driving, for example, we want to estimate the next location where the pedestrian might be appearing, realize whether they attempt to cross, or predict the changes in their dynamics.

There are a number of works that are purely data-driven, meaning that they attempt to model pedestrian walking direction with the assumption that all relevant information is known to the system. These models either base their estimation merely on dynamic information such as the position and velocity of pedestrians [602], or in addition, take into account the contextual information of the scene such as pedestrian signal state, whether the pedestrian is walking alone or in a group, and their distance to the curb [603]. In a work by Brouwer et al. [604], the authors investigate the role of different types of information in collision estimation. More specifically, they consider the following four factors: dynamics (directions pedestrian can potentially move and time to collision), physiological elements (pedestrian’s moving direction and distance to the car and velocity), awareness (in terms of head orientation towards the vehicle), and obstacles (the position of obstacles in the scene and sidewalks). The authors show that, in isolation, second and third factors are the best predictors of collision. They also show that by fusing the information from all four factors, the best prediction results can be achieved.

The vision-based intention estimation algorithms often treat the problem as tracking a dynamic object by simply taking into account the changes in the position, velocity and orientation of the pedestrian [153, 605] or by considering the changes in their 3D pose [606]. For instance, Kooij et al. [597] employ a dynamical Bayesian model, which takes as input the current position of the pedestrian and, based on their motion history, infers in which direction the pedestrian might move next. In addition to pedestrian position, Volz et al. [607] use information regarding the pedestrian’s distance to the curb and the car as well as the pedestrian’s velocity at the time. This information is fed to a LSTM network to infer whether the pedestrian is going to cross the street or not.

In robotics, intention prediction algorithms are used as a means of improving trajectory selection and navigation. Besides dynamic information, these techniques assume a potential goal for pedestrians based on which their trajectories are predicted [608, 609]. The drawback of these algorithms is that they assign to pedestrians a predefined goal location, which is based on the current orientation of the pedestrian towards one of the potential goal locations.

Figure 31: Pedestrian’s head orientation as the sign of awareness. Source: [592].

In some recent works, social context is exploited to estimate intention. For instance, pedestrian awareness is measured by the head orientation relative to the vehicle (see Figure 31) [592, 610, 611]. Kooij et al. [592] employ a graphical model that takes into account factors such as pedestrian trajectory, distance to the curb and awareness. Here, they argue that the pedestrian looking towards the car is a sign that he noticed the car and is less likely to cross the street. This model, however, is based on a scripted data which means that the participants were instructed to perform certain actions, and the videos were only recorded in a narrow non-signalized street.

For intention estimation, some scholars also consider social forces, which refer to people’s tendency to maintain a certain level of distance from one another. In their simplest form, social forces can be treated as a dynamic navigation problem in which pedestrians choose the path that minimizes the likelihood of colliding with others [612]. Social forces also reflect the relationship between pedestrians, which in turn can be used to predict their future behavior. For instance, Madrigal et al. [613] define two types of social forces: repulsion and attraction. In this interpretation, for example, if two pedestrians are walking close to one another for a period of time, it is more likely that they are interacting, therefore the tracker estimates their future states close together.

Apart from the explicit tracking of pedestrian behavior, a number of works try to solve the intention estimation problem using various classification approaches. Kohler et al. [598], via using a SVM algorithm, classify pedestrian posture as about to cross or not crossing. The postures are extracted in the form of silhouette body models from motion images which are generated by background subtraction. In the extensions of this work [614, 615], the authors use a HOG-based detection algorithm to first localize the pedestrian, and then using stereo information, they extract the body silhouette from the scene. To account for the previous action, they perform the same process for N consecutive frames and superimpose all silhouettes into a single image. The final image is used to classify whether the pedestrian is going to cross.

Rangesh et al.[616] estimates the pose of pedestrians in the scene, and identifies whether they are holding cellphones. The combination of the pedestrians’ pose and the presence of a cellphone is used to estimate the level of pedestrians engagement in their devices. In [617], the authors use various contextual information such as characteristics of the road, the presence of traffic signals and zebra crossing lines, in conjunction with pedestrians’ state to estimate whether they are going to cross. In this method, two neural network architectures are used. One network is responsible for detecting contextual elements in the scene and the other identifying whether the pedestrian is walking/standing and looking/not-looking. The scores from both networks are then fed to a linear SVM to classify the intention of the pedestrians. The authors report that by taking into account the context, intention estimation accuracy can be improved by upto 23%.

In addition to appearance-based models, Volz et al. [618] use pedestrian velocity, and distance to the curb, crosswalk and the car. Schneemann et al. [619] consider the structure of the street as a factor influencing crossing behavior. The authors generate an image descriptor in the form of a grid which contains the following information: street-zones in the scene including ego-zone (the vehicle’s lane), non-ego lanes (other street lanes), sidewalks, and mixed-zones (places where cars may park), crosswalk occupancy (the position of scene elements with respect to the current position of the pedestrians), and waiting area occupancy (occupancy of waiting areas such as bus stops with respect to the pedestrian’s orientation and position). Such descriptors are generated for a number of consecutive frames and concatenated to form the final descriptor. At the end, an SVM algorithm is used to decide how likely the pedestrian is to cross. Despite its sophistication for exploiting various contextual elements, this algorithm does not perform any perceptual tasks to identify the aforementioned elements and simply assumes they are all known in advance.

In the context of robotic navigation, Park et al. [620] classify observed trajectories to measure the imminence of collisions. The authors recorded over 2.5 hours of videos of the pedestrians who were instructed to engage in various activities with the robot (e.g. approaching the robot for interaction or simply blocking its way). Using a Gaussian process, the trajectories were then classified into blocking and non-blocking groups.

Table 16 summarizes the papers discussed in this section.

Model Year Factors Inference Pred. Type Data Type Cam Pose
LTA [612] 2009 PP, PV,G, SC GD Traj Vid+Col BeV+FiP
Early-et [598] 2012 PPs SVM Cross Img+Col F+FiP
IAPA [608] 2013 PP,PV,G MDP Traj,Cross Vid+Col F
Evasive [614] 2013 PPs,MH SVM Cross St+Vid+Gr F+FiP
Early-Pred [621] 2014 PP,PV SVM Traj Vid+Gr Mult+FiP
Veh-Pers [597] 2014 PP,PV,VD BN Traj Vid+Gr F
Context-Based [592] 2014 SS,HO,PP,VD BN Cross Daimler-Intend F
Intent-Aware [613] 2014 PP,PV,SC BN Traj PETS F
Path-Predict [606] 2014 PP,PV,Pose BN Traj, Pose Vid+Col F
MMF [602] 2015 PP,PV,SC CRF Traj Daimler-Path F
Intend-MDP [609] 2015 PP,PV,G,VD MDP Traj Vid+Col+L F
SVB [615] 2015 MH,PPs SVM Cross St+Vid+Gr F+FiP
PE-PC [603] 2015 GS,PP,PV,Si BN Traj, Cross Vid+Gr F
Traj-Pred [605] 2015 PP,PV PF Traj Vid+Col F
FRE [618] 2015 PV,DC,DCr,DV,VD SVM Cross Vid+L F
Eval-PMM [604] 2016 PP,PV,HO BN Traj Vid+Col F
ECR [610] 2016 PP,PV,HO BN Collision Caltech, Daimler-Mono F
HI-Robot [620] 2016 MH GP Collision Vid+L F
CBD [619] 2016 PP,SS,MH SVM Cross Vid+Col F
DDA [607] 2016 DC,DCr,DV,VD Neural Net Cross Vid+L F
DFA [611] 2017 PP,PV,MH DFA Cross Vid+I F
Cross-Intent [617] 2017 PPs,SS,HO
Neural Net,
Cross JAAD F
Ped-Phones [616] 2018 PPs SVM,BN Pose Vid+Col F
Table 16: A summary of intention estimation algorithms. Abbreviations: Factors: PP = Pedestrian Position, PV = Pedestrian Velocity, SC = Social Context, PPs = Pedestrian Posture, SS = Street Structure, MH = Motion History, HO = Head Orientation, G = Goal, GS = Group Size, Si = Signal, DC = Distance to curb, DCr = Distance to Crosswalk, DV = Distance Velocity, Vehicle Dynamics = VD, Inference: GD = Gradient Decent, PF = Particle Filter, GP = Gaussian Process, Pred. Type: Traj = Trajectory, Cross = Crossing, Data Type: Img = Image, Col = Color, Vid = Video, Gr = Greyn, St = Stereo, I = Infrared Cam Pose: F = Front view, BeV = Bird’s Eye View, Mult = Multiple view, FiP = Fixed Position.

12 Reasoning and Decision Making: How Pedestrians Analyze the Information

Discussing reasoning as a separate topic is difficult because it is involved in every aspect of an intelligent system. For instance, as was shown in the previous sections, visual perception algorithms employ various types of reasoning whether for data acquisition (e.g. active recognition) or processing the sensory input (e.g. the use of time series for activity recognition). As a result, in this section, we will discuss reasoning at a macro level and review a number of control architectures used by intelligent robots. These architectures include all the subtasks required for accomplishing a task, ranging from perception to actuation of the plans.

Finding a common definition for all types of reasoning used in practical systems is a daunting task. Primarily, this is due to the fact that the terminology and grouping of reasoning approaches vary significantly in different fields of computer science such as AI, robotics or cognitive systems. Therefore, we leave the detailed definitions of different AI systems to AI experts and instead will briefly discuss a number of control and planning architectures designed for actual autonomous robots or vehicles.

12.1 Decision Making in Autonomous Vehicles

Figure 32: A general control architecture for autonomous vehicles. Source: [622]

At the very least a practical intelligent robot must have three sub-modules including perception, decision-making (planning)555The term may vary in different literature and terms such as decision-making, planning or action selection are used interchangeably. and control [623]. Note that some architectures such as Subsumption [624], that merely rely on bottom-up sensory input, have achieved intelligent behavior, such as moving or navigating in an environment, to some extent. Here, our focus is rather on intelligent vehicles and the submodules necessary for their proper functioning.

In addition to the three main modules mentioned above, some systems include a monitoring component that oversees the performance of other modules and intervenes (e.g. by changing operating constraints) if a failure is detected [625]. Given the inherent complexity of the autonomous driving task, more recent architectures further divide these components into sub-modules [622] (see Figure 32). For instance, data acquisition can be decoupled from perception and actuation can be separated from control.

The heart of a robotic control architecture is the decision-making module, which is responsible for reasoning about what control commands to produce given the perceptual input and the mission’s objectives. Decision-making has several sub-components including global planning, behavioral planning, and local planning. Global planning can be seen as strategic planning in which the abstract mission goals are translated into geographical goals (mission planning) and then, by taking into account the path constraints, a route is specified (route planning) [625]. The behavioral planning module (sometimes called behavioral execution module) is responsible for guaranteeing the system adherence to various rules of the road, especially those concerning structured interactions with other traffic and road blockages. Examples include selecting an appropriate lane when driving in an urban setting or handling intersections by observing other vehicles and determining who has the right of way [626]. Last but not least is the local planning module which is in charge of executing the current goal. This module generates trajectories and deals with dynamic factors in the scene (obstacles avoidance) [622].

12.2 Knowledge-based Systems

Now the question is, how can reasoning in each level of planning be performed? Classical models use knowledge-based systems (KBS) to deal with reasoning tasks. KBS typically have two major components, a knowledge-base and an inference engine [627]. Here, knowledge is represented in the form of rules and facts about the world. The inference engine task is then to use the knowledge base to solve a given problem often by repeating selection and application of the knowledge to the problem. There are many instances of successful applications of KBS to practical systems such as traffic monitoring and control [628] and autonomous driving [626] (e.g. for traffic rules).

Knowledge-based systems, however, have a number of limitations. One of the major ones is in the knowledge elicitation process in which expert knowledge is transformed into rules. This process can be challenging because it requires special skills and often takes many man-years to accomplish. In addition, KBS require an explicit model of the world, which in practice makes them less flexible when dealing with new problems [629]

. Such limitations gave rise to case-based systems (CBS) that learn new knowledge as they experience new phenomena. CBS, on their own or in conjunction with rule-based systems

[630], show more practically feasible performance in complex systems.

The core of CBS is a library of cases which is acquired by the system through experience. In this context, a case comprises three components [629]: the problem that describes the state of the world, the solution to that problem, and/or the outcome which describes the state of the world after the case has occurred. CBS involve four processes: retrieve the most similar case, reuse the case to attempt to solve the problem, revise the proposed solution (typically if outcome is seen), and retain the new solution as a part of a new case.

Figure 33: a) The hierarchical structure of the case base. Solid arrows denote specialization between the cases, and dashed lines link the cases at the same level of specialization. b) the temporal linkage between the cases.

CBS have been used in robotic applications such as navigation and obstacle avoidance [631, 632] and autonomous driving [633]. For instance, Vacek et al. [633] propose a hierarchical structure where each case is built by producing a description of the scene. To do so, first, the perception module identifies the features (e.g. signs, road users, etc.) in the scene. Then, depending on the criticality of each feature, a value between 0 and 1 is associated with it. For example, a person standing on the sidewalk is not as critical as an approaching car from the opposite direction. The combination of all features forms the final description of the situation which in turn characterizes the case. The resulting cases are then stored hierarchically in the memory. Depending on their specialization level, cases are connected either vertically or horizontally (see Figure (a)a). In addition, there are temporal linkages between cases that highlight the evolution of a case throughout the time (Figure (b)b).

12.3 Probabilistic Reasoning Systems

One particular challenge for reasoning in autonomous robotic applications is dealing with uncertainties. This problem arises when the belief of the system about the world is incomplete which can be caused by numerous factors such as noise in sensory input. To tackle this problem, probabilistic approaches such as Bayesian inference


and Markov decision processes (MDPs)

[608, 609] are widely used. In robotic control applications, MDPs and their variants such as partially observable MDPs (POMDP), are widely used. An MDP represents a domain by a finite set of states, a finite set of actions, and an action model, which specifies the probability of performing each action in the given state that results in a transition to the next state [635]. For instance, in [609], a POMDP is used to control the speed of the vehicle along the planned path while taking into account the behavior of pedestrians. In this formulation, each state contains the position, orientation and instantaneous speed of the vehicle as well as the pedestrian’s position, goal and instantaneous speed. The actions to choose from are discretized into ACCELERATE, MAINTAIN and DECELERATE.

13 Visual Attention: Focusing on Relevance

The role of attention in optimizing visual processing tasks such as visual search is undeniable [636, 207]. Practical systems are no exception and are shown to benefit from attention mechanisms which reduce the processing complexity of perceptual tasks, and as a result save computational resources. This, in particular, is important for practical systems such as robots or autonomous cars which are dealing with complex real-world scenarios and have limited resources [637].

Given the importance of visual attention in machine vision, during the past decades a great deal of works have been done on developing computational attention models for applications such as object detection and recognition [638], image captioning [639] and robotics [640].

There are a large number of computational attention models (mainly on visual saliency), the majority of which are generic in a sense that they can be used in different applications simply as they are or with some minor tuning for particular tasks. Going into details and discussing the body of work on computational attention models is beyond the scope of this report. As a result, we limit our discussion to first, describing the categories of visual saliency models and some of the common techniques to develop them, and second, reviewing some of the techniques designed for intelligent driving systems. At the end we briefly argue whether the current approaches to visual attention suffice for autonomous driving applications.

13.1 Computational Models of Visual Attention

In computer vision community, the majority of the work on attention have been realized in the form of generating visual saliency maps. Depending on the objective of saliency and the way the image is processed, visual saliency algorithms can be divided into two categories, bottom-up and top-down algorithms [641].

As the name implies, bottom-up algorithms are data driven and measure saliency based on detecting the regions of the image that stand out in comparison to the rest of the scene. The saliency generated by these maps can be either object-based (identifying the dominant objects) or fixation-based (predicting human eye fixations).

To generate bottom-up saliency maps, input images are commonly decomposed into features, and then the distribution of these features is measured, either locally or globally, to identify the uniqueness of a given region in the image. The types of features used for this purpose include superpixels [642, 643], low-level features such as color and intensity [644]

, higher level learned features such as Independent Component Analysis (ICA)

[645], or in more recent works convolutional features using deep learning techniques [646, 647].

In contrast to bottom-up algorithms, top-down saliency models identify saliency as the regions with similar properties to a specific object or task [648]. These algorithms often use features such as SIFT in conjunction with learning techniques such as SVM [649], conditional random fields (CRF) [650] or neural networks [651] to determine the presence of the object of interest based on a combination of pre-learned features.

Some top-down algorithms take into account the nature of the task to generate saliency maps. Such works define the task as a search problem in which the objective is to find regions with similar properties to the object of interest. To achieve this some techniques use neural network architectures that are trained on different object classes. At the run time, the task (i.e. object class) is provided to the network, for instance in the form of object exemplars [651] or labels [652], and in return the network highlights areas that have similar properties to the object.

Besides static images, a number of models also estimate saliency in videos. Similar to the previous techniques, these models are either data driven in which, for example, optical flow generated based on image sequences is used to characterize the scene [653] or are task driven in a sense that they attempt to learn the human gaze patterns in various scenarios [654].

For further details on general saliency algorithms please refer to [655, 208, 656] and for 3D visual saliency in robotics refer to [637].

13.2 Attention in Intelligent Driving Systems

The use of attention mechanisms in autonomous driving goes as far back as the early 90s in the work of Ernst Dickmanns and his colleagues [24]. Their autonomous van, VaMoRs, was equipped with a dual-focal length camera mounted on a pan-tilt unit. Using this system, the wide angle image was used to analyze global features such as the road boundaries, whereas, the enlarged image was used for focusing on objects and obstacles ahead. The pan-tilt unit allowed the vehicle to maintain its focus of attention on the center of the road in the cases when it was turning or entering a tight curve. The primary purpose of this attention mechanism was to deal with the limitations of camera sensors (e.g. narrow viewing angle) at the time. Today, the new technological advancements in designing wide view angle cameras make the use of such mechanisms obsolete.

In more recent years, computational attention models have found their way into intelligent assistive driving systems. These works commonly rely on some form of inside looking cameras to monitor the changes in the drivers attention, and if necessary, alert the driver or control the vehicle in the case of inattention. The types of sensors used in such systems may vary ranging from a single looking camera [657] to a distributed network of cameras for high resolution 3D face reconstruction and tracking [658, 659].

Some assistive driving systems go one step further and match the focus of driver’s attention to objects surrounding the vehicle to make better sense of the situation [660, 139]. For example, Tawari et al.[139] use a head mounted camera and a Google Glass (which records eye movements) to simultaneously detect the object of interest (in this case a pedestrian) and determine whether it is in the center of the driver’s attention.

More recent works attempt to simulate human attention patterns in autonomous driving systems. The focus of these works is on developing saliency models to imitate human fixation patterns, and consequently identify regions of interest for further processing [661, 662, 663].

In [661], the authors recorded the fixation patterns of 40 subjects with various driving background by showing them 100 still traffic images. Based on their observation, they concluded that the majority of the subjects fixated on the vanishing point of the road. Using this knowledge, they propose a saliency algorithm consisting of a bottom-up algorithm with a top-down bias on the vanishing point of the road.

Since still images are not proper representatives of driving task, more recent algorithms rely on video datasets for estimating saliency. One such dataset is Dr(Eye)ve [664], which is a collection of 74 video sequences of 5 minute long, each recorded from 8 drivers during actual driving experience. The dataset contains 550K frames and is accompanied with fixations of the drivers as well as GPS information, car speed and car course. In a subsequent work [662], the scholars behind Dr(Eye)ve show how using an off-the-shelf CNN algorithm trained on their data can predict human fixation patterns in traffic scenes. Tawari and Kang [663] further improve fixation predictions on Dr(Eye)ve by taking into account the task of driving. In their algorithm, the authors used yaw rate, which is an indicative of events such as turning, as a prior to narrow the predictions to more relevant areas.

13.3 Are the Vehicles Attentive Enough?

Since attention plays a key role in optimizing visual perception in the machine, in this section we briefly discuss the limitations of computational attention models in autonomous driving and point out possible future directions.

Data collection limitations

One problem with collecting fixation data is the reproducibility of driving conditions for the drivers, given the dynamic nature of the environment. As a result, driving fixation datasets such as Dr(Eye)ve [664] provide only one instance of driving on a given road.

Having only one fixation sample is problematic and is not representative of general human fixation patterns. As discussed earlier in Section 4.3 the allocation of attention is highly subjective, meaning that it depends on the characteristics of the driver (e.g. novice vs expert [160] or culture [665]).

Another potential issue is with the way fixations are recorded These datasets often record the eye movements of drivers which is a form of overt visual attention. However, going back to psychological studies on human attention, we see that attentional fixations can be covert meaning that humans often internally focus on a particular region of a scene without any explicit eye movement [206, 208]. This fact was also evident in the studies of visual attention in driving where it was argued that expert drivers tend to use their peripheral vision to focus on objects, reducing the need for eye movements [160]. As one would expect, measuring covert fixation patterns is extremely difficult and not recorded in current datasets.

Last but not least, the content of the datasets is not representative of complex driving scenarios. Dr(Eye)ve [664], which to the best of our knowledge is the only publicly available driving fixation dataset, mainly contains empty rural roads and lacks the presence of road users or other environmental factors. Given the nature of the dataset, the fixation patterns are highly biased towards vanishing point of roads.

The limitations of computational models
Figure 34: Examples of fixation predictions of the algorithm in [663] (left) and human ground truth data (right).

As we discussed earlier (see Section 8), there is strong evidence that attention is mainly biased by the task. The nature of the task can influence where the human looks, the way his visual receptors are optimized, and how the features are represented and processed internally.

Unfortunately, in the context of autonomous driving, the role of task in attention has been largely ignored. Some works such as [663] attempted to capture the influence of task on the focus of attention, but they failed in more complex scenarios. Figure 34 shows two examples from [663] indicating why the task is important in fixation prediction. As can be seen, in both of these instances the vehicle is approaching a branching road. In these cases the car is maintaining its path on the main road, hence the driver’s attention is focused on the road ahead. The algorithm, on the other hand, falsely predicts fixations on the vanishing points of both branches. This means the algorithm does not take into account the driver’s task.

Another implication of Figure 34 is that the data (as mentioned earlier) is highly biased towards vanishing points of the road. For example, in Figure (b)b the driver is clearly focusing on the yellow bus as it is turning in the path of the vehicle whereas the algorithm focuses on the vanishing points instead.

Figure 35: Examples of fixation predictions of the algorithm in [663] (left) and human ground truth data (right).

In the psychology literature, it is also argued that context can be a good predictor of human focus of attention [208]. In driving, as discussed before, context comprises other road users, road structure or traffic signs. Referring again to [663], we can see how the lack of considerations for context can result in misprediction of human fixations. For instance, in Figure (a)a we can see that a car is approaching the road, hence, the driver is focusing on that car. Similarly, in Figure (b)b the driver is paying attention to the pedestrian crosswalk below the traffic sign. In both cases, however, the algorithm falsely picks the vanishing points as the focus of attention.

14 Joint Attention, Traffic Interaction, and More…: Case Studies

Thus far, we have reviewed various theoretical and technical aspects of pedestrian behavior understanding in terms of what we have to look for in the scene (e.g. contextual elements), where to look for them (e.g. attention algorithms) and how to identify them (e.g. visual detection algorithms).

In this section, with the aid of two case studies, we will relate the previous findings to the problem of driver-pedestrian interaction. More specifically, we will analyze the problem to identify what has to be done, and how (un)suitable the current state of the art algorithms are for solving different aspects of the problem.

14.1 Driver-pedestrian interaction in a complex traffic scene

Before discussing the case studies, it is worth formulating the problem of driver-pedestrian interaction in a typical traffic scenario. Overall, we can divide the interaction problem into three phases:

  1. Detection: It is important to reliably identify various traffic scene elements that are potentially relevant to the task of interaction. The detection process can be generic or directed (attention). In a generic approach, we identify all the traffic elements in the scene whereas in a directed approach we perform detection on the parts of the scene that are relevant to the task.

  2. Selection: The detected elements have to be evaluated to determine which ones are relevant, even if a directed detection approach is used. For instance, we might use a directed approach to detect pedestrians that are in close proximity to the vehicle or its path. Next, we need to determine which of these pedestrians are potentially going to interact with the driver based on, for example, their direction of motion, the state of the traffic signal or head orientation.

  3. Inference and negotiation: Once we selected the potential candidates and relevant contextual elements, we need to determine what the candidates are going to do next. This is where joint attention plays a significant role. It helps to identify the center of focus of the pedestrian (e.g. by following their head/eye movements) and interpret their actions according to their objectives and bodily movements. It is also involved in negotiating the right of way with the pedestrian, by using explicit (e.g. hand gesture) or implicit (e.g. change of pace) signals. The end result is an agreement (or a contract) between the driver and the pedestrian, which is demonstrated by changing or maintaining the current behavior of the involved parties.

The following case studies will focus on different aspects of the problem as described above. Case 1 mainly addresses the problem of detection and selection and case 2 focuses on the problem of joint attention.

14.2 Case 1: Making a right turn at a four-way intersection

(a) A view of the scene
(b) Potentially relevant regions
(c) Actual relevant regions
(d) Priority of attention
(e) Priority of attention after one second
(f) Priority of attention after two seconds
Figure 36: Attention allocation and behavioral understanding in a traffic scene. In sub-figures d-f colors indicate the priority of attention, from the highest to the lowest: green, blue, yellow and red

We begin with a typical interaction scenario where the vehicle is driving towards a four-way intersection (see Figure (a)a) and a pedestrian is standing near the curb waiting to cross the street. We can assume that the primary objective of the vehicle is safety, that is avoiding collisions with other road users or the infrastructure.

14.2.1 Perceiving the scene
(a) Rainy
(b) Snowy
(c) Sunny
Figure 37: Traffic scenes with various weather conditions.

Figure (b)b highlights the potentially relevant contextual elements in the scene. The current state-of-the-art visual perception algorithms are capable of detecting [393, 395, 406] and recognizing [382, 383] traffic scene elements such as pedestrians, vehicles and signs with a reasonable margin of error.

However, there are still challenges that need to be addressed in traffic scene understanding research. The object detection algorithms for traffic scenes are commonly trained and evaluated using datasets, such as KITTI [315] or Cityscapes [316] or, in the case of pedestrian detection, Caltech [322], all of which are collected in daylight under clear sky conditions. Such lack of variability in weather and lighting conditions raises the concern of how reliable these methods would be in real traffic scenarios. As illustrated in Figure 37, we can see, for example, the strong reflection of light on the road surface in a rainy weather (Figure (a)a) or appearance changes of pedestrians in a snowy day (Figure (b)b), none of which exists in a typical sunny day (Figure (c)c). Therefore, it is important to train object detection algorithms using datasets that capture the variability of traffic scenes.

14.2.2 Where to attend

It is easy to see that identifying all the contextual elements can be computationally expensive. Such a holistic view of the scene can increase the chance of error in the detection task and complicates reasoning processes about the behavior of the agents present in the scene.

To limit the focus of attention, we need to identify the contextual elements that are relevant [111]. The relevance is defined based on the nature of the task, which, in our case, is the vehicle making a right turn. One can see that, simply by taking into account the task, a big portion of initially identified elements such as pedestrians walking on the other side of the street, the road straight ahead or to the left, and the parked vehicles are discarded as they are deemed irrelevant (see Figure (c)c).

The role of task on attention allocation has been widely studied in behavioral psychology, but unfortunately very few computational models of attention take task into consideration. For instance, in [459] the authors use a spatial prior for sign detection to limit the detection to regions on the borders of the road. As for top-down models, they focus on highlighting regions with properties coherent with the object of interest but fail to prioritize the instances of the same class according to the task in hand [666]. In the case of models of attention for driving, we also saw that the models which simply learn the driver’s gaze behavior [662] fail to predict regions of interest in more complex scenarios, where the vehicle is not driving on a straight road. In addition, the same issue was apparent in the case of the model that attempts to consider the task [663] in the form of yaw rate and the structure of the street. We saw in the examples earlier (see Figures 34-35) that using such simple cues is certainly not enough to identify regions of importance.

14.2.3 Prioritizing the attention

Although discarding all irrelevant elements speeds up the analysis of the scene, it does not suffice for a fast dynamic task such as driving. For instance, as illustrated in Figure (c)c, the relevant contexts include three pedestrians, 2 vehicles and the traffic lights. Tracking and analyzing all these elements can be expensive, thus might not be possible in a timely manner to avoid collision.

A solution to the above problem is to prioritize the objects from the highest to the lowest importance. Consider the elements illustrated in Figure (d)d. The vehicle is approaching the intersection to turn right. So the immediate relevant contextual elements are traffic signs and lights indicating whether the vehicle is allowed to pass or not. Of course the signs on the other side of the intersection are irrelevant as they concern drivers who are driving straight ahead.

In our example the road users possess different levels of priority. The first participant important to the driver is the pedestrian wearing a red coat and standing near the traffic light pole. She is intending to cross in front of the vehicle, therefore the primary focus should be on what she might do next. Other road users are less important. For instance, the cars stopped the red light at the intersection are driving in a different direction and might not pose any threat unless they perform a reckless behavior. The pedestrians already crossing the intersection on the road ahead have the lowest priority. They are moving to the right side of the street (Figure (d)d and can potentially cross the street again, however, their speed is much slower than the vehicle and there is a very low chance that they’ll be crossing the street by the time the vehicle is turning.

Last but not least is the street. Generally speaking, the road on which the vehicle is driving or intending to drive has a high priority. However, in the given context the road’s state mainly depends on dynamic objects that are in the driver’s center of attention, hence compared to the traffic light or the pedestrian closest to the vehicle, it has a lower priority.

The problem of attending to the objects according to their degree of priority has extensively been studied in human behavioral studies [206, 208]. Bottom-up computational models define priority in terms of how distinctive various features in the scene are [645, 646]. Top-down algorithms [650, 651, 652], on the other hand, define priority in terms of similarity of the image regions to the object of interest. These models, however, define the task very narrowly as visual search based on a template or a label. These algorithms lack the the capability of understanding the relationship between various objects in the scene and the influence of temporal changes on the task (e.g. changes in traffic light color or the state of pedestrians).

14.2.4 Evolving the attention

In a dynamic task such as driving, the focus of attention has to evolve according to the changes in the objective of the task. Going back to the example in Figure (f)f, we can see that the pedestrian wearing a red coat is more likely to cross the street when the car is far away (Figure (d)d) than when the car is in a close proximity (Figure (e)e). Therefore, a lower priority is assigned to the pedestrian in the second time step. The same is true about the pedestrians at the back in Figure (d)d who are not relevant anymore in Figure (f)f as they are clearly not going to a cross the vehicle’s path. At the same time we see that a new pedestrian is appearing in the scene (Figure (e)e) who is crossing the path of the car,and thus has the highest priority at this stage.

Video saliency algorithms [653, 654], to some extent, deal with temporal evolution of attention by taking into account optical flow or fixation data from human subjects. The bottom-up algorithms, of course, are not suitable for such a task as they are only capable of identifying certain patterns in the image, e.g. moving objects or activities, without any connection to the context within which they occurred.

As we saw earlier, the algorithms learning human fixation patterns [663, 662] also fail to highlight what is important in a given driving context due to a number of reasons. The datasets used in these techniques are unbalanced and the occurrence of traffic interactions such as crossing intersections are rarer compared to driving on regular road in which the driver looks straight ahead. Furthermore, there is a high variability in driving tasks, for example, different types of intersections, different numbers of pedestrians or cars present in the scene or even weather conditions. Collecting data for all of these scenarios is a daunting task. Finally, these algorithms do not account for the nature of the task (some only primitively in the form of route [663]) and context when predicting the focus of attention.

14.2.5 Foreseeing the future

The dynamic nature of driving makes the problem of attention and scene understanding quite complex. Given that driving is a temporal task the relevant context goes beyond the static configuration of the elements observed in the scene. As we discussed earlier in Section 4.3, when identifying the context in a dynamic interactive scenario, one should consider the intentions and consequences of the actions performed by the parties involved, in this case the other road users. In some cases actions have to be predicted even in the absence of traffic participants. To explain this further, lets use the example in Figure (f)f.

We begin with the pedestrian wearing a red coat. She is standing at the traffic light and looking towards the ego-vehicle. At the current situation (Figure (d)d) she is the first immediate traffic participant that is relevant to the task. Although the traffic light is green (red for the pedestrian) there is still a chance that she will start crossing (similar to the people at the back who are crossing while the light is green for the cars). The pedestrian is looking towards the vehicle, the gap between the pedestrian and the vehicle is sufficiently big and no other vehicle is driving in the opposite direction. In addition, the pedestrian’s posture indicates that she is not in a fully static position. Taking all these factors into consideration, she should be the main focus of attention.

After one second, the car is much closer to the pedestrian, she is in a fully static condition and the traffic light remains green for the ego-vehicle. This significantly lowers the chance of crossing, therefore the priority of attending to the pedestrian wearing a red coat is reduced.

Earlier, we saw how intention estimation algorithms [603, 597, 592] predict pedestrian behavior, albeit in a limited context. These algorithms mainly rely on the dynamics of the vehicle and the pedestrian [603, 597] and some also take into account pedestrians’ awareness by observing their head orientation [592]. Although these algorithms can be effective in a number of cases, they have some drawbacks. Mainly relying on the trajectory of the pedestrian means that if there is a discontinuity in the pedestrian’s motion or the pedestrian is motionless (which is the case in our example), these algorithms fail to predict upcoming movements.

Figure 38: Examples of pedestrians looking at the traffic.

Measuring head orientation as a sign of awareness may also be tricky in some scenarios. The variability of head orientation and looking actions by pedestrians is very high (see Figure 38) and depends on their positioning or the structure of the scene. In some cases, e.g. Figure (a)a, pedestrians may not explicitly turn their heads towards the vehicle, and instead notice the vehicle with their peripheral vision. Pedestrians’ head in some cases may not even be visible enough for performing recognition (see Figure (d)d).

Figure 39: An example of a pedestrian subtly looking at the vehicle and changing her path. Arrows show the direction of motion and colors indicate whether the pedestrian is moving without noticing the car blue, the pedestrian is looking at the car while moving yellow, and the pedestrian is clearing the path in response to the car green

To remedy the problem of measuring pedestrian awareness, besides head orientation, one should rely on more implicit clues. For instance, the activity of the pedestrian, e.g. stopping or clearing path, can be an indicator that they noticed or are aware of the vehicles presence (see Figure (h)h). In cases where no obvious changes in the pedestrian’s state are observable, we can rely on other clues to measure their awareness. For example, in Figure (d)d the posture of the pedestrian wearing a red coat is a clear indicator that she is looking towards the vehicle, even though her head or face is not fully visible.

Another drawback of practical intention estimation algorithms is their lack of consideration for contextual information beyond pedestrians’ or vehicles’ states. Some simulation-based works exploit higher level contextual information such as social forces [603] or street structure [604] but they do not provide any algorithmic solution for solving the perception problem, assuming everything is detected and also do not address the complex interactions between the traffic participants.

As we mentioned earlier, in some cases, behavior prediction should be performed in the absence of other road users. In road traffic, it is often the case that various elements in the scene are not immediately visible to drivers. For instance, the view of a pedestrian or a car might be blocked by another vehicle, or as in our example (Figure (e)e) by a traffic pole. In such situations, an experienced driver takes note of obstructing elements and adjusts his driving style accordingly.

In assisstive driving context, some scholars address the issue of obstruction in traffic scenes by identifying, for example, the gap between vehicles on the road side and directing the system’s attention to those regions as the potential areas where pedestrians might appear [409]. For autonomous driving, however, an extension of such approaches is needed to identify any forms of potential obstruction in the scene that might affect the view of the scene.

14.3 Case 2: Interaction with pedestrians in a parking lot

(a) A view of the scene
(b) Potential relevant elements
(c) Actual relevant elements
(d) Interacting pedestrians
(e) Pedestrians 2 and 3 are engaged in eye contact
(f) Pedestrian 1 engages in eye contact with pedestrian 3
(g) Pedestrian 2 is looking at the vehicle
(h) Pedestrians 1 and 3 start turning towards the car
Figure 40: Joint attention and nonverbal communication in a traffic scene. In sub-figure c colors indicate the priority of attention, from highest to lowest: green, blue, yellow and red.

The second example is a traffic scenario occurring in a parking lot (see Figure 40) where two pedestrians are intending to cross while interacting with another pedestrian on the other side of the road. Similar to case 1, we can first identify all potentially relevant elements in the scene (Figure (b)b) and then specify which ones are actually relevant to the current task (Figure (c)c). Given that we extensively discussed the problem of attention in the previous case, here we mainly focus on the interaction between the pedestrians and the driver.

14.3.1 Joint attention and nonverbal communication

Pedestrians 1 (p1) and 2 (p2) (as illustrated in Figure (d)d) are clearly intending to cross as they are approaching the road whereas the intention of pedestrian 3 (p3) is unknown as he is looking away from the vehicle and standing next to a parked car. In order to make sense of their actions, we need to identify and analyze instances of joint attention and interaction. In this case, we can observe five instances of joint attention:

  1. P1 and the driver: In Figure (e)e, p1 is making eye contact with the driver indicating her intention of crossing and assessing the position and velocity of the vehicle. She then switches her gaze at p3 (Figure (f)f) and looks back at the driver again. This is a classical instance of joint attention in which we can simply by following the gaze of the pedestrian, similar to the works presented earlier [233, 228, 239], identify her focus of attention. However, this case is different from the social robotic applications. Instead of one task, the pedestrian is involved in two tasks, crossing the road and communicating with another pedestrian. This fact makes the reasoning particularly tricky because we have to identify which task is the main focus of the pedestrian’s attention, which in turn will influence the way she is going to behave. Such multi-tasking behavior in the context of joint attention has not previously been addressed.

  2. P1 and P3: P1 makes eye contact with p3 (Figure (f)f) and turns her head towards the ego-vehicle (Figure (h)h) influencing p3 to also follow her gaze towards the same point.

  3. P2 and P3: P2 is initially looking at p3 (Figure (e)e). While engaged in communication, he initiates a hand gesture at p3 (Figure (f)f) and at the same time begins turning his head towards the vehicle. In Figure (g)g p2’s hand gesture is still in effect while he is looking towards the vehicle. Once again state-of-the-art action recognition [569, 585, 579] can detect these behaviors.

    Besides perception issues, it is important to make sense of the communication cue exhibited by p2 since it can potentially influence the behavior of p3. By the time the hand gesture is observed (Figure (g)g) p3 has not yet looked towards the vehicle, and as a result, we do not know whether he is aware of the approaching vehicle and what he is going to do next. For instance, the hand gesture might induce p3 to cross or signal him to stop.

    Understanding nonverbal communication cues have been extensively studied in HCI applications [233, 245, 232, 561] and to some extent in general scene understanding tasks [234]. However, these algorithms do not usually consider the semantics of nonverbal signals and specifically do not link them to possible actions (or intentions) of the recipient of the signal.

  4. P2 and the driver: The instance of joint attention in this case is particularly interesting for a number of reasons:

    1. P2 looks at the driver right after he engages in communication with p3 (Figure (g)g). This means in order to realize p2’s primary focus of attention, we have to follow his gaze prior to being engaged in eye contact with him. This is in contrast to widely used approaches in joint attention applications where the gaze following and identifying focus of attention takes place after eye contact.

    2. P2 makes a hand gesture which is clearly intended for p3 while looking at the driver (Figure (g)g). In joint attention applications, two techniques are common to identify the intention of a nonverbal cue. The communication is performed after an eye contact [226, 229], so the recipient would be the last person who made eye contact with the communicator (e.g. the robot). Another technique associates the cue (e.g. hand gesture) with the eye contact. As a result, whoever the communicator (e.g. the person interacting with the machine) is looking towards at the time of communicating is the intended recipient [232]. In our case, however, both these techniques fail because the action performed for someone else is co-occurring with looking at the driver. To solve this issue one has to track the behavior of the agents and be aware of what has happen prior to the eye contact.

    3. The communication between p2 and p3 certainly can shed light on understanding the intention of p2 when he makes eye contact with the driver. The current intention estimation algorithms are not addressing the role of communication on pedestrian behavior understanding. Some components of this problem, however, can be solved using various visual perception algorithms such as interaction [581, 582] and group activity recognition [512, 584] algorithms.

  5. P3 and the driver: This instance also has some similarities with the previous one in which the primary focus of p3 precedes his eye contact with the driver. Here, once the eye contact is observed, we have more certainty about p3’s upcoming crossing decision.

In this section, we introduced two cases of attention and interaction in driving to highlight various challenges in pedestrian behavior understanding. Although parts of these examples might be seen as very rare and exceptional, the components involved in understanding them are still relevant in typical situations. For instance, joint attention is a very common phenomenon and is involved in almost every interactive traffic scenarios. Nonverbal communication is very common as well, not only between pedestrians and drivers, but also between different drivers or sometimes even authorities and road users, e.g. police force and vehicles.

15 Summary and Future Work

Autonomous driving is a complex task and requires the cooperation between many branches of artificial intelligence to become a reality. In this report some of the open problems in the field, in particular, the interaction between autonomous vehicles and other road users, were discussed. It was shown that the interaction between traffic participants can guarantee the flow of traffic, prevent accidents and reduce the likelihood of malicious actions that might happen against the vehicles.

We showed that at the center of social interactions is joint attention, or the ability to share intention regarding a common object or an event. Joint attention often signals the intention of two or more parties to coordinate in order to accomplish a task. During the coordination, humans observe each other’s behavior and try to predict what comes next.

To make the prediction accurate, humans often communicate their intentions and analyze the context in which the behavior is observed. In traffic scenarios, communication is commonly nonverbal, i.e. through the use of cues such as eye contact, hand gestures or head movements to transmit the intention of crossing, showing gratitude or asking for the right of way. As for the context, there is a broad set of factors that potentially impact the behavior of pedestrians and drivers. Some of these factors are: dynamics of the scene (e.g. vehicles’ speed and distance), social norms, demographics, social forces (e.g. group size, social status), and physical context (e.g. signals, road geometry).

Realizing the ability of social interaction in practice is challenging and involves a vast number of research areas ranging from detection and identification of elements (e.g. pedestrians, signs, roads) in the scene to reasoning about locations, poses and the types of activities observed.

Today’s state-of-the-art computer vision algorithms attempt to resolve the perceptual tasks involved in traffic scene understanding. Despite their advancements and successes in a number of contexts, these algorithms are still far from ideal to be used in practice. For instance, pedestrian detection algorithms, which are only tested on datasets with favorable conditions, such as scenes in broad daylight or clear weather conditions, have not yet achieved a performance level close to humans. Activity recognition algorithms are mainly focusing on either video classification problems where the entire length of the activity is known at the detection time or applications where the outliers or background clutter is minimal , e.g. sport scenes. In addition, given that the majority of vision algorithms are designed for off-line applications, their processing times usually far exceed the real-time requirement for driving tasks.

In recent years, a number of attempts have been made to design algorithms for estimating the behavior of pedestrians at crosswalks. These works combine various contextual cues such as the dynamics of the vehicle and the pedestrian, pedestrian head orientation or the structure of the street to estimate whether the pedestrian will cross or not. However, these algorithms either lack visual perception methods necessary for analyzing the scenes, or are only applied to very limited contexts, e.g. narrow streets with no signal.

Given the broadness of the social interaction task in autonomous driving, there are many open problems in this field.

A great deal of studies on pedestrian behavior are dated back to a few decades ago. During this time we have witnessed major technological and socioeconomic changes, which means the behavioral studies have to be revisited in order to account for modern times. In particular, there is a major shortage of studies addressing the interaction issues between pedestrians and autonomous vehicles.

The majority of the practical joint attention systems are designed for tasks such as learning, rehabilitation or imitation. There is a very few attempts to build a system capable of joint attention in complex cooperative tasks, and none in the context of autonomous driving.

As it was presented earlier, the state of the art visual attention algorithms are heavily data-driven and are unsuitable for driving tasks. There is a need for a task-driven visual attention model that can dynamically focus on regions of interest during the course of driving.

Algorithms capable of understanding non-verbal communication are mainly designed for human-computer interaction where the proximity of the user to the machine is small, the background clutter is minimal, and the task often involves direct understanding of various bodily movements without considering the context. There are only few attempts on under- standing communication in traffic scenes, none of which infer the meaning of the observed cues, but rather identify the type of the cues.

The current intention estimation algorithms are very limited in context, and often are not accompanied with necessary visual perception algorithms to analyze the scenes. In addition, the data used in these algorithms is either not naturalistically obtained (e.g. participants are scripted), or not sufficiently diverse to include various traffic scenarios. This points to the need for implementing a system that can, first, identify the relevant elements in the scene, second, reason about the interconnections between these elements, and third, infer the upcoming actions of the road users. The algorithm should also be universal in a sense that it can be used in various traffic scenarios with different street structures, traffic signals, crosswalk configurations, etc.


  • [1] F. Kröger, “Automated driving in its social, historical and cultural contexts,” in Autonomous Driving.   Springer, 2016, pp. 41–68.
  • [2] T. Winkle, “Safety benefits of automated vehicles: Extended findings from accident research for development, validation and testing,” in Autonomous Driving.   Springer, 2016, pp. 335–364.
  • [3] T. Litman, “Autonomous vehicle implementation predictions,” Victoria Transport Policy Institute, vol. 28, 2014.
  • [4] E. D. Dickmanns and A. Zapp, “A curvature-based scheme for improving road vehicle guidance by computer vision,” in Cambridge Symposium_Intelligent Robotics Systems.   International Society for Optics and Photonics, 1987, pp. 161–168.
  • [5] M. Darms, P. E. Rybski, and C. Urmson, “A multisensor multiobject tracking system for an autonomous vehicle driving in an urban environment,” in 9th International Symposium on Advanced Vehicle Control (AVEC), 2008.
  • [6] R. Bishop, “Intelligent vehicle applications worldwide,” IEEE Intelligent Systems and Their Applications, vol. 15, no. 1, pp. 78–81, 2000.
  • [7] D. Radovanovic and D. Muoio, “This is what the evolution of self-driving cars looks like,” Online, 2017-05-28. [Online]. Available:
  • [8] “Automated driving levels of driving automation are defined in new SAE international standard j3016,” Online, 2017-05-28. [Online]. Available:
  • [9] V. Nguyen, “2019 audi a8 level 3 autonomy first-drive: Chasing the perfect ‘jam’,” Online, 2017-11-10. [Online]. Available:
  • [10] “Unmanned ground vehicle,” Online, 2017-05-28. [Online]. Available:
  • [11] A. Oagana, “A short history of mercedes-benz autonomous driving technology,” Online, 2017-05-28. [Online]. Available:
  • [12] A. Broggi, M. Bertozzi, A. Fascioli, C. G. L. Bianco, and A. Piazzi, “The argo autonomous vehicle’s vision and control systems,” International Journal of Intelligent Control and Systems, vol. 3, no. 4, pp. 409–441, 1999.
  • [13] “Self driving car,” Online, 2017-05-28. [Online]. Available:
  • [14] “De 1977 à nos jours, beaucoup de progrès !” Online, 2017-05-28. [Online]. Available:
  • [15] “Vislab intercontinental autonomous challenge: Inaugural ceremony – milan, italy,” Online, 2017-05-28. [Online]. Available:
  • [16] “Watch Stanford’s self-driving vehicle hit 120mph: Autonomous Audi proves to be just as good as a race car driver,” Online, 2017-05-28. [Online]. Available:
  • [17] A. Davieg, “We take a ride in the self-driving Uber now roaming Pittsburgh,” Online, 2017-05-28. [Online]. Available:
  • [18] “W. Grey Walter’s Tortoises – Self-recognition and narcissism,” Online, 2017-05-26. [Online]. Available:
  • [19] “1960 – Stanford Cart – (American),” Online, 2017-05-26. [Online]. Available:
  • [20] “Elsie (electro-mechanical robot, light sensitive with internal and external stability,” Online, 2017-05-26. [Online]. Available:
  • [21] H. Moravec, “Obstacle avoidance and navigation in the real world by a seeing robot rover.” DTIC Document, Tech. Rep., 1980.
  • [22] H. P. Moravec, “The Stanford Cart and the CMU rover,” Proceedings of the IEEE, vol. 71, no. 7, pp. 872–884, 1983.
  • [23] B. D. Mysliwetz and E. Dickmanns, “Distributed scene analysis for autonomous road vehicle guidance,” in Robotics and IECON’87 Conferences.   International Society for Optics and Photonics, 1987, pp. 72–79.
  • [24] E. D. Dickmanns, B. Mysliwetz, and T. Christians, “An integrated spatio-temporal approach to automatic visual guidance of autonomous vehicles,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 20, no. 6, pp. 1273–1284, 1990.
  • [25] M. Mueller-Freitag, “Germany asleep at the wheel?” Online, 2017-05-28. [Online]. Available:
  • [26] D. A. Pomerleau, J. Gowdy, and C. E. Thorpe, “Combining artificial neural networks and symbolic processing for autonomous robot guidance,” Engineering Applications of Artificial Intelligence, vol. 4, no. 4, pp. 279–285, 1991.
  • [27] D. Pomerleau, “Progress in neural network-based vision for autonomous robot driving,” in Intelligent VehiclesSymposium (IV).   IEEE, 1992, pp. 391–396.
  • [28] D. A. Pomerleau, “Neural network vision for robot driving,” in The Handbook of Brain Theory and Neural Networks.   Citeseer, 1996.
  • [29] S. Baluja, “Evolution of an artificial neural network based autonomous land vehicle controller,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 26, no. 3, pp. 450–463, 1996.
  • [30] C. Thorpe, M. Herbert, T. Kanade, and S. Shafer, “Toward autonomous driving: the CMU Navlab. i. perception,” IEEE expert, vol. 6, no. 4, pp. 31–42, 1991.
  • [31] T. M. Jochem, D. A. Pomerleau, and C. E. Thorpe, “Vision-based neural network road and intersection detection and traversal,” in IROS, vol. 3.   IEEE, 1995, pp. 344–349.
  • [32] Y. U. Yim and S.-Y. Oh, “Three-feature based automatic lane detection algorithm (TFALDA) for autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 4, no. 4, pp. 219–225, 2003.
  • [33] K. Kluge, “Extracting road curvature and orientation from image edge points without perceptual grouping into features,” in Intelligent Vehicles Symposium (IV).   IEEE, 1994, pp. 109–114.
  • [34] E. D. Dickmanns, R. Behringer, D. Dickmanns, T. Hildebrandt, M. Maurer, F. Thomanek, and J. Schiehlen, “The seeing passenger car ’VaMoRs-P’,” in Intelligent Vehicles Symposium (IV).   IEEE, 1994, pp. 68–73.
  • [35] U. Franke, S. Mehring, A. Suissa, and S. Hahn, “The Daimler-Benz steering assistant: a spin-off from autonomous driving,” in Intelligent Vehicles Symposium (IV).   IEEE, 1994, pp. 120–124.
  • [36] S. Bohrer, T. Zielke, and V. Freiburg, “An integrated obstacle detection framework for intelligent cruise control on motorways,” in Intelligent Vehicles Symposium (IV).   IEEE, 1995, pp. 276–281.
  • [37] T. Hong, M. Abrams, T. Chang, and M. Shneier, “An intelligent world model for autonomous off-road driving,” CVIU, 2000.
  • [38] R. Behringer, S. Sundareswaran, B. Gregory, R. Elsley, B. Addison, W. Guthmiller, R. Daily, and D. Bevly, “The DARPA grand challenge-development of an autonomous vehicle,” in Intelligent Vehicles Symposium (IV).   IEEE, 2004, pp. 226–231.
  • [39] T. Dang, S. Kammel, C. Duchow, B. Hummel, and C. Stiller, “Path planning for autonomous driving based on stereoscopic and monoscopic vision cues,” in IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.   IEEE, 2006, pp. 191–196.
  • [40] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale, M. Halpenny, G. Hoffmann et al., “Stanley: The robot that won the DARPA grand challenge,” Journal of field Robotics, vol. 23, no. 9, pp. 661–692, 2006.
  • [41] S. Thrun, M. Montemerlo, and A. Aron, “Probabilistic terrain analysis for high-speed desert driving.” in Robotics: Science and Systems, 2006, pp. 16–19.
  • [42] G. M. Hoffmann, C. J. Tomlin, M. Montemerlo, and S. Thrun, “Autonomous automobile trajectory tracking for off-road driving: Controller design, experimental validation and racing,” in American Control Conference (ACC).   IEEE, 2007, pp. 2296–2301.
  • [43] D. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel, “Practical search techniques in path planning for autonomous driving,” Ann Arbor, vol. 1001, p. 48105, 2008.
  • [44] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of Field Robotics, vol. 25, no. 8, pp. 425–466, 2008.
  • [45] D. Dolgov and S. Thrun, “Autonomous driving in semi-structured environments: Mapping and planning,” in ICRA.   IEEE, 2009, pp. 3407–3414.
  • [46] R. Kummerle, D. Hahnel, D. Dolgov, S. Thrun, and W. Burgard, “Autonomous driving in a multi-level parking structure,” in ICRA.   IEEE, 2009, pp. 3395–3400.
  • [47] J. Wei and J. M. Dolan, “A robust autonomous freeway driving algorithm,” in Intelligent Vehicles Symposium (IV).   IEEE, 2009, pp. 1015–1020.
  • [48] J. Wei, J. M. Snider, J. Kim, J. M. Dolan, R. Rajkumar, and B. Litkouhi, “Towards a viable autonomous driving research platform,” in Intelligent Vehicles Symposium (IV).   IEEE, 2013, pp. 763–770.
  • [49] S. Brechtel, T. Gindele, and R. Dillmann, “Probabilistic decision-making under uncertainty for autonomous driving using continuous pomdps,” in Intelligent Transportation Systems (ITSC).   IEEE, 2014, pp. 392–399.
  • [50] M. Bertozzi, L. Bombini, A. Broggi, M. Buzzoni, E. Cardarelli, S. Cattani, P. Cerri, S. Debattisti, R. Fedriga, M. Felisa et al., “The VISLAB intercontinental autonomous challenge: 13,000 km, 3 months, no driver,” in 17th World Congress on ITS, 2010.
  • [51] C. SQUATRIGLIA, “Audi’s robotic car climbs pikes peak,” Online, 2017-05-28. [Online]. Available:
  • [52] “Waymo,” Online, 2017-05-30. [Online]. Available:
  • [53] S. Millward, “Baidu’s driverless cars on china’s roads by 2020,” Online, 2017-05-30. [Online]. Available:
  • [54] a. English, “Toyota’s driveless car,” Online, 2017-05-30. [Online]. Available:
  • [55] “44 corporations working on autonomous vehicles,” Online, 2017-05-30. [Online]. Available:
  • [56] “Full self-driving hardware on all cars,” Online, 2017-05-30. [Online]. Available:
  • [57] G. Nica, “BMW CEO wants autonomous driving cars within five years,” Online, 2017-05-28. [Online]. Available:
  • [58] J. Ziegler, P. Bender, M. Schreiber, H. Lategahn, T. Strauss, C. Stiller, T. Dang, U. Franke, N. Appenrodt, C. G. Keller et al., “Making Bertha drive-An autonomous journey on a historic route,” IEEE Intelligent Transportation Systems Magazine, vol. 6, no. 2, pp. 8–20, 2014.
  • [59] A. Davieg, “Uber’s self-driving truck makes its first delivery: 50,000 beers,” Online, 2017-05-30. [Online]. Available:
  • [60] A. Marshal, “Don’t look now, but even buses are going autonomous,” Online, 2017-05-30. [Online]. Available:
  • [61] O. Levander, “Forget autonomous cars-autonomous ships are almost here,” Online, 2017-05-30. [Online]. Available:
  • [62] F. Lambert, “Elon Musk clarifies tesla’s plan for level 5 fully autonomous driving: 2 years away from sleeping in the car,” Online, 2017-05-30. [Online]. Available:
  • [63] E. Ackerman, “Toyota’s Gill Pratt on self-driving cars and the reality of full autonomy,” Online, 2017-05-30. [Online]. Available:
  • [64] B. Friedrich, “The effect of autonomous vehicles on traffic,” Autonomous Driving, pp. 317–334, 2016.
  • [65] T. M. Gasser, “Fundamental and special legal questions for autonomous vehicles,” Autonomous Driving, pp. 523–551, 2016.
  • [66] D. Muoio, “6 scenarios self-driving cars still can’t handle,” Online, 2017-05-30. [Online]. Available:
  • [67] R. Tussy, “The challenges facing autonomous vehicles,” Online, 2017-05-30. [Online]. Available:
  • [68] F. Lambert, “Tesla Model S driver crashes into a van while on autopilot [video],” Online, 2017-05-30. [Online]. Available:
  • [69] “Tesla on autopilot hits police motorcycle,” Online, 2017-05-30. [Online]. Available:
  • [70] “Uber suspends self-driving fleet after Ariz. crash,” Online, 2017-05-30. [Online]. Available:
  • [71] “Tesla driver dies in first fatal crash while using autopilot mode,” Online, 2017-05-30. [Online]. Available:
  • [72] “Another fatal tesla crash reportedly on autopilot emerges, Model S hits a streetsweeper truck – caught on dashcam,” Online, 2017-05-30. [Online]. Available:
  • [73] I. Wolf, “The interaction between humans and autonomous agents,” in Autonomous Driving.   Springer, 2016, pp. 103–124.
  • [74] B. Färber, “Communication and communication problems between autonomous vehicles and human drivers,” in Autonomous Driving.   Springer, 2016, pp. 125–144.
  • [75] M. Richtel, “Google’s driverless cars run into problem: Cars with drivers,” Online, 2017-05-30. [Online]. Available:
  • [76] S. E. Anthony, “The trollable self-driving car,” Online, 2017-05-30. [Online]. Available:
  • [77] M. Gough, “Machine smarts: how will pedestrians negotiate with driverless cars?” Online, 2017-05-30. [Online]. Available:
  • [78] D. Sun, S. Ukkusuri, R. F. Benekohal, and S. T. Waller, “Modeling of motorist-pedestrian interaction at uncontrolled mid-block crosswalks,” Urbana, vol. 51, p. 61801, 2002.
  • [79] A. M. Parkes, N. J. Ward, and L. Bossi, “The potential of vision enhancement systems to improve driver safety,” Le Travail Humain, vol. 58, no. 2, p. 151, 1995.
  • [80] M. McFarland, “Robots hit the streets – and the streets hit back,” Online, 2017-05-30. [Online]. Available:
  • [81] M. Kumashiro, H. Ishibashi, Y. Uchiyama, S. Itakura, A. Murata, and A. Iriki, “Natural imitation induced by joint attention in Japanese monkeys,” International Journal of Psychophysiology, vol. 50, no. 1, pp. 81–99, 2003.
  • [82] M. Kidwell and D. H. Zimmerman, “Joint attention as action,” Journal of Pragmatics, vol. 39, no. 3, pp. 592–611, 2007.
  • [83] G. Butterworth and E. Cochran, “Towards a mechanism of joint visual attention in human infancy,” International Journal of Behavioral Development, vol. 3, no. 3, pp. 253–272, 1980.
  • [84] M. Scaife and J. S. Bruner, “The capacity for joint visual attention in the infant.” Nature, 1975.
  • [85] M. Tomasello and J. Todd, “Joint attention and lexical acquisition style,” First language, vol. 4, no. 12, pp. 197–211, 1983.
  • [86] M. Botero, “Tactless scientists: Ignoring touch in the study of joint attention,” Philosophical Psychology, vol. 29, no. 8, pp. 1200–1214, 2016.
  • [87] R. Sheldrake and A. Beeharee, “Is joint attention detectable at a distance? Three automated, internet-based tests,” Explore: The Journal of Science and Healing, vol. 12, no. 1, pp. 34–41, 2016.
  • [88] C. Moore, M. Angelopoulos, and P. Bennett, “The role of movement in the development of joint visual attention,” Infant Behavior and Development, vol. 20, no. 1, pp. 83–92, 1997.
  • [89] P. Mundy and M. Crowson, “Joint attention and early social communication: Implications for research on intervention with autism,” Journal of Autism and Developmental disorders, vol. 27, no. 6, pp. 653–676, 1997.
  • [90] W. V. Dube, R. P. MacDonald, R. C. Mansfield, W. L. Holcomb, and W. H. Ahearn, “Toward a behavioral analysis of joint attention,” The Behavior Analyst, vol. 27, no. 2, p. 197, 2004.
  • [91] P. Holth, “An operant analysis of joint attention skills.” Journal of Early and Intensive Behavior Intervention, vol. 2, no. 3, p. 160, 2005.
  • [92] M. Tomasello and M. Carpenter, “Shared intentionality,” Developmental science, vol. 10, no. 1, pp. 121–125, 2007.
  • [93] T. Charman, S. Baron-Cohen, J. Swettenham, G. Baird, A. Cox, and A. Drew, “Testing joint attention, imitation, and play as infancy precursors to language and theory of mind,” Cognitive development, vol. 15, no. 4, pp. 481–498, 2000.
  • [94] M. Carpenter, K. Nagell, M. Tomasello, G. Butterworth, and C. Moore, “Social cognition, joint attention, and communicative competence from 9 to 15 months of age,” Monographs of the society for research in child development, pp. i–174, 1998.
  • [95] P. Mundy and A. Gomes, “Individual differences in joint attention skill development in the second year,” Infant behavior and development, vol. 21, no. 3, pp. 469–482, 1998.
  • [96] R. MacDonald, J. Anderson, W. V. Dube, A. Geckeler, G. Green, W. Holcomb, R. Mansfield, and J. Sanchez, “Behavioral assessment of joint attention: A methodological report,” Research in Developmental Disabilities, vol. 27, no. 2, pp. 138–150, 2006.
  • [97] T. Deroche, C. Castanier, A. Perrot, and A. Hartley, “Joint attention is slowed in older adults,” Experimental aging research, vol. 42, no. 2, pp. 144–150, 2016.
  • [98] E. Goffman et al., The presentation of self in everyday life.   Harmondsworth, 1978.
  • [99] P. D. Bardis, “Social interaction and social processes,” Social Science, vol. 54, no. 3, pp. 147–167, 1979.
  • [100] N. Sebanz, H. Bekkering, and G. Knoblich, “Joint action: Bodies and minds moving together,” Trends in cognitive sciences, vol. 10, no. 2, pp. 70–76, 2006.
  • [101] A. Fiebich and S. Gallagher, “Joint attention in joint action,” Philosophical Psychology, vol. 26, no. 4, pp. 571–587, 2013.
  • [102] P. Nuku and H. Bekkering, “Joint attention: Inferring what others perceive (and don’t perceive),” Consciousness and Cognition, vol. 17, no. 1, pp. 339–349, 2008.
  • [103] M. Sucha, D. Dostal, and R. Risser, “Pedestrian-driver communication and decision strategies at marked crossings,” Accident Analysis & Prevention, vol. 102, pp. 41–50, 2017.
  • [104] L. Fogassi, P. F. Ferrari, B. Gesierich, S. Rozzi, F. Chersi, and G. Rizzolatti, “Parietal lobe: from action organization to intention understanding,” Science, vol. 308, no. 5722, pp. 662–667, 2005.
  • [105] M. A. Umilta, E. Kohler, V. Gallese, L. Fogassi, L. Fadiga, C. Keysers, and G. Rizzolatti, “I know what you are doing: A neurophysiological study,” Neuron, vol. 31, no. 1, pp. 155–165, 2001.
  • [106] K. Verfaillie and A. Daems, “Representing and anticipating human actions in vision,” Visual Cognition, vol. 9, no. 1-2, pp. 217–232, 2002.
  • [107] J. R. Flanagan and R. S. Johansson, “Action plans used in action observation,” Nature, vol. 424, no. 6950, p. 769, 2003.
  • [108] D. C. Dennett, Brainstorms: Philosophical essays on mind and psychology.   MIT press, 1981.
  • [109] S. Baron-Cohen, “Mindblindness: An essay on autism and theory of mind. cambridge, ma: Bradford,” 1995.
  • [110] N. Humphrey, Consciousness regained: Chapters in the development of mind.   Nicholas Humphrey, 1984.
  • [111] D. Sperber and D. Wilson, “Precis of relevance: Communication and cognition,” Behavioral and brain sciences, vol. 10, no. 4, pp. 697–710, 1987.
  • [112] N. J. Briton and J. A. Hall, “Beliefs about female and male nonverbal communication,” Sex Roles, vol. 32, no. 1, pp. 79–90, 1995.
  • [113] M. A. Hecht and N. Ambady, “Nonverbal communication and psychology: Past and future,” Atlantic Journal of Communication, vol. 7, no. 2, pp. 156–170, 1999.
  • [114] R. Buck and C. A. VanLear, “Verbal and nonverbal communication: Distinguishing symbolic, spontaneous, and pseudo-spontaneous nonverbal behavior,” Journal of Communication, vol. 52, no. 3, pp. 522–541, 2002.
  • [115] C. Darwin, The expression of the emotions in man and animals.   Oxford University Press, USA, 1998.
  • [116] R. M. Krauss, Y. Chen, and P. Chawla, “Nonverbal behavior and nonverbal communication: What do conversational hand gestures tell us?” Advances in experimental social psychology, vol. 28, pp. 389–450, 1996.
  • [117] A. Mehrabian, Public places and private spaces: the psychology of work, play, and living environments.   Basic Books New York, 1976.
  • [118] R. L. Birdwhistell, Kinesics and context: Essays on body motion communication.   University of Pennsylvania press, 2010.
  • [119] M. R. DiMatteo, A. Taranta, H. S. Friedman, and L. M. Prince, “Predicting patient satisfaction from physicians’ nonverbal communication skills,” Medical care, pp. 376–387, 1980.
  • [120] S. Nowicki and M. P. Duke, “Individual differences in the nonverbal communication of affect: The diagnostic analysis of nonverbal accuracy scale,” Journal of Nonverbal behavior, vol. 18, no. 1, pp. 9–35, 1994.
  • [121] M. Argyle and J. Dean, “Eye-contact, distance and affiliation,” Sociometry, pp. 289–304, 1965.
  • [122] A. Senju and M. H. Johnson, “The eye contact effect: mechanisms and development,” Trends in cognitive sciences, vol. 13, no. 3, pp. 127–134, 2009.
  • [123] D. Rothenbücher, J. Li, D. Sirkin, B. Mok, and W. Ju, “Ghost driver: A field study investigating the interaction between pedestrians and driverless vehicles,” in International Symposium on Robot and Human Interactive Communication (RO-MAN).   IEEE, 2016, pp. 795–802.
  • [124] N. Guéguen, S. Meineri, and C. Eyssartier, “A pedestrian’s stare and drivers’ stopping behavior: A field experiment at the pedestrian crossing,” Safety science, vol. 75, pp. 87–89, 2015.
  • [125] R. Risser, “Behavior in traffic conflict situations,” Accident Analysis & Prevention, vol. 17, no. 2, pp. 179–197, 1985.
  • [126] A. E. Scheflen, “The significance of posture in communication systems,” Psychiatry, vol. 27, no. 4, pp. 316–331, 1964.
  • [127] G. Wilde, “Immediate and delayed social interaction in road user behaviour,” Applied Psychology, vol. 29, no. 4, pp. 439–460, 1980.
  • [128] D. Clay, “Driver attitude and attribution: implications for accident prevention,” Ph.D. dissertation, Cranfield University, 1995.
  • [129] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Understanding pedestrian behavior in complex traffic scenes,” IEEE Transactions on Intelligent Vehicles, vol. PP, no. 99, pp. 1–1, 2017.
  • [130] A. Varhelyi, “Drivers’ speed behaviour at a zebra crossing: a case study,” Accident Analysis & Prevention, vol. 30, no. 6, pp. 731–743, 1998.
  • [131] C. Klienke, “Compliance to requests made by gazing and touching,” J. exp. soc. Psychol., vol. 13, pp. 218–223, 1977.
  • [132] I. Walker and M. Brosnan, “Drivers’ gaze fixations during judgements about a bicyclist’s intentions,” Transportation research part F: traffic psychology and behaviour, vol. 10, no. 2, pp. 90–98, 2007.
  • [133] C. C. Hamlet, S. Axelrod, and S. Kuerschner, “Eye contact as an antecedent to compliant behavior,” Journal of Applied Behavior Analysis, vol. 17, no. 4, pp. 553–557, 1984.
  • [134] J. M. Price and S. J. Glynn, “The relationship between crash rates and drivers’ hazard assessments using the connecticut photolog,” in Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 44, no. 20.   SAGE Publications, 2000, pp. 3–263.
  • [135] D. Crundall, “Driving experience and the acquisition of visual information,” Ph.D. dissertation, University of Nottingham, 1999.
  • [136] J. M. Sullivan and M. J. Flannagan, “Differences in geometry of pedestrian crashes in daylight and darkness,” Journal of safety research, vol. 42, no. 1, pp. 33–37, 2011.
  • [137] A. Tom and M.-A. Granié, “Gender differences in pedestrian rule compliance and visual search at signalized and unsignalized crossroads,” Accident Analysis & Prevention, vol. 43, no. 5, pp. 1794–1801, 2011.
  • [138] M. Reed, “Intersection kinematics: a pilot study of driver turning behavior with application to pedestrian obscuration by a-pillars,” University of Michigan, Tech. Rep., 2008.
  • [139] A. Tawari, A. Møgelmose, S. Martin, T. B. Moeslund, and M. M. Trivedi, “Attention estimation by simultaneous analysis of viewer and view,” in Intelligent Transportation Systems Conference (ITSC).   IEEE, 2014, pp. 1381–1387.
  • [140] N. W. Heimstra, J. Nichols, and G. Martin, “An experimental methodology for analysis of child pedestrian behavior,” Pediatrics, vol. 44, no. 5, pp. 832–838, 1969.
  • [141] V. L. Neale, T. A. Dingus, S. G. Klauer, J. Sudweeks, and M. Goodman, “An overview of the 100-car naturalistic study and findings,” National Highway Traffic Safety Administration, Paper, no. 05-0400, 2005.
  • [142] R. Eenink, Y. Barnard, M. Baumann, X. Augros, and F. Utesch, “UDRIVE: the european naturalistic driving study,” in Proceedings of Transport Research Arena.   IFSTTAR, 2014.
  • [143] T. Wang, J. Wu, P. Zheng, and M. McDonald, “Study of pedestrians’ gap acceptance behavior when they jaywalk outside crossing facilities,” in Intelligent Transportation Systems (ITSC).   IEEE, 2010, pp. 1295–1300.
  • [144] T. Rosenbloom, “Crossing at a red light: Behaviour of individuals and groups,” Transportation research part F: traffic psychology and behaviour, vol. 12, no. 5, pp. 389–394, 2009.
  • [145] M. M. Ishaque and R. B. Noland, “Behavioural issues in pedestrian speed choice and street crossing behaviour: a review,” Transport Reviews, vol. 28, no. 1, pp. 61–85, 2008.
  • [146] D. Johnston, “Road accident casuality: A critique of the literature and an illustrative case,” Ontario: Grand Rounds. Department of Psy chiatry, Hotel Dieu Hospital, 1973.
  • [147] M. Gheri, “Über das blickverhalten von kraftfahrern an kreuzungen,” Kuratorium für Verkehrssicherheit, Kleine Fachbuchreihe Bd, vol. 5, 1963.
  • [148] M. Šucha, “Road users’ strategies and communication: driver-pedestrian interaction,” Transport Research Arena (TRA), 2014.
  • [149] D. Yagil, “Beliefs, motives and situational factors related to pedestrians’ self-reported behavior at signal-controlled crossings,” Transportation Research Part F: Traffic Psychology and Behaviour, vol. 3, no. 1, pp. 1–13, 2000.
  • [150] M. Lefkowitz, R. R. Blake, and J. S. Mouton, “Status factors in pedestrian violation of traffic signals.” The Journal of Abnormal and Social Psychology, vol. 51, no. 3, p. 704, 1955.
  • [151] E. Papadimitriou, G. Yannis, and J. Golias, “A critical assessment of pedestrian behaviour models,” Transportation research part F: traffic psychology and behaviour, vol. 12, no. 3, pp. 242–255, 2009.
  • [152] D. Twisk, N. Van Nes, and J. Haupt, “Understanding safety critical interactions between bicycles and motor vehicles in europe by means of naturalistic driving techniques,” in Proceedings of the first international cycling safety conference, 2012.
  • [153] M. Goldhammer, A. Hubert, S. Koehler, K. Zindler, U. Brunsmann, K. Doll, and B. Sick, “Analysis on termination of pedestrians’ gait at urban intersections,” in Intelligent Transportation Systems (ITSC).   IEEE, 2014, pp. 1758–1763.
  • [154] G. Waizman, S. Shoval, and I. Benenson, “Micro-simulation model for assessing the risk of vehicle–pedestrian road accidents,” Journal of Intelligent Transportation Systems, vol. 19, no. 1, pp. 63–77, 2015.
  • [155] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Agreeing to cross: How drivers and pedestrians communicate,” in Intelligent Vehicles Symposium (IV), June 2017, pp. 264–269.
  • [156] R. R. Oudejans, C. F. Michaels, B. van Dort, and E. J. Frissen, “To cross or not to cross: The effect of locomotion on street-crossing behavior,” Ecological psychology, vol. 8, no. 3, pp. 259–267, 1996.
  • [157] D. Cottingham, “Pedestrian crossings and islands,” Online, 2017-06-3. [Online]. Available:
  • [158] R. Tian, E. Y. Du, K. Yang, P. Jiang, F. Jiang, Y. Chen, R. Sherony, and H. Takahashi, “Pilot study on pedestrian step frequency in naturalistic driving environment,” in Intelligent Vehicles Symposium (IV).   IEEE, 2013, pp. 1215–1220.
  • [159] D. Crompton, “Pedestrian delay, annoyance and risk: preliminary results from a 2 years study,” in Proceedings of PTRC Summer Annual Meeting, 1979, pp. 275–299.
  • [160] G. Underwood, P. Chapman, N. Brocklehurst, J. Underwood, and D. Crundall, “Visual attention while driving: Sequences of eye fixations made by experienced and novice drivers,” Ergonomics, vol. 46, no. 6, pp. 629–646, 2003.
  • [161] S. G. Klauer, V. L. Neale, T. A. Dingus, D. Ramsey, and J. Sudweeks, “Driver inattention: A contributing factor to crashes and near-crashes,” in Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 49, no. 22.   SAGE Publications Sage CA: Los Angeles, CA, 2005, pp. 1922–1926.
  • [162] G. Underwood, “Visual attention and the transition from novice to advanced driver,” Ergonomics, vol. 50, no. 8, pp. 1235–1249, 2007.
  • [163] Y. Barnard, F. Utesch, N. Nes, R. Eenink, and M. Baumann, “The study design of UDRIVE: the naturalistic driving study across europe for cars, trucks and scooters,” European Transport Research Review, vol. 8, no. 2, pp. 1–10, 2016.
  • [164] B. E. Sabey and G. Staughton, “Interacting roles of road environment vehicle and road user in accidents,” Ceste I Mostovi, 1975.
  • [165] J. Nasar, P. Hecht, and R. Wener, “Mobile telephones, distracted attention, and pedestrian safety,” Accident analysis & prevention, vol. 40, no. 1, pp. 69–75, 2008.
  • [166] D. C. Schwebel, D. Stavrinos, K. W. Byington, T. Davis, E. E. O’Neal, and D. De Jong, “Distraction and pedestrian safety: How talking on the phone, texting, and listening to music impact crossing the street,” Accident Analysis & Prevention, vol. 45, pp. 266–271, 2012.
  • [167] I. E. Hyman, S. M. Boss, B. M. Wise, K. E. McKenzie, and J. M. Caggiano, “Did you see the unicycling clown? Inattentional blindness while walking and talking on a cell phone,” Applied Cognitive Psychology, vol. 24, no. 5, pp. 597–607, 2010.
  • [168] S. Schmidt and B. Färber, “Pedestrians at the kerb–recognising the action intentions of humans,” Transportation research part F: traffic psychology and behaviour, vol. 12, no. 4, pp. 300–310, 2009.
  • [169] A. Lindgren, F. Chen, P. W. Jordan, and H. Zhang, “Requirements for the design of advanced driver assistance systems-the differences between Swedish and Chinese drivers,” International Journal of Design, vol. 2, no. 2, 2008.
  • [170] G. M. Björklund and L. Åberg, “Driver behaviour in intersections: Formal and informal traffic rules,” Transportation Research Part F: Traffic Psychology and Behaviour, vol. 8, no. 3, pp. 239–253, 2005.
  • [171] T. Rosenbloom, H. Barkan, and D. Nemrodov, “For heaven’s sake keep the rules: Pedestrians’ behavior at intersections in ultra-orthodox and secular cities,” Transportation Research Part F, vol. 7, pp. 395–404, 2004.
  • [172] R. Sun, X. Zhuang, C. Wu, G. Zhao, and K. Zhang, “The estimation of vehicle speed and stopping distance by pedestrians crossing streets in a naturalistic traffic environment,” Transportation research part F: traffic psychology and behaviour, vol. 30, pp. 97–106, 2015.
  • [173] W. A. Harrell, “Factors influencing pedestrian cautiousness in crossing streets,” The Journal of Social Psychology, vol. 131, no. 3, pp. 367–372, 1991.
  • [174] P.-S. Lin, Z. Wang, and R. Guo, “Impact of connected vehicles and autonomous vehicles on future transportation,” Bridging the East and West, p. 46, 2016.
  • [175] E. CYingzi Du, K. Yang, F. Jiang, P. Jiang, R. Tian, M. Luzetski, Y. Chen, R. Sherony, and H. Takahashi, “Pedestrian behavior analysis using 110-car naturalistic driving data in USA,” Online, 2017-06-3. [Online]. Available:
  • [176] J. Caird and P. Hancock, “The perception of arrival time for different oncoming vehicles at an intersection,” Ecological Psychology, vol. 6, no. 2, pp. 83–109, 1994.
  • [177] W. Edwards, “The theory of decision making.” Psychological bulletin, vol. 51, no. 4, p. 380, 1954.
  • [178] P. Legrenzi, V. Girotto, and P. N. Johnson-Laird, “Focussing in reasoning and decision making,” Cognition, vol. 49, no. 1, pp. 37–66, 1993.
  • [179] H. A. Simon, “Theories of decision-making in economics and behavioral science,” The American economic review, vol. 49, no. 3, pp. 253–283, 1959.
  • [180] J. B. Carroll, Human cognitive abilities: A survey of factor-analytic studies.   Cambridge University Press, 1993.
  • [181] R. J. Sternberg, “Toward a unified componential theory of human reasoning.” DTIC Document, Tech. Rep., 1978.
  • [182] H. Barlow, “Inductive inference, coding, perception, and language,” Perception, vol. 3, no. 2, pp. 123–134, 1974.
  • [183] P. Johnson-Laird, S. S. Khemlani, and G. P. Goodwin, “Logic, probability, and human reasoning,” Trends in cognitive sciences, vol. 19, no. 4, pp. 201–214, 2015.
  • [184] M. Oaksford and N. Chater, “The probabilistic approach to human reasoning,” Trends in cognitive sciences, vol. 5, no. 8, pp. 349–357, 2001.
  • [185] A. Aliseda, Abductive reasoning.   Springer, 2006, vol. 330.
  • [186] H. R. Fischer, “Abductive reasoning as a way of worldmaking,” Foundations of Science, vol. 6, no. 4, pp. 361–383, 2001.
  • [187] “Types of reasoning,” Online, 2017-06-25. [Online]. Available:
  • [188] L. A. Zadeh, “Fuzzy logic and approximate reasoning,” Synthese, vol. 30, no. 3, pp. 407–428, 1975.
  • [189] L. Cosmides, “The logic of social exchange: Has natural selection shaped how humans reason? studies with the Wason selection task,” Cognition, vol. 31, no. 3, pp. 187–276, 1989.
  • [190] R. M. Byrne and P. N. Johnson-Laird, “Spatial reasoning,” Journal of memory and language, vol. 28, no. 5, pp. 564–575, 1989.
  • [191] A. U. Frank, “Qualitative spatial reasoning about distances and directions in geographic space,” Journal of Visual Languages & Computing, vol. 3, no. 4, pp. 343–371, 1992.
  • [192] L. Vila, “A survey on temporal reasoning in artificial intelligence,” AI Communications, vol. 7, no. 1, pp. 4–28, 1994.
  • [193] A. Pani and G. Bhattacharjee, “Temporal representation and reasoning in artificial intelligence: A review,” Mathematical and Computer Modelling, vol. 34, no. 1-2, pp. 55–80, 2001.
  • [194] M. Klenk, K. D. Forbus, E. Tomai, H. Kim, and B. Kyckelhahn, “Solving everyday physical reasoning problems by analogy using sketches,” in National Conference on Artificial Intelligence, vol. 20, no. 1, 2005, p. 209.
  • [195] R. Baillargeon, “Physical reasoning in infancy,” The cognitive neurosciences, pp. 181–204, 1995.
  • [196] A. G. Cohn, “Qualitative spatial representation and reasoning techniques,” in Annual Conference on Artificial Intelligence.   Springer, 1997, pp. 1–30.
  • [197] E. Davis, “Physical reasoning,” Handbook of knowledge representation, vol. 1, pp. 597–620, 2008.
  • [198] H. Jaeger, “Artificial intelligence: Deep neural reasoning,” Nature, vol. 538, no. 7626, pp. 467–468, 2016.
  • [199] B. Peng, Z. Lu, H. Li, and K.-F. Wong, “Towards neural network-based reasoning,” arXiv preprint arXiv:1508.05508, 2015.
  • [200]

    R. Socher, D. Chen, C. D. Manning, and A. Ng, “Reasoning with neural tensor networks for knowledge base completion,” in

    Advances in neural information processing systems, 2013, pp. 926–934.
  • [201] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou et al., “Hybrid computing using a neural network with dynamic external memory,” Nature, vol. 538, no. 7626, pp. 471–476, 2016.
  • [202] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  • [203] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing machines,” arXiv preprint arXiv:1410.5401, 2014.
  • [204] J. S. Sichman, R. Conte, Y. Demazeau, and C. Castelfranchi, “A social reasoning mechanism based on dependence networks,” in Proceedings of 11th European Conference on Artificial Intelligence, 1998, pp. 416–420.
  • [205] C. B. De Soto, M. London, and S. Handel, “Social reasoning and spatial paralogic.” Journal of Personality and Social Psychology, vol. 2, no. 4, p. 513, 1965.
  • [206] R. D. Wright, Visual attention.   Oxford University Press, 1998.
  • [207] J. K. Tsotsos, “Toward a computational model of visual attention,” in Early vision and beyond.   MIT Press, Cambridge, MA, 1995, pp. 207–218.
  • [208] N. D. Bruce, C. Wloka, N. Frosst, S. Rahman, and J. K. Tsotsos, “On computational modeling of visual saliency: Examining what’s right, and what’s left,” Vision research, vol. 116, pp. 95–112, 2015.
  • [209] Y.-C. Lee, J. D. Lee, and L. Ng Boyle, “The interaction of cognitive load and attention-directing cues in driving,” Human factors, vol. 51, no. 3, pp. 271–280, 2009.
  • [210] S. G. Klauer, T. A. Dingus, V. L. Neale, J. D. Sudweeks, D. J. Ramsey et al., “The impact of driver inattention on near-crash/crash risk: An analysis using the 100-car naturalistic driving study data,” 2006.
  • [211] R. E. Llaneras, J. Salinger, and C. A. Green, “Human factors issues associated with limited ability autonomous driving systems: Drivers’ allocation of visual attention to the forward roadway,” in Proceedings of the 7th International Driving Symposium on Human Factors in Driver Assessment, Training and Vehicle Design.   Public Policy Center, University of Iowa Iowa City, 2013, pp. 92–98.
  • [212] Y. Takeda, T. Sato, K. Kimura, H. Komine, M. Akamatsu, and J. Sato, “Electrophysiological evaluation of attention in drivers and passengers: Toward an understanding of drivers’ attentional state in autonomous vehicles,” Transportation research part F: traffic psychology and behaviour, vol. 42, pp. 140–150, 2016.
  • [213] B. Scassellati, “Mechanisms of shared attention for a humanoid robot,” in Embodied Cognition and Action: Papers from the 1996 AAAI Fall Symposium, vol. 4, no. 9, 1996, p. 21.
  • [214] B. Scassellati, “Imitation and mechanisms of joint attention: A developmental structure for building social skills on a humanoid robot,” pp. 176–195, 1999.
  • [215] C. Breazeal, A. Takanishi, and T. Kobayashi, “Social robots that interact with people,” in Springer handbook of robotics.   Springer, 2008, pp. 1349–1369.
  • [216] T. Fong, I. Nourbakhsh, and K. Dautenhahn, “A survey of socially interactive robots,” Robotics and autonomous systems, vol. 42, no. 3, pp. 143–166, 2003.
  • [217] B. Scassellati, “Knowing what to imitate and knowing when you succeed,” in Proceedings of the AISB’99 Symposium on Imitation in Animals and Artifacts, 1999, pp. 105–113.
  • [218] C. Breazeal and R. Brooks, “Robot emotion: A functional perspective,” Who needs emotions, pp. 271–310, 2005.
  • [219] C. Breazeal and J. Velasquez, “Robot in society: Friend or appliance,” in Proceedings of the 1999 Autonomous Agents Workshop on Emotion-Based Agent Architectures, 1999, pp. 18–26.
  • [220] R. A. Brooks, C. Breazeal, M. Marjanović, B. Scassellati, and M. M. Williamson, “The Cog project: Building a humanoid robot,” in Computation for metaphors, analogy, and agents.   Springer, 1999, pp. 52–87.
  • [221] C. Breazeal, C. D. Kidd, A. L. Thomaz, G. Hoffman, and M. Berlin, “Effects of nonverbal communication on efficiency and robustness in human-robot teamwork,” in IROS.   IEEE, 2005, pp. 708–713.
  • [222] H. Ishiguro, T. Ono, M. Imai, T. Maeda, T. Kanda, and R. Nakatsu, “Robovie: An interactive humanoid robot,” Industrial robot: An international journal, vol. 28, no. 6, pp. 498–504, 2001.
  • [223] C. Becker-Asano, K. Ogawa, S. Nishio, and H. Ishiguro, “Exploring the uncanny valley with Geminoid HI-1 in a real-world application,” in Proceedings of IADIS International conference interfaces and human computer interaction, 2010, pp. 121–128.
  • [224] M. Staudte and M. W. Crocker, “Visual attention in spoken human-robot interaction,” in the 4th ACM/IEEE international conference on Human robot interaction.   ACM, 2009, pp. 77–84.
  • [225] B. Mutlu, F. Yamaoka, T. Kanda, H. Ishiguro, and N. Hagita, “Nonverbal leakage in robots: Communication of intentions through seemingly unintentional behavior,” in the 4th ACM/IEEE international conference on Human robot interaction (HRI).   ACM, 2009, pp. 69–76.
  • [226] M. Zheng, A. Moon, E. A. Croft, and M. Q.-H. Meng, “Impacts of robot head gaze on robot-to-human handovers,” International Journal of Social Robotics, vol. 7, no. 5, pp. 783–798, 2015.
  • [227] M. Imai, T. Ono, and H. Ishiguro, “Physical relation and expression: Joint attention for human-robot interaction,” IEEE Transactions on Industrial Electronics, vol. 50, no. 4, pp. 636–643, 2003.
  • [228] M. Staudte and M. W. Crocker, “Investigating joint attention mechanisms through spoken human–robot interaction,” Cognition, vol. 120, no. 2, pp. 268–291, 2011.
  • [229] B. Mutlu, A. Terrell, and C.-M. Huang, “Coordination mechanisms in human-robot collaboration,” in Proceedings of the Workshop on Collaborative Manipulation, 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2013.
  • [230] T. Yonezawa, H. Yamazoe, A. Utsumi, and S. Abe, “Gaze-communicative behavior of stuffed-toy robot with joint attention and eye contact based on ambient gaze-tracking,” in Proceedings of the 9th international conference on Multimodal interfaces.   ACM, 2007, pp. 140–145.
  • [231] M. Shiomi, T. Kanda, H. Ishiguro, and N. Hagita, “Interactive humanoid robots for a science museum,” in Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction (HRI).   ACM, 2006, pp. 305–312.
  • [232] D. Miyauchi, A. Sakurai, A. Nakamura, and Y. Kuno, “Active eye contact for human-robot communication,” in CHI’04 Extended Abstracts on Human Factors in Computing Systems.   ACM, 2004, pp. 1099–1102.
  • [233] S. Kim, Z. Yu, J. Kim, A. Ojha, and M. Lee, “Human-robot interaction using intention recognition,” in Proceedings of the 3rd International Conference on Human-Agent Interaction.   ACM, 2015, pp. 299–302.
  • [234] M. Monajjemi, J. Bruce, S. A. Sadat, J. Wawerla, and R. Vaughan, “UAV, do you see me? Establishing mutual attention between an uninstrumented human and an outdoor UAV in flight,” in IROS.   IEEE, 2015, pp. 3614–3620.
  • [235] Z. Yücel and A. A. Salah, “Head pose and neural network based gaze direction estimation for joint attention modeling in embodied agents,” in Annual Meeting of Cognitive Science Society, 2009.
  • [236] Z. Yucel, A. A. Salah, C. Meriçli, and T. Meriçli, “Joint visual attention modeling for naturally interacting robotic agents,” in International Symposium on Computer and Information Sciences (ISCIS).   IEEE, 2009, pp. 242–247.
  • [237]

    Z. Yücel, A. A. Salah, Ç. Meriçli, T. Meriçli, R. Valenti, and T. Gevers, “Joint attention by gaze interpolation and saliency,”

    IEEE Transactions on cybernetics, vol. 43, no. 3, pp. 829–842, 2013.
  • [238] S. Gorji and J. J. Clark, “Attentional push: Augmenting salience with shared attention modeling,” arXiv preprint arXiv:1609.00072, 2016.
  • [239]

    H. Kera, R. Yonetani, K. Higuchi, and Y. Sato, “Discovering objects of joint attention via first-person sensing,” in

    Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    , 2016, pp. 7–15.
  • [240] Y. Nagai, M. Asada, and K. Hosoda, “Developmental learning model for joint attention,” in IROS, vol. 1.   IEEE, 2002, pp. 932–937.
  • [241] Y. Nagai, K. Hosoda, and M. Asada, “How does an infant acquire the ability of joint attention?: A constructive approach,” 2003.
  • [242] M. Ito and J. Tani, “Joint attention between a humanoid robot and users in imitation game,” in the International Conference on Development and Learning (ICDL), 2004.
  • [243] Y. Nagai, “The role of motion information in learning human-robot joint attention,” in ICRA.   IEEE, 2005, pp. 2069–2074.
  • [244] A. Haasch, N. Hofemann, J. Fritsch, and G. Sagerer, “A multi-modal object attention system for a mobile robot,” in IROS.   IEEE, 2005, pp. 2712–2717.
  • [245] B. Schauerte and R. Stiefelhagen, ““Look at this!” learning to guide visual saliency in human-robot interaction,” in IROS.   IEEE, 2014, pp. 995–1002.
  • [246] A. Billard and K. Dautenhahn, “Experiments in learning by imitation-grounding and use of communication in robotic agents,” Adaptive behavior, vol. 7, no. 3-4, pp. 415–438, 1999.
  • [247] K. Dautenhahn and A. Billard, “Studying robot social cognition within a developmental psychology framework,” in Advanced Mobile Robots, 1999.(Eurobot’99) 1999 Third European Workshop on.   IEEE, 1999, pp. 187–194.
  • [248] H. Kozima and H. Yano, “A robot that learns to communicate with human caregivers,” in Proceedings of the First International Workshop on Epigenetic Robotics, 2001, pp. 47–52.
  • [249] H. Kozima, C. Nakagawa, and H. Yano, “Can a robot empathize with people?” Artificial life and robotics, vol. 8, no. 1, pp. 83–88, 2004.
  • [250] H. Kozima, M. P. Michalowski, and C. Nakagawa, “Keepon,” International Journal of Social Robotics, vol. 1, no. 1, pp. 3–18, 2009.
  • [251] K. Dautenhahn and I. Werry, “Issues of robot-human interaction dynamics in the rehabilitation of children with autism,” Proceedings From animals to animats, vol. 6, pp. 519–528, 2000.
  • [252] H. Kozima and A. Ito, “An attention-based approach to symbol acquisition,” in Intelligent Control (ISIC) Held jointly with IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA), Intelligent Systems and Semiotics (ISAS).   IEEE, 1998, pp. 852–856.
  • [253] H. Kozima and J. Zlatev, “An epigenetic approach to human-robot communication,” in 9th IEEE International Workshop on Robot and Human Interactive Communication.   IEEE, 2000, pp. 346–351.
  • [254] A. Ito, S. Hayakawa, and T. Terada, “Why robots need body for mind communication-an attempt of eye-contact between human and robot,” in IEEE International Workshop on Robot and Human Interactive Communication.   IEEE, 2004, pp. 473–478.
  • [255] H. Kozima, C. Nakagawa, and Y. Yasuda, “Interactive robots for communication-care: A case-study in autism therapy,” in IEEE International Workshop on Robot and Human Interactive Communication.   IEEE, 2005, pp. 341–346.
  • [256] L. Hobert, A. Festag, I. Llatser, L. Altomare, F. Visintainer, and A. Kovacs, “Enhancements of v2x communication in support of cooperative autonomous driving,” IEEE Communications Magazine, vol. 53, no. 12, pp. 64–70, 2015.
  • [257] X. Cheng, M. Wen, L. Yang, and Y. Li, “Index modulated OFDM with interleaved grouping for V2X communications,” in Intelligent Transportation Systems (ITSC).   IEEE, 2014, pp. 1097–1104.
  • [258] S. R. Narla, “The evolution of connected vehicle technology: From smart drivers to smart cars to… self-driving cars,” Institute of Transportation Engineers. ITE Journal, vol. 83, no. 7, p. 22, 2013.
  • [259] W. Cunningham, “Honda tech warns drivers of pedestrian presence,” Online, 2017-06-30. [Online]. Available:
  • [260] T. Schmidt, R. Philipsen, and M. Ziefle, “From v2x to control2trust,” in Proceedings of the Third International Conference on Human Aspects of Information Security, Privacy, and Trust.   Springer-Verlag New York, Inc., 2015, pp. 570–581.
  • [261] T. Lagstrom and V. M. Lundgren, “AVIP-autonomous vehicles interaction with pedestrians,” Master’s thesis, Chalmers University of Technology, Gothenborg, Sweden, 2015.
  • [262] C. P. Urmson, I. J. Mahon, D. A. Dolgov, and J. Zhu, “Pedestrian notifications,” US Patent US 9 196 164B1, 11 24, 2015.
  • [263] “Mitsubishi electric introduces road-illuminating directional indicators,” Online, 2017-06-30. [Online]. Available:
  • [264] “Overview: Mercedes-Benz F 015 luxury in motion,” Online, 2017-06-30. [Online]. Available:
  • [265] N. Pennycooke, “AEVITA: Designing biomimetic vehicle-to-pedestrian communication protocols for autonomously operating & parking on-road electric vehicles,” Master’s thesis, Massachusetts Institute of Technology, 2012.
  • [266] N. Mirnig, N. Perterer, G. Stollnberger, and M. Tscheligi, “Three strategies for autonomous car-to-pedestrian communication: A survival guide,” in ACM/IEEE International Conference on Human-Robot Interaction.   ACM, 2017, pp. 209–210.
  • [267] A. Sieß, K. Hübel, D. Hepperle, A. Dronov, C. Hufnagel, J. Aktun, and M. Wölfel, “Hybrid city lighting-improving pedestrians’ safety through proactive street lighting,” in International Conference on Cyberworlds (CW).   IEEE, 2015, pp. 46–49.
  • [268] “Smart highway,” Online, 2017-06-30. [Online]. Available:
  • [269] E. M. Tapia, S. S. Intille, and K. Larson, “Activity recognition in the home using simple and ubiquitous sensors,” in Pervasive, vol. 4.   Springer, 2004, pp. 158–175.
  • [270] J. R. Kwapisz, G. M. Weiss, and S. A. Moore, “Activity recognition using cell phone accelerometers,” ACM SigKDD Explorations Newsletter, vol. 12, no. 2, pp. 74–82, 2011.
  • [271] A. K. Bourke and G. M. Lyons, “A threshold-based fall-detection algorithm using a bi-axial gyroscope sensor,” Medical engineering & physics, vol. 30, no. 1, pp. 84–90, 2008.
  • [272] Q. V. Vo, G. Lee, and D. Choi, “Fall detection based on movement and smart phone technology,” in International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).   IEEE, 2012, pp. 1–4.
  • [273] S.-W. Lee and K. Mase, “Recognition of walking behaviors for pedestrian navigation,” in International Conference on Control Applications.   IEEE, 2001, pp. 1152–1155.
  • [274] L. Bao and S. Intille, “Activity recognition from user-annotated acceleration data,” Pervasive computing, pp. 1–17, 2004.
  • [275] S. W. Lee and K. Mase, “Activity and location recognition using wearable sensors,” IEEE pervasive computing, vol. 1, no. 3, pp. 24–32, 2002.
  • [276] M. Kourogi, T. Ishikawa, and T. Kurata, “A method of pedestrian dead reckoning using action recognition,” in Position Location and Navigation Symposium (PLANS).   IEEE, 2010, pp. 85–89.
  • [277] L. Tong, W. Chen, Q. Song, and Y. Ge, “A research on automatic human fall detection method based on wearable inertial force information acquisition system,” in International Conference on Robotics and Biomimetics (ROBIO).   IEEE, 2009, pp. 949–953.
  • [278] N. Gao and L. Zhao, “A pedestrian dead reckoning system using SEMG based on activities recognition,” in Chinese Guidance, Navigation and Control Conference (CGNCC).   IEEE, 2016, pp. 2361–2365.
  • [279] X. Su, H. Tong, and P. Ji, “Activity recognition with smartphone sensors,” Tsinghua Science and Technology, vol. 19, no. 3, pp. 235–249, 2014.
  • [280] S. Mehner, R. Klauck, and H. Koenig, “Location-independent fall detection with smartphone,” in Proceedings of the 6th International Conference on PErvasive Technologies Related to Assistive Environments.   ACM, 2013, p. 11.
  • [281] K. Stormo, “Human fall detection using distributed monostatic UWB radars,” Master’s thesis, Institutt for teknisk kybernetikk, 2014.
  • [282] M. G. Amin, Y. D. Zhang, F. Ahmad, and K. D. Ho, “Radar signal processing for elderly fall detection: The future for in-home monitoring,” IEEE Signal Processing Magazine, vol. 33, no. 2, pp. 71–80, 2016.
  • [283] W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Understanding and modeling of wifi signal based human activity recognition,” in Proceedings of the 21st annual international conference on mobile computing and networking.   ACM, 2015, pp. 65–76.
  • [284] T. Van Kasteren, A. Noulas, G. Englebienne, and B. Kröse, “Accurate activity recognition in a home setting,” in Proceedings of the 10th international conference on Ubiquitous computing.   ACM, 2008, pp. 1–9.
  • [285] B. U. Toreyin, Y. Dedeoglu, and A. E. Çetin, “HMM based falling person detection using both audio and video,” Lecture Notes in Computer Science, vol. 3766, p. 211, 2005.
  • [286] O. D. Lara and M. A. Labrador, “A survey on human activity recognition using wearable sensors.” IEEE Communications Surveys and Tutorials, vol. 15, no. 3, pp. 1192–1209, 2013.
  • [287] P. Rashidi and A. Mihailidis, “A survey on ambient-assisted living tools for older adults,” IEEE journal of biomedical and health informatics, vol. 17, no. 3, pp. 579–590, 2013.
  • [288] M. Bertozzi, A. Broggi, and A. Fascioli, “Vision-based intelligent vehicles: State of the art and perspectives,” Robotics and Autonomous systems, vol. 32, no. 1, pp. 1–16, 2000.
  • [289] S. A. Nene, S. K. Nayar, H. Murase et al., “Columbia object image library (COIL-20),” 1996.
  • [290] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [291] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” in ICCV, vol. 2.   IEEE, 2005, pp. 1800–1807.
  • [292] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by unsupervised scale-invariant learning,” in CVPR, vol. 2.   IEEE, 2003, pp. II–II.
  • [293] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,” PAMI, vol. 28, no. 4, pp. 594–611, 2006.
  • [294] E. M., A. Zisserman, C. K. I. Williams, and L. Van Gool, “The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results,”
  • [295] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,”
  • [296] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results,”
  • [297] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky, “Exploiting hierarchical context on a large database of object categories,” in CVPR, 2010.
  • [298] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in CVPR.   IEEE, 2009, pp. 951–958.
  • [299] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng, “NUS-WIDE: A real-world web image database from National University of Singapore,” in ACM Conference on Image and Video Retrieval (CIVR), Santorini, Greece., 2009.
  • [300] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
  • [301] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
  • [302] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results,”
  • [303] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in CVPR.   IEEE, 2010, pp. 3485–3492.
  • [304] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view RGB-D object dataset,” in ICRA.   IEEE, 2011, pp. 1817–1824.
  • [305] A. Aldoma, T. Fäulhammer, and M. Vincze, “Automation of “ground truth” annotation for multi-view RGB-D object instance recognition datasets,” in IROS.   IEEE, 2014, pp. 5016–5023.
  • [306] B. Browatzki, J. Fischer, B. Graf, H. H. Bülthoff, and C. Wallraven, “Going into depth: Evaluating 2D and 3D cues for object classification on a new, large-scale object dataset,” in ICCVW.   IEEE, 2011, pp. 1189–1195.
  • [307] G. Patterson and J. Hays, “SUN attribute database: Discovering, annotating, and recognizing scene attributes,” in CVPR, 2012.
  • [308] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPSW, vol. 2011, no. 2, 2011, p. 5.
  • [309] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in ECCV.   Springer, 2012, pp. 746–760.
  • [310] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,”
  • [311] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in ECCV.   Springer, 2014, pp. 740–755.
  • [312]

    B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in

    Advances in neural information processing systems, 2014, pp. 487–495.
  • [313] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D shapenets: A deep representation for volumetric shapes,” in CVPR, 2015, pp. 1912–1920.
  • [314] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” PAMI, 2017.
  • [315] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” International Journal of Robotics Research (IJRR), 2013.
  • [316] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in CVPR, June 2016.
  • [317] S. M. Bileschi, “Streetscenes: Towards scene understanding in still images,” MASSACHUSETTS INST OF TECH CAMBRIDGE, Tech. Rep., 2006.
  • [318] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in ECCV, 2008, pp. 44–57.
  • [319] A. Geiger, C. Wojek, and R. Urtasun, “Joint 3D estimation of objects and scene layout,” in NIPS, 2011.
  • [320] T. Scharwächter, M. Enzweiler, U. Franke, and S. Roth, “Stixmantics: A medium-level model for real-time semantic scene understanding,” in ECCV.   Springer, 2014, pp. 533–548.
  • [321] F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers,” in CVPR, 2016, pp. 2129–2137.
  • [322] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in CVPR, June 2009.
  • [323] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, vol. 1.   IEEE, 2005, pp. 886–893.
  • [324] M. Enzweiler and D. M. Gavrila, “Monocular pedestrian detection: Survey and experiments,” PAMI, vol. 31, no. 12, pp. 2179–2195, 2009.
  • [325] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio, “Pedestrian detection using wavelet templates,” in CVPR, 1997, pp. 193–99.
  • [326] S. Munder and D. M. Gavrila, “An experimental study on pedestrian classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 11, pp. 1863–1868, 2006.
  • [327] L. Wang, J. Shi, G. Song, and I.-F. Shen, “Object detection combining recognition and segmentation,” in ACCV.   Springer, 2007, pp. 189–199.
  • [328] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “A mobile vision system for robust multi-person tracking,” in CVPR.   IEEE, 2008, pp. 1–8.
  • [329] M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila, “Multi-cue pedestrian classification with partial occlusion handling,” in CVPR.   IEEE, 2010, pp. 990–997.
  • [330] C. Wojek, S. Walk, and B. Schiele, “Multi-cue onboard pedestrian detection,” in CVPR.   IEEE, 2009, pp. 794–801.
  • [331] C. G. Keller, M. Enzweiler, and D. M. Gavrila, “A new benchmark for stereo-based pedestrian detection,” in Intelligent Vehicles Symposium (IV).   IEEE, 2011, pp. 691–696.
  • [332] K. Fragkiadaki, W. Zhang, G. Zhang, and J. Shi, “Two-granularity tracking: Mediating trajectory and detection graphs for tracking under occlusions,” in ECCV.   Springer, 2012, pp. 552–565.
  • [333] W. Ouyang and X. Wang, “A discriminative deep model for pedestrian detection with occlusion handling,” in CVPR.   IEEE, 2012, pp. 3258–3265.
  • [334] Y. Deng, P. Luo, C. C. Loy, and X. Tang, “Pedestrian attribute recognition at far distance,” in Proceedings of the 22nd ACM international conference on Multimedia.   ACM, 2014, pp. 789–792.
  • [335] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon, “Multispectral pedestrian detection: Benchmark dataset and baseline,” in CVPR, 2015, pp. 1037–1045.
  • [336] S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A diverse dataset for pedestrian detection,” arXiv preprint arXiv:1702.05693, 2017.
  • [337] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object representations for fine-grained categorization,” in ICCVW, 2013, pp. 554–561.
  • [338] K. Matzen and N. Snavely, “NYC3DCars: A dataset of 3D vehicles in geographic context,” in ICCV, 2013.
  • [339]

    S. Sivaraman and M. M. Trivedi, “A general active-learning framework for on-road vehicle recognition and tracking,”

    IEEE Transactions on Intelligent Transportation Systems, vol. 11, no. 2, pp. 267–276, 2010.
  • [340] J. Arróspide, L. Salgado, and M. Nieto, “Video analysis-based vehicle detection and tracking using an mcmc sampling framework,” EURASIP Journal on Advances in Signal Processing, vol. 2012, no. 1, p. 2, 2012.
  • [341] C. Caraffi, T. Vojir, J. Trefny, J. Sochman, and J. Matas, “A System for Real-time Detection and Tracking of Vehicles from a Single Car-mounted Camera,” in ITSC, Sep. 2012, pp. 975–982.
  • [342] M. Ozuysal, V. Lepetit, and P.Fua, “Pose estimation for category specific multiview object localization,” in CVPR, Miami, FL, June 2009.
  • [343] S. Agarwal and D. Roth, “Learning a sparse representation for object detection,” in ECCV.   Springer, 2002, pp. 113–127.
  • [344] H. Schneiderman and T. Kanade, “A statistical method for 3d object detection applied to faces and cars,” in CVPR, vol. 1.   IEEE, 2000, pp. 746–751.
  • [345] C. Papageorgiou and T. Poggio, “A trainable object detection system: Car detection in static images,” mitai, Tech. Rep. 1673, Oct. 1999.
  • [346] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “The German Traffic Sign Recognition Benchmark: A multi-class classification competition,” in IEEE International Joint Conference on Neural Networks, 2011, pp. 1453–1460.
  • [347] M. Mathias, R. Timofte, R. Benenson, and L. Van Gool, “Traffic sign recognition-how far are we from the solution?” in The International Joint Conference on Neural Networks (IJCNN).   IEEE, 2013, pp. 1–8.
  • [348]

    S. Maldonado-Bascon, S. Lafuente-Arroyo, P. Gil-Jimenez, H. Gomez-Moreno, and F. López-Ferreras, “Road-sign detection and recognition based on support vector machines,”

    IEEE Transactions on Intelligent Transportation Systems, vol. 8, no. 2, pp. 264–278, 2007.
  • [349] S. Šegvic, K. Brkić, Z. Kalafatić, V. Stanisavljević, M. Ševrović, D. Budimir, and I. Dadić, “A computer vision assisted geoinformation inventory for traffic infrastructure,” in Intelligent Transportation Systems (ITSC).   IEEE, 2010, pp. 66–73.
  • [350] F. Larsson, M. Felsberg, and P.-E. Forssen, “Correlating Fourier descriptors of local patches for road sign recognition,” IET Computer Vision, vol. 5, no. 4, pp. 244–254, 2011.
  • [351] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, “Vision-based traffic sign detection and analysis for intelligent driver assistance systems: Perspectives and survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 4, pp. 1484–1497, 2012.
  • [352] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, “Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark,” in International Joint Conference on Neural Networks, no. 1288, 2013.
  • [353] M. Aly, “Real time detection of lane markers in urban streets,” in Intelligent Vehicles Symposium (IV).   IEEE, 2008, pp. 7–12.
  • [354] S. Belongie, J. Malik, and J. Puzicha, “Shape context: A new descriptor for shape matching and object recognition,” in Advances in neural information processing systems, 2001, pp. 831–837.
  • [355] I. Chakravarty and H. Freeman, “Characteristic views as a basis for three-dimensional object recognition,” representations, vol. 9, p. 10, 1982.
  • [356] M. Oshima and Y. Shirai, “Object recognition using three-dimensional information,” PAMI, no. 4, pp. 353–361, 1983.
  • [357] P. J. Besl and R. C. Jain, “Three-dimensional object recognition,” ACM Computing Surveys (CSUR), vol. 17, no. 1, pp. 75–145, 1985.
  • [358] D. G. Lowe, “Three-dimensional object recognition from single two-dimensional images,” Artificial intelligence, vol. 31, no. 3, pp. 355–395, 1987.
  • [359] Y. Lamdan, J. T. Schwartz, and H. J. Wolfson, “Object recognition by affine invariant matching,” in CVPR.   IEEE, 1988, pp. 335–344.
  • [360] D. Wilkes and J. K. Tsotsos, “Active object recognition,” in CVPR.   IEEE, 1992, pp. 136–141.
  • [361] S. J. Dickinson, H. I. Christensen, J. Tsotsos, and G. Olofsson, “Active object recognition integrating attention and viewpoint control,” in ECCV.   Springer, 1994, pp. 2–14.
  • [362] A. E. Johnson and M. Hebert, “Using spin images for efficient object recognition in cluttered 3D scenes,” PAMI, vol. 21, no. 5, pp. 433–449, 1999.
  • [363] M. J. Swain and D. H. Ballard, “Color indexing,” IJCV, vol. 7, no. 1, pp. 11–32, 1991.
  • [364] T. Gevers and A. W. Smeulders, “Color-based object recognition,” Pattern recognition, vol. 32, no. 3, pp. 453–464, 1999.
  • [365] M. Pontil and A. Verri, “Support vector machines for 3D object recognition,” PAMI, vol. 20, no. 6, pp. 637–646, 1998.
  • [366] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in CVPR.   IEEE, 2008, pp. 1–8.
  • [367] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing features: efficient boosting procedures for multiclass object detection,” in CVPR, vol. 2.   IEEE, 2004, pp. II–II.
  • [368] D. G. Lowe, “Object recognition from local scale-invariant features,” in ICCV, vol. 2.   Ieee, 1999, pp. 1150–1157.
  • [369] D. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.
  • [370] H. Bilen, M. Pedersoli, V. P. Namboodiri, T. Tuytelaars, and L. Van Gool, “Object classification with adaptable regions,” in CVPR, 2014, pp. 3662–3669.
  • [371] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, vol. 1.   IEEE, 2001, pp. I–I.
  • [372] R. Lienhart and J. Maydt, “An extended set of Haar-like features for rapid object detection,” in International Conference on Image Processing, vol. 1.   IEEE, 2002, pp. I–I.
  • [373] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin, “Context-based vision system for place and object recognition,” in ICCV.   IEEE, 2003, p. 273.
  • [374] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” PAMI, vol. 32, no. 9, pp. 1627–1645, 2010.
  • [375] J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in CVPR.   IEEE, 2012, pp. 702–709.
  • [376] L. Bo, X. Ren, and D. Fox, “Learning hierarchical sparse features for RGB-(D) object recognition,” The International Journal of Robotics Research, vol. 33, no. 4, pp. 581–599, 2014.
  • [377] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in ECCV.   Springer, 2014, pp. 345–360.
  • [378] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” IJCV, vol. 104, no. 2, pp. 154–171, 2013.
  • [379] T. Deselaers, B. Alexe, and V. Ferrari, “Weakly supervised localization and learning with generic knowledge,” IJCV, vol. 100, no. 3, pp. 275–293, 2012.
  • [380] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3D object proposals for accurate object class detection,” in Advances in Neural Information Processing Systems, 2015, pp. 424–432.
  • [381] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in ICCV, 2015, pp. 1026–1034.
  • [382] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [383] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [384] D. Maturana and S. Scherer, “Voxnet: A 3D convolutional neural network for real-time object recognition,” in IROS.   IEEE, 2015, pp. 922–928.
  • [385] P. Wohlhart and V. Lepetit, “Learning descriptors for object recognition and 3D pose estimation,” in CVPR, 2015, pp. 3109–3118.
  • [386] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [387] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015, pp. 1–9.
  • [388] S. Shankar, V. K. Garg, and R. Cipolla, “Deep-carving: Discovering visual attributes by carving deep neural nets,” in CVPR, 2015, pp. 3403–3412.
  • [389] M. Oquab, L. Bottou, I. Laptev, J. Sivic et al., “Weakly supervised object recognition with convolutional neural networks,” in NIPS, 2014.
  • [390] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in CVPR, 2014, pp. 2147–2154.
  • [391] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014, pp. 580–587.
  • [392] Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “CNN: Single-label to multi-label,” arXiv preprint arXiv:1406.5726, 2014.
  • [393] R. Girshick, “Fast R-CNN,” in ICCV, 2015, pp. 1440–1448.
  • [394]

    J. C. Caicedo and S. Lazebnik, “Active object localization with deep reinforcement learning,” in

    ICCV, 2015, pp. 2488–2496.
  • [395] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in CVPR, 2016, pp. 779–788.
  • [396] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [397] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV.   Springer, 2014, pp. 346–361.
  • [398] V. Kantorov, M. Oquab, M. Cho, and I. Laptev, “ContextLocNet: Context-aware deep network models for weakly supervised localization,” in ECCV.   Springer, 2016, pp. 350–365.
  • [399] H. Hu, G.-T. Zhou, Z. Deng, Z. Liao, and G. Mori, “Learning structured inference neural networks with label relations,” in CVPR, 2016, pp. 2960–2968.
  • [400] Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and S. Yan, “Contextualizing object detection and classification,” PAMI, vol. 37, no. 1, pp. 13–27, 2015.
  • [401] M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in CVPR, 2015, pp. 3367–3375.
  • [402] D. M. Gavrila and V. Philomin, “Real-time object detection for“ smart” vehicles,” in ICCV, vol. 1.   IEEE, 1999, pp. 87–93.
  • [403] C. Papageorgiou and T. Poggio, “A trainable system for object detection,” International Journal of Computer Vision, vol. 38, no. 1, pp. 15–33, 2000.
  • [404] A. Ess, T. Müller, H. Grabner, and L. J. Van Gool, “Segmentation-based urban traffic scene understanding.” in BMVC, vol. 1, 2009, p. 2.
  • [405] C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele, “Monocular visual scene understanding: Understanding multi-object traffic scenes,” PAMI, vol. 35, no. 4, pp. 882–897, 2013.
  • [406] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3d object detection for autonomous driving,” in CVPR, 2016, pp. 2147–2156.
  • [407] K. C. Fuerstenberg, K. C. Dietmayer, and V. Willhoeft, “Pedestrian recognition in urban traffic using a vehicle based multilayer laserscanner,” in Intelligent Vehicle Symposium (IV), vol. 1.   IEEE, 2002, pp. 31–35.
  • [408] K. Kidono, T. Miyasaka, A. Watanabe, T. Naito, and J. Miura, “Pedestrian recognition using high-definition LIDAR,” in Intelligent Vehicles Symposium (IV).   IEEE, 2011, pp. 405–410.
  • [409] A. Broggi, P. Cerri, S. Ghidoni, P. Grisleri, and H. G. Jung, “A new approach to urban pedestrian detection for automatic braking,” IEEE Transactions on Intelligent Transportation Systems, vol. 10, no. 4, pp. 594–605, 2009.
  • [410] D. Beckwith and K. Hunter-Zaworski, “Passive pedestrian detection at unsignalized crossings,” Transportation Research Record: Journal of the Transportation Research Board, no. 1636, pp. 96–103, 1998.
  • [411] H. Nanda and L. Davis, “Probabilistic template based pedestrian detection in infrared videos,” in Intelligent Vehicle Symposium (IV), vol. 1.   IEEE, 2002, pp. 15–20.
  • [412] M. Bertozzi, A. Broggi, R. Chapuis, F. Chausse, A. Fascioli, and A. Tibaldi, “Shape-based pedestrian detection and localization,” in Intelligent Transportation Systems (ITSC), vol. 1.   IEEE, 2003, pp. 328–333.
  • [413] M. Bertozzi, E. Binelli, A. Broggi, and M. Rose, “Stereo vision-based approaches for pedestrian detection,” in CVPR.   IEEE, 2005, pp. 16–16.
  • [414] C. A. Richards, C. E. Smith, and N. P. Papanikolopoulos, “Vision-based intelligent control of transportation systems,” in the IEEE International Symposium on Intelligent Control.   IEEE, 1995, pp. 519–524.
  • [415] K. Rohr, “Incremental recognition of pedestrians from image sequences,” in CVPR.   IEEE, 1993, pp. 8–13.
  • [416] C. Wöhler, J. K. Anlauf, T. Pörtner, and U. Franke, “A time delay neural network algorithm for real-time pedestrian recognition,” in International Conference on Intelligent Vehicle.   Citeseer, 1998.
  • [417] A. Broggi, M. Bertozzi, A. Fascioli, and M. Sechi, “Shape-based pedestrian detection,” in Intelligent Vehicles Symposium (IV).   IEEE, 2000, pp. 215–220.
  • [418] M. Oren, C. Papageorgiou, P. Shinha, E. Osuna, and T. Poggio, “A trainable system for people detection,” in Proceedings of Image Understanding Workshop, vol. 24, 1997.
  • [419] P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,” in BMVC.   British Machine Vision Association, 2009.
  • [420] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in ECCV.   Springer, 2006, pp. 428–441.
  • [421] J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li, “Robust multi-resolution pedestrian detection in traffic scenes,” in CVPR, 2013, pp. 3033–3040.
  • [422] Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning strong parts for pedestrian detection,” in ICCV, 2015, pp. 1904–1912.
  • [423] Y. Tian, P. Luo, X. g. Wang, and X. Tang, “Pedestrian detection aided by deep learning semantic tasks,” in CVPR, 2015, pp. 5079–5087.
  • [424] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing LIDAR and images for pedestrian detection using convolutional neural networks,” in ICRA.   IEEE, 2016, pp. 2198–2205.
  • [425] Q. Hu, P. Wang, C. Shen, A. van den Hengel, and F. Porikli, “Pushing the limits of deep cnns for pedestrian detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
  • [426] G. Brazil, X. Yin, and X. Liu, “Illuminating pedestrians via simultaneous detection & segmentation,” in ICCV, 2017, pp. 4950–4959.
  • [427] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in CVPR.   IEEE, 2009, pp. 304–311.
  • [428] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” PAMI, vol. 34, no. 4, pp. 743–761, 2012.
  • [429] R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten years of pedestrian detection, what have we learned?” arXiv preprint arXiv:1411.4304, 2014.
  • [430] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, “Survey of pedestrian detection for advanced driver assistance systems,” PAMI, vol. 32, no. 7, pp. 1239–1258, 2010.
  • [431] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features,” in ICCV, 2015, pp. 82–90.
  • [432] L. Zhang, L. Lin, X. Liang, and K. He, “Is faster r-cnn doing well for pedestrian detection?” in ECCV.   Springer, 2016, pp. 443–457.
  • [433] C. Radford and D. Houghton, “Vehicle detection in open-world scenes using a Hough transform technique,” in Third International Conference on Image Processing and its Applications.   IET, 1989, pp. 78–82.
  • [434] M. Betke, E. Haritaoglu, and L. S. Davis, “Multiple vehicle detection and tracking in hard real-time,” in Intelligent Vehicles Symposium (IV).   IEEE, 1996, pp. 351–356.
  • [435] M. Bertozzi, A. Broggi, and S. Castelluccio, “A real-time oriented system for vehicle detection,” Journal of Systems Architecture, vol. 43, no. 1-5, pp. 317–325, 1997.
  • [436] M. Bertozzi, A. Broggi, A. Fascioli, and S. Nichele, “Stereo vision-based vehicle detection,” in Intelligent Vehicles Symposium (IV).   IEEE, 2000, pp. 39–44.
  • [437] G. Alessandretti, A. Broggi, and P. Cerri, “Vehicle and guard rail detection using Radar and vision data fusion,” IEEE Transactions on Intelligent Transportation Systems, vol. 8, no. 1, pp. 95–105, 2007.
  • [438] D. Bullock, J. Garrett, and C. Hendrickson, “A neural network for image-based vehicle detection,” Transportation Research Part C: Emerging Technologies, vol. 1, no. 3, pp. 235–247, 1993.
  • [439] S. Gupte, O. Masoud, R. F. Martin, and N. P. Papanikolopoulos, “Detection and classification of vehicles,” IEEE Transactions on intelligent transportation systems, vol. 3, no. 1, pp. 37–47, 2002.
  • [440] C. P. Papageorgiou and T. Poggio, “A trainable object detection system: Car detection in static images,” 1999.
  • [441] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection using evolutionary Gabor filter optimization,” IEEE Transactions on Intelligent Transportation Systems, vol. 6, no. 2, pp. 125–137, 2005.
  • [442] H. T. Niknejad, K. Takahashi, S. Mita, and D. McAllester, “Vehicle detection and tracking at nighttime for urban autonomous driving,” in IROS.   IEEE, 2011, pp. 4442–4447.
  • [443] S. Sivaraman and M. M. Trivedi, “Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 4, pp. 1773–1795, 2013.
  • [444] A. López, J. Hilgenstock, A. Busse, R. Baldrich, F. Lumbreras, and J. Serrat, “Nighttime vehicle detection for intelligent headlight control,” in Advanced Concepts for Intelligent Vision Systems.   Springer, 2008, pp. 113–124.
  • [445] L. Cao, Q. Jiang, M. Cheng, and C. Wang, “Robust vehicle detection by combining deep features with exemplar classification,” Neurocomputing, vol. 215, pp. 225–231, 2016.
  • [446] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulière, and T. Chateau, “Deep manta: A coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image,” in CVPR, 2017.
  • [447] S. Lange, F. Ulbrich, and D. Goehring, “Online vehicle detection using deep neural networks and lidar based preselected image patches,” in Intelligent Vehicles Symposium (IV).   IEEE, 2016, pp. 954–959.
  • [448] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” in CVPR, 2017.
  • [449] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 5, pp. 694–711, 2006.
  • [450] S. Sivaraman and M. M. Trivedi, “Vehicle detection by independent parts for urban driver assistance,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 4, pp. 1597–1608, 2013.
  • [451] L. Priese, V. Rehrmann, R. Schian, and R. Lakmann, “Traffic sign recognition based on color image evaluation,” in Proceedings IEEE Intelligent Vehicles Symposium’93.   Citeseer, 1993.
  • [452] N. Kehtarnavaz and A. Ahmad, “Traffic sign recognition in noisy outdoor scenes,” in Intelligent Vehicles Symposium (IV).   IEEE, 1995, pp. 460–465.
  • [453] A. De La Escalera, L. E. Moreno, M. A. Salichs, and J. M. Armingol, “Road traffic sign detection and classification,” IEEE transactions on industrial electronics, vol. 44, no. 6, pp. 848–859, 1997.
  • [454] A. Arlicot, B. Soheilian, and N. Paparoditis, “Circular road sign extraction from street level images using colour, shape and texture databases maps.” in Workshop Laserscanning, 2009, pp. 205–210.
  • [455] Y.-W. Seo, D. Wettergreen, and W. Zhang, “Recognizing temporary changes on highways for reliable autonomous driving,” in International Conference on Systems, Man, and Cybernetics (SMC).   IEEE, 2012, pp. 3027–3032.
  • [456] V. Balali, E. Depwe, and M. Golparvar-Fard, “Multi-class traffic sign detection and classification using google street view images,” in Transportation Research Board 94th Annual Meeting, 2015.
  • [457] A. De la Escalera, J. M. Armingol, and M. Mata, “Traffic sign recognition and analysis for intelligent vehicles,” Image and vision computing, vol. 21, no. 3, pp. 247–258, 2003.
  • [458] P. Paclık and J. Novovicova, “Road sign classification without color information,” in the 6th Annual Conference of the Advanced School for Computing and Imaging, 2000.
  • [459] Y. Zhu, C. Zhang, D. Zhou, X. Wang, X. Bai, and W. Liu, “Traffic sign detection and recognition using fully convolutional network guided proposals,” Neurocomputing, vol. 214, pp. 758–766, 2016.
  • [460] J. D. Crisman and C. E. Thorpe, “UNSCARF-A color vision system for the detection of unstructured roads,” in ICRA.   IEEE, 1991, pp. 2496–2501.
  • [461] A. Broggi and S. Berte, “Vision-based road detection in automotive systems: A real-time expectation-driven approach,” Journal of Artificial Intelligence Research, vol. 3, pp. 325–348, 1995.
  • [462] Y. He, H. Wang, and B. Zhang, “Color-based road detection in urban traffic scenes,” IEEE Transactions on Intelligent Transportation Systems, vol. 5, no. 4, pp. 309–318, 2004.
  • [463] H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. R. Bradski, “Self-supervised monocular road detection in desert terrain.” in Robotics: science and systems, vol. 38.   Philadelphia, 2006.
  • [464] R. Mohan, “Deep deconvolutional networks for scene parsing,” arXiv preprint arXiv:1411.4101, 2014.
  • [465] A. Laddha, M. K. Kocamaz, L. E. Navarro-Serment, and M. Hebert, “Map-supervised road detection,” in Intelligent Vehicles Symposium (IV), 2016 IEEE.   IEEE, 2016, pp. 118–123.
  • [466] C. C. T. Mendes, V. Frémont, and D. F. Wolf, “Exploiting fully convolutional neural networks for fast road detection,” in ICRA.   IEEE, 2016, pp. 3174–3179.
  • [467] G. L. Oliveira, W. Burgard, and T. Brox, “Efficient deep methods for monocular road segmentation,” in IROS, 2016.
  • [468] A. B. Hillel, R. Lerner, D. Levi, and G. Raz, “Recent progress in road and lane detection: a survey,” Machine vision and applications, vol. 25, no. 3, pp. 727–745, 2014.
  • [469] S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose estimation,” in BMVC, 2010.
  • [470]

    V. Kazemi, M. Burenius, H. Azizpour, and J. Sullivan, “Multi-view body part recognition with random forests,” in

    BMVC.   British Machine Vision Association, 2013.
  • [471] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014, pp. 3686–3693.
  • [472] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by deep multi-task learning,” in ECCV.   Springer, 2014, pp. 94–108.
  • [473] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, “Progressive search space reduction for human pose estimation,” in CVPR, jun 2008.
  • [474] N. Jammalamadaka, A. Zisserman, M. Eichner, V. Ferrari, and C. V. Jawahar, “Has my algorithm succeeded? an evaluator for human pose estimators,” in ECCV, 2012.
  • [475] J. Charles, T. Pfister, D. Magee, D. Hogg, and A. Zisserman, “Domain adaptation for upper body pose tracking in signed TV broadcasts,” in BMVC, 2013.
  • [476] S. Gasparrini, E. Cippitelli, S. Spinsante, and E. Gambi, “A depth-based fall detection system using a kinect sensor,” Sensors, vol. 14, no. 2, pp. 2756–2775, 2014.
  • [477] B. Sapp and B. Taskar, “Modec: Multimodal decomposable models for human pose estimation,” in CVPR, 2013.
  • [478] A. Shafaei and J. J. Little, “Real-time human motion capture with multiple depth cameras,” in CRV.   Canadian Image Processing and Pattern Recognition Society (CIPPRS), 2016.
  • [479] V. Ferrari, M. Marín-Jiménez, and A. Zisserman, “2D human pose estimation in TV shows,” Statistical and Geometrical Approaches to Visual Motion Analysis, pp. 128–147, 2009.
  • [480] B. Sapp, D. Weiss, and B. Taskar, “Parsing human motion with stretchable models,” in CVPR.   IEEE, 2011, pp. 1281–1288.
  • [481] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Unstructured human activity detection from RGBD images,” in ICRA.   IEEE, 2012, pp. 842–849.
  • [482] V. Kazemi and J. Sullivan, “Using richer models for articulated pose estimation of footballers,” in BMVC, 2012.
  • [483] H. S. Koppula, R. Gupta, and A. Saxena, “Learning human activities and object affordances from RGB-D videos,” The International Journal of Robotics Research, vol. 32, no. 8, pp. 951–970, 2013.
  • [484] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human pose estimation using body parts dependent joint regressors,” in CVPR, 2013, pp. 3041–3048.
  • [485] T.-H. Yu, T.-K. Kim, and R. Cipolla, “Unconstrained monocular 3D human pose estimation by action detection and cross-modality regression forest,” in CVPR, 2013, pp. 3642–3649.
  • [486] S. Escalera, J. Gonzàlez, X. Baró, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante, “Multi-modal Gesture Recognition Challenge 2013: Dataset and results,” in ICMI.   ACM, 2013, pp. 445–452.
  • [487] W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to action: A strongly-supervised representation for detailed action understanding,” in CVPR, 2013, pp. 2248–2255.
  • [488] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments,” PAMI, vol. 36, no. 7, pp. 1325–1339, jul 2014.
  • [489] S. Antol, C. L. Zitnick, and D. Parikh, “Zero-shot learning via visual abstraction,” in ECCV, 2014.
  • [490] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic, “3D pictorial structures for multiple human pose estimation,” in CVPR, 2014, pp. 1669–1676.
  • [491] A. Cherian, J. Mairal, K. Alahari, and C. Schmid, “Mixing body-part sequences for human pose estimation,” in CVPR, 2014.
  • [492] E. Cippitelli, S. Gasparrini, E. Gambi, S. Spinsante, J. Wåhslény, I. Orhany, and T. Lindhy, “Time synchronization and data fusion for RGB-depth cameras and inertial sensors in AAL applications,” in IEEE International Conference on Communication Workshop (ICCW).   IEEE, 2015, pp. 265–270.
  • [493] A. Wetzler, R. Slossberg, and R. Kimmel, “Rule of thumb: Deep derotation for improved fingertip detection,” in BMVC.   BMVA Press, September 2015, pp. 33.1–33.12.
  • [494] M. I. López-Quintero, M. J. Marín-Jiménez, R. Muñoz-Salinas, F. J. Madrid-Cuevas, and R. Medina-Carnicer, “Stereo Pictorial Structure for 2D articulated human pose estimation,” Machine Vision and Applications, vol. 27, no. 2, pp. 157–174, 2015.
  • [495] A. Haque, B. Peng, Z. Luo, A. Alahi, S. Yeung, and L. Fei-Fei, “Towards viewpoint invariant 3d human pose estimation,” in ECCV, October 2016.
  • [496] S. Gasparrini, E. Cippitelli, E. Gambi, S. Spinsante, J. Wåhslén, I. Orhan, and T. Lindh, “Proposal and experimental evaluation of fall detection solution based on wearable and depth data fusion,” in ICT innovations.   Springer, 2016, pp. 99–108.
  • [497] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action match a spatio-temporal maximum average correlation height filter for action recognition,” in CVPR.   IEEE, 2008, pp. 1–8.
  • [498] M. Marszałek, I. Laptev, and C. Schmid, “Actions in context,” in CVPR, 2009.
  • [499] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in ICPR, vol. 3.   IEEE, 2004, pp. 32–36.
  • [500] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in CVPR, 2014.
  • [501] J. M. Chaquet, E. J. Carmona, and A. Fernández-Caballero, “A survey of video datasets for human action and activity recognition,” CVIU, vol. 117, no. 6, pp. 633–659, 2013.
  • [502] L. Wang, T. Tan, H. Ning, and W. Hu, “Silhouette analysis-based gait recognition for human identification,” PAMI, vol. 25, no. 12, pp. 1505–1518, 2003.
  • [503] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in ICCV, 2005, pp. 1395–1402.
  • [504] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action recognition using motion history volumes,” CVIU, vol. 104, no. 2, pp. 249–257, 2006.
  • [505] T.-K. Kim, S.-F. Wong, and R. Cipolla, “Tensor canonical correlation analysis for action classification,” in CVPR.   IEEE, 2007, pp. 1–8.
  • [506] Y. Wang, K. Huang, and T. Tan, “Human activity recognition based on r transform,” in CVPR.   IEEE, 2007, pp. 1–8.
  • [507] S. Ali and M. Shah, “Floor fields for tracking in high density crowd scenes,” in ECCV.   Springer, 2008, pp. 1–14.
  • [508] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in CVPR, 2008.
  • [509] D. Tran and A. Sorokin, “Human activity recognition with metric learning,” in ECCV.   Springer, 2008, pp. 548–561.
  • [510] UCF, “UCF aerial action data set,” Online, 2009. [Online]. Available:
  • [511] Microsoft, “MSR Action dataset,”, 2009.
  • [512] W. Choi, K. Shahid, and S. Savarese, “What are they doing?: Collective activity classification using spatio-temporal relationship among people,” in ICCVW.   IEEE, 2009, pp. 1282–1289.
  • [513] RUCV Group, Online, 2017-07-30. [Online]. Available:
  • [514] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos “in the wild”,” in CVPR.   IEEE, 2009, pp. 1996–2003.
  • [515] J. C. Niebles, C.-W. Chen, and L. Fei-Fei, “Modeling temporal structure of decomposable motion segments for activity classification,” in ECCV.   Springer, 2010, pp. 392–405.
  • [516] E. Auvinet, C. Rougier, J. Meunier, A. St-Arnaud, and J. Rousseau, “Multiple cameras fall dataset,” DIRO-Université de Montréal, Tech. Rep, vol. 1350, 2010.
  • [517] S. Blunsden and R. Fisher, “The BEHAVE video dataset: ground truthed video for multi-person behavior classification,” Annals of the BMVA, vol. 4, no. 1-12, p. 4, 2010.
  • [518] A. Patron-Perez, M. Marszalek, A. Zisserman, and I. D. Reid, “High five: Recognising human interactions in TV shows.” in BMVC, vol. 1.   Citeseer, 2010, p. 2.
  • [519] M. S. Ryoo and J. K. Aggarwal, “UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA),” Online, 2010. [Online]. Available:
  • [520] C.-C. Chen, M. S. Ryoo, and J. K. Aggarwal, “UT-Tower Dataset: Aerial View Activity Classification Challenge,” Online, 2010. [Online]. Available:
  • [521] G. Denina, B. Bhanu, H. T. Nguyen, C. Ding, A. Kamal, C. Ravishankar, A. Roy-Chowdhury, A. Ivers, and B. Varda, “Videoweb dataset for multi-camera activities and non-verbal communication,” in Distributed Video Sensor Networks.   Springer, 2011, pp. 335–347.
  • [522] V. Delaitre, I. Laptev, and J. Sivic, “Recognizing human actions in still images: a study of bag-of-features and part-based representations,” in BMVC, 2010.
  • [523] S. Singh, S. A. Velastin, and H. Ragheb, “Muhavi: A multicamera human action video dataset for the evaluation of action recognition methods,” in Advanced Video and Signal Based Surveillance (AVSS).   IEEE, 2010, pp. 48–55.
  • [524] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis et al., “A large-scale benchmark dataset for event recognition in surveillance video,” in CVPR.   IEEE, 2011, pp. 3153–3160.
  • [525] UCF, “UCF-ARG data set,” Online, 2011. [Online]. Available:
  • [526] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A large video database for human motion recognition,” in ICCV, 2011.
  • [527] K. K. Reddy and M. Shah, “Recognizing 50 human action categories of web videos,” Machine Vision and Applications, vol. 24, no. 5, pp. 971–981, 2013.
  • [528] M. S. Ryoo and L. Matthies, “First-person activity recognition: What are they doing to me?” in CVPR, June 2013.
  • [529] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [530] G. Waltner, T. Mauthner, and H. Bischof, “Improved Sport Activity Recognition using Spatio-temporal Context,” in Proc. DVS-Conference on Computer Science in Sport (DVS/GSSS), 2014.
  • [531] K. Lee, D. Ognibene, H. J. Chang, T.-K. Kim, and Y. Demiris, “STARE: Spatio-temporal attention relocation for multiple structured activities detection,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5916–5927, 2015.
  • [532] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” in CVPR, 2015, pp. 961–970.
  • [533] Y. Wang, D. Tran, and Z. Liao, “Learning hierarchical poselets for human parsing,” in CVPR.   IEEE, 2011, pp. 1705–1712.
  • [534]

    J. Ohya and F. Kishino, “Human posture estimation from multiple images using genetic algorithm,” in

    ICPR, vol. 1.   IEEE, 1994, pp. 750–753.
  • [535] S. J. McKenna and S. Gong, “Real-time face pose estimation,” Real-Time Imaging, vol. 4, no. 5, pp. 333–347, 1998.
  • [536] E. Ueda, Y. Matsumoto, M. Imai, and T. Ogasawara, “A hand-pose estimation for vision-based human interfaces,” IEEE Transactions on Industrial Electronics, vol. 50, no. 4, pp. 676–684, 2003.
  • [537] J. Ng and S. Gong, “Composite support vector machines for detection of faces across views and pose estimation,” Image and Vision Computing, vol. 20, no. 5, pp. 359–368, 2002.
  • [538] A. Agarwal and B. Triggs, “3D human pose from silhouettes by relevance vector regression,” in CVPR, vol. 2.   IEEE, 2004, pp. II–II.
  • [539] D. Ramanan, “Learning to parse images of articulated bodies,” in Advances in neural information processing systems, 2007, pp. 1129–1136.
  • [540] S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose estimation,” in BMVC, 2010, pp. 12.1–12.11.
  • [541] W. Ouyang, X. Chu, and X. Wang, “Multi-source deep learning for human pose estimation,” in CVPR, 2014, pp. 2329–2336.
  • [542] A. Cherian, J. Mairal, K. Alahari, and C. Schmid, “Mixing body-part sequences for human pose estimation,” in CVPR, 2014, pp. 2353–2360.
  • [543] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Strong appearance and expressive spatial models for human pose estimation,” in ICCV, 2013, pp. 3487–3494.
  • [544] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in CVPR, 2016.
  • [545] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis, “Sparseness meets deepness: 3D human pose estimation from monocular video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4966–4975.
  • [546] A. Toshev and C. Szegedy, “DeepPose: Human pose estimation via deep neural networks,” in CVPR, 2014, pp. 1653–1660.
  • [547] X. Fan, K. Zheng, Y. Lin, and S. Wang, “Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation,” in CVPR, 2015, pp. 1347–1355.
  • [548] T. Pfister, J. Charles, and A. Zisserman, “Flowing ConvNets for human pose estimation in videos,” in ICCV, 2015, pp. 1913–1921.
  • [549] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in CVPR, 2017.
  • [550] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014, pp. 3686–3693.
  • [551] Z. Liu, J. Zhu, J. Bu, and C. Chen, “A survey of human pose estimation: the body parts parsing based methods,” Journal of Visual Communication and Image Representation, vol. 32, pp. 10–19, 2015.
  • [552] S. S. Rautaray and A. Agrawal, “Vision based hand gesture recognition for human computer interaction: a survey,” Artificial Intelligence Review, vol. 43, no. 1, pp. 1–54, 2015.
  • [553] R. Poppe, “A survey on vision-based human action recognition,” Image and vision computing, vol. 28, no. 6, pp. 976–990, 2010.
  • [554] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recognition,” CVIU, vol. 115, no. 2, pp. 224–241, 2011.
  • [555] S. Vishwakarma and A. Agrawal, “A survey on activity recognition and behavior understanding in video surveillance,” The Visual Computer, vol. 29, no. 10, pp. 983–1009, 2013.
  • [556] J. K. Aggarwal and L. Xia, “Human activity recognition from 3D data: A review,” Pattern Recognition Letters, vol. 48, pp. 70–80, 2014.
  • [557] G. Guo and A. Lai, “A survey on still image based human action recognition,” Pattern Recognition, vol. 47, no. 10, pp. 3343–3361, 2014.
  • [558] M. Ziaeefard and R. Bergevin, “Semantic human activity recognition: a literature review,” Pattern Recognition, vol. 48, no. 8, pp. 2329–2345, 2015.
  • [559] S. Herath, M. Harandi, and F. Porikli, “Going deeper into action recognition: A survey,” Image and Vision Computing, vol. 60, pp. 4–21, 2017.
  • [560] A. B. Sargano, P. Angelov, and Z. Habib, “A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition,” Applied Sciences, vol. 7, no. 1, p. 110, 2017.
  • [561] L. W. Campbell, D. A. Becker, A. Azarbayejani, A. F. Bobick, and A. Pentland, “Invariant features for 3-d gesture recognition,” in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.   IEEE, 1996, pp. 157–162.
  • [562] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, and W. Von Seelen, “Walking pedestrian recognition,” IEEE Transactions on intelligent transportation systems (ITS), vol. 1, no. 3, pp. 155–163, 2000.
  • [563] S. A. Niyogi, E. H. Adelson et al., “Analyzing and recognizing walking figures in XYT,” in CVPR, vol. 94, 1994, pp. 469–474.
  • [564] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” PAMI, vol. 23, no. 3, pp. 257–267, 2001.
  • [565] Y. Ke, R. Sukthankar, and M. Hebert, “Event detection in crowded videos,” in ICCV.   IEEE, 2007, pp. 1–8.
  • [566] N. Ikizler, R. G. Cinbis, S. Pehlivan, and P. Duygulu, “Recognizing actions from still images,” in ICPR.   IEEE, 2008, pp. 1–4.
  • [567] B. Yao and L. Fei-Fei, “Modeling mutual context of object and human pose in human-object interaction activities,” in CVPR.   IEEE, 2010, pp. 17–24.
  • [568] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Dynamically encoded actions based on spacetime saliency,” in CVPR, 2015, pp. 2755–2764.
  • [569] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015, pp. 4694–4702.
  • [570] W. Yang, Y. Wang, and G. Mori, “Recognizing human actions from still images with latent poses,” in CVPR.   IEEE, 2010, pp. 2030–2037.
  • [571] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in time-sequential images using hidden markov model,” in CVPR.   IEEE, 1992, pp. 379–385.
  • [572] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing action at a distance,” in ICCV.   IEEE, 2003, p. 726.
  • [573] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in ICCV, vol. 2.   IEEE, 2005, pp. 1395–1402.
  • [574] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” PAMI, vol. 35, no. 1, pp. 221–231, 2013.
  • [575] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in CVPR, 2014, pp. 1725–1732.
  • [576] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
  • [577] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in CVPR, 2016, pp. 1933–1941.
  • [578] X. Wang, A. Farhadi, and A. Gupta, “Actions~transformations,” in CVPR, 2016, pp. 2658–2667.
  • [579] S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in lstms for activity detection and early detection,” in CVPR, 2016, pp. 1942–1950.
  • [580] A. Kar, N. Rai, K. Sikka, and G. Sharma, “AdaScan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos,” in CVPR, 2017.
  • [581] S. Park and J. Aggarwal, “Recognition of two-person interactions using a hierarchical bayesian network,” in First ACM SIGMM international workshop on Video surveillance.   ACM, 2003, pp. 65–76.
  • [582] M. S. Ryoo and J. K. Aggarwal, “Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities,” in ICCV.   IEEE, 2009, pp. 1593–1600.
  • [583] T. Lan, L. Sigal, and G. Mori, “Social roles in hierarchical models for human activity recognition,” in CVPR.   IEEE, 2012, pp. 1354–1361.
  • [584] Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan, M. J. Roshtkhari, and G. Mori, “Deep structured models for group activity recognition,” in BMVC, 2015.
  • [585] Z. Deng, A. Vahdat, H. Hu, and G. Mori, “Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition,” in CVPR, 2016, pp. 4772–4781.
  • [586] C. M. Richard, J. L. Brown, J. L. Campbell, J. S. Graving, J. Milton, and I. van Schalkwyk, “SHRP 2 implementation assistance program concept to countermeasure-research to deployment using the SHRP2 safety data (NDS) influence of roadway design features on episodic,” 2015.
  • [587] A. Habibovic, E. Tivesten, N. Uchida, J. Bärgman, and M. L. Aust, “Driver behavior in car-to-pedestrian incidents: An application of the driving reliability and error analysis method (DREAM),” Accident Analysis & Prevention, vol. 50, pp. 554–565, 2013.
  • [588] M. Dozza and J. Werneke, “Introducing naturalistic cycling data: What factors influence bicyclists’ safety in the real world?” Transportation research part F: traffic psychology and behaviour, vol. 24, pp. 83–91, 2014.
  • [589] EC Funded CAVIAR project, “Caviar: Context aware vision using image-based active recognition,”, 2002.
  • [590] I. Kotseruba, A. Rasouli, and J. K. Tsotsos, “Joint attention in autonomous driving (jaad),” arXiv preprint arXiv:1609.04741, 2016.
  • [591] N. Schneider and D. M. Gavrila, “Pedestrian path prediction with recursive bayesian filters: A comparative study,” in German Conference on Pattern Recognition.   Springer, 2013, pp. 174–183.
  • [592] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila, “Context-based pedestrian path prediction,” in ECCV.   Springer, 2014, pp. 618–633.
  • [593] E. Ohn-Bar, S. Martin, A. Tawari, and M. M. Trivedi, “Head, eye, and hand patterns for driver activity recognition,” in ICPR.   IEEE, 2014, pp. 660–665.
  • [594] P. Molchanov, S. Gupta, K. Kim, and K. Pulli, “Multi-sensor system for driver’s hand-gesture recognition,” in IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1.   IEEE, 2015, pp. 1–8.
  • [595] C. Laugier, I. E. Paromtchik, M. Perrollaz, M. Yong, J.-D. Yoder, C. Tay, K. Mekhnacha, and A. Nègre, “Probabilistic analysis of dynamic scenes and collision risks assessment to improve driving safety,” IEEE Intelligent Transportation Systems Magazine, vol. 3, no. 4, pp. 4–19, 2011.
  • [596] B. Li, T. Wu, C. Xiong, and S.-C. Zhu, “Recognizing car fluents from video,” in CVPR, 2016, pp. 3803–3812.
  • [597] J. F. Kooij, N. Schneider, and D. M. Gavrila, “Analysis of pedestrian dynamics from a vehicle perspective,” in Intelligent Vehicles Symposium (IV).   IEEE, 2014, pp. 1445–1450.
  • [598] S. Köhler, M. Goldhammer, S. Bauer, K. Doll, U. Brunsmann, and K. Dietmayer, “Early detection of the pedestrian’s intention to cross the street,” in Intelligent Transportation Systems (ITSC).   IEEE, 2012, pp. 1759–1764.
  • [599] M. T. Phan, I. Thouvenin, V. Fremont, and V. Cherfaoui, “Estimating driver unawareness of pedestrian based on visual behaviors and driving behaviors,” 2013.
  • [600] M. Bahram, C. Hubmann, A. Lawitzky, M. Aeberhard, and D. Wollherr, “A combined model-and learning-based framework for interaction-aware maneuver prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 6, pp. 1538–1550, 2016.
  • [601] E. Ohn-Bar and M. M. Trivedi, “Looking at humans in the age of self-driving and highly automated vehicles,” IEEE Transactions on Intelligent Vehicles, vol. 1, no. 1, pp. 90–104, 2016.
  • [602] A. T. Schulz and R. Stiefelhagen, “A controlled interactive multiple model filter for combined pedestrian intention recognition and path prediction,” in Intelligent Transportation Systems (ITSC).   IEEE, 2015, pp. 173–178.
  • [603] Y. Hashimoto, Y. Gu, L.-T. Hsu, and S. Kamijo, “Probability estimation for pedestrian crossing intention at signalized crosswalks,” in International Conference on Vehicular Electronics and Safety (ICVES).   IEEE, 2015, pp. 114–119.
  • [604] N. Brouwer, H. Kloeden, and C. Stiller, “Comparison and evaluation of pedestrian motion models for vehicle safety systems,” in Intelligent Transportation Systems (ITSC).   IEEE, 2016, pp. 2207–2212.
  • [605] M. M. Trivedi, T. B. Moeslund et al., “Trajectory analysis and prediction for improved pedestrian safety: Integrated framework and evaluations,” in Intelligent Vehicles Symposium (IV).   IEEE, 2015, pp. 330–335.
  • [606] R. Quintero, I. Parra, D. F. Llorca, and M. Sotelo, “Pedestrian path prediction based on body language and action classification,” in Intelligent Transportation Systems (ITSC).   IEEE, 2014, pp. 679–684.
  • [607] B. Völz, K. Behrendt, H. Mielenz, I. Gilitschenski, R. Siegwart, and J. Nieto, “A data-driven approach for pedestrian intention estimation,” in Intelligent Transportation Systems (ITSC).   IEEE, 2016, pp. 2607–2612.
  • [608] T. Bandyopadhyay, C. Z. Jie, D. Hsu, M. H. Ang Jr, D. Rus, and E. Frazzoli, “Intention-aware pedestrian avoidance,” in Experimental Robotics.   Springer, 2013, pp. 963–977.
  • [609] H. Bai, S. Cai, N. Ye, D. Hsu, and W. S. Lee, “Intention-aware online pomdp planning for autonomous driving in a crowd,” in ICRA.   IEEE, 2015, pp. 454–460.
  • [610] J. Hariyono, A. Shahbaz, L. Kurnianggoro, and K.-H. Jo, “Estimation of collision risk for improving driver’s safety,” in Annual Conference of Industrial Electronics Society (IECON).   IEEE, 2016, pp. 901–906.
  • [611] J.-Y. Kwak, B. C. Ko, and J.-Y. Nam, “Pedestrian intention prediction based on dynamic fuzzy automata for vehicle driving at nighttime,” Infrared Physics & Technology, vol. 81, pp. 41–51, 2017.
  • [612] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in ICCV.   IEEE, 2009, pp. 261–268.
  • [613] F. Madrigal, J.-B. Hayet, and F. Lerasle, “Intention-aware multiple pedestrian tracking,” in ICPR.   IEEE, 2014, pp. 4122–4127.
  • [614] S. Köhler, B. Schreiner, S. Ronalter, K. Doll, U. Brunsmann, and K. Zindler, “Autonomous evasive maneuvers triggered by infrastructure-based detection of pedestrian intentions,” in Intelligent Vehicles Symposium (IV).   IEEE, 2013, pp. 519–526.
  • [615] S. Köhler, M. Goldhammer, K. Zindler, K. Doll, and K. Dietmeyer, “Stereo-vision-based pedestrian’s intention detection in a moving vehicle,” in Intelligent Transportation Systems (ITSC).   IEEE, 2015, pp. 2317–2322.
  • [616] A. Rangesh and M. M. Trivedi, “When vehicles see pedestrians with phones: A multi-cue framework for recognizing phone-based activities of pedestrians,” arXiv preprint arXiv:1801.08234, 2018.
  • [617] A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior,” in ICCVW, 2017, pp. 206–213.
  • [618] B. Völz, H. Mielenz, G. Agamennoni, and R. Siegwart, “Feature relevance estimation for learning pedestrian behavior at crosswalks,” in Intelligent Transportation Systems (ITSC).   IEEE, 2015, pp. 854–860.
  • [619] F. Schneemann and P. Heinemann, “Context-based detection of pedestrian crossing intention for autonomous driving in urban environments,” in IROS.   IEEE, 2016, pp. 2243–2248.
  • [620] C. Park, J. Ondřej, M. Gilbert, K. Freeman, and C. O’Sullivan, “HI Robot: Human intention-aware robot planning for safe and efficient navigation in crowds,” in IROS.   IEEE, 2016, pp. 3320–3326.
  • [621] M. Goldhammer, M. Gerhard, S. Zernetsch, K. Doll, and U. Brunsmann, “Early prediction of a pedestrian’s trajectory at intersections,” in Intelligent Transportation Systems (ITSC).   IEEE, 2013, pp. 237–242.
  • [622] D. González, J. Pérez, V. Milanés, and F. Nashashibi, “A review of motion planning techniques for automated vehicles,” ITSC, vol. 17, no. 4, pp. 1135–1145, 2016.
  • [623] T. Fraichard and P. Garnier, “Fuzzy control to drive car-like vehicles,” Robotics and autonomous systems, vol. 34, no. 1, pp. 1–22, 2001.
  • [624] R. A. Brooks, “Elephants don’t play chess,” Robotics and autonomous systems, vol. 6, no. 1-2, pp. 3–15, 1990.
  • [625] J. S. Mitchell, D. W. Payton, and D. M. Keirsey, “Planning and reasoning for autonomous vehicle control,” International Journal of Intelligent Systems, vol. 2, no. 2, pp. 129–198, 1987.
  • [626] D. Ferguson, C. Baker, M. Likhachev, and J. Dolan, “A reasoning framework for autonomous urban driving,” in Intelligent Vehicles Symposium, 2008 IEEE.   IEEE, 2008, pp. 775–780.
  • [627] R. Davis, “Meta-rules: Reasoning about control,” Artificial intelligence, vol. 15, no. 3, pp. 179–222, 1980.
  • [628] R. Cucchiara, M. Piccardi, and P. Mello, “Image analysis and rule-based reasoning for a traffic monitoring system,” ITSC, vol. 1, no. 2, pp. 119–130, 2000.
  • [629] I. Watson and F. Marir, “Case-based reasoning: A review,”

    The knowledge engineering review

    , vol. 9, no. 4, pp. 327–354, 1994.
  • [630] A. R. Golding and P. S. Rosenbloom, “Improving rule-based systems through case-based reasoning.” in AAAI, vol. 1, 1991, pp. 22–27.
  • [631] K. Moorman and A. Ram, “A case-based approach to reactive control for autonomous robots,” in Proceedings of the AAAI Fall symposium on AI for real-world autonomous mobile robots, 1992, pp. 1–11.
  • [632] C. Urdiales, E. J. Perez, J. Vázquez-Salceda, M. Sànchez-Marrè, and F. Sandoval, “A purely reactive navigation scheme for dynamic environments using case-based reasoning,” Autonomous Robots, vol. 21, no. 1, pp. 65–78, 2006.
  • [633] S. Vacek, T. Gindele, J. M. Zollner, and R. Dillmann, “Using case-based reasoning for autonomous vehicle guidance,” in IROS.   IEEE, 2007, pp. 4271–4276.
  • [634] A. Rasouli and J. K. Tsotsos, “Visual saliency improves autonomous visual search,” in Canadian Conference on Computer and Robot Vision (CRV).   IEEE, 2014, pp. 111–118.
  • [635] J. Hertzberg and R. Chatila, “AI reasoning methods for robotics,” in Springer Handbook of Robotics.   Springer, 2008, pp. 207–223.
  • [636] J. K. Tsotsos, “Analyzing vision at the complexity level,” Behavioral and brain sciences, vol. 13, no. 3, pp. 423–445, 1990.
  • [637] E. Potapova, M. Zillich, and M. Vincze, “Survey of recent advances in 3d visual attention for robotics,” The International Journal of Robotics Research, vol. 36, no. 11, pp. 1159–1176, 2017.
  • [638] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
  • [639] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning, 2015, pp. 2048–2057.
  • [640] A. Rasouli and J. K. Tsotsos, “Integrating three mechanisms of visual attention for active visual search,” arXiv preprint arXiv:1702.04292, 2017.
  • [641] Z. Bylinskii, E. DeGennaro, R. Rajalingham, H. Ruda, J. Zhang, and J. Tsotsos, “Towards the quantitative evaluation of visual attention models,” Vision research, vol. 116, pp. 258–268, 2015.
  • [642] K. Chang, T. Liu, H. Chen, and S. Lai, “Fusing generic objectness and visual saliency for salient object detection,” in ICCV, Barcelona, October 2011.
  • [643] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li, “Automatic salient object segmentation based on context and shape prior,” in The British Machine Vision Conference (BMVC), September 2011.
  • [644] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visua attention for rapid scene analysis,” Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
  • [645] N. Bruce and J. K. Tsotsos, “Attention based on information maximization,” Journal of Vision, vol. 7, no. 9, p. 950, 2007.
  • [646] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in CVPR, Boston, June 2015.
  • [647] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in CVPR, Boston, June 2015.
  • [648] R. K. Cave, “The featuregate model of visual selection,” Psychological research, vol. 62, no. 2-3, pp. 182–194, 1999.
  • [649] J. Zhu, Y. Qiu, R. Zhang, and J. Huang, “Top-down saliency detection via contextual pooling,” Journal of Signal Processing Systems, vol. 74, no. 1, pp. 33–46, 2014.
  • [650] J. Yang and M. Yangm, “Top-down visual saliency via joint crf and dictionary learning,” in CVPR, June 2012.
  • [651] S. He, R. Lau, and Q. Yang, “Exemplar-driven top-down saliency detection via deep association,” in CVPR, Las Vegas, June 2016.
  • [652] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top-down neural attention by excitation backprop,” in European Conference on Computer Vision.   Springer, 2016, pp. 543–559.
  • [653] W. Wang, J. Shen, and L. Shao, “Video salient object detection via fully convolutional networks,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 38–49, 2018.
  • [654] G. Leifman, D. Rudoy, T. Swedish, E. Bayro-Corrochano, and R. Raskar, “Learning gaze transitions from depth to improve video saliency estimation,” International Conference on Computer Vision (ICCV), 2017.
  • [655] A. Borji, D. N. Sihite, and L. Itti, “Quantitative analysis of humanmodel agreement in visual saliency modeling: A comparative study,” IEEE Transactions on Image Processing, vol. 22, no. 1, 2013.
  • [656] S. Filipe and L. A. Alexandre, “From the human visual system to the computational models of visual attention: a survey,” Artificial Intelligence Review, vol. 39, no. 1, pp. 1–47, 2013.
  • [657] A. Doshi and M. M. Trivedi, “Attention estimation by simultaneous observation of viewer and view,” in CVPRW.   IEEE, 2010, pp. 21–27.
  • [658] A. Tawari and M. M. Trivedi, “Robust and continuous estimation of driver gaze zone by dynamic analysis of multiple face videos,” in Intelligent Vehicles Symposium (IV).   IEEE, 2014, pp. 344–349.
  • [659] A. Tawari, S. Sivaraman, M. M. Trivedi, T. Shannon, and M. Tippelhofer, “Looking-in and looking-out vision for urban intelligent assistance: Estimation of driver attentive state and dynamic surround for safe merging and braking,” in Intelligent Vehicles Symposium (IV).   IEEE, 2014, pp. 115–120.
  • [660] R. Tanishige, D. Deguchi, K. Doman, Y. Mekada, I. Ide, and H. Murase, “Prediction of driver’s pedestrian detectability by image processing adaptive to visual fields of view,” in Intelligent Transportation Systems (ITSC).   IEEE, 2014, pp. 1388–1393.
  • [661] T. Deng, K. Yang, Y. Li, and H. Yan, “Where does the driver look? top-down-based saliency detection in a traffic driving environment,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp. 2051–2062, 2016.
  • [662] A. Palazzi, F. Solera, S. Calderara, S. Alletto, and R. Cucchiara, “Learning where to attend like a human driver,” in Intelligent Vehicles Symposium (IV), 2017 IEEE.   IEEE, 2017, pp. 920–925.
  • [663] A. Tawari and B. Kang, “A computational framework for driver’s visual attention using a fully convolutional architecture,” in Intelligent Vehicles Symposium (IV), 2017 IEEE.   IEEE, 2017, pp. 887–894.
  • [664] S. Alletto, A. Palazzi, F. Solera, S. Calderara, and R. Cucchiara, “Dr (eye) ve: A dataset for attention-based tasks with applications to autonomous and assisted driving,” in Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2016, pp. 54–60.