A Restricted Visual Turing Test for Deep Scene and Event Understanding

12/06/2015 ∙ by Hang Qi, et al. ∙ Intelligent Automation, Inc. 0

This paper presents a restricted visual Turing test (VTT) for story-line based deep understanding in long-term and multi-camera captured videos. Given a set of videos of a scene (such as a multi-room office, a garden, and a parking lot.) and a sequence of story-line based queries, the task is to provide answers either simply in binary form "true/false" (to a polar query) or in an accurate natural language description (to a non-polar query). Queries, polar or non-polar, consist of view-based queries which can be answered from a particular camera view and scene-centered queries which involves joint inference across different cameras. The story lines are collected to cover spatial, temporal and causal understanding of input videos. The data and queries distinguish our VTT from recently proposed visual question answering in images and video captioning. A vision system is proposed to perform joint video and query parsing which integrates different vision modules, a knowledge base and a query engine. The system provides unified interfaces for different modules so that individual modules can be reconfigured to test a new method. We provide a benchmark dataset and a toolkit for ontology guided story-line query generation which consists of about 93.5 hours videos captured in four different locations and 3,426 queries split into 127 story lines. We also provide a baseline implementation and result analyses.



There are no comments yet.


page 1

page 3

page 4

page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation and Objective

Figure 1:

Illustation of depth and complexity of the proposed VTT in deep scene and event understanding, which focuses on a largely unexplored task in computer vision – joint spatial, temporal and causal understanding of scene and event in multi-camera videos. See text for details.

During the past decades, we have seen tremendous progress in individual vision modules such as image classification [7, 11, 19, 44] and object detection [8, 35, 45, 10, 31], especially after competitions like PASCAL VOC [5]

and ImageNet ILSVRC 


and the convolutional neural networks 

[21, 17, 12] trained on the ImageNet dataset [4] were proposed. Those tasks are evaluated based on either classification or detection accuracy, focusing on a coarse level understanding of data. In the area of natural language and text processing, there have been well-studied text-based question answering (QA). For example, a chatterbot named Eugene Goostman111https://en.wikipedia.org/wiki/Eugene_Goostman was reported as the first computer program which has passed the famed Turing test [36] in an event organized at the University of Reading. The success of text-based QA and the recent achievements of individual vision modules have inspired visual Turing tests (VTT) [9, 25] where image-based questions (so-called visual question answering, VQA) or story-line queries are used to test a computer vision system. VTT has been suggested as a more suitable evaluation framework going beyond measuring the accuracy of labels and bounding boxes. Most existing work on VTT focus on images and emphasize free-form and open-ended Q/A’s [2, 1].

In this paper, we are interested in a restricted visual Turing test (VTT) – story-line based visual query answering in long-term and multi-camera captured videos. Our VTT emphasizes a joint spatial, temporal, and causal understanding of scenes and events, which are largely unexplored in computer vision. By “restricted”, we mean the queries are designed based on a selected ontology. Figure 1 shows two examples in our VTT dataset. Consider the question how we shall test whether a computer vision system understands, for example, a conference room. In VQA [1], the input is an image and a “bag-of-questions” (e.g., is this a conference room?) and the task is to provide a natural language answer (either in a multiple-choice manner or with free-form responses). In our VTT, to understand a conference room, the input consists of multi-camera captured videos and story-line queries covering basic questions (e.g., , for a coarse level understanding) and difficult ones (e.g., ) involving spatial, temporal, and causal inference for a deeper understanding. More specifically, to answer correctly, a computer vision system would need to build a scene-centered representation for the conference room (i.e., put chairs and tables in 3D), to detect, track, re-identify, and parse people coming into the room across cameras, and to understand the concept of sitting in a chair (i.e., the pose of a person and scene-centered spatial relation between a person and a chair), etc. If a computer vision system can further unfold the intermediate representation to explicitly show how it derives the answer, it enhances the “trust” that we have on the system that it has gain a correct understanding of the scene.

Web-scale images vs. long-term and multi-camera captured videos. Web-scale images emphasize the breadth that a computer vision system can learn and handle in different applications. Those images are often of album photo styles collected from different image search engines such as Flickr, Google, Bing, and Facebook. This paper focuses on long-term and multi-camera captured videos usually produced by video surveillance, which are also important data sources in the visual big data epic and have important security or or law enforcement applications. Furthermore, as the example in Figure 1 shows, mutli-camera videos can facilitate a much deeper understanding of scenes and events. The two types of datasets are complementary, but the latter has not been explored in a QA setting.

Free-form and open-ended Q/A’s vs. restricted story-line based queries. Free-form and open-ended Q/A’s are usually collected through crowd-sourcing platforms like Amazon Mechanical Turk (MTurk) to achieve diversities. However, it is hard to obtain well-posed pairs from a massive amount of untrained workers on the Internet. This is challenging even for simple tasks like image labeling as investigated in the ImageNet dataset [4] and the Label-Me dataset [16]. For the video datasets in this paper, it is impractical to use MTurk to collect story-line based queries covering long-term temporal ranges and across multi-cameras. Instead, we adopt a selected yet sufficiently expressive ontology (shown in Figure 3) in generating queries. Following the statistical principles stated in Geman et al.’s Turing test framework [9], we design a easy-to-use toolkit by which several people with certain expertise can create a large number of story lines covering different interesting and important spatial, temporal and, causal aspects in videos with the quality of queries and answers controlled.

Quest for an integrated vision system. Almost all the recent methods proposed for image captioning and VQA are based on the combination of convolutional neural network [21, 17]

and recurrent neural network like long short-term memory 

[14]. On the one hand, it is exciting to see much progress have been made in terms of performance. On the other hand, it shows the restricted setting of the tasks in image captioning and VQA. The proposed VTT entails an integrated vision system which cannot be handled by training convolutional and recurrent neural networks directly, to the best of our knowledge. We present a prototype vision system as our baseline implementation which integrates different vision modules (where the state-of-the-art CNN based components can be applied), a knowledge base, and a query engine.

Figure 2: A systematic overview of the proposed VTT. See text for details.
Figure 3: The ontology used in the VTT.
Figure 4: Illustration of our prototype vision system for VTT. Top-left: input videos with people playing baseball games. Middle-Left: Illustration of the offline parsing pipeline which performs spatial-temporal parsing in the input videos. Bottom-Left: Visualization of the parsed results. Bottom-Right: The knowledge base constructed based on the parsing results in the form of a relation graph. Top-Right: Example story line and queries. Graph segments used for answering two of the queries are highlighted.

1.2 Overview

Figure 2 illustrates a systematic overview of the proposed VTT which consists of four components:

i) Multi-camera video dataset collection: Existing datasets are either focusing on single individual images or short video sequences with clear action or event boundaries. Our multiple-camera video dataset includes a rich set of activities in both indoor and outdoor scenes. Videos are collected by multiple cameras with overlapping field-of-views during the same time window. A variety types of sensors are used: stationary HD video cameras located on the ground and rooftop, moving cameras mounted on bicycles and automobiles, and infrared cameras. The camera parameters are provided as meta data. The videos capture daily activities of a group of people and different events in a scene which include routine ones (e.g., an ordinary group launch, playing four square soccer game) and abnormal ones (e.g., evacuating from a building during a fire alarm) with large appearance and structural variations exhibited.

ii) Ontology guided story-line based query/answer collection: We are interested in a selected ontology as listed in Figure 3. The ontology is sufficiently expressive to represent different aspects of spatial, temporal, and causal understanding in videos from basic level (e.g., identifying objects and parts) to fine-grained level (e.g., does person A have a clear-line-of-sight to person B?). Based on the ontology, we build a toolkit for story-line query generation following the statistical principles stated in [9]. Queries organized in multiple story lines are designed to evaluate a computer vision system from basic object detection queries to more complex relationship queries, and further probe the system’s ability in reasoning from the physical and social perspectives, which entails human-like commonsense reasoning. Cross-camera referencing queries requires the ability to integrate visual signals from multiple overlapping sensors.

iii) Integrated vision system: We build a computer vision system that can be used to study the organization of modules designed for different tasks and interactions between them to improve the overall performance. It is designed with two principles in mind: first, well-established computer vision tasks shall be incorporated so that we can built upon the existing achievements; second, the modules shall be loosely coupled so that it allows user to replace one or more modules with alternatives to study the performance in an integrated environment. We define a set of APIs for each individual task and connect all modules into a pipeline. After the system has processed the input videos and saved the results in its knowledge-base, it fetches queries from the evaluation server one after another at the testing time.

iv) Q/A evaluation server: We provide a web service API through which a computer vision system can interact with the evaluation server over HTTP connections. The evaluation server iterates through a stream of queries grouped by scenes. In each scene, queries are further grouped into story lines. A query is not available to the system until the previous story lines and all previous queries in the same story line have finished. The correct answer is provided to the system after each query. This information can be used by the system to be adaptive with the ability to learn from the provided answers. The answer can be used to update the previous understanding such that any conflict has to be resolved and wrong interpretations can be discarded.

Figure 4 shows an example of a full workflow of our system. We have spent more than 30 person-year in total to collect the data and build the whole system. Our prototype system has passed a detailed third-party evaluation involving more than 1,000 queries. We plan to release the whole system to the computer vision community and organize competition and regular workshop in the near future.

2 Related Work and Our Contributions

Question answering is the natural way of effective communication between human beings. Integrating computer vision and natural language processing, as well as other modal knowledge, has been a hot topic in the recent development of deeper image and scene understanding.

Visual Turing Test. Inspired by the generic Turing test principle in AI [36], Geman et al. proposed a visual Turing test [9] for object detection tasks in images which organizes queries into story lines, within which queries are connected and the complexities are increased gradually – similar to conversations between human beings. In a similar spirit, Malinowski and Fritz [24, 25] proposed a multi-word method to address factual queries of scene images. In the dataset and evaluation framework proposed in this paper, we adopt similar evaluation structure to [9], but focus on a more complex scenario which features videos and overlapping cameras to facilitate a broader scope of vision tasks.

Image Description and Visual Question Answering. To go beyond labels and bounding boxes, image tagging [3], image captioning [6, 18, 26], and video captioning [32] have been proposed recently. The state-of-the-art methods have shown, however, a coarse level understanding of an image (i.e., labels and bounding boxes of appeared objects) together with natural language -gram statistics suffices to generate reasonable captions. Microsoft COCO [22] provides descriptions or captions for images. Question answering focuses on specific contents on the image and evaluate the system’s abilities using human generated question. Unlike the image description task where a generated sentence is consider correct as long as it describes the dominant objects and activities in the image, human generated questions can ask all details and even hidden knowledge that require deduction. In such scenario, a pre-trained end-to-end system may not necessarily perform well as the question space is too large to be covered by training data. IQA [30] converts image descriptions into Q/A pairs. VQA [1] evaluates in a free-formed and open-ended questions about images, where the question-answer pairs are given by human annotators. Although it encourages participants to pursuit a deep and specific understanding about the image, it only focuses on the content of the image and does not address many other fundamental aspects of computer vision like 3D scene parsing, camera registration, etc. Moreover, actions are not static concepts, temporal information are largely missing in images.

Our Contributions: This paper makes two main contribution to deep scene and event understanding:

  • It presents a new visual Turing test benchmark consisting of a long-term and multi-camera captured video dataset and a large number of ontology-guided story-line based queries.

  • It presents a prototype integrated vision system consisting of a well-designed architecture, various vision modules, a knowledge base, and a query engine.

3 Dataset

In this section, we introduce the video dataset we collected for the VTT. In our dataset, we organize data by multiple independent scenes. Each scene consists of video footage from eight to twelve cameras with overlapping fields of view during the same time period. By now, we have a total number of 14 collections captured at 4 different locations: two indoor (an office and an auditorium) and two outdoor (a parking lot and a garden). Table 1 gives a summary of the data collections.

Collection Type Cameras Event Length
(Moving) duration hh:mm:ss
Office 1 Indoor 9 56 min 8:27:23
Office 2 Indoor 12 90 min 17:35:36
Auditorium 1 Indoor 10 (1) 15 min 2:29:50
Auditorium 2 Indoor 11 (1) 48 min 8:53:24
Parking lot 1 Outdoor 9 (1) 15 min 2:41:24
Parking lot 2 Outdoor 11 (2) 44 min 8:15:44
Parking lot 3 Outdoor 9 12 min 2:22:00
Parking lot 4 Outdoor 11 (2) 47 min 8:14:42
Parking lot 5 Outdoor 11 (1) 68 min 13:15:06
Parking lot 6 Outdoor 11 (1) 23 min 4:27:44
Garden 1 Outdoor 7 (1) 15 min 1:57:01
Garden 2 Outdoor 10 (2) 41 min 6:54:38
Garden 3 Outdoor 8 (1) 27 min 3:27:00
Garden 4 Outdoor 8 (2) 34 min 4:15:56
Total 8.9 hours 93:27:28
Table 1: Summary of our VTT dataset.
(a) Objects
(b) Parts
(c) Attributes & Properties
(d) Relationships
Figure 5: Distribution of predicates

Our dataset reflects real-world video surveillance data and poses unique challenges to modern computer vision algorithms:

Varied number of entities. In our dataset, activities in the scene could involve individuals as well as multiple interacting entities.

Rich events and activities. The activities captured in the dataset involves different degrees of complexities: from the simplest single-person actions to the group sport activities which involve as many as dozens of people.

Unknown action boundary. Unlike existing action or activity dataset where each action data point is well segmented and each segment only contains one single action, our dataset consists of multiple video streams. Actions and activities are not pre-segmented and multiple actions may happen at the same time. Such characteristic preserves more information about the spatial context of one action and correlation between multiple actions.

Multiple overlapping cameras. This requires the system to perform multi-object tracking across multiple cameras with re-identification and 3D geometry reasoning.

Varied scales and view points. Most of our data are collected in 1920x1080 resolution, however, because of the difference in cameras’ mounting points, a person who only occupies a couple of hundred pixels in bird’s-eye views may occlude the entire view frame when he or she stands very close to a ground camera.

Illumination variation. Areas covered by different cameras have different illumination conditions: some areas are covered by dark shadows whereas some other areas have heavy reflection.

Infrared cameras and moving cameras. Apart from regular RGB signals, our dataset provides infrared videos as a supplementary. Moving cameras (i.e., cameras mounted on moving objects) also provide additional challenges to the dataset and reveal more spatial structure of the scene.

The complexity of our VTT dataset. To demonstrate the difficulties of our dataset, we conduct a set of experiments on a typical subset of data using the state-of-the-art object detection models [31] and multiple-object tracking methods [29]. A summary of the data and results are shown in Table 2.

Dataset Fashion Sport Evacuation Jeep
Cameras 4 4 4 4
Length (mm:ss) 4:30 1:35 3:00 3:35
Frames 32,962 11,798 21,830 25,907
Dataset Fashion Sport
Detection 0.475 0.413 0.635 0.485 0.554 0.596 0.534 0.694
Tracking MOTP 0.683 0.674 0.692 0.694 0.728 0.727 0.716 0.739
Tracking MOTA 0.341 0.304 0.494 0.339 0.413 0.483 0.430 0.573
Evacuation Jeep
Detection 0.518 0.556 0.534 0.533 0.252 0.250 0.280 0.389
Tracking MOTP 0.698 0.692 0.720 0.651 0.680 0.651 0.689 0.696
Tracking MOTA 0.389 -0.241 0.346 0.399 0.172 0.170 0.203 0.270
Table 2: Top: Summary of the selected subset of data. Bottom: Results from detection and tracking. For Detection: AP is calculated as in PASCAL VOC 2012 [5] based on results by Faster-RCNN [31]. For Tracking: MOTA and MOTP are calculated as in Multiple Object Tracking Benchmark [20] based on results by [29].

4 Queries

A query is a first-order logic sentence (with modification) composed using variables, predicates (as shown in Figure 3), logical operators (), arithmetic operators, and quantifiers ( and ). The answer to a query is either true or false meaning whether the fact stated by the sentence holds given the data and the system’s state of belief. The formal language representation eliminates the need of natural language processing and allows us to focus computer vision problems on a constrained set of predicates.

We evaluate computer vision systems by asking a sequence of queries organized into multiple story lines. Each story line explores a natural event across a period of time in a way similar to conversations between humans. At the beginning of a story line, major objects of interest are defined first. The vision system under evaluation shall indicate whether it detects these objects. A correct detection establishes a mutual conversation context for consecutive queries, which ensures the vision system and queries are referring to the same objects in later interactions. When the system fails to detect an object, however, the evaluation server will skip the queries regarding that object. Because neither answering these queries correctly nor wrongly reveals the system’s performance in interpreting the designated data.

Object definition queries. To define an object, specifications of object type, time, and location are three components. Object type is specified by object predicates in the ontology. A time is either a view-centric frame number in a particular video or a scene-centric wall clock time. A location is either a point or a bounding box represented by its two diagonal points, where a point can be specified either in view-centric coordinates (i.e. pixels) or in scene-centric coordinates (i.e. latitude-longitude, or coordinates in a customized reference coordinate system, if defined). For example, an object definition query regarding a person in the form of first-order logic sentence would look like:

when the designated location is a bounding box. Note that the statements made by object definition queries are always true, as they aim to establish the conversation context.

Non-definition queries. Non-definition queries in a story line explores a system’s spatial, temporal and causal understanding of events in a scene regarding the detected objects. The query space consists of all possible combinations of predicates in the ontology with the detected objects (and/or objects interacting with the detected ones) being the arguments. When expressing complex activities or relationships, multiple predicates are typically conjuncted by to form a query. For example, suppose and are two detected people confirmed by object detection queries, the following query states “ is a male, is a female, and there is a clear line of sight between them at time ”:

Note that the location is not specified, because once and is identified and detected, we assume the vision system can track them over time.

Moreover, story lines unfold fine-grained knowledge about the event in the scene as it goes. In particular, given the detected objects and established context, querying about objects interacting with the detected ones becomes unambiguous. As in the example shown in Figure 4, even the ball is not specified by any object definition queries (and actually it is hard to detect the ball even the position is given), once the two people interacting with the ball are identified, it becomes legitimate to ask if “the female catches a ball at time ”:

and if “the male and female are playing a ball game together over the period of to ”:

Times and locations are specified the same way as in object definition queries with an extension that a time period can be specified by a starting time and a ending time.

Correctly answering such queries is non-trivial as it requires joint cognitive reasoning based on spatial, temporal, and casual information across multiple cameras over a time period.

Figure 6: An example XML segment of a query in the implementation. This segment is equivalent to the statement “ is a male, is a female, and there is a clear line of sight between them at time ”.

In non-polar cases, we support three types of questions: “what”, “when”, and “where”, to which the answers are object labels, time intervals, and location polygons, respectively.

Currently, we have created 3,426 queries in the dataset. Figure 5 shows the distribution of predicates in selected categories. Though we try to be unbiased in general, we do consider some predicates are more common in and important than others and thus make the distribution non-uniform. For example, among all occurrence of object predicates, “person” takes 55.9%, which is reasonable because human activities are our major point of interest. Meanwhile, we are also building a query generation toolkit on the top of Vatic [37] for rapid query creation with respect to the statistical properties discussed by Geman et al. in [9]. In the implementation, queries are presented in the form of XML documents as shown in Figure 6 for easy parsing.

5 System

We designed and implemented a computer vision system to perform the test as shown in Figure 2. It consists of three major parts: an offline parsing pipeline which decompose the visual perception into multiple sub-tasks, a knowledge base which stores parsing results (including entities, properties, and relations between them), and a query engine which answers queries by searching the knowledge base. The system also features a flexible architecture and a visualization toolkit.

5.1 Offline parsing pipeline

Offline parsing pipeline processes the multiple-view videos. Each view is first processed by a single-view parsing pipeline where video sequences from multiple cameras are handled independently. Then multiple-view fusion matches tracks from multiple views, reconciles results from single-view parsing, and generates scene-based results for answering questions.

To take advantage of achievements in various sub-areas in computer vision, we organize a pipeline of modules, each of which focuses on one particular group of predicates by generating corresponding labels for the input data. Every module gets access to the original video sequence and products from previous modules in the pipeline. The implemented modules are described as follows. Most components are derived from the state-of-the-art methods at the time we developed the system last year.

Scene parsing

generates a homography matrix for each sensor by camera calibration and also produces estimated depth map and segmentation label map for each camera view. The implementation is derived from 


Object detection [34, 31] processes the video frames and generates bounding boxes for major objects of interest.

Multiple object tracking [29] generates tracks for all detected objects.

Human attributes [28]classifies appearance attributes of detected human including gender, color of clothes, type of clothes, and accessories (e.g. hat, backpack, glasses).

Action detection detects human actions and poses in the scene. The implementation is derived form [42, 43, 40].

Behavior detection parses human-human, human-scene, and human-object interactions.

Vehicle parsing [41, 15, 13] produces bounding boxes and fluent labels for specific parts of detected cars (e.g. fender, hood, trunk, windows, lights).

Multiple-view fusion merges the tracks and bounding boxes from multiple views based on appearance and geometry cues.

The middle-left part of Figure 4 shows the dependencies between these modules in the system.

Figure 7: Screenshot of the visualization tool. At the top, it shows videos from four different views with detected objects. At the bottom, detected objects are projected into the 3D scene. The videos and the 3D scene share the same playback timeline.

5.2 Knowledge base and query answering

We employ a generic graph-based data model to store knowledge. The detected objects, actions, attribute labels are all modeled as nodes, the connections between them are modeled as edges. In our implementation, the parsing results are stored into Resource Description Framework (RDF) graphs [38], in the from of triple expressions, which can be queried by a standard query language SPARQL [39]. Given that the questions are formal language, our query engine first parses the query and transforms the query into a sequence of SPARQL statements. Apache Jena [27] is used to execute these statements and to return answers derived from the knowledge base. Figure 8 shows the architecture of query engine.

Figure 8: Dependencies among single-view parsing tasks.

In practice, it is infeasible to pre-calculate all possible predicates and save each individual knowledge segment into the knowledge base. For example, pre-calculating all “clear-line-of-sight()” relationships would involve pair-wise combination across all detected humans. This strategy is obviously inefficient in that the portion of data being queried with this predicate is actually sparse. Alternatively, we designed a online computation module which evaluates binary and trinary relationships only at the testing time when such predicates appear in a query.

Evaluation protocols. The computer vision system talks to the evaluation server over HTTP connections. At the beginning of the evaluation, the system first acquires an session id from the evaluation server. Then the system repeatedly request the next available scene, storyline, query in the session from the evaluation server. In this protocol, the evaluation server maintains the states of evaluation sessions internally and ensures the vision system cannot overwrite the submitted answer to any query.

5.3 Design Decisions

The system is architected with two goals bearing in mind: first, we want to incorporate existing tasks in computer vision; second, the architecture shall be flexible enough for replacing a module with alternatives to pursuit incremental improvements later. To this end, we defined a set of APIs for each vision task and connect all the modules using remote procedure calls (RPC). This enables the system to only focus on the logical connection between modules and provides the implementation flexibility for individual components. In practice, we deploy all modules onto different dedicated machines. Under the RPC interfaces, computation-intensive algorithms usually utilize GPU and MPI internally to pursuit faster calculation and data parallelism. This design allows us to use this system as an experiment platform by switching between alternative models and implementations for studying their effects and contributions to query answering.

To make the system easy to use, we also developed a dashboard with visualization tools for rapid development and experiment. Figure  7 shows a screenshot of the visualization.

Office Parking lot (winter) Parking lot (fall) Garden Auditorium
Video length 17:35:36 8:14:42 4:27:44 4:15:56 8:53:24
# of cameras 12 12 11 8 11
# moving cameras 0 2 1 1 2
# IR cameras 0 1 1 0 1
# of queries 108 247 236 215 254
Definition queries - 63 71 54 55
Non-definition queries 108 184 165 161 199
Respond rate 0.522 0.600 0.795 0.683 0.731
Accuracy 0.785 0.615 0.626 0.586 0.684
Table 3: Performance by data collection.
Figure 9: Results breakdown. Left to right: (1) histogram of unique number of queries by length and (2) accuracies breakdown (object definition queries are included in the calculation; (3) histogram of queries by category and (4) accuracies breakdown.

6 Evaluation

Our prototype system has been evaluated by an independent third-party company which collected the datasets and created 1,160 polar queries in a subset of data (see the upper parts in Table 3). The company was invited to administrate the independent test under the same grant on which we worked. During the test, the testing data was available to our system two weeks before the story-line query evaluation. We performed the offline parsing within the two weeks by deploying our system on a small cluster consisting of 10 workstations. During the evaluation, our system did not utilize the ground-truth answers received after each response for consecutive queries.

Among the 1,160 queries, 243 queries are object definitions, 197 (81%) of which are successfully detected, For non-definition queries, we either provided binary “true/false” answers or claimed “unable to respond” (when our implementation cannot handle or recognize some of the predicates involved in a query). Table 3 shows the accuracy as the ratio of correctly answered queries to number of the responded non-definition queries. Note that during the evaluation, for simplicity, the object definition queries are not included in the accuracy calculation, because they aim to establish mutual knowledge for consecutive queries in the story line, which ensures the evaluation server and the system are discussing the same objects. Therefore, the ground-truth answers to these queries are actually always “true”. One can obtain an 100% accuracy in object definition queries by a trivial method (answering “true” at all times) with the risk of not discussing the same objects in consecutive queries. Now, we are extending this by generating more object definition queries to which the answers can be “false” for evaluating detection performance. These queries does not serve to establish conversation context, therefore for the story lines starting with an object definition query whose ground-truth answer is false, we randomly sample the predicates and relations to generate the remaining queries.

Figure 9 further breakdowns the accuracy by the number of unique predicates and the category of predicates in a query, respectively.

Breakdown by number of predicates. Most queries have either one, two, or three predicates. This is a natural result of the choice to avoid overcomplicating the queries. As the number of predicates increases, the accuracy of our prototype system decreases, since a wrong prediction in any of the predicates may cause answering the query incorrectly. The queries with one, two, or three predicates can mostly be explained as follows:

i) One predicate: These are queries that deal only with the predicates for the various types of objects (people, car, etc.). Most of these queries (243) are object definition queries; the others (46) deal with counting objects (e.g., “how many people are in the scene?”).

ii) Two predicates: These queries are mostly queries involving unary predicates operating on an object. One predicate is used to define the object (usually person or automobile), and the unary predicate is the second predicate involved.

iii) Three predicates: These queries are mostly queries involving binary predicates operating on two objects. Two predicates are used to define the operands, and the binary predicate is the third predicate involved.

Breakdown by category. When looking at the accuracy by categories, our prototype system perform well in classic computer vision tasks (detection, part-of relations, actions, behaviors). However, queries involving spatial reasoning and interactions between human and objects or scene are still challenging and open to further research.

7 Discussion and Conclusion

This paper presented a restricted visual Turing test (VTT) for deeper scene and event understanding in long-term and multi-camera videos. Our VTT emphasizes a joint spatial, temporal and causal understanding by utilizing scene-centered representation and story-line based queries. The dataset and queries distinguish the proposed VTT from the recent proposed visual question answering (VQA). We also presented a prototype integrated vision system which obtained reasonable results in our VTT.

In our on-going work, we are generating more story-line based queries and setting up a website for holding a VTT competition. In the proposed competition, we will release the whole system as a playground. Our system architecture allows a user to substitute one or more modules with their own methods and then run through the VTT to see the improvements. One of our next steps is to create a publicly available “vision module market” where researchers can evaluate different individual components from the VTT perspective besides the traditional metrics.


This work is supported by DARPA MSEE FA 8650-11-1-7149 and DARPA SIMPLEX N66001-15-C-4035. We would like to thank Josh Walters and his colleagues at BAE Systems, the third-party collaborator in the project who administrated the test, Alexander Grushin and his colleagues at I-A-I for the effort in testing the system. We also thank members in the VCLA Lab at UCLA who contributed perception algorithms in their published work into the baseline test.


  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
  • [2] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. Vizwiz: nearly real-time answers to visual questions. In User Interface Software and Technology, pages 333–342, 2010.
  • [3] J. Deng, A. C. Berg, and F. Li.

    Hierarchical semantic indexing for large scale image retrieval.

    In CVPR, pages 785–792, 2011.
  • [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    , pages 248–255. IEEE, 2009.
  • [5] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2014.
  • [6] A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. A. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, pages 15–29, 2010.
  • [7] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR, 2005.
  • [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 32(9):1627–1645, Sept. 2010.
  • [9] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
  • [10] R. Girshick. Fast R-CNN. In ICCV, 2015.
  • [11] K. Grauman and T. Darrell. The pyramid match kernel: Efficient learning with sets of features. 2005.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
  • [13] M. Hejrati and D. Ramanan. Analyzing 3d objects in cluttered images. In NIPS, pages 602–610, 2012.
  • [14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • [15] W. Hu and S. Zhu. Learning 3d object templates by quantizing geometry and appearance spaces. 2015.
  • [16] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba. Undoing the damage of dataset bias. In ECCV, pages 158–171, 2012.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
  • [18] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, pages 1601–1608, 2011.
  • [19] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
  • [20] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler. MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942 [cs], Apr. 2015. arXiv: 1504.01942.
  • [21] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
  • [22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, pages 740–755. Springer, 2014.
  • [23] X. Liu, Y. Zhao, and S.-C. Zhu. Single-view 3d scene parsing by attributed grammar. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 684–691. IEEE, 2014.
  • [24] M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, pages 1682–1690, 2014.
  • [25] M. Malinowski and M. Fritz. Towards a visual turing challenge. CoRR, abs/1410.8027, 2014.
  • [26] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.
  • [27] B. McBride. Jena: A semantic web toolkit. IEEE Internet computing, 6(6):55–59, 2002.
  • [28] S. Park and S.-C. Zhu. Attributed grammars for joint estimation of human attributes, part and pose. Proc. of International Conference on Computer vision (ICCV), 2015.
  • [29] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1201–1208. IEEE, 2011.
  • [30] M. Ren, R. Kiros, and R. Zemel. Image question answering: A visual semantic embedding model and a new dataset. arXiv preprint arXiv:1505.02074, 2015.
  • [31] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [32] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In ICCV, pages 433–440, 2013.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pages 1–42, April 2015.
  • [34] X. Song, T. Wu, Y. Jia, and S.-C. Zhu. Discriminatively trained and-or tree models for object detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3278–3285. IEEE, 2013.
  • [35] X. Song, T.-F. Wu, Y. Jia, and S.-C. Zhu. Discriminatively trained and-or tree models for object detection. In CVPR, 2013.
  • [36] A. M. Turing. Computing machinery and intelligence. Mind, 59(236):433–460, 1950.
  • [37] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision, pages 1–21. 10.1007/s11263-012-0564-1.
  • [38] W3C. Resource description framework.
  • [39] W3C. Sparql 1.1 overview.
  • [40] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3169–3176. IEEE, 2011.
  • [41] T. Wu, B. Li, and S.-C. Zhu. Learning and-or models to represent context and occlusion for car detection and viewpoint estimation. IEEE Trans on Pattern Analysis and Machine Intelligence, 2015.
  • [42] B. Xiaohan Nie, C. Xiong, and S.-C. Zhu. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1293–1301, 2015.
  • [43] B. Z. Yao, B. X. Nie, Z. Liu, and S.-C. Zhu. Animated pose templates for modeling and detecting human actions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(3):436–452, 2014.
  • [44] J. Zhu, T. Wu, S.-C. Zhu, X. Yang, and W. Zhang. A reconfigurable tangram model for scene representation and categorization. TIP, 2015 (Accepted).
  • [45] S.-C. Zhu and D. Mumford. A stochastic grammar of images. Found. Trends. Comput. Graph. Vis., 2(4):259–362, Jan. 2006.