John K. Tsotsos

verfied profile


Tsotsos received an Hons. B.A.Sc. in Engineering Science in 1974, an M.Sc. in Computer Science in 1976 and a PhD. in Computer Science in 1980, all from the University of Toronto.

He then joined the University of Toronto on faculty in both Departments of Computer Science and of Medicine. He founded the Computer Vision Group at the University of Toronto in 1980, which he led until 1999.

He moved to York University in 2000 where he is now Distinguished Research Professor of Vision Science. He was Director of the Centre for Vision Research at York University from 2000-2006. Among his honors are: Canadian Heart Foundation Scholar at University of Toronto 1981-1984; an Honorable Mention Marr Prize at the 1st International Conference in Computer Vision in 1987; Fellow, Artificial Intelligence and Robotics Program at Canadian Institute for Advanced Research 1985-1995; Tier I Canada Research Chair of Computational Vision 2003-2024; Fellow of the Royal Society of Canada; Fellow IEEE; 2006 Canadian Image Processing and Pattern Recognition Society Award for Research Excellence and Service; the 1st President’s Research Excellence Award from York University on the occasion of the University’s 50th anniversary in 2009; the 2011 Geoffrey J. Burton Memorial Lectureship from the United Kingdom's Applied Vision Association for significant contribution to vision science; the 2015 Sir John William Dawson Medal from the Royal Society of Canada for sustained excellence in multidisciplinary research, the first computer scientist to be so honored; and several conference best paper or finalist awards.

His current research focuses on a comprehensive theory of visual attention in humans. A practical outlet for this theory forms a second focus, embodying elements of the theory into the vision systems of mobile robots.

  • A Possible Reason for why Data-Driven Beats Theory-Driven Computer Vision

    Why do some continue to wonder about the success and dominance of deep learning methods in computer vision and AI? Is it not enough that these methods provide practical solutions to many problems? Well no, it is not enough, at least for those who feel there should be a science that underpins all of this and that we should have a clear understanding of how this success was achieved. Here, this paper proposes that despite all the success and all the proclamations of so many about the superiority of these methods, the dominance we are witnessing would not have been possible by the methods of deep learning alone: the tacit change has been the evolution of empirical practice in computer vision and AI over the past decades. We demonstrate this by examining the distribution of sensor settings in vision datasets and performance of both classic and deep learning algorithms under various camera settings. This reveals a strong mismatch between optimal performance ranges of classical theory-driven algorithms and sensor setting distributions in the common vision datasets.

    08/28/2019 ∙ by John K. Tsotsos, et al. ∙ 81 share

    read it

  • High-Level Perceptual Similarity is Enabled by Learning Diverse Tasks

    Predicting human perceptual similarity is a challenging subject of ongoing research. The visual process underlying this aspect of human vision is thought to employ multiple different levels of visual analysis (shapes, objects, texture, layout, color, etc). In this paper, we postulate that the perception of image similarity is not an explicitly learned capability, but rather one that is a byproduct of learning others. This claim is supported by leveraging representations learned from a diverse set of visual tasks and using them jointly to predict perceptual similarity. This is done via simple feature concatenation, without any further learning. Nevertheless, experiments performed on the challenging Totally-Looks-Like (TLL) benchmark significantly surpass recent baselines, closing much of the reported gap towards prediction of human perceptual similarity. We provide an analysis of these results and discuss them in a broader context of emergent visual capabilities and their implications on the course of machine-vision research.

    03/26/2019 ∙ by Amir Rosenfeld, et al. ∙ 10 share

    read it

  • Early Salient Region Selection Does Not Drive Rapid Visual Categorization

    The current dominant visual processing paradigm in both human and machine research is the feedforward, layered hierarchy of neural-like processing elements. Within this paradigm, visual saliency is seen by many to have a specific role, namely that of early selection. Early selection is thought to enable very fast visual performance by limiting processing to only the most relevant candidate portions of an image. Though this strategy has indeed led to improved processing time efficiency in machine algorithms, at least one set of critical tests of this idea has never been performed with respect to the role of early selection in human vision. How would the best of the current saliency models perform on the stimuli used by experimentalists who first provided evidence for this visual processing paradigm? Would the algorithms really provide correct candidate sub-images to enable fast categorization on those same images? Here, we report on a new series of tests of these questions whose results suggest that it is quite unlikely that such an early selection process has any role in human rapid visual categorization.

    01/15/2019 ∙ by John K. Tsotsos, et al. ∙ 8 share

    read it

  • The Elephant in the Room

    We showcase a family of common failures of state-of-the art object detectors. These are obtained by replacing image sub-regions by another sub-image that contains a trained object. We call this "object transplanting". Modifying an image in this manner is shown to have a non-local impact on object detection. Slight changes in object position can affect its identity according to an object detector as well as that of other objects in the image. We provide some analysis and suggest possible reasons for the reported phenomena.

    08/09/2018 ∙ by Amir Rosenfeld, et al. ∙ 6 share

    read it

  • SMILER: Saliency Model Implementation Library for Experimental Research

    The Saliency Model Implementation Library for Experimental Research (SMILER) is a new software package which provides an open, standardized, and extensible framework for maintaining and executing computational saliency models. This work drastically reduces the human effort required to apply saliency algorithms to new tasks and datasets, while also ensuring consistency and procedural correctness for results and conclusions produced by different parties. At its launch SMILER already includes twenty three saliency models (fourteen models based in MATLAB and nine supported through containerization), and the open design of SMILER encourages this number to grow with future contributions from the community. The project may be downloaded and contributed to through its GitHub page:

    12/20/2018 ∙ by Calden Wloka, et al. ∙ 4 share

    read it

  • Early recurrence enables figure border ownership

    The face-vase illusion introduced by Rubin demonstrates how one can switch back and forth between two different interpretations depending on how the figure outlines are assigned [1]. This border ownership assignment is an important step in the perception of forms. Zhou et al. [2] found neurons in the visual cortex whose responses not only depend on the local features present in their classical receptive fields, but also on their contextual information. Various models proposed that feedback from higher ventral areas or lateral connections could provide the required contextual information. However, some studies [3, 4, 5] ruled out the plausibility of models exclusively based on lateral connections. In addition, further evidence [6] suggests that ventral feedback even from V4 is not fast enough to provide context to border ownership neurons in either V1 or V2. As a result, the border ownership assignment mechanism in the brain is a mystery yet to be solved. Here, we test with computational simulations the hypothesis that the dorsal stream provides the global information to border ownership cells in the ventral stream. Our proposed model incorporates early recurrence from the dorsal pathway as well as lateral modulations within the ventral stream. Our simulation experiments show that our model border ownership neurons, similar to their biological counterparts, exhibit different responses to figures on either side of the border.

    01/10/2019 ∙ by Paria Mehrani, et al. ∙ 4 share

    read it

  • Scene Classification in Indoor Environments for Robots using Context Based Word Embeddings

    Scene Classification has been addressed with numerous techniques in computer vision literature. However, with the increasing number of scene classes in datasets in the field, it has become difficult to achieve high accuracy in the context of robotics. In this paper, we implement an approach which combines traditional deep learning techniques with natural language processing methods to generate a word embedding based Scene Classification algorithm. We use the key idea that context (objects in the scene) of an image should be representative of the scene label meaning a group of objects could assist to predict the scene class. Objects present in the scene are represented by vectors and the images are re-classified based on the objects present in the scene to refine the initial classification by a Convolutional Neural Network (CNN). In our approach we address indoor Scene Classification task using a model trained with a reduced pre-processed version of the Places365 dataset and an empirical analysis is done on a real-world dataset that we built by capturing image sequences using a GoPro camera. We also report results obtained on a subset of the Places365 dataset using our approach and additionally show a deployment of our approach on a robot operating in a real-world environment.

    08/18/2019 ∙ by Bao Xin Chen, et al. ∙ 4 share

    read it

  • Fast Visual Object Tracking with Rotated Bounding Boxes

    In this paper, we demonstrate a novel algorithm that uses ellipse fitting to estimate the bounding box rotation angle and size with the segmentation(mask) on the target for online and real-time visual object tracking. Our method, SiamMask E, improves the bounding box fitting procedure of the state-of-the-art object tracking algorithm SiamMask and still retains a fast-tracking frame rate (80 fps) on a system equipped with GPU (GeForce GTX 1080 Ti or higher). We tested our approach on the visual object tracking datasets (VOT2016, VOT2018, and VOT2019) that were labeled with rotated bounding boxes. By comparing with the original SiamMask, we achieved an improved Accuracy of 0.645 and 0.303 EAO on VOT2019, which is 0.049 and 0.02 higher than the original SiamMask.

    07/08/2019 ∙ by Bao Xin Chen, et al. ∙ 3 share

    read it

  • Autonomous Vehicles that Interact with Pedestrians: A Survey of Theory and Practice

    One of the major challenges that autonomous cars are facing today is driving in urban environments. To make it a reality, autonomous vehicles require the ability to communicate with other road users and understand their intentions. Such interactions are essential between the vehicles and pedestrians as the most vulnerable road users. Understanding pedestrian behavior, however, is not intuitive and depends on various factors such as demographics of the pedestrians, traffic dynamics, environmental conditions, etc. In this paper, we identify these factors by surveying pedestrian behavior studies, both the classical works on pedestrian-driver interaction and the modern ones that involve autonomous vehicles. To this end, we will discuss various methods of studying pedestrian behavior, and analyze how the factors identified in the literature are interrelated. We will also review the practical applications aimed at solving the interaction problem including design approaches for autonomous vehicles that communicate with pedestrians and visual perception and reasoning algorithms tailored to understanding pedestrian intention. Based on our findings, we will discuss the open problems and propose future research directions.

    05/30/2018 ∙ by Amir Rasouli, et al. ∙ 2 share

    read it

  • Saccade Sequence Prediction: Beyond Static Saliency Maps

    Visual attention is a field with a considerable history, with eye movement control and prediction forming an important subfield. Fixation modeling in the past decades has been largely dominated computationally by a number of highly influential bottom-up saliency models, such as the Itti-Koch-Niebur model. The accuracy of such models has dramatically increased recently due to deep learning. However, on static images the emphasis of these models has largely been based on non-ordered prediction of fixations through a saliency map. Very few implemented models can generate temporally ordered human-like sequences of saccades beyond an initial fixation point. Towards addressing these shortcomings we present STAR-FC, a novel multi-saccade generator based on a central/peripheral integration of deep learning-based saliency and lower-level feature-based saliency. We have evaluated our model using the CAT2000 database, successfully predicting human patterns of fixation with equivalent accuracy and quality compared to what can be achieved by using one human sequence to predict another. This is a significant improvement over fixation sequences predicted by state-of-the-art saliency algorithms.

    11/29/2017 ∙ by Calden Wloka, et al. ∙ 0 share

    read it

  • STAR-RT: Visual attention for real-time video game playing

    In this paper we present STAR-RT - the first working prototype of Selective Tuning Attention Reference (STAR) model and Cognitive Programs (CPs). The Selective Tuning (ST) model received substantial support through psychological and neurophysiological experiments. The STAR framework expands ST and applies it to practical visual tasks. In order to do so, similarly to many cognitive architectures, STAR combines the visual hierarchy (based on ST) with the executive controller, working and short-term memory components and fixation controller. CPs in turn enable the communication among all these elements for visual task execution. To test the relevance of the system in a realistic context, we implemented the necessary components of STAR and designed CPs for playing two closed-source video games - Canabaltand Robot Unicorn Attack. Since both games run in a browser window, our algorithm has the same amount of information and the same amount of time to react to the events on the screen as a human player would. STAR-RT plays both games in real time using only visual input and achieves scores comparable to human expert players. It thus provides an existence proof for the utility of the particular CP structure and primitives used and the potential for continued experimentation and verification of their utility in broader scenarios.

    11/26/2017 ∙ by Iuliia Kotseruba, et al. ∙ 0 share

    read it