Tracking Without Re-recognition in Humans and Machines

05/27/2021 ∙ by Drew Linsley, et al. ∙ Brown University Northeastern University 6

Imagine trying to track one particular fruitfly in a swarm of hundreds. Higher biological visual systems have evolved to track moving objects by relying on both appearance and motion features. We investigate if state-of-the-art deep neural networks for visual tracking are capable of the same. For this, we introduce PathTracker, a synthetic visual challenge that asks human observers and machines to track a target object in the midst of identical-looking "distractor" objects. While humans effortlessly learn PathTracker and generalize to systematic variations in task design, state-of-the-art deep networks struggle. To address this limitation, we identify and model circuit mechanisms in biological brains that are implicated in tracking objects based on motion cues. When instantiated as a recurrent network, our circuit model learns to solve PathTracker with a robust visual strategy that rivals human performance and explains a significant proportion of their decision-making on the challenge. We also show that the success of this circuit model extends to object tracking in natural videos. Adding it to a transformer-based architecture for object tracking builds tolerance to visual nuisances that affect object appearance, resulting in a new state-of-the-art performance on the large-scale TrackingNet object tracking challenge. Our work highlights the importance of building artificial vision models that can help us better understand human vision and improve computer vision.



There are no comments yet.


page 5

page 9

page 10

page 11

page 12

page 15

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

11footnotetext: These authors contributed equally to this work.footnotetext: Carney Institute for Brain Science, Brown University, Providence, RIfootnotetext: Northeastern University, Boston, MAfootnotetext: DeepMind, London, UK

Lettvin and colleagues Lettvin1959-ha presciently noted, “The frog does not seem to see or, at any rate, is not concerned with the detail of stationary parts of the world around him. He will starve to death surrounded by food if it is not moving.” Object tracking is fundamental to survival, and higher biological visual systems have evolved the capacity for two distinct and complementary strategies to do it. Consider Figure 1: can you track the object labeled by the yellow arrow from left-to-right? The task is trivial when there are “bottom-up” cues for object appearance, like color, which make it possible to “re-recognize” the target in each frame (Fig. 1a). On the other hand, the task is more challenging when objects cannot be discriminated by their appearance. In this case integration of object motion over time is necessary for tracking (Fig. 1b). Humans are capable of tracking objects by their motion when appearance is uninformative Pylyshyn1988-pi ; Blaser2000-xz , but it is unclear if the current generation of neural networks for video analysis and tracking can do the same. To address this question we introduce PathTracker, a synthetic challenge for object tracking without re-recognition (Fig. 1c).

Leading models for video analysis rely on object classification pre-training. This gives them access to rich semantic representations that have supported state-of-the-art performance on a host of tasks, from action recognition to object tracking carreira2017 ; Bertasius2021-dk ; Wang2021-no . As object classification models have improved, so too have the video analysis models that depend on them. This trend in model development has made it unclear if video analysis models are effective at learning tasks when appearance is uninformative. The importance of diverse visual strategies has been highlighted by synthetic challenges like Pathfinder,

Figure 1: The appearance of objects makes them (a) easy or (b) hard to track. We introduce the PathTracker Challenge (c), which asks observers to track a particular green dot as it travels from the red-to-blue markers, testing object tracking when re-recognition is impossible.

a visual reasoning task that asks observers to trace long paths embedded in a static cluttered display Linsley2018-ls ; Kim2020-yw . Pathfinder tests object segmentation when appearance cues like category or shape are missing. While humans can easily solve it  Kim2020-yw , feedforward neural networks struggle, including state-of-the-art vision transformers Tay2020-ni ; Kim2020-yw ; Linsley2018-ls . Importantly, models that learn an appropriate visual strategy for Pathfinder are also quicker learners and better at generalization for object segmentation in natural images Linsley2020-en ; Linsley2020-ua . Our PathTracker challenge extends this line of work into video by posing an object tracking problem where the target can be tracked by motion and spatiotemporal continuity, not category or appearance.


Humans effortlessly solve our novel PathTracker challenge. A variety of state-of-the-art models for object tracking and video analysis do not.

  • [leftmargin=*]

  • We find that neural architectures including R3D Tran2017-fg and state-of-the-art transformer-based TimeSformers Bertasius2021-dk are strained by long PathTracker videos. Humans, on the other hand, are far more effective at solving these long PathTracker videos.

  • We describe a solution to PathTracker: a recurrent model inspired by primate neural circuitry involved in object tracking, whose decisions that are strongly correlated with those of humans.

  • These same circuit mechanisms improve object tracking in natural videos through a motion-based strategy that builds tolerance to changes in target object appearance, resulting in the certified top score on TrackingNet Muller2018-qn at the time of this submission.

  • We release all PathTracker data, code, and human psychophysics at to spur interest in the challenge of tracking without re-recognition.

2 Related Work

Shortcut learning and synthetic datasets

A byproduct of the great power of deep neural network architectures is their vulnerability to learning spurious correlations between inputs and labels. Perhaps because of this tendency, object classification models have trouble generalizing to novel contexts Barbu2019-zq ; Geirhos2020-nl , and render idiosyncratic decisions that are inconsistent with humans Ullman2016-ea ; Linsley2017-qe ; Linsley2019-bw . Synthetic datasets are effective at probing this vulnerability because they make it possible to control spurious image/label correlations and fairly test the computational abilities of models. For example, the Pathfinder challenge was designed to test if neural architectures can trace long curves despite gaps – a visual computation associated with the earliest stages of visual processing in primates. That challenge identified diverging visual strategies between humans and transformers that are otherwise state of the art in natural image object recognition Tay2020-ni ; Dosovitskiy2020-if . Other challenges like Bongard-LOGO Nie2020-lx , cABC Kim2020-yw , and PSVRT Kim2018-ib

have highlighted limitations of leading neural network architectures that would have been difficult to identify using natural image benchmarks like ImageNet 

Deng2009-jk . These limitations have inspired algorithmic solutions based on neural circuits discussed in SI §A.

Models for video analysis

A major leap in the performance of models for video analysis came from using networks which are pre-trained for object recognition on large image datasets Carreira2017-ic . The recently introduced TimeSformer Bertasius2021-dk achieved state-of-the-art performance with weights initialized from an image categorization transformer (ViT; Dosovitskiy2020-if

) that was pre-trained on ImageNet-21K. The story is similar in object tracking 


, where successful models rely on “backbone” feature extraction networks trained on ImageNet or Microsoft COCO 

Lin2014-zk for object recognition or segmentation Bertasius2020-eo ; Wang2021-no .

Figure 2: PathTracker is a synthetic visual challenge that asks observers to watch a video clip and answer if a target dot starting in a red marker travels to a blue marker. The target dot is surrounded by identical “distractor” dots, each of which travels in a randomly generated and curved path. In positive examples, the target dot’s path ends in the blue square. In negative examples, a “distractor” dot ends in the blue square. The challenge of the task is due to the identical appearance of target and distractor dots, which makes appearance-based tracking strategies ineffective. Moreover, the target dot can momentarily occupy the same location as a distractor when they cross each other’s paths, making them impossible to individuate in that frame and compelling strategies like motion trajectory extrapolation or working memory to recover the target track. (b) A 3D visualization of the video in (a) depicts the trajectory of the target dot, traveling from red-to-blue markers. The target and distractor cross approximately half-way through the video. (c,d) We develop versions of PathTracker that test observer sensitivity to the number distractors and length of videos (e,f). The number of distractors and video length interact to make it more likely for the target dot to cross a distractor in a video (compare the one X in b vs. two in d vs. three in f; see SI §B for details).

3 The PathTracker Challenge


PathTracker asks observers to decide whether or not a target dot reaches a goal location (Fig. 2). The target dot travels in the midst of a pre-specified number of distractors. All dots are identical, and the task is difficult because of this: (i) observers cannot rely on appearance to track the target, and (ii) the paths of the target and distractors can momentarily “cross” and occupy the same space, making them impossible to individuate in that frame and meaning that observers cannot only rely on target location to solve the task. This challenge is inspired by object tracking paradigms of cognitive psychology Pylyshyn1988-pi ; Blaser2000-xz , which suggest that humans might rely on mechanisms for motion perception, attention and working memory to solve a task like PathTracker.

Figure 3: Model accuracy on the PathTracker challenge. Video analysis models were trained to solve 32 (a) and 64 frame (b) versions of challenge, which featured the target object and 14 identical distractors. Models were tested on PathTracker

datasets with the same number of frames but 1, 14, or 25 distractors (left/middle/right). Grey hatched boxes denote 95% bootstrapped confidence intervals for humans. Only our InT Circuit rivaled humans on each dataset.

The trajectories of target and distractor dots are randomly generated, and the target occasionally crosses distractors (Fig. 2). These object trajectories are smooth by design, giving the appearance of objects meandering through a scene, and the difference between the coordinates of any dot on successive frames is no more than 2 pixels with less than of angular displacement. In other words, dots never turn at acute angles. We develop different versions of Pathtracker which we expect to be more or less difficult by adjusting the number of distractors and/or the length of videos. These variables change the expected number of times that distractors cross the target and the amount of time that observers must track the target (Fig. 2). To make the task as visually simple as possible and maximize contrast between dots and markers, the dots, start, and goal markers are placed on different channels in 3232 pixel three-channel images. Markers are stationary throughout each video and placed at random locations. Examples videos can be viewed at

Human benchmark

We began by testing if humans can solve PathTracker. We recruited 180 individuals using Amazon Mechanical Turk to participate in this study. Participants viewed PathTracker videos and pressed a button on their keyboard to indicate if the target object or a distractor reached the goal. These videos were played in web browsers at 256256 pixels using HTML5, which helped ensure consistent framerates Eberhardt2016-cw . The experiment began with an 8 trial “training” stage, which familiarized participants with the goal of PathTracker. Next, participants were tested on 72 videos. The experiment was not paced and lasted approximately 25 minutes, and participants were paid for their time. See and SI §B for an example and more details.

Participants were randomly entered into one of two experiments. In the first experiment, they were trained on the 32 frame and 14 distractor PathTracker, and tested on 32 frame versions with 1, 14, or 25 distractors. In the second experiment, they were trained on the 64 frame and 14 distractor PathTracker, and tested on 64 frame versions with 1, 14, or 25 distractors. All participants viewed unique videos to maximize our sampling over the different versions of PathTracker. Participants were significantly above chance on all tested conditions of PathTracker (p 0.001, test details in SI §B). They also exhibited a significant negative trend in performance on the 64 frame datasets as the number of distractors increased (, ). There was no such trend on the 32 frame datasets, and average accuracy between the two datasets was not significantly different. These results validate our initial design assumptions: humans can solve PathTracker, and manipulating distractors and video length increases difficulty.

4 Solving the PathTracker challenge

Can state-of-the-art models for video analysis match humans on PathTracker? To test this question we surveyed a variety of architectures that are the basis for leading approaches to many video analysis tasks, from object tracking to action classification. We restricted our survey to models that could be trained end-to-end to solve PathTracker without any additional pre- or post-processing steps. The selected models fall into three groups: (i) deep convolutional networks (CNNs), (ii) transformers, and (iii

) recurrent neural networks (RNNs). The deep convolutional networks include a 3D ResNet (R3D 

Tran2017-fg ), a space/time separated ResNet with “2D-spatial + 1D-temporal” convolutions (R(2+1)D Tran2017-fg ), and a ResNet with 3D convolutions in early residual blocks and 2D convolutions in later blocks (MC3 Tran2017-fg ). We trained versions of these models with random weight initializations and weights pretrained on ImageNet. We included an R3D trained from scratch without any downsampling, in case the small size of PathTracker videos caused learning problems (see SI §C for details). We also trained a version of the R3D on optic flow encodings of PathTracker (SI §C). For transformers, we turned to the TimeSformer Bertasius2021-hi . We test two of its instances: (i) attention is jointly computed for all locations across space and time in videos, and (ii) temporal attention is applied before spatial attention, which results in massive computational savings. We found similar PathTracker performance with both models. We report the latter version here as it was marginally better (see SI §C for performance of the other, joint space-time attention TimeSformer). We include a version of the TimeSformer trained from scratch, and a version pre-trained on ImageNet-20K. Note that state-of-the-art transformers for object tracking in natural videos feature similar deep and multi-headed designs Wang2021-no but use additional post-processing steps that are beyond the scope of PathTracker

. Finally, we include a convolutional-gated recurrent unit (Conv-GRU) 

Bhat2020-hb .


The visual simplicity of PathTracker cuts two ways: it makes it possible to compare human and model strategies for tracking without re-recognition as long as the task is not too easy. Prior synthetic challenges like Pathfinder constrain sample sizes for training to probe specific computations Kim2020-yw ; Linsley2018-ls ; Tay2020-ni . We adopt the following strategy to select a training set size that would help us test tracking strategies that do not depend on re-recognition. We took Inception 3D (I3D) networks Carreira2017-ic , which have been a strong baseline architecture in video analysis over the past several years, and tested their ability to learn PathTracker as we adjusted the number of videos for training. As we discuss in SI §A, when this model was trained with 20K examples of the 32 frame and 14 distractor version of PathTracker it achieved good performance on the task without signs of overfitting to its simple visual statistics. We therefore train all models in subsequent experiments with 20K examples.

We measure the ability of models to learn PathTracker and systematically generalize to novel versions of the challenge when trained on 20K samples. We trained models using a similar approach as in our human psychophyics. Models were trained on one version of pathfinder, and tested on other versions with the same number of frames, and the same or different number of distractors. In the first experiment, models were trained on the 32 frame and 14 distractor PathTracker, then tested on the 32 frame PathTracker datasets with 1, 14, or 25 distractors (Fig. 3a). In the second experiment, models were trained on the 64 frame and 14 distractor PathTracker, then tested on the 64 frame PathTracker datasets with 1, 14, or 25 distractors (Fig. 3a). Models were trained to detect if the target dot reached the blue goal marker using binary crossentropy and the Adam optimizer Kingma2014-ct

until performance on a test set of 20K videos with 14 distractors decreased for 200 straight epochs. In each experiment, we selected model weights that performed best on the 14 distractor dataset. Models were retrained three times on learning rates

e-e-e-e-e- to optimize performance. The best performing model was then tested on the remaining 1 and 25 distractor datasets in the experiment. We used four NVIDIA GTX GPUs and a batch size 180 for training.


We treat human performance as the benchmark for models on PathTracker. Nearly all CNNs and the ImageNet-initialized TimeSformer performed well enough to reach the 95% human confidence interval on the 32 frame and 14 distractor PathTracker. However, all models performed worse when systematically generalizing to PathTracker datasets with a different number of distractors, even when that number decreased (Fig. 3a, 1 distractor). Model performance on the 32 frame PathTracker datasets was worst for 25 distractors. No CNN or transformer reached the 95% confidence interval of humans on this dataset (Fig. 3a). The optic flow R3D and the TimeSformer trained from scratch were even less successful but still above chance, while the Conv-GRU performed at chance. Model performance plummeted across the board on 64 frame PathTracker datasets. The drop in model performance from 32 to 64 frames reflects a combination of the following features of PathTracker. (i) The target becomes more likely to cross a distractor when length and the number of distractors increase (Fig. 2; Fig. 2c). This makes the task difficult because the target is momentarily impossible to distinguish from a distractor. (ii) The target object must be tracked from start-to-end to solve the task, which can incur a memory cost that is monotonic w.r.t. video length. (iii) The prior two features interact to non-linearly increase task difficulty (Fig. 2c).

Figure 4: The Index-and-Track (InT) circuit model is inspired by Neuroscience models of motion perception Berzhanskaya2007-lu and executive cognitive function Wong2006-xa . (a) The circuit receives input encodings from a video (), which are processed by interacting recurrent inhibitory and excitatory units (), and a mechanism for selective “attention” () that tracks the target location. (b) InT units have spatiotemporal receptive fields. Spatial connections are formed by convolution with weight kernels (). Temporal connections are controlled by gates (). (c) Model parameters are fit with gradient descent. Softplus, sigmoid, convolution, elementwise product .

Neural circuits for tracking without re-recognition

PathTracker is inspired by object tracking paradigms from Psychology, which tested theories of working memory and attention in human observers Pylyshyn1988-pi ; Blaser2000-xz . PathTracker may draw upon similar mechanisms of visual cognition in humans. However, the video analysis models that we include in our benchmark (Fig. 3) do not have inductive biases for working memory, and while the TimeSformer uses a form of attention, it is insufficient for learning PathTracker and only reached human performance on one version of the challenge (Fig. 3).

Neural circuits for motion perception, working memory, and attention have been the subject of intense study in Neuroscience for decades. Knowledge synthesized from several computational, electrophysiological and imaging studies point to canonical features and computations that are carried out by these circuits. (i

) Spatiotemporal feature selectivity emerges from non-linear and time-delayed interactions between neuronal subpopulations 

Takemura2013-ch ; Kim2014-bc . (ii) Recurrently connected neuronal clusters can maintain task information in working memory Elman1990-hd ; Wong2006-xa . (iii) Synaptic gating, inhibitory modulation, and disinhibitory circuits are neural substrates of working memory and attention Hochreiter1997-gc ; OReilly2006-je ; Badre2012-hv ; DArdenne2012-az ; Mitchell2007-fd . (iv) Mechanisms for gain control may aid motion-based object tracking by building tolerance to visual nuisances, such as illumination Berzhanskaya2007-hs ; Mely2018-bc . We draw from these principles to construct the “Index-and-Track” circuit (InT, Fig. 4).

Figure 5: Performance, decision correlations, and error consistency between models and humans on PathTracker. In a new set of psychopysics experiments, humans and models were trained on 64 frame PathTracker datasets with 14 distractors, and rendered decisions on a variety of challenging versions. Decision correlations are computed with Pearson’s , and error consistency with Cohen’s  Geirhos2020-uq

. Only the Complete InT circuit rivals human performance and explains the majority of their decision and error variance on each test dataset (

a,c,e). Visualizing InT attention () reveals that it has learned to solve PathTracker by multi-object tracking (b,d,f; color denotes time). The consistency between InT and human decisions raises the possibility that humans rely on a similar strategy.

InT circuit description

The InT circuit takes an input at location and feature channel from video frame (Fig. 4a). This input is passed to an inhibitory unit , which interacts with an excitatory unit , both of which have persistent states that store memories with the help of gates . The inhibitory unit is also gated by another inhibitory unit, , which is a non-linear function of

, and can either decrease or increase (i.e., through disinhibition) the inhibitory drive. In principle, the sigmoidal nonlinearity of

means that it can selectively attend, and hence, we refer to as “attention”. Moreover, since is a function of , which lags in time behind , its activity reflects the displacement (or motion) of an object in versus the current memory of . InT units have spatiotemporal receptive fields (Fig. 4b). Interactions between units at different locations are computed by convolution with weight kernels and attention is computed by . Gate activities that control InT dynamics and temporal receptive fields are similarly calculated by kernels, . Recurrent units in the InT support non-linear (gain) control. Inhibitory units can perform divisive and subtractive computations, controlled by . Excitatory units can perform multiplicative and additive computations, controlled by . Parameters . “SoftPlus” rectifications denoted by enforce inhibitory and excitatory function and competition (Fig. 4c). The final state is passed to a readout for PathTracker (SI §D).

InT PathTracker performance

We trained the InT on PathTracker following the procedure in §4. It was the only model that rivaled humans on each version of PathTracker (Fig. 3). The gap in performance between InT and the field is greatest on the 64 frame version of the challenge.

How does the InT solve PathTracker? There are at least two strategies that it could choose from. One is to maintain a perfect track of the target throughout its trajectory, and extrapolate the momentum of its motion to resolve crossings with distractors. Another is to track all objects that cross the target and check if any of them reach the goal marker by the end of the video. To investigate the type of strategy learned by the InT for PathTracker and to compare this strategy to humans, we ran additional psychophysics with a new group of 90 participants using the same setup detailed in §3. Participants were trained on 8 videos from the 14 distractor and 64 frame PathTracker and tested on 72 images from either the (i) 14 distractor and 64 frame dataset, (ii) 25 distractor and 64 frame dataset, or (iii) 14 distractor and 128 frame dataset. Unlike the psychophysics in §3, all participants viewing a given test set saw the same videos, which made it possible to compare their decision strategies with the InT.

InT performance reached the confidence intervals of humans on each test dataset. The InT also produced errors that were extremely consistent with humans and explained nearly all variance in Pearson’s and Cohen’s on each dataset (Fig. 5, middle and right columns). This result means that humans and InT rely on similar strategies for solving PathTracker.

What is the underlying strategy? We visualized activity of units in the InT as they processed PathTracker videos and found that they had learned a multi-object tracking strategy to solve the task (Fig. 5; see SI §F for method, and for animations). The units track the target object until it crosses a distractor and ambiguity arises, at which point attention splits and it tracks both objects. This strategy indexes a limited number of objects at once, consistent with studies of object tracking in humans Pylyshyn1988-pi . Since the InT is not explicitly constrained for this tracking strategy, we next investigated the minimal circuit for learning it and explaining human behavior.

We developed versions of the InT with lesions applied to different combinations of its divisive/subtractive and multiplicative/additive computations, a version without attention units , and a version that does not make a distinction of inhibition vs. excitation (“complete + tanh”), in which rectifications were replaced with hyperbolic tangents that squash activities into . While some of these models marginally outperformed the Complete InT on the 14 distractor and 64 frame dataset, their performance dropped precipitously on the 25 distractor and 64 frame dataset, and especially the very long 14 distractor and 128 frame dataset (Fig. 5e). Attention units in the complete InT’s nearest rival (complete + tanh) were non-selective, potentially contributing to its struggles. InT performance also dropped when we forced it to attend to fewer objects (SI §F).

Figure 6: Circuit mechanisms for tracking without re-recognition build tolerance to visual nuissances that affect object appearance. (a) The TransTChen2021-is is a transformer architecture for object tracking. We develop an extension, the InT+TransT, in which our InT circuit model recurrently modulates TransT activity. Unlike the TransT, the InT+TransT is trained on sequences to promote tracking strategies that do not rely on re-recognition. (b-d) The InT+TransT excels when the target object is visually similar to other depicted objects, undergoes changes in illumination, or is occluded.

5 Appearance-free mechanisms for object tracking in the wild

The InT solves PathTracker by learning to track multiple objects at once, without relying on the re-recognition strategy that has been central to progress in video analysis challenges in computer vision. However, it is not clear if tracking without re-recognition is useful in the natural world. We test this question by turning to object tracking in natural videos. At the time of writing, the state-of-the-art object tracker is the TransT Chen2021-is , a deep multihead transformer Vaswani2017-gh . The TransT finds pixels in a video frame that match the appearance of an image crop depicting a target object. During training, the TransT receives a tuple of inputs, consisting of this target object image and a random additional “search frame” from the same video. These images are encoded with a modified ResNet50 He2015-cb , passed to separate transformers, and finally combined by a “cross-feature attention” (CFA) module, which compares the two encodings via a transformer key/query/value computation. The target frame is used for key and value operations, and the search frame is used for the query operation. Through its pure appearance-based approach to tracking, the TransT has achieved top performance on TrackingNet Muller2018-qn , LaSOT Fan2018-sb , and GOT-10K Huang2018-yt .


We tested whether or not the InT circuit can improve TransT performance by learning a complementary object strategy that does not depend on appearance, or re-recognition. We reasoned that this strategy might help TransT tracking in cases where objects are difficult to discern by their appearance, such as when they are subject to changing lighting, color, or occlusion. We thus developed the InT+TransT, which involves the following modifications of the original TransT (Fig. 6a). (i) We introduce two InT circuits to form a bottom-up and top-down feedback loop with the TransT Linsley2020-en ; Gilbert2013-hb , which in principle will help the model select the appropriate tracking strategy depending on the video – re-recognition or not. One InT receives ResNet50 search image encodings and modulates the TransT’s CFA encoding of this search image. The other receives the output of the TransT and uses this information to update memory in the first InT. (ii) The TransT is trained with pairs of target and search video frames, separated in time by up to 100 frames. We introduce the intervening frames to the InT circuits. See SI §F for extended methods.


InT+TransT training and evaluation hews close to the TransT procedure. This includes training on the latest object tracking challenges in computer vision: TrackingNet Muller2018-qn , LaSOT Fan2018-sb , and GOT-10K Huang2018-yt . All three challenges depict diverse classes of objects, moving in natural scenes that range from simplistic and barren to complex and cluttered. TrackingNet (30,132 train and 511 test videos) and GOT-10K (10,000 train and 180 test) evaluation is performed on official challenge servers, whereas LaSOT (1,120 train and 280 test) is evaluated with a Matlab toolbox. While the TransT is also trained with static images from Microsoft COCO Lin2014-zk , in which the search image is an augmented version of the target, we do not include COCO in InT+TransT since we expect object motion to be an essential feature for our model Chen2021-is . The InT+TransT is initialized with TransT weights and trained with AdamW Loshchilov2017-dy and a learning rate of 14 for InT parameters, and 1

6 for parameters in the TransT readout and CFA module. Other TransT parameters are frozen and not trained. The InT+TransT is trained with the same objective functions as the TransT for target object bounding box prediction in the search frame, and an additional objective function for bounding box prediction using InT circuit activity in intervening frames. The complete model was trained with batches of 24 videos on 8 NVIDIA GTX GPUs for 150 epochs (2 days). We selected the weights that performed best on GOT-10K validation. A hyperparameter controls the number of frames between the target and search that are introduced into the InT during training. We relied on coarse sampling (1 or 8 frames) due to memory issues associated with recurrent network training on long sequences 

Linsley2020-ua .


An InT+TransT trained on sequences of 8 frames performed inference around 30FPS on a single NVIDIA GTX and beat the TransT on nearly all benchmarks. It is in first place on the TrackingNet leaderboard (, better than the TransT on LaSOT, and rivals the TransT on the GOT-10K challenge (Table 1). The InT+TransT performed better when trained with longer sequences (compare and , Table 1). Consistent with InT success on PathTracker, the InT+TransT was qualitatively better than the TransT on challenging videos where the target interacted with other similar looking objects (Fig. 6;

We also found that the InT+TransT excelled in other challenging tracking conditions. The LaSOT challenge provides annotations for challenging video features, which reveal that the InT+TransT is especially effective for tracking objects with “deformable” parts, such as moving wings or tails (SI §F). We further test if introducing object appearance perturbations to the GOT-10K might distinguish performance between the TransT and InT+TransT. We evaluate these models on the GOT-10K test set with one of three perturbations: inverting the color of all search frames (Color), inverting the color of random search frames (rColor), or introducing random occlusions (Occl.). The InT+TransT outperformed the TransT on each of these tests (Table 1).

Model TrackingNetMuller2018-qn LaSOTFan2018-sb GOTHuang2018-yt
InT+TransT 87.5 74.0 72.2 43.1 62.5 56.9
InT+TransT 87.3 73.6 70.0 36.2 37.8 25.4
TransT Chen2021-is 86.7 73.8 72.3 40.7 57.5 55.2
Table 1: Model performance on TrackingNet (P), LaSot (P), GOT-10K (AO), and perturbations applied to the GOT-10K (AO). Best performance is in black, and certified state of the art is bolded. Perturbations on the GOT-10K are color inversions on every frame () or random frames (), and random occluders created from scrambling image pixels (). InT+TransT was trained on sequences of 8 frames, and InT+TransT was trained on 1-frame sequences.

6 Discussion

A key inspiration for our study is the centrality of visual motion and tracking across a broad phylogenetic range, via three premises: (i) Object motion integration over time per se is essential for ecological vision and survival Lettvin1959-ha . (ii

) Object motion perception cannot be completely reduced to recognizing similar appearance features at two different moments in time. In perceptual phenomena like

phi motion, the object that is tracked is described as “formless” with no distinct appearance Steinman2000-jt . (iii) Motion integration over space and time is a basic operation of neural circuits in biological brains, which can be independent of appearance Huk2005-bc . These three premises form the basis for our work.

We developed PathTracker to test whether state-of-the-art models for video analysis can solve a visual task when object appearance is ambiguous. Prior visual reasoning challenges like Pathfinder Linsley2018-ls ; Kim2020-yw ; Tay2020-ni , indicate that this is a problem for object recognition models, which further serve as a backbone for many video analysis models. While no existing model was able to contend with humans on Pathfinder, our InT circuit was. Through lesioning experiments, we discovered that the InT’s ability to explain human behavior depends on its full array of inductive biases, helping it learn a visual strategy that indexes and tracks a limited number of the objects at once, echoing classic theories on the role of attention and working memory in object tracking Blaser2000-xz ; Pylyshyn1988-pi .

We further demonstrate that the capacity for video analysis without relying on re-recognition helps in natural scenes. Our InT+TransT model is more capable than the TransT at tracking objects when their appearance changes, and is the state of the art on the TrackingNet challenge. Together, our findings demonstrate that object appearance is a necessary element for for video analysis, but it is not sufficient for modeling biological vision and rivaling human performance.

We are grateful to Daniel Bear for his suggestions to improve this work. We would also like to thank Rajan Girsa for initial discussions related to Python Flask used in MTurk portal. GM is also affiliated with Labrynthe Pvt. Ltd., New Delhi, India. Funding provided by ONR grant #N00014-19-1-2029, the ANR-3IA Artificial and Natural Intelligence Toulouse Institute, and ANITI (ANR-19-PI3A-0004). Additional support from the Brown University Carney Institute for Brain Science, Center for Computation in Brain and Mind, and Center for Computation and Visualization (CCV).


  • (1) Lettvin, J.Y., Maturana, H.R., McCulloch, W.S., Pitts, W.H.: What the frog’s eye tells the frog’s brain. Proceedings of the IRE 47(11) (November 1959) 1940–1951
  • (2) Pylyshyn, Z.W., Storm, R.W.: Tracking multiple independent targets: Evidence for a parallel tracking mechanism. Spat. Vis. 3(3) (1988) 179–197
  • (3) Blaser, E., Pylyshyn, Z.W., Holcombe, A.O.: Tracking an object through feature space. Nature 408(6809) (November 2000) 196–199
  • (4) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset.

    In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 6299–6308

  • (5) Bertasius, G., Wang, H., Torresani, L.: Is Space-Time attention all you need for video understanding? (February 2021)
  • (6) Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: Exploiting temporal context for robust visual tracking. (March 2021)
  • (7) Linsley, D., Kim, J., Veerabadran, V., Serre, T.: Learning long-range spatial dependencies with horizontal gated-recurrent units. (May 2018)
  • (8) Kim*, J., Linsley*, D., Thakkar, K., Serre, T.: Disentangling neural mechanisms for perceptual grouping. International Conference on Representation Learning (2020)
  • (9) Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham, P., Rao, J., Yang, L., Ruder, S., Metzler, D.: Long range arena: A benchmark for efficient transformers. (November 2020)
  • (10) Linsley, D., Kim, J., Ashok, A., Serre, T.: Recurrent neural circuits for contour detection. International Conference on Learning Representations (2020)
  • (11) Linsley, D., Ashok, A.K., Govindarajan, L.N., Liu, R., Serre, T.: Stable and expressive recurrent vision models. (May 2020)
  • (12) Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. (November 2017)
  • (13) Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: A Large-Scale dataset and benchmark for object tracking in the wild. In: Computer Vision – ECCV 2018, Springer International Publishing (2018) 310–327
  • (14) Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gutfreund, D., Tenenbaum, J., Katz, B.: ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alché-Buc, F., Fox, E., Garnett, R., eds.: Advances in Neural Information Processing Systems 32. Curran Associates, Inc. (2019) 9453–9463
  • (15) Geirhos, R., Jacobsen, J.H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. Nature Machine Intelligence 2(11) (November 2020) 665–673
  • (16) Ullman, S., Assif, L., Fetaya, E., Harari, D.: Atoms of recognition in human and computer vision. Proc. Natl. Acad. Sci. U. S. A. 113(10) (March 2016) 2744–2749
  • (17) Linsley, D., Eberhardt, S., Sharma, T., Gupta, P., Serre, T.: What are the visual features underlying human versus machine vision? In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). (October 2017) 2706–2714
  • (18) Linsley, D., Shiebler, D., Eberhardt, S., Serre, T.: Learning what and where to attend. (2019)
  • (19) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. (October 2020)
  • (20) Nie, W., Yu, Z., Mao, L., Patel, A.B., Zhu, Y., Anandkumar, A.: Bongard-LOGO: A new benchmark for Human-Level concept learning and reasoning. (October 2020)
  • (21) Kim, J., Ricci, M., Serre, T.: Not-So-CLEVR: learning same-different relations strains feedforward neural networks. Interface Focus 8(4) (August 2018) 20180011
  • (22) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. (June 2009) 248–255
  • (23) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. (May 2017)
  • (24) Fiaz, M., Mahmood, A., Jung, S.K.: Tracking noisy targets: A review of recent object tracking approaches. (February 2018)
  • (25) Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Lawrence Zitnick, C., Dollár, P.: Microsoft COCO: Common objects in context. (May 2014)
  • (26) Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2020) 9736–9745
  • (27) Eberhardt, S., Cader, J.G., Serre, T.: How deep is the feature analysis underlying rapid visual categorization? In Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., eds.: Advances in Neural Information Processing Systems 29. Curran Associates, Inc. (2016) 1100–1108
  • (28) Bertasius, G., Wang, H., Torresani, L.: Is Space-Time attention all you need for video understanding? (February 2021)
  • (29) Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Know your surroundings: Exploiting scene information for object tracking. In: Computer Vision – ECCV 2020, Springer International Publishing (2020) 205–221
  • (30) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. (December 2014)
  • (31) Berzhanskaya, J., Grossberg, S., Mingolla, E.: Laminar cortical dynamics of visual form and motion interactions during coherent object motion perception. Spat. Vis. 20(4) (2007) 337–395
  • (32) Wong, K.F., Wang, X.J.: A recurrent network mechanism of time integration in perceptual decisions. J. Neurosci. 26(4) (January 2006) 1314–1328
  • (33) Takemura, S.Y., Bharioke, A., Lu, Z., Nern, A., Vitaladevuni, S., Rivlin, P.K., Katz, W.T., Olbris, D.J., Plaza, S.M., Winston, P., Zhao, T., Horne, J.A., Fetter, R.D., Takemura, S., Blazek, K., Chang, L.A., Ogundeyi, O., Saunders, M.A., Shapiro, V., Sigmund, C., Rubin, G.M., Scheffer, L.K., Meinertzhagen, I.A., Chklovskii, D.B.: A visual motion detection circuit suggested by drosophila connectomics. Nature 500(7461) (August 2013) 175–181
  • (34) Kim, J.S., Greene, M.J., Zlateski, A., Lee, K., Richardson, M., Turaga, S.C., Purcaro, M., Balkam, M., Robinson, A., Behabadi, B.F., Campos, M., Denk, W., Seung, H.S., the EyeWirers: Space–time wiring specificity supports direction selectivity in the retina. Nature 509 (May 2014) 331
  • (35) Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2) (April 1990) 179–211
  • (36) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8) (November 1997) 1735–1780
  • (37) O’Reilly, R.C., Frank, M.J.: Making working memory work: a computational model of learning in the prefrontal cortex and basal ganglia. Neural Comput. 18(2) (February 2006) 283–328
  • (38) Badre, D.: Opening the gate to working memory. Proc. Natl. Acad. Sci. U. S. A. 109(49) (December 2012) 19878–19879
  • (39) D’Ardenne, K., Eshel, N., Luka, J., Lenartowicz, A., Nystrom, L.E., Cohen, J.D.: Role of prefrontal cortex and the midbrain dopamine system in working memory updating. Proc. Natl. Acad. Sci. U. S. A. 109(49) (December 2012) 19900–19909
  • (40) Mitchell, J.F., Sundberg, K.A., Reynolds, J.H.: Differential attention-dependent response modulation across cell classes in macaque visual area V4. Neuron 55(1) (July 2007) 131–141
  • (41) Berzhanskaya, J., Grossberg, S., Mingolla, E.: Laminar cortical dynamics of visual form and motion interactions during coherent object motion perception. Spat. Vis. 20(4) (2007) 337–395
  • (42) Mély, D.A., Linsley, D., Serre, T.: Complementary surrounds explain diverse contextual phenomena across visual modalities. Psychol. Rev. 125(5) (October 2018) 769–784
  • (43) Geirhos, R., Meding, K., Wichmann, F.A.: Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency. (June 2020)
  • (44) Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. (March 2021)
  • (45) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.U., Polosukhin, I.: Attention is all you need. In Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., eds.: Advances in Neural Information Processing Systems. Volume 30., Curran Associates, Inc. (2017)
  • (46) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. (December 2015)
  • (47) Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., Ling, H.: LaSOT: A high-quality benchmark for large-scale single object tracking. (September 2018)
  • (48) Huang, L., Zhao, X., Huang, K.: GOT-10k: A large High-Diversity benchmark for generic object tracking in the wild. (October 2018)
  • (49) Gilbert, C.D., Li, W.: Top-down influences on visual processing. Nat. Rev. Neurosci. 14(5) (May 2013) 350–363
  • (50) Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. (November 2017)
  • (51) Steinman, R.M., Pizlo, Z., Pizlo, F.J.: Phi is not beta, and why wertheimer’s discovery launched the gestalt revolution. Vision Res. 40(17) (August 2000) 2257–2264
  • (52) Huk, A.C., Shadlen, M.N.: Neural activity in macaque parietal cortex reflects temporal integration of visual motion signals during perceptual decision making. J. Neurosci. 25(45) (November 2005) 10420–10436
  • (53) Shimamura, A.P.: Toward a cognitive neuroscience of metacognition. Conscious. Cogn. 9(2 Pt 1) (June 2000) 313–23; discussion 324–6
  • (54) Rousseeuw, P.J., Croux, C.: Alternatives to the median absolute deviation. J. Am. Stat. Assoc. 88(424) (December 1993) 1273–1283
  • (55) Edgington, E.S.: RANDOMIZATION TESTS. J. Psychol. 57 (April 1964) 445–449
  • (56) Wedel, A., Pock, T., Zach, C., Bischof, H., Cremers, D.: An improved algorithm for TV-L1 optical flow. In: Statistical and Geometrical Approaches to Visual Motion Analysis, Springer Berlin Heidelberg (2009) 23–45
  • (57) Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift.

    In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, (July 2015) 448–456

  • (58) Grossberg, S., Mingolla, E.: Neural dynamics of perceptual grouping: textures, boundaries, and emergent segmentations. Percept. Psychophys. 38(2) (August 1985) 141–171
  • (59) Wu, Y., He, K.: Group normalization. (March 2018)

Appendix A Extended related Work

Translating circuits for biological vision into artificial neural networks

While the Pathfinder challenge of Linsley2018-ls presents immense challenges for transformers and deep convolutional networks Kim2020-yw , the authors found that it can be solved by a simple model of intrinsic connectivity in visual cortex, with orders-of-magnitude fewer parameters than standard models for image categorization. This model was developed by translating descriptive models of neural mechanisms from Neuroscience into an architecture that can be fit to data using gradient descent Linsley2018-ls ; Linsley2020-ua . We adopt a similar approach in the current work, identifying mechanisms for object tracking without re-recognition in Neuroscience, and developing those into differentiable operations with parameters that can be optimized by gradient descent. This approach has the dual purpose of introducing task-relevant inductive biases into computer vision models, and developing theory on their relative utility for biological vision.


In this work we tested a relatively small number of PathTracker versions. We mostly focused on small variations to the number of distractors and video length, but in future work we hope to incorporate other variations like speed and velocity manipulations, and generalization across temporal variations. Another limitation is that appearance-free strategies confer relatively modest gains over the state of the art. One potential issue is determining when a visual system should rely on appearance-based vs. appearance-free features for tracking. Our solution is two-pronged and potentially insufficient. The first strategy is for top-down feedback from the TransT into the InT, which we aligns tracks between the two models. The second strategy is potentially naive, in that we gate the InT modulation to the TransT based on its agreement with the prior TransT query, and the confidence of the TransT query. Additional work is needed to identify better approaches. Meta-cognition work from Cognitive Neuroscience is one possible resource Shimamura2000-ii .

Societal impacts

The basic goal of our study is for understanding how biological brains work. PathTracker

helps us screen models against humans on a simple visual task which tests visual strategies for tracking without “re-recognition”, or appearance cues. The fact that we developed a circuit that explains human performance is primarily important because it makes predictions about the types of neural circuit mechanisms that we might ultimately find in the brain in future Neuroscience work. Our extension to natural videos achieves new state-of-the-art because it is able to implement visual strategies that build tolerance to visual nuisances in way that resembles humans. It must be recognized the further development of this model has potential for misuse. One possible nefarious application is for surveillance. On the other hand, such a technology could be essential for ecology, sports, self-driving cars, robotics, and other real-world applications of machine vision. We open source our code and data to promote research towards such beneficial applications.

Appendix B Human benchmark

For our benchmark experiments we recruited 120 participants. Every participant was compensated with $8 through MTurk on successful completion of all test trials by pasting a unique code generated by the system into their MTurk account. The decision regarding this amount was reached upon by prorating the minimum wage. An additional overhead fee of 40% per participant was paid to MTurk. Collectively, we spent $960 on these benchmark experiments.

The experiment was not time bound and participants could complete it at their own pace, taking around 25 minutes to complete. Videos with 32-, 64- and 128-frames were of duration 4, 8 and 14 seconds respectively. The videos played at 10 frames per second. Participant reaction times were also recorded on every trial and we include these in our data release. After every trial participants were redirected to a screen confirming successful submission of their response. They could start the next trial by clicking the “Continue” button or by pressing spacebar. If not, they were automatically redirected to the next trial after 3000 ms. Participants were also shown a “rest screen” with a progress bar after every 10 trials where they could take additional and longer breaks if needed. The timer was turned off for the rest screen.

Experiment design

At the beginning of the experiment, we collected participant consent using a consent form approved by a University’s Institutional Review Board (IRB). Our experiment was completed on a computer via Chrome browser. Once consented, we provided a demonstration clearly stating the instructions with an example video to the participants. We also provided them with an option to revisit the instructions, if needed, from the top right corner of the navigation bar at any point during the experiment.

Participants were asked to classify the video as “positive” (the dot leaving the red marker entered the blue marker) or “negative” (the dot leaving the red marker did not enter the blue marker) using the right and left arrow keys respectively. The choice for keys and their corresponding instances were mentioned below the video on every screen, along with a small instruction paragraph above the video. See fig S1. Participants were given feedback on their response (correct/incorrect) after every practice trial, but not after the test trials.

Figure S1: An experimental trial screen.


The experiment was written in Python Flask, including the server side script and logic. The frontend templates were written in HTML with Bootstrap CSS framework. We used javascript for form submission with keys and redirections, done on the end-user side. The server was run with nginx on 1 Intel(R) Xeon(R) CPU E5-2695 v3 at 2.30GHz, 4GB RAM, Red Hat Enterprise Linux Server.

Video frames for each experiment were generated at 32

32 resolution. Before writing them to the mp4 videos displayed to human participants in the experiment, the frames were resized through nearest-neighbor interpolation to 256

256. In order to allow time for users to prepare for each trial, the first frame of each video was repeated 10 times before the rest of the video played.

Filtering criteria

Amazon Mechanical Turk data is notoriously noisy. Because of this, we adopted a simple and bias-free approach to filter participants who were inattentive or did not understand the task (these users were still paid for their time). For the main benchmark described in §3 in the main text, participants completed one of two experiments, where they were trained and tested on videos with 32 or 64 frames. No participant viewed both lengths of PathTracker. Participants were trained with 14 distractor videos, then tested on videos with 1, 14, or 25 distractors. We filtered participants according to their performance on the training videos for a particular experiment, which were otherwise not used for any analysis in this study. We removed participants who did not exceed 2 median absolute deviations below the median (MAD median absolute deviation Rousseeuw1993-wj

; this is a robust alternative to using the mean and standard deviation to find outliers). The threshold was approximately

training accuracy for each experiment (chance is ). This procedure filtered 74/180 participants in the benchmark.

Statistical testing

We assessed the difference between human performance and chance using randomization tests Edgington1964-zb . We computed human accuracy on each test dataset, then over 10,000 steps, we shuffled video labels, and then recomputed and stored the resulting accuracy. We computed values as the proportion of shuffled accuracies that exceed the real accuracy. We also used linear models for significance testing of trends in human accuracy as we increased the number of distractors. From these models we computed -tests and -values.

Using an I3D Carreira2017-ic to select PathTracker training set sizes

As mentioned in the main text, we selected PathTracker training set size for models reported in the main text by investigating sample efficiency of the standard but not state-of-the-art I3D Carreira2017-ic . We were specifically interested in identifying a “pareto principle” in learning dynamics where additional training samples began to yield smaller gains in accuracy, potentially signifying a point at which I3D had learned a viable strategy (Fig. S2). At this point, we suspected that the task would remain challenging – but still solvable – across the variety of PathTracker conditions we discuss in the main text. We focus on basic 32 frame and 14 distractor training and find an inflection point at 20K examples. We plot I3D performance on this condition in Fig. S2a and performance slopes in Fig. S2b. The first and lowest slope corresponds to 20K samples, and hence may reflect an inflection in the model’s visual strategy. Our experiments in the main text demonstrate that this strategy is a viable one for calibrating the difficulty of synthetic challenges.

Target-distractor crossings

We compute the number of average crossings between the target object and distractors in PathTracker. Increasing video length monotonically increases the number of crossings. Length further interacts with the number of distractors to yield more crossings (Fig. S2c).

Figure S2: Our approach for selecting training set size on PathTracker, and a proxy for difficulty across the versions of the challenge. (a) We plot I3D performance as a function of training set size. The dotted line denotes the point at which the derivative of accuracy w.r.t. training set size is smallest (b). We take this change performance as a function of training set size as evidence that I3D has learned a strategy that is sufficient for the task. We suspected this size would make the PathTracker challenging but still solvable for the models we discuss in the main text. (c) The number of average crossings in PathTracker videos as a function of distractors and video length. Lines depict exponential fits for each number of distractors across lengths.

Appendix C Solving the Pathtracker challenge

State-of-the-art model details

We trained a variety of models on our benchmark. This included an R3D without any strides or downsampling. Because this manipulation caused an explosion in memory usage, we reduced the number of features per-residual block of this “No Stride R3D” from 64/128/256/512 to 32/32/32/32. We also included two forms of TimeSformers 

Bertasius2021-hi , one with distinct applications of temporal and spatial attention that we include in our main analyses, and another with join temporal and spatial attention (Fig. S3).

Optic Flow

We followed the method of Carreira2017-ic to compute optic flow encodings of PathTracker datasets. We used OpenCV’s implementation of the TV-L1 algorithm Wedel2009-nq . We extracted two channels from the output given by the algorithm, and appended a channel-averaged version of the corresponding PathTracker image, similar to the approach of Carreira2017-ic .

Figure S3: An extended benchmark of state-of-the-art models on PathTracker with (a) 32 and (b) 64 frame versions of the task.

Appendix D InT circuit description

Our InT circuit has two recurrent neural populations, and . These populations evolve over time and receive a dynamic “feedforward” drive via . This feedforward drive is derived from a convolution between each frame of the PathTracker videos a kernel . This activity is then rectified by a softplus pointwise nonlinearity. InT hidden states are initialized with . The InT circuit also includes Batch Normalization Ioffe2015-zm applied to the outputs of its recurrent kernels , with scales () and intercepts () shared across timesteps of processing. We initialize the scale parameters to following prior work Linsley2020-ua . We do not store Batch Normalization moments during training. InT gain control (i.e., its divisive normalization) is expected to emerge at steady state Mely2018-bc ; Grossberg1985-ui in similar dynamical systems formulations, although our formulation relaxes some of these constraints.

The final activity of in the InT for a PathTracker video is passed to a readout that renders a binary decisions for the task. This readout begins by convolving with a kernel . The output is channel-wise concatenated with the channel of the first frame containing the location of the goal marker. This activity is then convolved with another kernel , which is designed to capture overlap between the goal marker and the putative target object/dot. The resulting activity is “global” average pooled and entered into binary crossentropy for model optimization. On PathTracker, all versions of the InT and the ConvGRU used this input transformation. All versions of the InT, the ConvGRU, and the “No Stride R3D’ used this readout.

Appendix E InT PathTracker

We visualize InT attention units on PathTracker

by simply binarizing the logits, where values greater than

are set to and units below that threshold are set to . When applying the same strategy to versions of the InT other than the complete circuit, we found attention that was far more diffuse. For this lesioned InT circuits, adjusting this threshold to be more conservative, choosing two or three or even four standard deviations above the mean, never yielded attention that looked like the complete model. For instance, the closest competitor to the complete InT is one in which its Softplus rectifications are changed to hyperbolic tangents, which remove model constraints for separate and competing forms of Inhibition and Excitation. This model’s attention was subsequently diffuse and it also performed worse in generalization than the complete circuit (Fig. S4).

We also developed a version of the InT with attention that was biased against multi-object tracking. In the normal formulation, InT attention is transformed with a sigmoid pointwise nonlinearity. This independently transforms every unit in to be in , giving them the capacity to attend to multiple objects at once. In our version biased against multi-object tracking we replaced the sigmoid with a spatial softmax, which normalized the sum of units in each channel of to 1. This model performed worse than the CNNs or TimeSformer on Pathtracker (Fig. S3)

Figure S4: A comparison of attention between the complete InT and one where its softplus rectifications are replaced by tanh.

Appendix F InT+TransT

We introduced our InT circuit into a state-of-the-art TransT to promote alternative visual strategies for object tracking (Fig. S5). TransT tracking is completely appearance-based, and because of this commitment, was able to achieve an approximately boost in performance on the GOT-10K challenge over its closest competitor.

Figure S5: The (a) TransT and (b) InT addition to create our the InT+TransT. The InT additively modulate the TransT query (Q) in its CFA, which corresponds to its encoding of the search image which is compared to its encoding of the target. The InT activity is recurrent, and itself modulated by a cost volume which captures the similarity of InT activity and the TransT query from the prior step, along with the TransT query entropy. This cost volume is designed to gate InT activity unless the TransT is low-confidence and the InT and TransT render different predictions, at which point the InT can adjust TransT queries. The InT is further supervised on each step of a video to predict target object bounding boxes.


We add two InTs to the TransT (Fig. S5). The key difference between these InTs and the ones used on PathTracker is they used GroupNorm Wu2018-av instead of Batch Normalization. This was done because object tracking in natural images is memory intensive and forces smaller batch sizes than what we used for PathTracker.

The first of the InT+TransT InTs (InT Fig. S5b) has the same dimensionality as the one described for PathTracker in the main text. As an input, it received ResNet50 features, like the TransT, which were convolved with a kernel to reduce dimensionality for this InT. Its recurrent excitatory and inhibitory units were initialized by kernels which were convolved with a binary map indicating the location of the target object in the first search frame. We expected that this initialization would promote a strategy for appearance-free tracking.

The InT excitatory units were then passed to a three-layer convnet with

convolutional kernels, designed to inflate the dimensionality of the InT to match the TransT and register their representations. This convnet had softplus activation functions applied to the output of the first and second layers, and used kernels

and . We refer to this registered activity as .

We eventually used this activity to modify the “search frame” query in the TransT cross-feature attention (CFA) module: . Note that is compared to key/value encodings of a cropped image of the target to predict bounding box coordinates. But before being sent to the TransT,

was subject to a gate that was designed to help the InT+TransT adjudicate between its InT and TransT for the most reliable information resource on any given video frame. We developed a “cost volume” which consisted of two activities concatenated together along their channel axis. The first activity is an estimate of the reliability of the TransT output from the preceding step of processing

at every spatial location . As we describe below, is processed into the dimensionality of this InT, at which point we compute the energy at every spatial location yielding a 1-dimensional spatial volume. When this value is high at a particular location, we expect that the TransT is confident in its representation there. When the value is low, we expect that the TransT is not confident. We concatenate this cost volume with the outer product yielding a 1024-dimensional volume. The final cost volume was then used to compute a gate for . It was first convolved with the kernel and then transformed with a sigmoid applied at every position forcing these values to be in . To summarize, represents the transformed output of the first InT, which was gated by a cost volume capturing TransT agreement and its consistency with the InT, before being used to modulate the TransT CFA query . See Fig. S5 for a schematic.

The final step of the InT+TransT is “top-down” feedback from the TransT back to the InT. This was done to encourage the two modules to align their object tracks and correct mistakes that emerged in one or the other resource Linsley2020-en . The query activity was convolved with a kernel , transformed with a softplus, and entered into another InT (InT Fig. S5)b). This meant it used recurrent kernels , the other InT’s recurrent excitatory unit activity as its input, and the aforementioned transformation of as its excitatory units. Its inhibitory hidden states were initialized by the kernel convolved with the transformed . We evaluated our InT+TransT on TrackingNet (published under the Apache License 2.0), LaSOT (published under the Apache License 2.0), and GOT-10K (published under CC BY-NC-SA 4.0). See Table S1 for a full comparison between our InT+TransT and other state-of-the-art models.

Object tracking training and evaluation

The InT+TransT is trained with the same procedure as the original TransT, except that its InTs are given the intervening frames between the target and search images, as described in the main text. Otherwise, we refer the reader to training details in the TransT paper Wang2021-no . Evaluation was identical to the TransT, including the use of temporal smoothing for postprocessing (“Online Tracking”). As was the case for TransT, this involved interpolating the TransT bounding box predictions with a Hanning window that penalized predictions on the current step which greatly diverged from previous steps. See Wang2021-no for details.

Method Source LaSOT TrackingNet GOT-10K
InT+TransT Ours 65.0 74.0 69.3 81.94 87.48 80.94 72.2 82.2 68.2
TransT CVPR2021 64.9 73.8 69.0 81.4 86.7 80.3 72.3 82.4 68.2
TransT-GOT CVPR2021 - - - - - - 67.1 76.8 60.9
SiamR-CNN CVPR2020 64.8 72.2 - 81.2 85.4 80.0 64.9 72.8 59.7
Ocean ECCV2020 56.0 65.1 56.6 - - - 61.1 72.1 47.3
KYS ECCV2020 55.4 63.3 - 74.0 80.0 68.8 63.6 75.1 51.5
DCFST ECCV2020 - - - 75.2 80.9 70.0 63.8 75.3 49.8
SiamFC++ AAAI2020 54.4 62.3 54.7 75.4 80.0 70.5 59.5 69.5 47.9
PrDiMP CVPR2020 59.8 68.8 60.8 75.8 81.6 70.4 63.4 73.8 54.3
CGACD CVPR2020 51.8 62.6 - 71.1 80.0 69.3 - - -
SiamAttn CVPR2020 56.0 64.8 - 75.2 81.7 - - - -
MAML CVPR2020 52.3 - - 75.7 82.2 72.5 - - -
D3S CVPR2020 - - - 72.8 76.8 66.4 59.7 67.6 46.2
SiamCAR CVPR2020 50.7 60.0 51.0 - - - 56.9 67.0 41.5
SiamBAN CVPR2020 51.4 59.8 52.1 - - - - - -
DiMP ICCV2019 56.9 65.0 56.7 74.0 80.1 68.7 61.1 71.7 49.2
SiamPRN++ CVPR2019 49.6 56.9 49.1 73.3 80.0 69.4 51.7 61.6 32.5
ATOM CVPR2019 51.5 57.6 50.5 70.3 77.1 64.8 55.6 63.4 40.2
ECO ICCV2017 32.4 33.8 30.1 55.4 61.8 49.2 31.6 30.9 11.1
MDNet CVPR2016 39.7 46.0 37.3 60.6 70.5 56.5 29.9 30.3 9.9
SiamFC ECCVW2016 33.6 42.0 33.9 57.1 66.3 53.3 34.8 35.3 9.8
Table S1: Object tracking results on the LaSOT Fan2018-sb , TrackingNet Muller2018-qn , and GOT-10K Huang2018-yt benchmarks. First place is in red and second place is in blue. Our InT+TransT model beats all others except for two benchmark GOT-10K scores.