Log In Sign Up

Detection and Description of Change in Visual Streams

This paper presents a framework for the analysis of changes in visual streams: ordered sequences of images, possibly separated by significant time gaps. We propose a new approach to incorporating unlabeled data into training to generate natural language descriptions of change. We also develop a framework for estimating the time of change in visual stream. We use learned representations for change evidence and consistency of perceived change, and combine these in a regularized graph cut based change detector. Experimental evaluation on visual stream datasets, which we release as part of our contribution, shows that representation learning driven by natural language descriptions significantly improves change detection accuracy, compared to methods that do not rely on language.


page 2

page 19

page 20


Person Re-Identification with Vision and Language

In this paper we propose a new approach to person re-identification usin...

Compound Sequential Change Point Detection in Multiple Data Streams

We consider sequential change point detection in multiple data streams, ...

Shaping Visual Representations with Language for Few-shot Classification

Language is designed to convey useful information about the world, thus ...

Optimal Parallel Sequential Change Detection under Generalized Performance Measures

This paper considers the detection of change points in parallel data str...

On the Detection of Concept Changes in Time-Varying Data Stream by Testing Exchangeability

A martingale framework for concept change detection based on testing dat...

Towards Property-Based Tests in Natural Language

We consider a new approach to generate tests from natural language. Rath...

Estimating the Level and Direction of Phonetic Dialect Change in the Northern Netherlands

This article reports ongoing investigations into phonetic change of dial...

1 Introduction

Visual streams depict changes in a scene over time, and this paper describes a learning-based approach to describing those changes in a manner that (a) improves the detection changes beyond what is possible based solely on image analysis without change descriptions, (b) incorporates unlabled data to yield strong performance even with a limited number of labeled samples, and (c) is robust to nuisance changes such as lighting or perspective shifts.

In general, visual streams (illustrated in Figure 1) consist of time series of images of a scene, such as time lapse imagery or a sequence of satellite images. Given such visual streams, our goal is to detect and describe “relevant” changes while ignoring nuisance changes such as lighting or perspective changes or seasonal change. This change detection and description task is relevant to intelligence, insurance, urban planning, and natural disaster relief.

Figure 1: A subsequence from “Street Change,” a proposed dataset for change detection and captioning for visual streams. The dotted line denotes the location of a changepoint, where at least one non-distractor element of the scene has changed. Some human captions for the change across the changepoint are: “A road sign is gone.” “The orange construction sign was removed.” “The safety sign is not in the road.”

Our contributions to visual stream change detection and captioning are:

  1. We formulate the joint change detection and description problem and develop two new datasets, one synthetic and one real, of visual streams with known changepoints and change descriptions in natural language.

  2. We propose a semi-supervised framework for describing changes between images that leverages unlabeled image pairs to improve performance, an important feature given the cost of human labeling of visual streams.

  3. We propose and analyze an approach to change detection within a visual stream. It is based on a regularized graph cut cost that also incorporates the consistency of perceived change for image pairs straddling the change point. The representation used to assess such consistency emerges from learning to describe change with language. We show that this approach is superior to baselines that do not incorporate language in learning.

The efficacy of including our language model in the change detection process shows that access to language change descriptions at training time may provide a richer form of supervision than training using the times of changes alone. This is a new, previously unexplored avenue for leveraging the interaction between vision and language.

2 Related work

While automatic image captioning has received growing attention in the computer vision community in the past several years

[xu2015show, johnson2016densecap, fang2015captions, ling2017teaching], such tools are inadequate for detection of changes in a visual stream. Describing changes in a scene is fundamentally different from recognizing objects in an image. Unlike in classical automatic image captioning, a visual stream may contain no relevant changes and it may contain multiple objects that do not change. Furthermore, video captioning methods [venugopalan2015sequence, pan2016hierarchical, wang2017video] are inapplicable because in many visual streams there are relatively large periods of time between subsequent images; for instance, a satellite may pass over a location only once per day. Finally, the images of interest may be corrupted by blur, shifts, and varying environmental conditions that make identifying changes challenging.

2.1 Automatic description of the visual world

Visual descriptions The task of describing the content of an image or video in natural language, often referred to as image/video captioning, has emerged as one of the central tasks in computer vision in recent years [vinyals2015show, donahue2015long, johnson2016densecap, venugopalan2015sequence, shetty2016frame, zhou2018end, krishna2017dense]. Typically, captioning is approached as a translation task, mapping an input image to a sequence of words. In most existing work, learning is fully supervised, using existing data sets [young2014image, lin2014microsoft, krishna2017visual, xu2016msr] of “natural images/videos” with associated descriptions. However, when faced with drastically different visual content, e.g., 

remote sensing or surveillance videos, and a different type of descriptions, in our case describing change, models trained on natural data lack both the visual “understanding” of the input and an appropriate vocabulary for the output. We address this limitation by leveraging novel semi-supervised learning methods, requiring only a small amount of labeled data.

Automatic description of change. Change is implicitly handled in methods for captioning video [pan2016hierarchical, wang2017video, pan2017video, baraldi2017hierarchical]. However, when describing change in the videos, these methods tend to focus on describing the “updated scene” rather than the nature of the change between the scenes.

The most related work to ours is [jhamtani2018learning, park2019robust, oluwasanmi2019fully], who also attempt to generate captions for non-trivial changes in images. [jhamtani2018learning] proposes the Spot-the-Diff dataset, where pairs of images are extracted from surveillance videos. The viewpoints are fixed and there is always some change. [park2019robust] proposes CLEVR-Change, a dataset where pairs of synthesized images are provided with their synthesized captions. Lighting and viewpoint can vary between image pairs. We extend [park2019robust] in two ways. First, we consider a series of images instead of two, and a change can either happen at any timestep, or not at all. Secondly, we consider a semi-supervised setting where only a subset of image series has associated captions. Moreover, we report results not only on CLEVR-change and Spot-the-Diff, but also a newly collected street-view dataset.

Semi/unsupervised captioning The goal of unsupervised/unpaired captioning is to learn to caption from a set of images and a corpus, but with no mapping between the two. Usually, some pivoting is needed: [gu2018unpaired, gu2019unpaired] tackle unpaired captioning with a third modality, chinese captions and scene graph respectively. Similarly, [feng2019unsupervised, laina2019towards] use visual concept detectors trained on external data to bridge the unpaired images and captions.

In the semi-supervised setting, extra unlabeled data supplement paired image caption data. [liu2018show] uses a self-retrieval module to mine negative samples from unlabeled images to improve captioning performance. [kim2019image] uses a GAN to generate pseudo labels, making extra unpaired images and captions paired. [anne2016deep, venugopalan2017captioning] exploit additional unpaired text and image for novel object captioning.

Our work is most similiar to [liu2018show] but with several important differences. First, we use a real-fake discriminator instead of a retrieval-based discriminator. Second, the datasets/tasks are different. Our datasets are more under-annotated and out-of-domain than COCO[lin2014microsoft], a large natural image dataset that can benefit easiliy from pretrained vision networks. Note that, despite the similar technique, our focus in this paper is not to propose a new semi-supervised technique, but a more general setting of change descriptions in an annotation-restrained situation.

2.2 Change point detection and localization

Change point detection and localization is a classical problem in time series analysis. Typical approaches aim to determine whether and at what time(s) the underlying generative model associated with the signal being recorded has changed [padilla2019optimal, garreau2018consistent, matteson2014nonparametric]. Such techniques are also used in video indexing, to detect a scene cut, a shot boundary, etc. [priya2010video, sze2003scene, irani1998video, smoliar1994content, hu2011survey, yu2001efficient].

In this work, we focus on finding a point that a nontrivial change in content happens while many nuisance changes such as misalignments or shifts between images and illumination changes, seasonal variations, etc., may dominate the visual changes. Much current work on high-dimensional change point detection (e.g., [cho2015multiple, wang2018high]) assumes that the time series has a mean that is piecewise-constant over time; this is clearly violated by real world data, such as shown in Fig 9. We show that learning to describe changes can help locate changepoints.

3 Semisupervised learning of change description

Suppose we have image pairs labeled with change descriptions, . Of these, some may correspond to no change, with like “no change” or “nothing changed”. An additional set of unlabeled pairs is presumed to only contain pairs where a change does occur between and , although no description of this change is available.

Figure 2: Description generator and discriminator. can compute the likelihood for a given change caption for a given pair of images , as well as generate (sample; dashed arrow) a caption for that pair.

can estimate the probability that a given caption

is a valid description of change between and . During the final phase of training, the generator receives feedback, illustrated in red, from two sources: ground truth captions, which should have high likelihood under , and the discriminator , which outputs a probability that the sampled captions from are valid for the pair.

Learning to generate descriptions proceeds in three phases:

  • Phase 1 (Sec. 3.1): We use labeled data to pretrain a description generator , Fig. 2 (left). A conditional generative model, allows us to calculate the likelihood of a change description given an image pair, as well as sample (estimate) a caption for a given image pair. The former capacity is used only during training. As part of its operation, computes a hidden state .

  • Phase 2 (Sec. 3.2): We use both labeled and unlabeled data to train a discriminator , Fig. 2 (center). For two images and and caption , this discriminator computes a “compatibility score” indicating whether is a valid description of the change between and .

  • Phase 3 (Sec. 3.3): We fine-tune the description generator using both labeled and unlabeled data, leveraging the discriminator , Fig. 2 (right).

Before describing these training phases in detail, we outline the architecture choices for and . uses the DUDA architecture introduced in [park2019robust], which can be trained to be robust to lighting and pose changes. Briefly, it extracts visual features from and using a ResNet [he2016deep], and distills them via spatial attention maps and

into a triplet of visual feature vectors, associated with

, and the difference between the two. Finally, these and the caption are fed into an LSTM-based recurrent network with temporal attention states used to combine the visual features at each time step.

The discriminator shares the components of DUDA with . However, instead of using the LSTM model to estimate proability of the word sequence given , here the LSTM model just consumes the words in , updating the hidden state. The final state

is fed to a multi-layer fully connected network, and finally to a linear binary classifier. Its output

is an estimate of the probability that the is a valid description of change from to .

3.1 Phase 1: Training description generator

As in [park2019robust], training of the description generator is done with the regularized cross-entropy loss,


where is the ground truth change description for the labeled pair , and are the spatial maps computed inside DUDA for that pair, and is the sequence of temporal attention states computed inside DUDA; is the entropy. For further details, refer to [park2019robust].

3.2 Phase 2: Learning to recognize valid change captions

We train a discriminator using both labeled and unlabeled image pairs to minimize the loss

[positives from labeled pairs] (2)
[negatives from labeled pairs] (3)
[negatives across labeled/unlabeled] (4)

The sum in (2) is over labeled pairs with known change annotations; these are the positive examples for . Note that these may includes pairs with no change, annotated by captions like “no change”. The second sum in (3) is over examples obtained by matching one pair with the annotation for another pair. Here we assume that with high probabilty such a randomly matched annotation will not in fact be valid for the given pair, and so these are negative examples for .

Finally, we turn to the unlabeled pairs. The third sum in (4) matches annotations from labeled data to unlabeled pairs, which we assume will not create valid descriptions. This procedure creates additional negative examples for .

We include negative examples from labeled images for two reasons. The first is to ensure the network is trained using negative examples even in the fully-supervised case. Secondly, we wish to prevent overfitting, where learns simply to output 0 for all pairs in the unlabeled set, and 1 for all of the labeled set.

3.3 Phase 3: Updating generator using discriminator

With in hand, we can refine the generator using both labeled and unlabeled data. We train to generate change captions that the discriminator labels as “valid,” while still staying close to the ground truth data. To ensure that generates captions that resemble true labels, we simultaneously maximize the likelihood of the ground truth and the outputs of the discriminator for captions generated by . We use the following loss:


Since sampling captions from to pass to is not differentiable, we treat training using

as a reinforcement learning problem, which we solve using REINFORCE

[williams1992simple]. This approach has been used elsewhere for learning in sequence generation [luo-discriminability-2018, hendricks2016generating].

In this case, the policy is given by , parametrized by the network parameters , which predicts a distribution over captions. We wish to maximize some reward, , which should be high for valid change captions and their associated images. The gradient of the expected reward for is approximated by:

where is the result of sampling from , and is defined by the output of . In our setting, is given by . This permits the gradient of to be approximated at training points, and for unlabeled data to be used while training , to ensure that as much data is incorporated into the training procedure as possible.

“A yellow cylinder

appeared by the purple block.”

(,“no change”)

(,“no change”)

(,“no change”)

(,“no change”)






Figure 3: Mining of a visual stream from CLEVR-Sequence (left) with known change point and change annotation , to create labeled image pairs (right).

While GAN training traditionally uses feedback from a learned discriminator as a training signal for a generator, typically GAN training includes optimization over a minimax objective, where the discriminator and generator compete over the course of training. We do not update the discriminator in an adversarial fashion, instead freezing it during the final phase of training.

3.4 Training with visual streams

The above relies on pairs of images. More generally, we may have access to a training set of visual streams, which consist of ordered image sequences , associated times at which a change occurs, and (possibly) natural language descriptions of the changes. Such sequences may be used to generate training data as follows. Consider an annotated stream . For every it yields an annotated pair , while for every , we get , and similarly for every . See Figure 3.

If the stream is not annotated with change description, we still can get the pairs annotated with “no change” for the frame indices that do not straddle . However, for we can only get unlabeled pairs .

4 Change Detection

The system described in Section 3 can be used not only to caption changes in visual streams, but also to detect and localize the occurence of changes. In this section, we describe how to use the learned modules from Section 3, which operate in a pairwise manner, to identify changepoints in image sequences.

More formally, assume we are given a visual stream, represented as an ordered sequence of images . Nuisance changes, such as viewpoint, lighting, or weather variation, are observed between each sequential pair of frames of the stream. There is some unknown such that between and a relevant (non-nuisance) change occurs, after which only nuisance changes occur between frames. Our goal in change detection is to estimate .

We begin by defining an abstract pairwise relation : is a relation that, given two images , outputs a statistic which is high for pairs where a change has occurred between and , and is low otherwise.

We can now use these pairwise statistics to define different score functions for various candidate values of in a visual stream; in general, we will compute an estimate by maximizing these score functions. For any method which generates a score , .

Figure 4: Visual representation of graph given by a time series. Selecting a changepoint (denoted by the dotted line), partitions the edges into , in black, and , in green.

Step-wise scores.

We define the step-wise score as


and the associated change point estimate as . This approach finds the consecutive pair of frames in the stream which is most likely to contain a change between that pair. Unfortunately, if is noisy or imperfect, the step-wise approach may be prone to errors and spurious detections.

Graph cut scores.

To address the noise sensitivity for the step-wise approach, we propose a strategy based on graph cuts. Specifically, we form a graph with each node corresponding to a . We assign an edge weight to each pair of nodes and (specifics below) and define a score function with the edge weights. Selecting a candidate change point is then equivalent to selecting a graph cut, where the edges that are cut connect some and ; we denote the set of these edges as . A visualization is provided in Figure 4, where a candidate results in an edge partitioning into , in black, and , in green.

The graph-cut score we consider is, for the complement of ,


Computing Change Statistics.

In order to compute the pairwise statistic , we explore ways to use the networks and introduced in Section 3, as well as a standalone, “image-only” network trained specially for the task. First, we note that by fixing the test caption to the “no change” token, we can define:


which uses the discriminator’s label as a measure of how poorly the “no change” description fits the pair of images and .

We also define , an “image-only” change detector. learns to produce visual features using the same encoder architecture as the generator in Section 3.1. The visual features are passed through several fully-connected layers, and the output of is a single label, rather than a caption. is trained to classify whether pairs contain a change or not using a binary classification loss, where a label of 1 indicates no change. Now we can define:


so is an estimate of the probability that a change has occurred between and that has been trained without using language information. We define an “image-only” stepwise score where is used for . Similarly, we also define an “image-only” variant of the graph cut score of using .

Consistency-regularized graph cut scores.

The above step-wise and graph cut change detection scores depend only on the pairwise relation

. However, we may have access to learned hidden representations

that contain more information relevant to change detection, which may help regularize the above graph cut scores. Specifically, we regularize our score function to help ensure that all the pairs with and have similar representations (since they should represent the same change). We measure the similarity between representations of two pairs, and

using the cosine similarity.

This leads to a regularized “representation consistency” score of


for a user-specified tuning parameter ; then . The consistency measure in the final term of (10) sums over all pairs of black edges in Figure 4, finding the average similarity of the associated hidden representations.

To make the above concrete, we may choose to be from Equation 8, and let the representation of a pair be the final hidden state of the caption generator . In this formulation, a natural interpretation of the estimator arises: we would like to determine a changepoint where for all pairs , where but , is low. In addition, the representation consistency pushes us to choose a changepoint where the generator will output similar descriptions of the change that has occurred across for all possible pairs. In this formulation, the language model becomes an integral part of the changepoint detection method.

Absent a language model, we can also define an “image-only representation consistency”. To do this, in Equation 10 we utilize , and for internal representation we use the penultimate layer of the image pair used by , a configuration that computes .

5 Data

The experiments in Section 6 consider both (a) an assessment of our proposed system of change captioning and the role of unlabeled data in the performance, and (b) an evaluation of our approach when both annotated and unannotated visual streams are available for change annotation and detection. The datasets used for these experiments are described below.

5.1 Semi-supervised Change Captioning

CLEVR-Change is a simulated dataset, with a variety of distractor changes and a curated set of “relevant” changes, first presented in [park2019robust]. Each image consists of colored geometric primitives on a flat, infinite ground plane. The images are generated with the CLEVR engine [johnson2017clevr], a standard tool in a variety of visual reasoning tasks. CLEVR-Change includes, for each initial frame, a frame in which only nuisance changes have occurred, as well as a frame in which both nuisance and relevant changes have occurred. There were initially roughly 39000 change/distractor pairs in the CLEVR-change dataset, which we have augmented with 10000 additional unlabeled pairs of images by mimicking their generation process. The additional unlabeled pairs always contain a change.

Spot-the-Diff [jhamtani2018learning] is a video surveillance change description dataset which consists of pairs of surveillance images and associated human-generated captions for the visual changes in each pair. Each pair is aligned and assumed to contain a change, with similar lighting conditions between the “before” and “after” frames, so this dataset does not contain the nuisance changes present in the CLEVR-Change dataset. Spot-the-Diff contains 13192 image pairs, which we split into 10000 training pairs and 3192 testing pairs.

5.2 Visual Streams

Due to the lack of standard datasets for our task, we propose two datasets for evaluation of captioning and changepoint detection on visual streams.

The first dataset we introduce is a modification of the CLEVR-Change synthetic dataset. By modifying the machinery in [park2019robust]

, we generate a visual stream instead of a pair of images. Over the course of each sequence, the camera follows a random walk from its initial location, while the light intensity also follows a random walk, to ensure that the proposed methodology must be robust to nuisance changes. At some time in the sequence, one of the non-nuisance changes used in the original CLEVR-Change dataset occurs. Each sequence contains 10 images, with the changepoints uniformly distributed in

(assuming zero indexing). Further details can be found in the supplementary materials. Figure 3 shows a subsequence drawn from one CLEVR-Sequence stream.

We also propose the “Street Change” dataset, which consists of 151 image sequences captured in an urban environment during different seasons and times of day, extracted from the VL-CMU dataset [badino2012real]. Each sequence contains multiple views of a particular scene that are captured at an initial date, and then a second subsequence that is captured at a later date. The original change detection dataset was curated by [alcantarilla2018street], and we have utilized human labelers on Amazon’s Mechanical Turk to provide change annotations for training. The annotation set includes at least three captions for both the original and time-reversed sequences, which were curated after data collection to ensure high-quality labels.

The “Street Change” dataset contains both minor and major nuisance changes across the dataset. First, from an annotation perspective, each frame is captured at a different viewpoint, and there are often lighting changes across the sequence. Moreover, since some sequences span seasons, environmental changes like plant growth, snowfall, and weather variation are present in the dataset, which any successful trained model must learn to ignore. An example sequence is shown in 9

, along with the associated human annotations. More examples are included in the supplementary materials. The average length of each stream is 9 images, the mean sentence length is 6.2 words with standard deviation 1.9, and the maximum sentence length is 16 words. There are 420 unique words in the captions.

6 Experiments

In this section we empirically demonstrate the performance of the proposed method on both caption generation and changepoint detection. We find that the proposed training regime improves on the state of the art for change captioning, as well as an adapted semi-supervised approach from standard image captioning.

We find that the RC approach, which leverages both and the generative language model , outperforms our other proposed changepoint detection methods. While adding the representation consistency term also improves “Image-Only” approaches, we show that incorporating both and results in good performance on language metrics for change captioning, as well as superior change detection performance..

6.1 Implementation Details

is trained as in Phase 1 for 40 epochs using Adam

[adam], and the discriminator is also trained for 40 epochs using Adam. Phase 3 of training proceeds for 20 epochs. The learning rate begins at 0.001 and decays by a factor of 0.9 each 5 epochs. in is set to be 0.2. The image-only discriminator is trained for 40 epochs using Adam with a learning rate of 0.001.

is identical in architecture to the network used in [park2019robust] for all three training sets. The three architectures differ only in the dimension of the final output of , , which is 64, 2404, and 420 for CLEVR, Spot-the-diff, and Street Change respectively. At test time, sampling captions is done in a greedy fashion.

Adapting Show, Tell, and Discriminate to change captioning.

An alternative approach to recognizing valid descriptions is proposed in [liu2018show] for single-image captioning. Their approach, called Show, Tell, and Discriminate (ST&D), embeds the descriptions and images in a common space, and then measures the similarity of the embeddings by measuring the cosine similarity between them. The embedding module is trained with a hard triplet loss on labeled data, of the form: for margin parameter . Training in stage three proceeds similarly to our approach, with providing a form of discriminative reward to .

We have modified the ST&D approach to perform change captioning, so the embedding module operates on pairs of images rather than single images. In Section 6 we compare the performance of the modified ST&D to our approach. The learned joint embedding space is dimension 2048, and the margin used for the triplet loss is 0.1. We follow the authors’ lead and set the weighting of the self-retrieval module in stage three to be 1.

Labeled 1000 2000 10000 20000
Unlabeled 1k 10k 30k 1k 10k 30k 1k 10k 30k 1k 10k 30k
Ours C 57.1 44.8 42.7 58.1 60.2 61.4 93.7 98.5 103.3 100.2 100.5 110.0
B 30.8 28.5 26.9 34.2 34.6 34.7 49.3 51.5 51.6 54.5 54.8 56.4
M 32.2 30.1 29.5 32.4 32.6 33.2 39.3 39.3 39.5 39.9 41.1 42.9
R 69.2 65.6 65.1 70.2 70.3 70.9 78.8 79.8 80.8 81.2 82.4 83.5
ST&D C 42.4 42.7 43.6 55.7 58.2 60.3 92.8 93.9 102.0 98.4 105.1 106.4
B 27.0 26.3 27.1 32.4 46.9 47.2 48.7 51.8 52.0 55.3 55.3 57.3
M 29.4 29.8 29.6 31.5 32.8 33.0 38.3 38.7 40.7 38.9 40.9 42.2
R 65.4 65.8 65.6 65.8 69.8 70.0 78.9 79.1 81.5 79.3 82.0 83.3
Table 1: Semi-supervised scaling with respect to labeled and unlabeled training set size for CLEVR-Change using our method and Show, Tell, and Discriminate. We report CIDEr (C), BLEU-4 (B), METEOR (M), and ROUGE-L (R) (scaled by ).
Labeled 2000 5000 8000 9000 Unlabeled 1k 2k 5k 8k 1k 2k 5k 1k 2k 1k Ours C 11.1 11.4 12.5 13.2 21.7 21.8 23.1 30.4 30.8 33.3 B 5.2 14.4 13.3 14.6 10.5 11.7 15.2 9.9 12.8 16.1 M 19.6 22.1 22.7 23.1 22.2 22.1 23.5 22.9 23.0 24.0 R 39.1 47.7 48.6 50.3 46.2 47.5 50.4 48.5 49.3 51.4 ST&D C 10.2 9.9 10.5 11.4 18.5 19.6 21.3 29.0 30.1 30.8 B 3.8 8.4 10.1 9.6 11.2 12.9 13.7 10.6 11.5 13.0 M 16.9 16.6 17.1 17.1 20.4 21.2 22.3 22.5 22.9 23.5 R 38.9 38.7 40.6 41.1 42.4 43.1 47.7 47.0 47.1 47.7 L 30 100 130 U 100 30 0 Ours C 86.6 116.0 119.8 B 30.5 41.7 42.6 M 33.8 40.5 39.9 R 66.6 75.9 73.5 ST&D C 64.4 105.2 112.7 B 26.7 38.9 38.6 M 32.0 39.3 38.8 R 64.5 72.6 73.1
Table 2: Left: Semi-supervised scaling with respect to labeled and unlabeled training set size for Spot-the-Diff, and Right Street Change on a variety of language metrics (scaled by ) using our method and Show, Tell, and Discriminate. Note that training set size for Street Change is in terms of number of labeled sequences, instead of image pairs.


For language tasks, we report BLEU-4 [papineni2002bleu], CIDEr [vedantam2015cider], ROUGE-L [lin2004rouge], and METEOR [denkowski2014meteor]. To evaluate the quality of change detection methods, we generate precision-recall curves; we report full precision-recall curves in Figure 4, as well as summarizing with Average Precision in Table 3.

Semisupervised Change Captioning

We begin by comparing our semisupervised approach to Show, Tell, and Discriminate (ST&D ), a method for semisupervised single-image captioning, which does not utilize unlabeled data to train the network that provides feedback to . Table 1 explores a range of training set sizes, varying both and , the number of labeled and image pairs respectively. Except in the case of extremely scarce labeled data (k), we see consistent improvements from adding unlabeled data; our approach appears to use additional unlabeled data more effectively, overtaking the adapted ST&D approach.

Table 2 compares captioning performance with respect to different labeled and unlabeled training set sizes for both Spot-the-Diff (left) and Street Change (right). Note the training set sizes for Street Change are in terms of sequences.

Change Detection.

In Figures 5 a) and b), we illustrate the performance of all changepoint methods introduced on both CLEVR Sequence and Street Change. For all methods with the “Image Only” prefix “-IO,” we utilize to calculate scores, and for hidden representations we utilize the penultimate layer of . For methods without the “-IO” prefix, we use , which utilizes the learned change caption discriminator, and the hidden state of the caption generator as .

Figure 5: Precision-recall curve for (a) CLEVR-Sequence, and (b) Street Change. Methods compared are Stepwise with Language (Step), Stepwise with Images Only (Step-IO), Regularized Cut with , Regularized Cut with Images Only with (RC-IO ), Graph Cut (GC), Graph Cut with Images Only (GC-IO), Regularized Cut (RC), and Regularized Cut with Images Only (RC-IO).
Figure 6: Relationship between amount of labeled and unlabeled data in the training set and change detection performance for CLEVR-Sequence and Street Change.

To choose in , a held-out set of 20 sequences from CLEVR Sequence were used. We use in both CLEVR Sequence and Street Change tests. We also explore setting to isolate the performance of the representation consistency loss as a standalone method.

Figure 6 illustrates the effect of adding both labeled and unlabeled data to the training set of and . We observe a consistent improvement from image-only methods to their counterparts that use a language-informed . Graph-cut and representation consistency terms both improve detection accuracy compared to stepwise approaches; combining the two leads to further improvements.

In Table 3, we explore the effect of loosening the requirements for a “correct” change detection. In these cases, we vary a window size parameter , so that a detection is called correct if the absolute error is smaller than . is the default for all other tables and figures.

CLEVR-Sequence Street Change
Window Size 0 1 2 3 4 mAP 0 1 2 3 4 mAP
Step-IO 17.2 21.8 33.0 37.6 65.8 35.1 35.7 43.1 46.3 46.3 48.4 44.0
Step 20.8 24.9 37.1 40.5 62.3 37.1 23.9 35.6 37.7 38.9 39.0 35.0
GC-IO 22.0 27.3 36.4 42.4 49.2 35.5 27.1 34.0 35.0 35.1 35.5 33.3
GC 58.4 63.9 69.0 71.1 73.2 67.1 19.8 27.1 28.8 29.9 30.8 27.3
RC-IO 24.8 31.8 39.4 45.9 51.3 38.6 35.3 36.1 36.2 36.2 36.2 36.0
RC 68.8 72.2 73.1 74.4 76.6 73.1 52.2 52.8 52.9 52.9 52.9 52.7
RC-IO 33.1 38.5 42.9 48.9 59.5 44.6 47.1 48.3 48.3 48.3 48.4 48.1
RC 79.5 84.0 89.8 91.3 92.9 87.5 72.1 73.2 73.2 73.2 73.2 73.0
Table 3: Average Precision (AP) for CLEVR-Change with differing window sizes. Window size represents the maximum distance from the true changepoint for a point to be counted as ”correct.” mAP is the mean AP over all window sizes. To generate Average Precision (AP) values, we average the precision of a method for recall=.

7 Conclusions

In this work, we explore natural language captioning and detection of change in visual streams, illustrating that learning to generate captions to describe change also enhances our ability to accurately detect visual changes over streams. While natural language labels have a strong positive impact on change detection performance, we recognize that labeled training data is often difficult and expensive to obtain. With this challenge in mind, we develop a semi-supervised training regimen that fully leverages unlabeled data. A broad array of experiments on newly developed datasets consisting of visual streams with natural language labels help quantify the performance gains associated with different amounts of unlabeled data on both captioning and detection.


Supplementary Material: Detection and Description of Change in Visual Streams A CLEVR-Sequence

CLEVR-Sequence is an extension of the CLEVR-Change dataset of [park2019robust]. The CLEVR-Change dataset consists of triplets of images along with annotations: one “original” image, a “change” image, and a “distractor” image.

CLEVR-Sequence, by contrast, consists of sequences of ten images. The changes between all but one pair in the sequence are “distractor” changes, in which the camera moves and the lighting changes slightly. For one pair, there is, in addition to the distractor change, a non-distractor change, in which one of the possible changes from [park2019robust] occurs with uniform probability over the change types. For clarity, these are: Color (an object changes color), Texture (the material of a single object changes), Add (a new object appears), Drop (a previously-present object disappears), and Move (a single object moves in the scene).

We generate 400 sequences for each candidate changepoint in the range , assuming 0 indexing. We also generate 400 “distractor” sequences in which no change occurs at any point.

We attempt to imitate the characteristics of distractor changes as outlined by the authors of [park2019robust] as closely as possible. Annotations are produced automatically for each sequence using the same methodology as the CLEVR-Change dataset.

Supplementary Material: Detection and Description of Change in Visual Streams B Street Change

The images in the Street Change dataset were curated by [alcantarilla2018street], and are grouped into 151 unique image sequences of variable lengths. The minimum sequence length is 2 (this length-2 sequence was omitted from training and testing), and the maximum is 42. As mentioned in the main body, the average sequence length is approximately 9.

In Street Change, the change point is always the center frame of the sequence. For methods which learn by leveraging the entire sequence simultaneously, this may be a problematic bias, but since we learn and infer using only pairwise comparisons, this bias is less concerning.

Captions were gathered using Amazon’s Mechanical Turk. Annotators were presented with a pair of images and asked to describe in a single brief sentence the major changes between them, ignoring nuisance changes. Annotators were given guidance on what to consider nuisance changes, like weather, lighting, time of day, and camera angle. Annotations for both the original sequences and time-reversed sequences were gathered, as a form of data augmentation.

The gathered annotations were curated by hand to remove descriptions of nuisance changes, ensure consistent spelling, ensure each annotation was a single sentence, and confirm accuracy of the annotations. We collected six annotations for each sequence (and six more for the time-reversed sequence), to ensure that there would be at least three acceptable annotations for each sequence.

Figure 7: A sample sequence from “Street Change.” Ground Truth Annotations: “Construction signs are gone from the street,” “the road sign is gone,” “the construction sign is gone,” “the construction signs are gone.” Generated Annotations: “the street sign is missing,” “the construction work is done,” “the signs were placed.” Note the visual distortions and occlusion in the initial half of the sequence, which is present in several other sequences in the dataset.

b.1 Example Street Change Sequences

Here we illustrate some sample full sequences from the Street Change Dataset, along with the associated ground-truth annotations, and the annotations produced by our method with 100 labeled training sequences and 30 unlabeled training sequences. The displayed images are drawn from the test sequences. To generate multiple captions, different pairs in the sequence were sampled; the language model is sampled greedily for all pairs.

All images best viewed digitally.

Figure 8: A sample sequence from “Street Change.” Ground Truth Annotations: “The trash is gone,” “the garbage can is gone,” “the bush is no longer there,” “the garbage can has been removed,” “the yard now has grass.” Generated Annotations: “The garbage can was removed,” “the trash can is gone,” “garbage is gone.” Shorter sequences tend to contain larger viewpoint changes between frames.
Figure 9: A sample sequence from “Street Change.” Ground Truth Annotations: “There is no more sign on sidewalk,” “the wooden barrier is gone,” “the barricade is gone,” “the wooden barricade on the sidewalk disappeared,” “the construction barrier is gone,” “there is no longer a wooden barrier on the sidewalk.” Generated Annotations: “the saw horse is gone,” “the construction barricade is gone,” “the construction barrier on the sidewalk is no longer there.” While visually finding the changepoint is straightforward in some sequences, there are many examples like this sequence, where the visual distinction between the first and second halves of the sequence is subtle.