Outdoor images often contain sufficient visual information to understand geographic information about the scene, such as where the image was captured. Developing effective algorithms for this task has received significant attention for many years [im2gps, weyand2016planet]. The appearance of an outdoor scene can also change rapidly. These changes are often due to fleeting, or transient, attributes such as lighting and weather conditions, that dramatically affect the visual perception of an environment. For instance, consider a scene that changes from sunny and pleasant to rainy and brooding in mere minutes. Several methods have been proposed for automatically understanding and extracting these subtle characteristics from imagery [patterson2012sun, lu2014two, laffont2014transient, baltenberger16transient]. Estimating these types of transient attributes has importance in a number of applications, including: environmental monitoring [wang2013observing, fedorov2014snow], as a pre-processing step for calibration [jacobs13cloudcalibration, workman2014rainbow], and enabling semantic browsing of large photo collections [jacobs07amos, laffont2014transient]. Our work fuses these two research areas by learning to estimate geo-temporal image features, which are related to when and where an image was captured.
Recently, a significant amount of work has explored how sources of supervision beyond manual annotation can be used to learn useful representations of images. In general, collecting manual annotations for millions, or perhaps billions, of images is prohibitively expensive. As Doersch summarizes [doersch2016supervision], “The idea is that, given the right task, the computer can learn on its own to represent useful semantic properties of the visual world.” Such learning tasks are often referred to as pretext tasks; they serve as an intermediary target for learning the intended representation. For example, Doersch et al. [doersch2015unsupervised] show how spatial context can be used as a supervisory signal in order to learn a visual representation for object discovery. Similarly, Pathak et al. [pathak2016context] use context-based pixel prediction for pre-training a representation for classification, detection, and segmentation tasks. We extend this line of work by using time and location context to learn useful features from a large corpus of imagery.
Our work makes the following visual assumptions about the world. First, that photographs provide a direct source of context regarding the conditions under which they were captured. For example, the time of day that an image is captured is directly related to the brightness of the image (i.e., light to dark), season can indicate the expected weather conditions or how people are dressed, and location can provide evidence about anticipated styles, such as architecture. Second, these context signals are hard to extract from an image, are potentially noisy (e.g., snow in early Summer), and can be indicated by multiple sources (e.g., snow on the ground, people wearing heavy coats). These assumptions motivate our method which integrates image appearance, time, and location, the latter of which are typically recorded automatically by the imaging device.
In our approach, we explicitly model the relationship between the image, its geographic location, and the time of capture. We propose a novel convolutional neural network architecture that implicitly learns how to extract geo-temporal features from the imagery by optimizing for a set of location and time estimation tasks. Specifically, we structure our network to jointly learn feature representations for three related spaces: images, time, and location. To accomplish this, each representation, or combination of representations, is used to predict held out information. For example, the image representation and location representation (or the combination of both) are used to learn to predict when an image was captured. In total, three representations are learned using four classification tasks. We optimize all representations and tasks simultaneously, in an end-to-end fashion.
The main contributions of this work are: 1) a novel approach for learning geo-temporal image features from a large corpus of imagery without requiring image-level manual annotations; 2) an evaluation of the learned features on the task of transient attribute estimation, where our features outperform those from a network pre-trained using the strongly supervised ImageNet dataset[ILSVRC15]; 3) an evaluation of the accuracy of our learned estimators, highlighting the value of additional context; and 4) a novel location estimation method that uses the task of time estimation to localize a static webcam.
2 Related Work
Image localization, or estimating where an image was captured, is an important problem in the vision community. Typically, the problem is formulated as image retrieval using a reference database of ground-level images[im2gps] or overhead images [lin2015learning, workman2015geocnn, workman2015localize] with known location. Other methods have been proposed which take advantage of photometric and geometric properties such as sun position [lalonde2010sun, workman2014rainbow], and many other cues. More recently, Weyand et al. [weyand2016planet]
proposed to directly predict the geographic location of a single image using a deep convolutional neural network by classifying the query image into a set of spatial bins. For our localization task, we adopt this classification approach and extend it to include temporal context.
Other work has explored how to estimate the time that an image was captured. Salem et al. [salem2016dating] demonstrate that human appearance, including clothing and hairstyle, is a useful cue for dating images. Matzen and Snavely [matzen2014scene] predict timestamps for photos by matching against a time-varying reconstruction of a scene. Volokitin et al. [eth_biwi_01292] use representations extracted from CNNs to estimate ambient temperature and time of year for outdoor images. As with localization, we adopt a classification approach to estimating when an image was captured and show how these estimates improve when the image location is known.
Attribute-based representations have become popular in outdoor scene understanding to help describe how the appearance of a scene changes over time. Laffont et al.[laffont2014transient] introduced a taxonomy of 40 transient attributes that describe intra-scene variations along with methods for identifying the presence of such attributes in an image. Using this dataset, Baltenberger et al. [baltenberger16transient] introduce methods for estimating the presence of transient attributes using convolutional neural networks. Jacobs et al. [jacobs07amos]
demonstrate that principal component analysis, when applied to webcam imagery, results in a decomposition that is closely related to natural changes in the scene, including the time of day, local weather conditions, and human activity. Similarly, a body of work has sought understand local weather conditions[islam13webcamweather, lu2014two]. Many studies have shown that these types of transient attributes can be useful for image and camera localization tasks [jacobs07geolocate, baltenberger16transient].
Recent work has explored the use of self-supervision, which are sometimes referred to as pretext tasks, for training deep neural networks to capture useful visual representations [doersch2015unsupervised, pathak2016context]. For example, Zhang et al. [zhang2016colorful]
show how image colorization (synthesizing colors for a grayscale image) is a powerful pretext task for learning visual representations. Pathak et al.[pathak2017learning] exploit low-level motion-based grouping cues for unsupervised feature learning. These methods typically exploit some known quantity of the data (e.g., pixel color values) to avoid expensive manual annotation. As a byproduct, a useful visual representation is learned. In our work, we consider two novel pretext tasks, time and location estimation.
3 Estimating Geo-Temporal Image Features
We propose a neural network architecture for learning geo-temporal features from images by optimizing for a set of location and time estimation tasks. An overview of the proposed architecture is shown in Figure 1. Our network takes three inputs: an image, , the time the image was captured, , and the location of capture, . Each input is independently processed by a context network to extract mid-level features. Then, pairs of these features are used by estimator networks to predict distributions over time or location.
3.1 Context Networks
We use three context networks: a temporal context network, ; a location context network, ; and an image context network,
. The output of each context network is a 128-dimensional feature with a sigmoid activation function. For the temporal context network, we parameterize the input timestamp using a one-hot encoding of month and hour of day, for a total ofdimensions. This encoding is flattened and passed to
, which consists of three fully-connected layers (with 256, 512, and 128 channels respectively), the first two with ReLU activations. For the location context network, we parameterize the geographic location,, using standard 3D ECEF coordinates, which we normalize by the Earth’s radius. Other than a different input and independent network weights, the location context network is identical to the temporal context network. For the image context network, we use the InceptionV2 architecture [szegedy2016rethinking], up to the global pooling layer, to extract features. We flatten the output feature map and append the same structure as the other context networks.
3.2 Estimator Networks
The output of the context networks are used as input to four different estimator networks:
Location Estimator, , which predicts location using only image features;
Time Estimator, , which predicts capture time using only image features;
Time-conditioned Location Estimator, , which predicts location using features from the image and the known capture time;
Location-conditioned Time Estimator, , which predicts capture time using features from the image and the known geographic location.
Aside from different output sizes, the estimator networks have the same structure as the context networks. We discretize the output space for location and time and represent the probability as a categorical distribution (i.e., using asoftmax activation for each estimator). For location, we use equal-angle “latitude longitude” bins. For time, we use “month hour” bins.
3.3 Implementation Details
We randomly initialize the InceptionV2 network using the standard strategy [szegedy2016rethinking]. We initialize all other network weights randomly using Xavier initializer [glorot2010understanding] and simultaneously optimize them during training. For each estimator network, we have a cross entropy loss. We minimize the sum of these using the Adam optimizer [kingma2014adam] ( and ). We use a learning rate policy that starts from 0.001 and decreases by half every iterations. For regularization, we apply weight decay with rate of 0.0001. We train the proposed network for
iterations with batch size 32. We apply batch normalization[inception15] on every layer except the last (for both context and estimator networks). The input images are scaled to and augmented by a random crop to the size of . We use Greenwich Mean Time (GMT) for all timestamps.
We evaluate the context networks and estimator networks on various datasets, visualize specific features in the image context networks, and show that the image context features have strong correlations with transient image attributes.
4.1 Training and Evaluation Datasets
We use four main datasets to evaluate our approach. The AMOS dataset refers to a subset of the AMOS database [jacobs07amos], which is a collection of over a billion images captured from public outdoor webcams around the world. For our experiments, we use a subset of images: only from webcams with high-accuracy geolocation and images captured between 2002 to 2017. This resulted in images from 12,193 webcams from which we held-out 231 for testing. Each image has a timestamp recorded by the image collection process. The YFCC dataset refers to a subset of the Yahoo 100 million dataset [yfcc100m], only including geotagged images from smart phones. We restricted the dataset to smart phone images since we found that non-phone images often had inaccurate timestamps. We filter out indoor images using the Places network [zhou2017places]. This results in a training set of 892,662 images and a test set of 170,994 images. The Hybrid dataset refers to a combination of the AMOS and YFCC training sets (sampling equally for each mini-batch). The TA dataset refers to the Transient Attributes Dataset [laffont2014transient], which contains 8,571 images, each manually annotated with transient attributes, such as sunny and cloudy.
4.2 Understanding the Image Context Representation
We conducted several experiments to relate image appearance to the representation learned by the image context network. To begin, we examined images that correspond with extremal activations. For this experiment, we used 10,000 images randomly sampled from the YFCC dataset and 7,732 images covering the year of 2015 from one webcam (ID: 4308) in the AMOS
dataset. For each neuron of the image context representation, we selected theimages that result in the highest activation from the two different sets of images. Figure 2 shows a montage of images for three neurons. The neurons appear to capture semantically meaningful attributes, such as daylight, rainy, and winter. Similarly, we selected two neurons and visualized their signal over time for images from the webcam. Figure 3 shows how scene appearance changes are related to the image context features. For the example shown, it appears that these neurons are related to daylight and fogginess. These experiments provide evidence that the mid-level representation captured by the image context network are related to static and transient scene attributes.
4.3 Analyzing Feature Correlation with Transient Attributes
To analyze quantitatively how much our model learns about transient attributes, we compute the cross correlations between a mid-level representation of the image context network and the corresponding transient attribute labels of all test images in the TA dataset. As a baseline, we compare to features of the same architecture trained for ImageNet [inception15] classification and features sampled uniformly at random. We select the feature from the last pooling layer (AvgPool_1a_7x7), which is the deepest layer that this model and ours share in common. We compute the cross correlation scores between the feature and the transient attribute scores of each image, resulting in a cross correlation matrix, , where the element is the cross correlation score between the -th feature channel and the -th transient attribute. Figure 4 shows, for each transient attribute, the maximum absolute correlation score over all feature channels. We observe that our proposed method learns features that are more correlated to the transient attributes () than the ImageNet network () or the random features ().
4.4 Comparing Mid-Level Features for Transient Attribute Estimation
Comparing mid-level features for transient attribute estimation. (left) Features extracted from the weights of our proposed approach versus a network trained for image classification and a randomly initialized baseline. (right) Features extracted from our method, trained on different datasets.
The previous experiment showed that the image context network is capturing mid-level features correlated with transient attributes. In this section, we explore the ability of this representation for directly estimating transient attributes. Similar to the previous experiment, we truncate our model at the last pooling layer (in order to compare versus alternative initialization strategies), and add a final two-layer MLP with 40 outputs corresponding to the 40 transient attributes in the TA dataset. We train this network, initializing from the weights of models trained for different tasks, including variants of our method trained on the AMOS, YFCC, and Hybrid datasets. During training, the MLP portions are randomly initialized while the earlier layers are frozen. We evaluate the average mean squared error (MSE) for the test set every 500 iterations (batch size 32). Figure 5 shows the performance comparison among different mid-level features, including ImageNet and randomly initialized InceptionV2. Our features are superior to all baselines and perform best when learned using the Hybrid dataset.
4.5 Application: Image Localization
There are two image localization formulations that our network architecture enables. The straightforward approach is to use the location estimator (or the time-conditioned variant) to generate a probability distribution over a discrete set of location bins. An alternative approach is to optimize for a continuous location estimate by minimizing the loss of the location-conditioned time estimator.
Given an input image, , we evaluate the location estimator and the time-conditioned location estimator , which requires a timestamp, . We trained our model on each dataset and perform quantitative evaluation using the test images from AMOS and YFCC, separately. We use the latitude/longitude center of the highest probability bin as our location estimate. The results of this experiment are presented in Figure 6. We observe that the time-conditioned location estimator is superior in both cases. We also conclude that our model performs better if trained on the same imagery source with the test set, and training the network with the Hybrid dataset is competitive on both test sets.
In this formulation, we use the location-conditioned time estimator, , to optimize for a continuous location estimate. Given the known image capture time , the idea is that the true location should result in a low value for the loss associated with the estimator, , where is the cross-entropy loss. Therefore, we can produce a location estimate by optimizing the location, , with respect to . Unfortunately, an individual image does not typically yield a unique, or accurate, location estimate using this method. However, if we sum the loss across images captured at different times, we find that the minima of the function becomes more distinct. Figure 7 shows several qualitative examples of this localization strategy on static webcams, where darker colors correspond to more likely locations. We can see that as additional images are included in the loss, the uncertainty of the location prediction diminishes.
|AMOS: 5992||AMOS: 8260|
4.6 Application: Time Estimation
Using the time estimator and location-conditioned time estimator, our network is able to estimate the capture time of a query image. These estimators output a distribution in discrete 2D time space. To evaluate our estimates, we compare the ground-truth capture time and the marginal probabilities of our predictions on the YFCC test set, and present the cumulative error plots in Figure 8. We observe that including location is not useful for pinpointing the month. We suspect this is because most of our imagery is in the northern hemisphere, and changing the location within a hemisphere doesn’t change the season. However, this is not the case when estimating the hour. To visualize this, we show in Figure 9 the impact of changing the location on the hour estimate. We compute the marginal hour distribution at different latitudes and longitudes. When performing a sweep over latitude, we fix the longitude value to be the ground truth (and vice versa). We found that the longitude of the image corresponds more with the hour prediction than the latitude, which matches expectations.
When learning about the world using images, the location and time an image was captured are useful pieces of metadata that are often available, but commonly overlooked. We presented a novel architecture for learning useful representations from images that takes advantage of this metadata. We found that for the task of transient attribute estimation, our method, despite being trained without manually obtained image-level annotations, learned image representations that outperform the representations learned using ImageNet. This is a rarely achieved feat in self-supervised representation learning against a frequently used baseline. One important area for future work is in investigating alternative architectures for the context networks. We did not conduct a thorough study in this regard and expect to see improvements in using newer image CNNs and higher capacity time and location networks. In addition, we expect that richer time and location input representations will result in improved geo-temporal image features.
We gratefully acknowledge the support of NSF CAREER award IIS-1553116 and ARPA-E Award DE-AR0000594. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.