We consider the problem of detecting frames of interest in videos stored on a remote server, from which we can request individual frames, at a fixed transmission cost per frame. We define a “frame of interest” as one containing at least one object belonging to any member of a set of task-specific object classes.
Our proposed solution is based on decomposing this problem into three constituent subproblems: (1) assigning a score to each observed frame corresponding to the probability, based on observing it in isolation, that it is “of interest”, (2) integrating this sequence of sparsely observed “interestingness scores” into a probability distribution over the full label sequence, and (3) deciding which unobserved frame, if any, to request next.
We solve (1) by learning a convolutional frame-scoring network from ground-truth (frame, label) pairs. We solve (2) using a hidden Markov model (HMM) derived from the transition and co-ocurrence statistics of ground-truth frame labels and regressed frame scores. Our solution to (3) is the following greedy policy: if the application context provides enough time and bandwidth to request another frame, make the frame request that would most lower the mean (across frames) expected cross entropy [murphy2012probabilistic] between frame labels and marginal distributions over them. We evaluate this approach on a subset of the ImageNet-VID dataset [russakovsky2015imagenet]
2 Related work
The collection of large-scale labeled video datasets like ImageNet-VID [russakovsky2015imagenet] and YouTube-BB [real2017youtubeboundingboxes] has allowed systems for detecting objects in images to be adapted to work with videos ([bertasius2018object], [Wang2018FullyMN], [Zhu_2018]). And due to the high computational cost of processing videos, several approaches to reducing system resource demands have been proposed.
Zhu et al. [zhu2018highmob]
train separate lightweight networks for (1) detecting objects in sparsely sampled keyframes and (2) estimating motion fields between keyframes, which they use to interpolate between object states inferred from those keyframes. Wang et al.[wang2018fast]
exploit motion vectors computed during H.264 compression to efficiently propagate features across frames. In the AdaScale system proposed by Chin et al.[chin2019adascale], an adaptive agent is trained to select the appropriate scale for the input image, optimizing for speed and accuracy.
Canel et al. [canel2019scaling]
propose an on-sensing-platform filtering system to reduce bandwidth consumption in remote video camera deployments. Their micro-classifier predicts the relevance of each frame, and only transmit frames with relevance scores exceeding a threshold. From this work, we borrow the task of frame-of-interest classification. We also take inspiration from their architecture for full-frame object detection, which applies aconvolution to the feature map generated by a convolutional feature-extractor and computes the over spatial locations.
The inference agent we propose maintains a probability distribution over the full label sequence of the video as well as a recommendation for which frame to request next, and updates both as new frames are received. The agent has three components: a convolutional frame-scoring network, a hidden Markov model for extrapolating inference across frames, and a request-recommendation rule (greedy expected-cross-entropy minimization).
3.1 Scoring retrieved frames
We use a convolutional frame-scoring network to regress the probability that a given retrieved frame is a frame of interest. We refer to this probability estimate as a “score” both because it is uncalibrated [guo2017calibration] and to disambiguate it from probabilities computed using the hidden Markov model.
The frame-scoring network, based on the network described in [canel2019scaling], is the composition of an ImageNet-pretrained ResNet-18 “backbone”, with global average pooling and fully connected layers removed [russakovsky2015imagenet, he2016resnet], a learned convolution with 1 output channel, and a operation.
3.2 Inferring dense labels from sparse scores
We construct a hidden Markov model (Fig. 2) to form beliefs over a video’s full sequence of frame labels from sparsely observed frame scores. The hidden states take one of values: if the frame is of interest, and otherwise. The observed states take one of
values, each indicating that the corresponding frame score falls into one of three equiprobable quantiles. We use the forward-backward algorithm[rabiner1986introduction] to compute marginal label probabilities. In Fig. 3, we show how updates according to this model after a new observation.
3.3 Deciding which frames to request
While refining our beliefs about a -frame video, we maintain a marginal label probability vector . indicates complete certainty that frame is of interest, and indicates complete certainty that it is not.
Our agent’s goal is to produce a sequence of queries such that after observing frames and updating accordingly, it minimizes the mean frame-wise cross entropy between the distributions specified by and the ground truth label vector :
Since is unknown during inference, we take the expectation over possible outcomes according to . We define the frame-wise expected cross entropy loss as
Using this formula, we compute the expected loss of the updated marginal probability vector for every possible next query :
where is the hidden Markov model’s observation emission distribution for hidden state . We select the query that minimizes this expected loss:
4 Experiment details
We test our approach on the task of identifying frames containing road vehicles in the ImageNet-VID dataset, with the target class set . We assign labels to every frame in ImageNet-VID using the provided annotations. If a frame contains an instance of a class in , it is labeled . Otherwise, it is labeled .
4.1 Training the frame-scoring network
The only parameters of the frame-scoring network that we learn are those of the final convolution operation. Our training set consists of 10,000 images sampled as follows. We divide all frames in ImageNet-VID into two sets according to their label ( or
). From each set, we select 5000 uniformly at random without replacement. This sampling procedure ensures training set diversity. We train the network to predict frame labels using a binary cross-entropy loss. We train for 100 epochs, using the Adam optimizer[kingma2014adam] with a learning rate of and mini-batch size of 8.
4.2 Computing transition matrices
The hidden Markov model is parameterized by two matrices: the hidden state transition matrix and the observation emission matrix. We compute these matrices based on the videos in the ImageNet-VID training set that contain at least one frame of interest. The transition matrix is composed of the occurence rates of adjacent label-label pairs. To construct the emission matrix, we first score every frame with the frame-scoring network. We then convert the scores to discrete observations by binning them into three equiprobable quantiles. Finally, we assemble the emission matrix from label-observation co-occurrence rates.
The evaluation dataset consists of the videos in the ImageNet-VID validation set containing at least one frame of interest. To reduce the evaluation run time, videos in this set longer than 300 frames are split into clips of 300 frames or fewer. We define the bandwidth ratio as the fraction of frames in the video our agent is allowed to observe. i.e. our agent can observe frames of a -frame video. After our agent has made this number of observations, we measure frame-of-interest classification accuracy. We measure this accuracy at each bandwidth ratio and average across videos in our evaluation dataset (Fig. 5).
Fig. 5 displays the accuracy-versus-bandwidth curve measured in our experiment. After observing only 2% of the frames in a video, our agent is able to make frame-of-interest predictions with the same accuracy as an unconstrained detector. Qualitatively, the agent seems to query densely around ambiguous sections of the video while ignoring others (Fig. 4).
Using a convolutional network to score sparsely observed frames and a hidden Markov model to make dense predictions and generate queries, we can accurately determine which frames in a video contain objects of interest. On a subset of the ImageNet-VID dataset, our method achieves an observation reduction of 98%.
To improve upon our approach, it may be valuable to further investigate the temporal statistics of our task. Road vehicles are fast-moving objects that only stay in-frame for short periods of time. The statistics of other tasks, like animal- or person-detection, where objects of interest enter and exit less frequently, may be different. The effectiveness of our approach may vary significantly with the statistics of the training and evaluation video sets, and more work is needed to fully characterize this variation.
Furthermore, since we designed our frame-scoring network with simplicity in mind, our system may be improved by substituting in a more sophisticated network. One possibility is to use an architecture incorporating object localization rather than direct, frame-level classification. The downside, however, would be a longer run time and consequently a lower maximum query rate.