Many computer vision algorithms focus on the problem of understanding the most general distribution of real-world images (often modeled by “Internet”-scale datasets such as ImageNet or COCO ). However any single camera observes only a tiny fraction of this general distribution. This offers the possibility of achieving more efficient inference by specializing compact, low-cost models to the specific distribution of frames observed by a single camera. In other words, to gain efficiency a model can learn to cheat—segmenting people sitting on a park lawn might be as easy as looking for shades of green.
However, in practice model specialization is challenging because it can be difficult to predict the distribution of frames a camera will see. Specialization approaches rely on tedious configuration of models [28, 11] or careful selection of model training samples so as not to miss rare events . Even if per-camera curation were possible, many video streams simply cannot be captured by a low-capacity model due to distribution shift in the images observed over time. For example, stationary cameras observe scenes that constantly evolve with time of day, changing weather conditions, and as different subjects move through the scene. Moving cameras offer greater challenges: TV cameras pan and zoom, most smartphone videos are hand-held, and egocentric cameras on vehicles or robots move through dynamic scenes.
In this paper we describe a surprisingly simple strategy for realizing high-accuracy, low-cost semantic segmentation models that are specialized to a single video stream. Our approach is based on the widely-used technique for model distillation [3, 18], whereby a lightweight “student” model is trained to output the predictions of a larger, high-capacity “teacher” model. However, rather than learning a specialized student model on offline data from the video stream that has been labeled with teacher predictions , we train the student in an online fashion on the live data stream, intermittently running the teacher to provide a target for learning. Intuitively, we find that simple models can be accurate, provided they are continuously adapted to the specific contents of a video stream as new frames arrive.
We show that online model distillation yields semantic segmentation models that closely approximate their Mask R-CNN  teacher with 7 to 17 lower inference runtime cost (11-27 when comparing FLOPs), even when the target video’s distribution is non-stationary over time. Our method requires no offline pretraining on data from the target video stream, and delivers higher accuracy segmentation output, at lower cost, than efficiency-centric video semantic segmentation solutions based on flow. We also provide a new video dataset designed for evaluating the efficiency of inference over long running video streams.
2 Related Work
Distillation for specialization:
Training a small, efficient model to mimic the output of a more expensive teacher has been proposed as a form of model compression (also called knowledge distillation) [3, 18]. While early explorations of distillation focused on approximating the output of the large model over the entire original data distribution, our work, like other recent work from the systems community , leverages distillation to create highly compact, domain-specialized models that need only mimic the teacher for a desired subset of the data. However, rather than treating model distillation as an offline training preprocess for a stationary target distribution (and incurring the high up-front training cost and the challenges of curating a representative training set for each unique video stream), we perform distillation online to adapt the student model dynamically to the changing contents of a video stream.
Training a model online as new video frames arrive violates the independent and identically distributed (i.i.d) assumptions of traditional stochastic gradient descent optimization. Although online learning from non-i.i.d data streams has been explored[6, 40]
, in general there has been relatively little work on online optimization of “deep” non-convex predictors on correlated streaming data. The major exception is the body of work on deep reinforcement learning, where the focus is on learning policies from experience. Online distillation can be formulated as a reinforcement or a meta-learning  problem. However, training methods [39, 32] employed in typical reinforcement settings are computationally expensive, require a large amount of samples, and are largely for offline use. Our goal is to train a compact model which mimics the teacher in a small temporal window. In this context, we demonstrate that standard gradient decent is remarkably effective for online training our compact architecture.
and more recent methods built upon deep feature hierarchies[30, 49, 19, 34] can be viewed as a form of rapid online learning of appearance models from video. Tracking parameterizes objects with bounding boxes rather than segmentation masks and the cost scales in complexity with the number of objects being tracked. Our approach for online distillation focuses on pixel-level semantic segmentation and poses a different set of performance challenges. It can be viewed as learning an appearance model for the entire scene as opposed to individual objects.
Fast-retraining of compact models:
A fundamental theme in our work is that low-cost models that do not generalize widely are useful, provided they can be quickly retrained to new distributions. Thus, our ideas bear similarity to recent work accelerating image classification in video via online adaptation to category skew
and on-the-fly model training for image super-resolution.
Video object segmentation:
Solutions to video object segmentation (VOS) leverage online adaptation of high-capacity deep models to a provided reference segmentation in order to propagate instance masks to future frames [35, 51, 48, 5]. The goal of these algorithms is to learn a high-quality, video-specific segmentation model for use on subsequent frames of a short video clip, not to synthesize a low-cost approximation to a pre-trained general segmentation model like Mask R-CNN  (MRCNN). Online VOS methods require seconds to minutes of training per short video clip (longer than directly evaluating a general segmentation model itself), precluding their use in a real-time setting. We believe our compact segmentation architecture and online distillation method could be used to significantly accelerate top-performing VOS solutions (see Section 5).
Temporal coherence in video:
Leveraging frame-to-frame coherence in video streams, such as background subtraction or difference detection, is a common way to reduce computation when processing video streams. More advanced methods seek to activate different network layers at different temporal frequencies according to expected rates of change [25, 41] or use frame-to-frame flow to warp inference results (or intermediate features) from prior frames to subsequent frames in a video [12, 52]. We show that for the task of semantic segmentation, exploiting frame-to-frame coherence in the form of model specialization (using a compact model trained on recent frames to perform inference on near future frames) is both more accurate and more efficient than flow-based methods on a wide range of videos.
3 Just-In-Time Model Distillation
Figure 1 provides a high-level overview of online model distillation for high quality, low-cost video semantic segmentation. On each video frame, a compact model is run, producing a pixel-level segmentation. This compact student model is periodically adapted using predictions from a high-quality teacher model (such as MRCNN ). Since the student model is trained online (adapted just-in-time for future use), we refer to it as “JITNet”. To make online distillation efficient in practice, our approach must address the following challenges: 1) creating a student model supporting fast inference and training, 2) training the student online using imperfect teacher output, and 3) determining when and how to update the student as new frames arrive. Next, we describe each of these components.
3.1 JITNet Architecture
Efficient online distillation requires the JITNet architecture to support both efficient inference and training. Since real-world video streams contain objects at varying scales, JITNet must make predictions from high-resolution inputs. To allow for adaptation in few training iterations, JITNet must be stable when updated with high learning rates.
The JITNet architecture is a compact semantic segmentation architecture resembling an encoder-decoder . The three blocks in the encoder and decoder are modified ResNet  blocks, in which the second 33 convolution of each block is replaced with a 13 followed by a 3
1 convolution). This modification significantly reduces the parameters and computation required. The number of channels for each block is chosen to reduce computation on high resolution feature maps. The residual blocks allow for fast and stable training with high learning rates. In addition to the residual connections in each block, the output features from each encoder block are concatenated to the input of the corresponding decoder block. These connections assist with training speed and stability, while allowing JITNet to use features at various semantic levels.
Table 1 gives the parameter count, number of floating-point operations, and runtime of both JITNet and MRCNN on a frame of 720p video on an NVIDIA V100 GPU. (We provide both inference and training costs for JITNet.) Compact segmentation models, such as those based on MobileNet V2 [38, 46], are 3-4 slower than JITNet at high resolution and also not designed for fast, stable online training. Analysis of JITNet variants on standard semantic segmentation datasets is provided in appendix to ground it relative to other architectures designed for efficiency.
|Model||FLOPS (B)||Params (M)||Time (ms)|
3.2 Online Training with Gradient Descent
Online training presents many challenges: training samples (frames) from the video stream are highly correlated, there is continuous distribution shift in content (the past may not be representative of the future), and teacher predictions used as a proxy for “ground truth” at training can exhibit temporal instability or errors. The method for updating JITNet parameters must account for these challenges.
To generate target labels for training, we use the instance masks provided by MRCNN above a confidence threshold, and convert them to pixel-level semantic segmentation labels. All pixels where no instances are reported are labeled as background. On most video streams, this results in a significantly higher fraction of background compared to other classes. This imbalance reduces the ability of the student model to learn quickly, especially for small objects, due to most of the loss being weighted on background. We mitigate this issue by weighting the pixel loss in each predicted instance bounding box (dilated by 15%) five times higher than pixels outside boxes. This weighting focuses training on the more challenging regions near object boundaries and on small objects. With these weighted labels, we compute the gradients for updating the model parameters using weighted cross-entropy loss and gradient descent. Since training JITNet on a video from a random initialization would require significant training to adapt to the stream, we pretrain JITNet on the COCO dataset, then adapt the pretrained model to each stream.
When fine-tuning models offline, it is common to only update a few layers or use small learning rates to avoid catastrophic forgetting. In contrast, for online adaptation, we want the JITNet model to adapt to current context even at the expense of forgetting. Therefore, we update all layers with high learning rates. Empirically, we find that gradient descent with high momentum (0.9) and learning rate (0.01) works remarkably well for updating JITNet parameters. We believe high momentum stabilizes training due to resilience to teacher prediction noise. We find that learning rates higher than 0.01 destabilize training. We use the same parameters for all online training experiments.
3.3 Adaptive Online Distillation
The rate of distribution shift depends on the nature of the video stream. If the distribution is close to stationary, then little adaptation is needed. If the distribution of the video shifts rapidly, then JITNet must be frequently updated to keep pace. This suggests an opportunity to maximize efficiency by only adapting JITNet when necessary to preserve high accuracy. The algorithm for determining when to adapt JITNet (when to run the teacher) must itself be efficient; otherwise, the cost of determining when to adapt will diminish the performance benefits of infrequent training.
We develop a straightforward adaptive strategy that employs exponential back-off . The full adaptive, online distillation algorithm is shown in Figure 3. Inputs to the algorithm are the video stream, a desired accuracy threshold, and the JITNet and teacher (MRCNN) models. Configuration parameters to the adaptation are the maximum number of learning steps performed on a single frame, and the minimum/maximum frame strides between teacher invocations.
The algorithm operates in a streaming fashion and processes the frames in the video in temporal order. The teacher is only executed on frames which are multiples of the current stride. When the teacher is run, the algorithm computes the accuracy of the current JITNet model with respect to the teacher. If JITNet accuracy is less than the desired JITNet accuracy threshold (mean IoU), the model is updated using the teacher predictions as detailed in the previous section. The JITNet model is trained until it either reaches the set accuracy threshold or the upper limit on update iterations per frame. Once the training phase ends, if JITNet meets the accuracy threshold, the stride for running the teacher is doubled; otherwise, it is halved (bounded by minimum and maximum stride). The accuracy threshold is the only user-exposed knob in the algorithm. As demonstrated in our evaluation, modifying the threshold’s value allows for a range of accuracy vs. efficiency trade-offs.
Even when consecutive video frames contain significant motion, their overall appearance may not change significantly. Therefore, it is better to perform more learning iterations on the current frame than to incur the high cost of running the teacher on a new, but visually similar, frame. The maximum stride was chosen so that the system can respond to changes within seconds (64 frames is about 2.6 seconds on 25 fps video). The maximum updates per frame is roughly the ratio of JITNet training time to teacher inference cost. These parameters can be changed based on the scenario. We set the minimum and maximum stride to 8 and 64 respectively, and maximum updates per frame to 8 for all experiments. We include an ablation study of these parameters, choices in network design, and training method in appendix.
4 Long Video Streams (LVS) Dataset
|MRCNN JITNet 0.9||MRCNN JITNet 0.9|
Evaluating fast video inference requires a dataset of long-running video streams that is representative of real-world camera deployments, such as automatic retail checkout, player analysis in sports, traffic violation monitoring, and wearable device video analysis for augmented reality. Existing large-scale video datasets have been designed to support training high-quality models for various tasks, such as action detection [26, 44], object detection, tracking, and segmentation [36, 50], and consist of carefully curated, diverse sets of short video clips (seconds to a couple minutes).
We create a new dataset designed for evaluating techniques for efficient inference in real-world, long-running scenarios. Our dataset, named the Long Video Streams dataset (LVS), contains 30 HD videos, each 30 minutes in duration and at least 720p resolution. (900 minutes total; for comparison, YouTube-VOS  is 345 minutes.) Unlike other datasets for efficient inference, which consist of streams from fixed-viewpoint cameras such as traffic cameras , we capture a diverse array of challenges: from fixed-viewpoint cameras, to constantly moving and zooming television cameras, and hand-held and egocentric video. Given the nature of these video streams, the most commonly occurring objects include people, cars, and animals.
It is impractical to obtain ground truth, human-labeled segmentations for all 900 minutes (1.6 million frames) of the dataset. Therefore, we carefully select videos for which MRCNN  is observed to provide accurate and robust predictions. (We evaluated other segmentation models such as DeepLab V3  and Inplace ABN , and found MRCNN to be the most reliable.) We use the highest-quality MRCNN  without test-time data augmentation, and provide its output for all dataset frames to aid evaluation of classification, detection, and segmentation (semantic and instance level) methods. Figure 4 shows a sampling of videos from the dataset, with their corresponding MRCNN segmentations (left image in each group). We refer readers to appendix for additional dataset details, including visualization of MRCNN predictions for all videos.
The goal of our evaluation is to test the limits of online distillation as a strategy for efficient video segmentation. We compare with an alternative motion-based interpolation method and an online approach for video object segmentation . While our focus is evaluating accuracy and efficiency on long video streams (LVS), we also include results on the DAVIS video benchmark  in appendix.
|Video||Oracle||Slow (2.2)||Fast (3.2)||JITNet 0.7||JITNet 0.8||JITNet 0.9|
|Overall||80.3||76.6||65.2||75.5 (17.4, 3.2%)||78.6 (13.5, 4.7%)||82.5 (7.5, 8.4%)|
|Sports (Fixed)||87.5||81.2||71.0||80.8 (24.4, 1.6%)||82.8 (21.8, 1.8%)||87.6 (10.4, 5.1%)|
|Sports (Moving)||82.2||72.6||59.8||76.0 (20.6, 2.1%)||79.3 (14.5, 3.6%)||84.1 (6.0, 9.1%)|
|Sports (Ego)||72.3||69.4||55.1||65.0 (13.6, 3.7%)||70.2 (9.1, 6.0%)||75.0 (4.9, 10.4%)|
|Animals||89.0||83.2||73.4||82.9 (21.7, 1.9%)||84.3 (19.6, 2.2%)||87.6 (14.3, 4.4%)|
|Traffic||82.3||82.6||74.0||79.1 (11.8, 4.6%)||82.1 (8.5, 7.1%)||84.3 (5.4, 10.1%)|
|Driving/Walking||50.6||69.3||55.9||59.6 (5.8, 8.6%)||63.9 (4.9, 10.5%)||66.6 (4.3, 11.9%)|
|Subset of Individual Video Streams|
|Table Tennis (P)||89.4||84.8||75.4||81.5 (24.7, 1.6%)||83.5 (24.1, 1.6%)||88.3 (12.9, 3.4%)|
|Kabaddi (P)||88.2||78.9||66.7||83.8 (24.8, 1.6%)||84.5 (23.5, 1.7%)||87.9 (7.8, 6.3%)|
|Figure Skating (P)||84.3||54.8||37.9||72.3 (15.9, 2.8%)||76.0 (11.4, 4.1%)||83.5 (5.4, 9.4%)|
|Drone (P)||74.5||70.5||58.5||70.8 (15.4, 2.8%)||76.6 (6.9, 7.2%)||79.9 (4.1, 12.5%)|
|Birds (Bi)||92.0||80.0||68.0||85.3 (24.5, 1.6%)||85.7 (24.2, 1.6%)||87.9 (21.7, 1.8%)|
|Dog (P,D,A)||86.1||80.4||71.1||78.4 (19.0, 2.2%)||81.2 (13.8, 3.2%)||86.5 (6.0, 8.4%)|
|Ego Dodgeball (P)||82.1||75.5||60.4||74.3 (17.4, 2.5%)||79.5 (13.2, 3.4%)||84.2 (6.1, 8.2%)|
|Biking (P,Bk)||70.7||71.6||61.3||68.2 (12.7, 3.5%)||72.3 (6.7, 7.3%)||75.3 (4.1, 12.4%)|
|Samui Street (P,A,Bk)||80.6||83.8||76.5||78.8 (8.8, 5.5%)||82.6 (5.3, 9.5%)||83.7 (4.2, 12.2%)|
|Driving (P,A,Bk)||51.1||72.2||59.7||63.8 (5.7, 8.8%)||68.2 (4.5, 11.5%)||66.7 (4.1, 12.4%)|
5.1 Experimental Setup
Our evaluation focuses on both the efficiency and accuracy of semantic segmentation methods relative to MRCNN. Although MRCNN trained on the COCO dataset can segment 80 classes, LVS video streams captured from a single camera over a span of 30 minutes typically encounter a small subset of these classes. For example, none of the indoor object classes such as appliances and cutlery appear in outdoor traffic intersection or sports streams. Therefore, we measure accuracy only on classes which are present in the stream and have reliable MRCNN predictions. We also limit evaluation to classes representing independently moving objects, since stationary objects can be handled efficiently using simpler methods. We observed that MRCNN often confuses the class of cars, trucks, and buses, so to improve temporal stability we combine these classes into a single class “auto” for both training and evaluation. Therefore, we only evaluate accuracy on the following classes: bird, bike, auto, dog, elephant, giraffe, horse, and person. Table 2 shows the classes that are evaluated in each individual stream as an abbreviated list following the stream name.
All evaluated methods generate pixel-level predictions for each class in the video. We use mean intersection over union (mean IoU) over the classes in each video as the accuracy metric. All results are reported on the first 30,000 frames of each video (16-20 minutes due to varying fps) unless otherwise specified. Timing measurements for JITNet, MRCNN (see Table 1
), and other baseline methods are performed using TensorFlow 1.10.1 (CUDA 9.2/cuDNN 7.3) and PyTorch 0.4.1 for MRCNN on an NVIDIA V100 GPU. All speedup numbers are reported relative to wall-clock time of MRCNN. However, note that MRCNN performs instance segmentation whereas JITNet performs semantic segmentation on a subset of classes.
5.2 Accuracy vs. Efficiency of Online Distillation
Table 2 gives the accuracy and performance of online distillation using JITNet at three different accuracy thresholds: JITNet 0.7, 0.8, and 0.9. Performance is the average speedup relative to MRCNN runtime, including the cost of teacher evaluation and online JITNet training. To provide intuition on the speedups possible on different types of videos, we organize LVS into categories of similar videos and show averages for each category (e.g., Sports (Moving) displays average results for seven sports videos filmed with a moving camera), as well as provide per-video results for a selection of 10 videos. We also show the fraction of frames for which MRCNN predictions are used. For instance, on the Kabaddi video stream, JITNet 0.8 is 23.5 times faster than MRCNN, with a mean IoU of 84.5, and uses 510 frames out of 30,000 (1.7%) for supervision.
On average, across all sequences, JITNet 0.9 maintains 82.5 mean IoU with 7.5 runtime speedup (11.6 in FLOPs). In the lower accuracy regime, JITNet 0.7 is 17.4 faster on average (26.6 in FLOPs) while maintaining a mean IoU of 75.5. Mean IoUs in the table exclude the background class, where all the methods have high accuracy. As expected, when the accuracy threshold is increased, JITNet improves in accuracy but uses a larger fraction of teacher frames for supervision. As expected, average speedup on sports streams from fixed cameras is higher than moving cameras. Even on challenging egocentric sports videos with significant motion blur, JITNet 0.9 provides 4.9 speedup while maintaining 75.0 mean IoU.
Although JITNet accuracy on the Sports (Fixed), Sports (Moving), Animals, and Traffic categories suggests potential for improvement, we observe that for streams with large objects, it is often difficult to qualitatively discern if JITNet or MRCNN produces higher quality predictions. Figure 4 displays sample frames with both MRCNN (left) and JITNet (right) predictions (zoom in to view details). The boundaries produced by JITNet on large objects (1st row right, 2nd row left, 3rd row left) are smoother than MRCNN, since MRCNN generates low-resolution masks (28 28) that are upsampled to full resolution. However, for videos containing small objects, such as traffic camera (Figure 4, 3th row, right) or aerial views (4th row, right), MRCNN produces sharper segmentations. JITNet’s architecture and operating resolution would need to be improved to match MRCNN segmentations on small objects.
Streams from the Sports (Ego) category exhibit significant motion blur due to fast motion. Teacher predictions on blurred frames can be unreliable and lead to disruptive model updates. The Driving/Walking streams traverse a busy downtown and a crowded beach, and are expected to be challenging for online distillation since object instances persist on screen for only short intervals in these videos. Handling these scenarios more accurately would require faster methods for online model adaptation.
5.3 Comparison with Offline Oracle Specialization
The prior section shows that a JITNet model pre-trained only on COCO can be continuously adapted to a new video stream with only modest online training cost. We also compare the accuracy of just-in-time adaptation to the results of specializing JITNet to the contents of the each stream entirely offline, and performing no online training. To simulate the effects of near best-case offline pre-training, we train JITNet models on every 5th frame of the entire 20 minute test video sequence (6,000 training frames). We refer to these models as “offline oracle” models since they are constructed by pre-training on the test set, and serve as a strong baseline for the accuracy achievable via offline specialization. All offline oracle models were pre-trained on COCO, and undergo one hour of pre-training on 4 GPUs using traditional random-batch SGD. (See appendix for further details.) Recall that in contrast, online adaptation incurs no pre-training cost and trains in a streaming fashion.
As shown in Table 2, JITNet 0.9 is on average more accurate than the offline oracle. Note that JITNet 0.9 uses only 8.4% of frames on average for supervision, while the oracle is trained using 20%. This trend also holds for the subcategory averages. This suggests that the compact JITNet model does not have sufficient capacity to fully capture the diversity present in the 20 minute stream.
Figure 5 shows mean IoU of JITNet 0.8 and the offline oracle across time for three videos. The top plot displays mean IoU of both methods (data points are averages over 30 second time intervals). The bottom plot displays the number of JITNet model updates in each interval. Images above the plots are representative frames from time intervals requiring the most JITNet updates. In the Birds video (left), these intervals correspond to events when new birds appear. In comparison, the Elephant video (center) contains a single elephant from different viewpoints and camera angles. The offline oracle model incurs a significant accuracy drop when the elephant dips into water. (This rare event makes up only a small fraction of the offline training set.) JITNet 0.8 displays a smaller drop since it specializes immediately to the novel scene characteristics. The Driving video (right) is challenging for both the offline oracle and online JITNet since it features significant visual diversity and continuous change. However, while the mean IOU of both methods is lower, online adaptation consistently outperforms the offline oracle in this case as well.
5.4 Comparison with Motion-Based Interpolation
An alternative approach to improving segmentation efficiency on video is to compute teacher predictions on a sparse set of frames and interpolate the results using flow. Table 2 shows two baselines that propagate pixel segmentations using Dense Feature Flow 
, although we upgrade the flow estimation network from FlowNet2 to modern methods. (We propagate labels, not features, since this was shown to be as effective .) The expensive variant (Flow (Slow)) runs MRCNN every 8th frame and uses PWC-Net  to estimate optical flow between frames. MRCNN labels are propagated to the next seven frames using the estimated flow. The fast variant (Flow (Fast)) uses the same propagation mechanism but runs MRCNN every 16th frame and uses a faster PWC-Net. Overall JITNet 0.7 is 2.8 faster and more accurate than the fast flow variant, and JITNet 0.9 has significantly higher accuracy than the slow flow variant except in the Driving/Walking category.
Figure 6 illustrates the challenge of using flow to interpolate sparse predictions. Notice how the ice skaters in the video undergo significant deformation, making them hard to track via flow. In contrast, online distillation trains JITNet to learn the appearance of scene objects (it leverages temporal coherence by reusing the model over local time windows), allowing it to produce high-quality segmentations despite complex motion. The slower flow baseline performs well compared to online adaptation on rare classes in the Driving (Bike) and Walking (Auto) streams, since flow is agnostic to semantic classes. Given the orthogonal nature of flow and online adaptation, it is possible a combination of these approaches could be used to handle streams with rapid appearance shifts.
|Category||OSVOS (3.3%)||JITNet 0.8|
|Overall||59.9||60.0||77.4 (14.5, 4.6%)|
|Sports (Fixed)||75.7||75.7||82.3 (24.0, 1.6%)|
|Sports (Moving)||69.1||69.3||78.7 (16.3, 2.9%)|
|Sports (Ego)||67.6||68.1||74.8 (9.5, 5.9%)|
|Animals||79.3||79.8||86.0 (19.7, 2.1%)|
|Traffic||22.3||21.9||70.8 (8.4, 7.7%)|
|Driving/Walking||36.7||36.3||66.8 (4.3, 11.8%)|
5.5 Comparison with Video Object Segmentation
Although not motivated by efficiency, video object segmentation (VOS) solutions employ a form of online adaptation: they train a model to segment future video frames based on supervision provided in the first frame. We evaluate the accuracy of the OSVOS  approach against JITNet on two-minute segments of each LVS video. (OSVOS was too expensive to run on longer segments.) For each 30-frame interval of the segment, we use MRCNN to generate a starting foreground mask, train the OSVOS model on the starting mask, and use the resulting model for segmenting the next 29 frames. We train OSVOS for 30 seconds on each starting frame, which requires approximately one hour to run OSVOS on each two-minute video segment. Since segmenting all classes in the LVS videos would require running OSVOS once per class, we run OSVOS on only one class per video (the first non-background (person or animal) class in each stream) and compare JITNet accuracy with OSVOS on the designated class. (Recall JITNet segments all classes.) Furthermore, we run two configurations of OSVOS: in mode (A) we use the OSVOS model from the previous 30-frame interval as the starting point for training in the next interval (a form of continuous adaptation). In mode (B) we reset to the pre-trained OSVOS model for each 30-frame interval.
Table 3 compares the accuracy of both OSVOS variants to online distillation with JITNet. The table also provides model accuracy, runtime speedup relative to MRCNN, and the fraction of frames used by JITNet 0.8 for supervision in the two-minute interval. Overall JITNet 0.8 is more accurate than OSVOS and two orders of magnitude faster. On Traffic streams, which have small objects, and Driving/Walking streams with rapid appearance changes, OSVOS has significantly lower accuracy than JITNet 0.8. We also observe that the mode A variant of OSVOS (continuously adapted) performs worse than the variant which is re-initialized. This suggests that the VGG-based  model architecture used in OSVOS is not amenable to continuous online adaptation. We believe the JITNet architecture could be employed as a means to significantly accelerate online VOS methods like OnAVOS  or more recent OSVOS-S  (uses MRCNN predictions every frame).
In this work we demonstrate that for common, real-world video streaming scenarios, it is possible to perform online distillation of compact (low cost) models to obtain semantic segmentation accuracy that is comparable with an expensive high capacity teacher. Going forward, we hope that our results encourage exploration of online distillation for efficient video inference on other tasks such as pose estimation. More generally, with continuous capture of high-resolution video streams becoming increasingly commonplace, we believe it is relevant for the broader community to think about the design and training of models that are not trained offline on carefully curated datasets, but instead continuously evolve each day with the data that they observe from specific video streams. We hope that the Long Video Streams dataset serves this line of research.
7.1 Online Distillation Parameter Study
All experiments in the work involving the online distillation algorithm use a fixed set of values for the parameters (maximum updates, minimum stride, learning rate), except the accuracy threshold. Table 5 compares the accuracy, speedup and fraction of frames used for supervision when these parameter values vary. We perform the ablation study on a subset of six streams (which are representative of the different scenarios) from the dataset. The baseline is JITNet 0.8, which is the online distillation algorithm run with an accuracy threshold of 0.8. For JITNet 0.8, the maximum updates, minimum stride, and learning rate were set to 8, 8 and 0.01 respectively. We vary one parameter at a time, and each column in the table corresponds to a variation from the JITNet 0.8 baseline.
High learning rates allow for faster adaptation. Therefore, we chose the highest learning rate at which online training is stable for all the experiments in the paper. As one can see, a lower learning rate of 0.001 reduces both accuracy and the amount of speedup. Increasing the learning to 0.1 destabilizes training and yields very poor results. The number of updates to perform on a single frame depends on how much the model can learn from the frame and how useful that information is in the immediate future. This is hard to predict without having access to future frames and what is inherently difficult for the model to learn. Increasing the number of updates leads to overfitting, reducing accuracy while increasing speed up and reducing the number of teacher samples overall. This suggests some room for improvement in choosing how many updates to perform on a given frame over our simple accuracy based heuristic. As one would expect, increasing and decreasing minimum stride increase and decrease accuracy respectively. Overall, the online distillation algorithm is reasonably robust to the input parameters.
7.2 DAVIS Evaluation
|JITNet 0.8||Max Updates||Learning Rate||Min Stride|
Online distillation as a technique can be used to mimic an accurate teacher model with a compact model for improving the runtime efficiency. The main focus of this work is to demonstrate the viability of the online distillation technique for semantic segmentation on streams captured from typical deployment settings. In this section, we show preliminary results on the viability of online distillation combined with the JITNet architecture for accelerating semi-supervised video object segmentation methods. Specifically, we evaluate how the JITNet architecture can be combined with state-of-the-art methods like OSVOS-S .
We evaluate three different configurations of JITNet at varying levels of supervision. In configuration A, we train JITNet on only the first ground truth frame of each sequence, and evaluate JITNet over the rest of the frames in sequence without any additional supervision (the standard video object segmentation task). On many sequences in DAVIS, object appearance changes significantly and requires prior knowledge of the object shape. Note that JITNet is a very low capacity model designed for online training and cannot encode such priors. Configuration A is not an online distillation scenario, but even with its low capacity, the JITNet architecture trained on just the first frame yields reasonable results.
Recent methods like OSVOS-S  leverage instance segmentation models such as Mask R-CNN for providing priors on object shape every frame. We take a similar approach in configuration B, where the goal is to mimic the expensive OSVOS-S model. We train JITNet on the first ground truth frame, then adapt using segmentation predictions from OSVOS-S  every 16 frames. Note that in configuration B, our combined approach does not use additional ground truth, since OSVOS-S predictions are made using only the first ground truth frame. Finally, in configuration C, we train on the first ground truth frame, and adapt on the ground truth mask every 16 frames. This gives an idea of how the quality of the teacher effects online distillation.
We use the validation set of the DAVIS 2016  dataset for our evaluation. The dataset contains 50 video sequences of 3455 frames total, each labeled with pixel-accurate segmentation masks for a single foreground object. We evaluate using the main DAVIS metrics: region similarity J and contour accuracy F, with precision, recall, and decay over time for both. We present metrics over the entire DAVIS 2016 validation set for all three JITNet configurations, alongside a subset of state-of-the-art video object segmentation approaches. In all configurations, we start with JITNet pre-trained on YouTube-VOS , with max updates per frame set to 500, accuracy threshold set to 0.95, and use standard data augmentations (flipping, random noise, blurring, rotation). JITNet A performs similarly to OFL , a flow-based approach for video object segmentation, while JITNet B, using OSVOS-S predictions, performs comparably to OSVOS, with significantly lower runtime cost. Finally, JITNet C, which uses ground truth masks for adaptation, performs comparable to only using OSVOS-S predictions. This suggests that even slightly noisy supervision suffices for online distillation. Overall these results are encouraging with regards to further work into exploring architectures well suited for online training.
7.3 Offline Training Details
JITNet COCO pre-training:
All JITNet models used in our experiments are pre-trained on the COCO dataset. We convert the COCO instance mask labels into a semantic segmentation labels by combining all the instance masks of each class for each image. We train the model for semantic segmentation on all 80 classes. The model is trained on 4 GPUs with batch size 24 (6 per GPU) using an Adam optimizer with a starting learning rate of 0.1 and a step decay schedule (reduces learning rate to 1/10th of current rate every 10 epochs) for 30 epochs.
JITNet offline oracle training:
All offline oracle models are initialized using the COCO pre-trained model and trained on the specialized dataset for each video using the same training setup as COCO, i.e., same number of GPUs, batch size, optimizer, and learning rate schedule. However, each of the specialized datasets is about 6000 images and 20 smaller than the COCO dataset.
7.4 Standalone Semantic Segmentation
The JITNet architecture is specifically designed with low capacity so that it can support both fast training and inference. To understand the accuracy vs. efficiency trade-off relative to other architectures such as MobileNetV2 [38, 46], we trained a JITNet model with twice the number of filters in each convolution layer and twice the number of encoder/decoder blocks than the one used in the paper. This modified architecture is 1.5 faster than the semantic segmentation architecture based on MobileNetV2. The larger JITNet gives a mean IoU of 67.34 on the cityscapes  validation set and compares favorably with the 70.71 mean IoU of the MobileNetV2 based model . We started with the larger JITNet architecture in the online distillation experiments, but lowered the capacity even further, with half the number of filters in each convolution layer and encoder/decoder blocks, since it provided a better cost vs. accuracy trade-off for online distillation.
M. I. Assaf Shocher, Nadav Cohen.
“Zero-shot” super-resolution using deep internal learning.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.
-  C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541. ACM, 2006.
-  S. Bulo, L. Porzi, and P. Kontschieder. In-place activated batchnorm for memory-optimized training of DNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
-  N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
-  L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  FAIR. Detectron Mask R-CNN. https://github.com/facebookresearch/Detectron, 2018.
C. Finn, P. Abbeel, and S. Levine.
Model-agnostic meta-learning for fast adaptation of deep networks.
In D. Precup and Y. W. Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua. Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):267–282, 2008.
-  R. Gadde, V. Jampani, and P. V. Gehler. Semantic video CNNs through representation warping. CoRR, abs/1708.03088, 2017.
-  J. Goodman, A. G. Greenberg, N. Madras, and P. March. Stability of binary exponential backoff. Journal of the ACM (JACM), 35(3):579–602, 1988.
-  S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr. Struck: Structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2096–2109, 2016.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
S. Hong, T. You, S. Kwak, and B. Han.
Online tracking by learning discriminative saliency maps with convolutional neural network.In International Conference on Machine Learning, pages 597–606, 2015.
-  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2462–2470, 2017.
-  J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, and I. Stoica. Chameleon: Scalable adaptation of video analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18, pages 253–266, New York, NY, USA, 2018. ACM.
-  Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-Learning-Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.
-  D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia. Noscope: Optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11):1586–1597, 2017.
-  D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia. NoScope: Optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11):1586–1597, 2017.
-  J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber. A clockwork RNN. In Proceedings of the International Conference on Machine Learning, pages 1863–1871, 2014.
-  H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
-  H. Liu, S. Chen, and N. Kubota. Intelligent video systems and analytics: A survey. IEEE Trans. Industrial Informatics, 9(3):1222–1233, 2013.
-  W.-L. Lu, J.-A. Ting, J. J. Little, and K. P. Murphy. Learning to track and identify players from broadcast sports videos. IEEE transactions on pattern analysis and machine intelligence, 35(7):1704–1716, 2013.
-  C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3074–3082, 2015.
-  K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. Van Gool. Video object segmentation without temporal information. 2018.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4293–4302. IEEE, 2016.
-  F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
-  S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
-  E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell. Clockwork convnets for video semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 852–868. Springer, 2016.
-  H. Shen, S. Han, M. Philipose, and A. Krishnamurthy. Fast video classification via adaptive cascading of deep models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, 2015.
-  K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  TensorFlow. TensorFlow DeepLab Model Zoo. https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md, 2018.
-  Y.-H. Tsai, M.-H. Yang, and M. Black. Video segmentation via object flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. In BMVC, 2017.
-  L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3119–3127, 2015.
-  N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang. YouTube-VOS: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
-  L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efficient video object segmentation via network modulation. Proceedings of the International Conference on Robotics and Automation, 2018.
-  X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, page 7, 2017.