Dynamic Face Video Segmentation via Reinforcement Learning

by   Yujiang Wang, et al.
Imperial College London

For real-time semantic video segmentation, most recent works utilise a dynamic framework with a key scheduler to make online key/non-key decisions. Some works used a fixed key scheduling policy, while others proposed adaptive key scheduling methods based on heuristic strategies, both of which may lead to suboptimal global performance. To overcome this limitation, we propose to model the online key decision process in dynamic video segmentation as a deep reinforcement learning problem, and to learn an efficient and effective scheduling policy from expert information about decision history and from the process of maximising global return. Moreover, we study the application of dynamic video segmentation on face videos, a field that has not been investigated before. By evaluating on the 300VW dataset, we show that the performance of our reinforcement key scheduler outperforms that of various baseline approaches, and our method could also achieve real-time processing speed. To the best of our knowledge, this is the first work to use reinforcement learning for online key-frame decision in dynamic video segmentation, and also the first work on its application on face videos.



There are no comments yet.


page 1

page 2

page 3

page 4


Visual Tracking by means of Deep Reinforcement Learning and an Expert Demonstrator

In the last decade many different algorithms have been proposed to track...

Distributional Reinforcement Learning for Scheduling of (Bio)chemical Production Processes

Reinforcement Learning (RL) has recently received significant attention ...

Dynamic Measurement Scheduling for Event Forecasting using Deep RL

Current clinical practice for monitoring patients' health follows either...

Online Safety Assurance for Deep Reinforcement Learning

Recently, deep learning has been successfully applied to a variety of ne...

Dynamic Video Segmentation Network

In this paper, we present a detailed design of dynamic video segmentatio...

Dynamic Measurement Scheduling for Adverse Event Forecasting using Deep RL

Current clinical practice to monitor patients' health follows either reg...

Online Multi-modal Person Search in Videos

The task of searching certain people in videos has seen increasing poten...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In computer vision, semantic segmentation is a computationally intensive task which performs per-pixel classification on images. Following the pioneering work of Fully Convolutional Networks (FCN)

long2015fully, tremendous progress has been made in recent years with the propositions of various deep segmentation methods chen2018deeplab; badrinarayanan2017segnet; wu2019wider; zhao2017pyramid; autodeeplab2019; deeplabv3plus2018; lin2017refinenet; paszke2016enet; chen2017rethinking; zhao2018icnet; nekrasov2018light. To achieve accurate result, these image segmentation models usually employ heavy-weight deep architectures and additional steps such as spatial pyramid pooling zhao2017pyramid; chen2017rethinking; chen2018deeplab and multi-scaled paths of inputs/features chen2016attention; zhao2018icnet; lin2016efficient; chen2018deeplab; chen2014semantic; lin2017refinenet; pohlen2017full, which further increase the computational workload. For real-time applications such as autonomous driving, video surveillance, and facial analysis wang2018face, it is impractical to apply such methods on a per-frame basis, which will result in high latency intolerable to those applications. Therefore, acceleration becomes a necessity for these models to be applied in real-time video segmentation.

Various methods shelhamer2016clockwork; zhu2017deep; xu2018dynamic; li2018low; jain2018inter; nekrasov2019architecture; jain2018accel; nilsson2018semantic; gadde2017semantic have been proposed to accelerate video segmentation. Because adjacent frames in a video often share a large proportion of similar pixels, most of these works utilise a dynamic framework which separates frames into key and non-key frames and produce their segmentation masks differently. As illustrated in Fig. 1 (left), a deep image segmentation model

is divided into a heavy feature extraction part

and a light task-related part . To produce segmentation masks, key frames would go through both and

, while a fast feature interpolation method is used to obtain features for the non-key frames by warping

’s output on the last key frame (LKF), thus to avoid the heavy cost of running on every frame. On top of that, a key scheduler is used to predict whether an incoming frame should be a key or non-key frame.

As an essential part of dynamic video segmentation, decisions made by the key scheduler could significantly affect the overall performance li2018low; xu2018dynamic; zhu2018towards of the video segmentation framework. However, this topic is somewhat underexplored by the community. Several recent works nekrasov2019architecture; zhu2017deep; jain2018accel; jain2018inter simply used a fixed key scheduler, which is usually suboptimal as it does not take into account the video content. Some other works have proposed to use adaptive schedulers xu2018dynamic; li2018low; zhu2018towards that make the key/non-key decisions based on whether the deviation between two frames surpass a certain threshold. Trained to heuristically predict similarity between image pairs, such schedulers lack awareness of the global video context, which may also lead to suboptimal performance in the long run.

To overcome this limitation, we propose to apply reinforcement learning techniques to expose the key scheduler to the global video context. Leveraging additional expert information about decision history, our scheduler is trained to learn key-decision policies that maximise the long-term returns in each episode, as shown in Fig. 1 (right).

We further study the application of dynamic video segmentation in real-time face videos. Comparing to semantic image/video segmentation, semantic segmentation for faces is a less investigated field gucclu2017end; zhou2015interlinked; kae2013augmenting; smith2013exemplar; nirkin2018face; warrell2009labelfaces; scheffler2011joint; yacoob2006detection; lee2008markov; ghiasi2015using, and there are even fewer works on face segmentation in videos wang2018face; saito2016real. These works either used engineering-based features kae2013augmenting; smith2013exemplar; warrell2009labelfaces; scheffler2011joint; yacoob2006detection; lee2008markov, or employed outdated image segmentation models like FCN long2015fully on a per-frame basis nirkin2018face; saito2016real; wang2018face without a dynamic acceleration mechanism. Therefore, we propose real-time face segmentation system based on the dynamic segmentation framework with our key scheduler trained using reinforcement learning.

In particular, we adopt the Deeplab-V3+ model deeplabv3plus2018 with MobileNet-V2 mobilenetv22018 backbone for the image segmentation model , and we use optical flows extracted by the FlowNet2-s architecture ilg2017flownet to interpolate key-frame features to non-key ones zhu2017deep. We conduct experiments on the 300 Videos in the Wild (300VW) shen2015first dataset, where the face segmentation annotations are obtained as mentioned in wang2018face. By comparing with several baseline approaches, we show that our reinforcement key schedulers can make more effective key-frame decisions at the cost of little resource. We also show that our final system could achieve real-time performances for face segmentation task.

Figure 1: Left: The dynamic video segmentation framework in which a key scheduler is used to make online key/non-key predictions. Right: a comparison between previous key schedulers and ours. Previous works only consider deviation between current frame (C) and the last key frame (K), while our scheduler takes into account C, K and historical information from non-key frames (N), aiming to maximise the global return.

2 Related works

Semantic image segmentation   Fully Convolutional Networks (FCN) long2015fully is the first work to use fully convolutional layers and skip connections to obtain pixel-level predictions for image segmentation. Successive works have made various improvements, including the usage of dilated convolutions chen2014semantic; chen2018deeplab; chen2017rethinking; yu2015multi; yu2017dilated, encoder-decoder architecture badrinarayanan2017segnet; lin2017refinenet; deeplabv3plus2018, Conditional Random Fields (CRF) for post-processing zheng2015conditional; chen2014semantic; chen2018deeplab, spatial pyramid pooling to capture multi-scale features zhao2017pyramid; chen2018deeplab; chen2017rethinking and Neural Architecture Search (NAS) zoph2016neural to search for the best-performing architectures dpc2018; autodeeplab2019. Nonetheless, such models usually require intensive computational resources, and applying them in real-time frame-by-frame video segmentation can lead to undesirably high latency. To address this, several light-weight architectures were proposed nekrasov2018light; zhao2018icnet. Unfortunately, due to the absence of effective interpolation from previous frames, their solutions could not produce temporal-consistent results.

Dynamic video segmentation   Clockwork ConvNet shelhamer2016clockwork

promoted the idea of dynamic segmentation by fixing part of the network to avoid unnecessary computations. Deep Feature Flow (DFF)

zhu2017deep proposed to accelerate video recognition by leveraging optical flow (extracted by FlowNet zhu2017flow; ilg2017flownet or SpyNet ranjan2017optical) to warp key-frame features. Similar idea can also be found in xu2018dynamic; jain2018accel; nilsson2018semantic; gadde2017semantic. Jain and Gonzalez jain2018inter

proposed to use block motion vectors in compressed videos to achieve fast feature interpolation, however, it is difficult to adapt this method to online scenarios. Mahasseni

et al.mahasseni2017budget employed a single convolution layer with uniform filters as the interpolation model. Nevertheless, Li et al.li2018low argued that using such convolution filters does not reflect the varying motion across frames, and thus proposed to use a spatially-variant convolution for propagation. On the other hand, NAS zoph2016neural was utilised by nekrasov2019architecture to explore the best architecture for the interpolation model.

Although various feature propagation techniques have been explored, research on key scheduling policy is limited. Most existing works adopted fixed key schedulers nekrasov2019architecture; zhu2017deep; jain2018accel; jain2018inter, which is inefficient for real-time segmentation. Mahasseni et al.mahasseni2017budget suggested a budget-aware, LSTM-based key selection strategy trained with reinforcement learning, however it is only applicable for offline scenarios. Inspired by DFF zhu2017deep, DVSNet xu2018dynamic used an adaptive key decision network which takes as input the optical flow features between key-current image pairs, and computes the similarity score between current interpolated segmentation mask (if non-key) and the prediction from image segmentation model (if key). If this score is lower than a threshold, it will be a key and vice versa. Li et al.li2018low introduced a dynamic key scheduler which was trained to predict the deviation degree between two frames by the deviations of their low-level features. Similarly, zhu2018towards proposed to adaptively determine key frames based on the number of positions where temporal features are inconsistent. All these schedulers were trained to learn the deviation degree between two frames and lacked the understanding of global context. On the contrary, our key scheduler is assisted by reinforcement techniques to derive a temporal-consistent policy for maximising overall performance.

Semantic face segmentation   The study of semantic face segmentation received far less attention than that of image/video segmentation. Early works on this topic were mostly engineering-feature based kae2013augmenting; smith2013exemplar; warrell2009labelfaces; scheffler2011joint; yacoob2006detection; lee2008markov and were designed for static image. Saito et al.saito2016real employed graphic cut algorithm to refine the segmentation probabilistic maps obtained from a FCN trained with augmented data. In nirkin2018face, a semi-supervised data collection approach was proposed to generate a large number of labelled facial images with random occlusion for FCN training. Recently, Wang et al.wang2018face integrated Conv-LSTM xingjian2015convolutional with FCN long2015fully to extract face masks from video sequence. Despite its improved accuracy against vanilla FCN, its run-time speed did not improve. Furthermore, we can simply replace FCN with other segmentation models deeplabv3plus2018; autodeeplab2019; badrinarayanan2017segnet to achieve better performance. None of the aforementioned works has considered the video dynamics, and their performances are overshadowed by those of new FCN variants badrinarayanan2017segnet; deeplabv3plus2018; chen2018deeplab; autodeeplab2019. To bridge this gap, we propose to combine the DFF framework zhu2017deep with the advanced Deeplab-V3+ segmentation approach deeplabv3plus2018 and the FlowNet2-s model ilg2017flownet, and integrate our proposed reinforcement-based key scheduler to devise an effective and efficient real-time face video segmentation framework.

Reinforcement learning   In model-free Reinforcement Learning (RL), an agent receives a state at each time step from the environment, and learns a policy with parameters that guides the agent to take an action to maximise the cumulative rewards . RL has demonstrated impressive performance on various fields such as robotics and complicated strategy games lillicrap2015continuous; silver2016mastering; mnih2013playing; vinyals2017starcraft; silva2017moba

. In this paper, we show that RL can be seamlessly applied to online key decision problem in real-time video segmentation, which can be seen as a Markov Decision Process (MDP). Among various RL approaches, we chose the policy gradient with reinforcement

williams1992simple to learn , where gradient ascend was used for maximising the objective function .

3 Methodology

Figure 2: An overview of our system. is the last key frame (key decision process not shown) with feature extracted by . For an incoming frame , its input state includes two components: the deviation information between and , and the expert information about decision history. is fed into Conv0 layer of policy network , while is concatenated to the output of FC2 layer. Basing on ,

gives probabilities output

regarding taking key or non-key actions. For a non-key action, the optical flow between and will be used to warp to , while for a key action, will go through to obtain a new key feature .

3.1 System Overview

Our target is to develop an efficient and effective key scheduling policy for the dynamic video segmentation system. To this end, a feature propagation framework is essential, thus we adopted the Deep Feature Flow zhu2017deep

where the optical flow is calculated by a light-weight flow estimation model

such as FlowNet zhu2017flow; ilg2017flownet or SpyNet ranjan2017optical. Specifically, an image segmentation model can be divided into a time-consuming feature extraction module and a task specified module . We denote the last key frame as and its features extracted by as , i.e., . For an incoming frame , if it is a key frame, the feature is and the segmentation mask is ; if not, instead of using the resource-intensive module for feature extraction, its feature will be propagated by a feature interpolation function , which involves the flow field from to , the scale field from to , and key frame feature , hence the predicted mask becomes . Please check zhu2017deep for more details on the feature propagation process.

On top of the DFF framework, we design a light-weight policy network to make online key predictions. The state at frame consists of two parts, the deviation information which describes the differences between and , and the expert information regarding key decision history (see Section 3.2 for details), i.e., . Feeding as input, the policy network outputs the action probabilities where and (we define for non-key action and for the key one). For an incoming frame , if where is a threshold, it will be identified as a key frame, vice versa. In general, key action will lead to a segmentation mask with better quality than the ones given by action .

In this work, we utilise the FlowNet2-s model ilg2017flownet as the optical flow estimation function . DVSNet xu2018dynamic has shown that the high-level features from FlowNet models contain sufficient information about the deviations between two frames, and it can also be easily fetched along with optical flow without additional cost. Therefore, we adopt the features of FlowNet2-s model for . It is worthwhile to notice that by varying properly, our key scheduler can be easily integrated into other dynamic segmentation frameworks jain2018accel; li2018low; nekrasov2019architecture; jain2018inter; zhu2018towards which do not use optical flow. Fig. 2 gives an overview of our system.

3.2 Training Policy Network

Network structure   Our policy network comprises of one convolution layer and four fully connected (FC) layers. The FlowNet2-s feature is fed into the first convolution layer Conv0 with 96 channels, followed by FC layers (FC0, FC1 and FC2) with output size being 1024, 1024 and 128 respectively. Two additional channels containing expert information about decision history are concatenated to the output of FC2 layer. The first channel records the Key All Ratio (KAR), which is the ratio between the key frame and every other frames in decision history, while the second channel contains the Last Key Distance (LKD), which is the interval between the current and the last key frame. KAR provides information on the frequency of historical key selection, and LKD gives awareness about the length of continuous non-key decisions. Hence, the insertion of KAR and LKD extends the output dimension of FC2 to 130, while FC3 layer summarises all these information and gives action probabilities where , and stand for non-key and key action correspondingly.

Reward definition   We use the widely-used mean Intersection-over-Union (mIoU) as the metric to evaluate the segmentation masks. We denote the mIoU of from a non-key action as , the mIoU from key action as , and the reward at frame is defined in Eq. 1

. Such definition encourages the scheduler to choose key action on the frames that would result in larger improvement over non-key action, this also reduces the variances of mIoUs across the video.


Constraining key selection frequency   The constraints on key selection frequency are necessary in our task. Since a key action will generally lead to a better reward than a non-key one, the policy network inclines to make all-key decisions if no constraint is imposed on the frequency of key selection. In this paper, we propose a stop immediately exceeding the limitation approach. Particularly, for one episode consisting of frames , the agent starts from and explores continuously towards . At each time step, if the KAR in decision history has already surpassed a limit , the agent will stop immediately and thus this episode ends, otherwise, it will continue until reaching the last frame . By using this strategy, a policy network should limit the use of key decision to avoid an over-early stopping, and also learn to allocate the limited key budgets on the frames with higher rewards. By varying the KAR limit , we could train with different key decision frequencies.

Episode settings   Real-time videos usually contains enormous number of high-dimensional frames, thus it is impractical to include all of them in one episode, due to the high computational complexity and possible huge variations across frames. For simplicity, we limit the length of one episode to 270 frames (9 seconds in our dataset), which should cover the most circumstances. We vary the starting frame during training to learn the global policies across videos. For each episode, we let the agent run times (with the aforementioned key constraint strategy) to obtain trials to reduce variances. The return of each episode can be expressed as , where is the starting frame index of the episode, and denotes the total step number at the trail (since agent may stop before steps), and refers to the reward of frame in trail. is also the main objective function to optimise.

Auxiliary losses   In addition to optimise the cumulative reward , we apply two auxiliary losses for regularisation. Following the works of mnih2016asynchronous; pang2018meta, we employ the entropy loss to promote the policy that retains high-entropy action posteriors so as to avoid over-confident actions. We also add a L2-norm loss for weight decay. Eq. 2 shows the final objective function to optimise using policy gradient with reinforcement method williams1992simple.


Epsilon-greedy strategy   During training, agent may still fall into over-deterministic dilemmas with action posteriors approaching nearly 1, even though the auxiliary entropy loss have been added. To recover from such dilemma, we implement a simple strategy similar to epsilon-greedy algorithm for action sampling, i.e., in the cases that action probabilities exceed a threshold (such as 0.98), instead of taking action with probability , we use to stochastically pick action (and for picking action ).

4 Experiments

4.1 Dataset

We have conducted our experiments on the 300 Videos on the Wild (300VW) dataset shen2015first. This dataset contains 114 videos (captured at 30 FPS) with an average length of 64 seconds, all of which are taken in unconstrained environment. Following wang2018face, we have cropped faces out of the video frames and generated the segmentation labels with facial skin, eyes, outer mouth and inner mouth for all the 218,595 frames. For experiment purpose, we divided the videos into three subject-independent parts, namely sets with 93 / 9 / 12 videos. For training preliminary networks such as the image segmentation model and the flow estimation model , we used sets , and for training, validation and testing purposes respectively, while for the training of key scheduler, we used set for training and validation, and retained set for testing. In detail, for training , we randomly picked 18,570 / 1,740 / 2,400 frames from sets for training/validation/testing purposes. As for the training of flow estimation model , we randomly generate 51,480 / 4,836 / 6,671 key-current image pairs with a varying gap between 1 to 30 frames from sets for training/validation/testing purposes. We intentionally excluded set for policy network learning, since this set has already been used to train and , instead, we used the full set (16,568 frames) for training and validating RL model, and evaluated it on the full set (22,580 frames).

4.2 Experimental Setup

Evaluation metric

We employed the commonly used mean Intersection-over-Union (mIoU) as the evaluation metric. It is worth mentioning that we excluded the background class from mIoU calculation for a fairer evaluation, i.e., the mIoU metric averages the IoUs of facial skin, eyes, outer and inner mouths without the background class, which may lead to a comparatively low mIoU value than including background.

Training preliminary networks   To enable a key scheduler to work in a Deep Feature Flow framework, preliminary networks such as the image segmentation model and the flow estimation function are required. We borrowed the state-of-the-art Deeplab-V3+ architecture deeplabv3plus2018 for model , and we selected MobileNet-V2 mobilenetv22018 as its backbone structure considering the balance between performance and running speed. Regarding the implementation of flow estimation function , we adopted the FlowNet2-s architecture ilg2017flownet. For training , we initialised the weights using the pre-trained model provided in deeplabv3plus2018

and fine-tuned the model. We set the output stride and decoder output stride to 16 and 4, respectively. We divided

into and , where the output of is the posterior for each image pixel, we then fine-tuned the FlowNet2-s model as suggested in zhu2017deep by freezing and . Also, we use the pre-trained weights provided in ilg2017flownet as the starting point of training . The input sizes for and are both set to 513*513.

RL settings   For state , following DVSNet xu2018dynamic, we leveraged the features from the Conv6 layer of the FlowNet2-s model as the deviation information , and we obtained the expert information from the last 90 decisions. During the training of policy network, , and

were frozen to avoid unnecessary computations. We chose RMSProp

tieleman2012lecture as the optimiser and set the initial learning rate to 0.001. The parameters in Eq. 2 were set to 0.14 and 0.001 respectively. We empirically decided the discount factor to be 1.0, as the per frame performance was equally important in our task. The value of epsilon in epsilon-greedy strategy was set to 0.98. During training, we set the threshold value for determining the key action to 0.5.

The maximum length of each episode was set to 270 frames (9 seconds), and we repeated a relatively large number of 32 trials for each episode, while the returns of these trials were concatenated with a normalisation step to stabilise gradients. A mini-batch size of 8 was used for back-propagation in . Starting from random weights, we trained each model for 2,400 episodes and validated the performances of checkpoints on the same set to find the best one, which was further evaluated on the test set. We also varied the KAR limit to obtain policy networks with different key decision tendencies.

Baseline comparison   For a fair comparison, we applied the same , and models for all approaches and only evaluated the key scheduling part. Besides, we used the plots of average key intervals versus mIoUs to measure the overall performance.

We compared our reinforcement key scheduler with three different baseline approaches: (1) The adaptive key decision model in DVSNet xu2018dynamic; (2) The adaptive key scheduler using flow magnitude difference in xu2018dynamic; (3) Deep Feature Flow with fixed key scheduler zhu2017deep. For implementation of DVSNet, we used the codes provided by the authors, and trained it with 32,049 image pairs generated from set . By varying the key score threshold, we obtained its performance curve. Regarding the computation of flow magnitude, we refer the readers to xu2018dynamic. Note that we changed its difference threshold to draw the results. For DFF with fixed scheduler, we simply changed the key interval value to get different results. To obtain the final results of our method, we trained six models with value set to 0.04, 0.06, 0.1, 0.12, 0.14 and 0.25, respectively. We then varied the threshold to obtain more data points for drawing the performance curve.


  We implemented our method in Tensorflow

tensorflow2015-whitepaper framework. Experiments were run on a cluster with eight NVidia 1080 Ti GPUs, and we evaluated the running speed on a desktop with one NVidia 1080 Ti GPU. It took approximately 2.5 days to train one model with RL.

4.3 Results

Preliminary networks   We evaluated four image segmentation models: FCN long2015fully with VGG16 simonyan2014very architecture, Deeplab-V2 chen2018deeplab of VGG16 version, Deeplab-V3+ deeplabv3plus2018 with Xception-65 chollet2017xception backbone, and Deeplab-V3+ with MobileNet-V2 mobilenetv22018 backbone. As can be seen from Table 1, Deeplab-V3+ with MobileNet-V2 backbone struck a better balance between the speed (58.8 FPS) and the accuracy (60.1% in mIoU), therefore we selected it for our segmentation model . Its feature extraction part was used to extract key frame feature in key-current images pairs during the training of FlowNet2-s ilg2017flownet model , whose performance was evaluated by the interpolation results on current frames. From Table 1 we can discover that the interpolation speed with is generally much faster than those segmentation models at the cost of a slight drop in mIoU (from 60.1% to 56.56%). Under live video scenario, the loss of accuracy can be effectively remedied by a good key scheduler.

Models Methods mIoU(%) FPS
FCN (VGG16) Per Frame 55.71 45.5
Deeplab-V2 (VGG16) Per Frame 58.66 3.44
Deeplab-V3+ (Xception-65) Per Frame 60.69 24.4
Deeplab-V3+ (MobileNet-V2) Per Frame 60.1 58.8
FlowNet2-s (: Deeplab-V3+ with MobileNet-V2) Feature Propagation 56.56 153.8
Table 1: The performance of various image segmentation models and FowNet2-s. mIoU does not include the background class. FPS is evaluated on a NVidia 1080 Ti GPU for one face image of 513*513 size.

RL training visualisation   In the upper row of Fig. 3, we demonstrate the average return during RL training with different KAR limits (0.04, 0.06, 0.14). It can be seen that even though we select the starting frames of each episode randomly, those return curves still exhibit a generally increasing trend despite several fluctuations. This validates the effectiveness of our solutions for reducing variances and stabilising gradients, and it also verifies that the policy is improving towards more rewarding key actions. Besides, as the value of increases and allows for more key actions, the maximum return that each curve achieves also becomes intuitively higher.

We also visualised the influences of two expert information KAR and LDK by plotting their weights in during RL training. As shown in the bottom row of Fig. 3, we plotted the weights of two channels in that received KAR and LDK as input and contributed to the key posteriors . From these plots we can observe that the weights of the LDK channel show a globally rising trend, while that of the KAR channel decrease continuously. These trends indicate that the information of KAR and LDK become increasingly important in key action decisions as training proceeds, since a large LDK value (or a small KAR) will encourage the scheduler to take key action. This is an intuitive result with the key constraint strategy we have applied. Furthermore, we can imply from the plots that the key scheduler tends to rely more on LDK channel than KAR channel to make key decision with a lower like 0.04, conversely, KAR becomes more significant with a higher value such as 0.14.

Figure 3: The upper row plots the average return curves during RL training with value set to 0.04, 0.06 and 0.14. The bottom row illustrates the variations of the weights of KAR and LDK channels contributing to the key posteriors . The plots in the same column are from the same training session.

Performance evaluation   The results of our key scheduler with RL and three baseline schedulers are shown in Fig. 4. We can see that the performances of all methods are similar for key intervals less than 20. This is to be expected as the performance degradation on non-key frames can be compensated by dense key selections. Our method starts to show superior performance when the key interval increases beyond 25, where our mIoUs are consistently higher than that of other methods and decreases slower as the key interval increases. It should be noted that, in the case of face videos, selecting key frames by a small interval () does not significantly affect the performance, which is not the same as in autonomous driving scenarios zhu2017deep; xu2018dynamic; jain2018accel. This could be attributed to the fact that variations between consecutive frames in face videos are generally less than those in autonomous driving. As a result, we can gain more efficiency benefit when using key scheduling policy with relatively large interval for dynamic segmentation of face video.

Figure 4: Comparison between ours and baseline key schedulers. No background class in mIoU.

Running speed   The average run-time of our key scheduler is 1.1 ms per frame. On average, segmentation takes 7.6 ms and 23.6 ms for non-key and key frame, respectively. With 10% / 20% / 30% key frames, the average frame rate of our system is 108 / 93 / 81 FPS, respectively, which are all faster than real-time as well as the speed of the original (58.8 FPS).

5 Conclusions

In this paper, we propose to learn an efficient and effective key scheduler via reinforcement learning for dynamic face video segmentation. By the utilisation of expert information and appropriately-designed training strategies, our key scheduler achieves more effective key decisions than baseline methods on most average key interval sections. This is the first work to apply dynamic segmentation techniques with RL on real-time face videos, and it can be enlightening to future works on real-time face segmentation and on dynamic video segmentation.