Approximating a 3DCNN with a 2DCNN
Spatiotemporal representations learnt using 3D convolutional neural networks (CNN's) are currently the state-of-the-art approaches for action related tasks. However, 3D-CNN's are notoriously known for being memory and compute resource intensive. 2D-CNN's, on the other hand, are much lighter on computing resource requirements, and are faster. However, 2D-CNN's performance on action related tasks is generally inferior to that of 3D-CNN's. Also, whereas 3D-CNN's simultaneously attend to appearance and salient motion patterns, 2D-CNN's are known to take shortcuts and recognize actions just from attending to background, which is not very meaningful. Taking inspiration from the fact that we, humans, can intuit how the actors will act and objects will be manipulated through years of experience and general understanding of the "how the world works," we suggest a way to combine the best attributes of 2D- and 3D-CNN's – we propose to hallucinate spatiotemporal representations as computed by 3D-CNN's, using a 2D-CNN. We believe that requiring the 2D-CNN to "see" into the future, would encourage it gain deeper about actions, and how scenes evolve by providing a stronger supervisory signal. Hallucination task is treated rather as an auxiliary task, while the main task is any other action related task such as, action recognition. Thorough experimental evaluation shows that hallucination task indeed helps improve performance on action recognition, action quality assessment, and dynamic scene recognition. From practical standpoint, being able to hallucinate spatiotemporal representations without an actual 3D-CNN, would enable deployment in resource-constrained scenarios such as lower-end phones and edge devices, and/or with lower bandwidth. This translates to pervasion of Video Analytics Software as a Service (VA SaaS), for e.g., automated physiotherapy options for financially challenged demographic.READ FULL TEXT VIEW PDF
Approximating a 3DCNN with a 2DCNN
Spatiotemporal representations are densely packed with information regarding the appearance and salient motion patterns occurring in the video clips, as illustrated in Fig. 2. Due to this representational power they are currently the best performing models on action related tasks like action recognition [42, 21, 14, 6], action quality assessment [31, 30], skills assessment , action detection . This representation power comes at a cost of increased computational complexity [60, 50, 58, 13]. Hadidi et al.  recently conducted an exhaustive comparison of various CNN’s from the perspective of computational cost. We have cited some of their findings in Table 1 to show how much costlier are 3D-CNN’s than 2D-CNN’s. For further analysis regarding deployment of CNN’s on edge devices, we guide readers to the extensive study reported in . Very high compute resource requirements leave 3D-CNN’s unsuitable for deploying in resource-constrained scenarios.
2D-CNN’s are generally used for learning and extracting spatial features pertaining to a single frame/image. As such, typical 2D-CNN’s, by design do not take into account any motion information. Some of the works [48, 49, 33, 10] have addressed this by using optical flows. Optical flow fires at all pixels that have moved/changed (refer to Fig. 2). This means, optical flow will pick up cues from some irrelevant activity happening in the background as well. 3D-CNN, on other hand, will attend to salient motion patterns characteristic of an action class. As a matter of fact, 2D-CNN’s can also find short cuts to recognize actions, where instead of recognizing an action meaningfully from foreground, 2D-CNN would pick up enough cues from the background as reported in [20, 15]. These kinds of a short cuts might get the job done, but it is not very meaningful. However, 2D-CNN’s has the advantage of being computationally lightweight, which makes them suitable for deployment on edge devices.
|Time per inference (ms)||200||90|
In nutshell, 2D-CNN’s have the advantage of being computationally less expensive, while 3D-CNN’s extract spatiotemporal features that more representation power. In our work, we propose a way to combine the best of both worlds. Our inspiration comes from the observation that given an image of a scene, humans can predict how the scene around them would evolve. They are able to do so because they have better general understanding of how other people are expected to behave and objects would move/manipulated. Building machines/computer vision systems with such capabilities has been a long standing goal. To this end, we propose to hallucinate spatiotemporal representations, as computed by a 3D-CNN, using a 2D-CNN and from a single still frame (see Fig.1).
Conceptually, encouraging a 2D-CNN to predict spatiotemporal representations pertaining to 16 frames, from a single frame provides strong supervisory signal, which would help the 2D-CNN to gain deeper understanding of actions and how a given scene evolves with time.
Practically, predicting spatiotemporal representations, instead of actually computing, comes in handy in the following situations:
Resource-constrained scenarios: many computer vision efforts in areas like automated physiotherapy, that are targeted for low-income groups, make use of 3D-CNN’s. It is more likely that low income demographic would have devices with low computational resources, which are not suitable to run 3D-CNN’s; in these cases, we can just hallucinate spatiotemporal representations.
Limited and/or expensive bandwidth: VA SaaS is increasingly being employed. Bandwidth used for communicating between clients and cloud is usually limited and expensive. Using our method, we can hallucinate information pertaining to multiple frames (e.g., 16 frames) from just one frame, which reduces the transmission load by 15 times).
We propose to use hallucination task as an auxiliary task along with the action related task such as, action recognition, action quality assessment, etc. Experimentally, we show that incorporating hallucination loss during training helps in following four cases: action recognition, fine-grained action recognition, action quality assessment, dynamic scene recognition.
Our work sits close to predicting features, present or future, from same and different modalities, efficient/light-weight approaches, and knowledge distillation. We briefly visit works that are closest to ours, and compare and contrast our approach against those.
Wang et al.  treat actions as transformations from a precondition to the effect. Essentially, they propose to learn the CNN parameters and transformation matrices, such that the product of features of precondition (initial) frames and transformation matrix will produce the features corresponding to effect (future) frames.
Hoffman et al.  proposed to hallucinate depth modality using RGB modality, and showed that improvements over single modalities and simple fusion of modalities.
Vondrick et al. 
propose to learn to predict future, in feature space. Their approach also allows for multi-modal predictions. Given the current video frame, they propose to predict the representation of a future frame. However, they future frame representation is computed using a 2D-CNN pretrained on dataset like ImageNet or Places, which belong to a non-human-action domain.
While Vondrick et al.  propose a framework to generate future frames by disentangling foreground and background, Vondrick and Torralba  propose to disentangle low-level details and high-level semantics with the use of a transformer. Learning to generate future frames helps the network to learn useful representations that transfer well to other tasks like action recognition. However, our goal is not to predict pixel-perfect future, rather to make predictions at semantic level. Instead of generating future frames, a few works like [48, 49, 33, 10] focus on learning to predict optical flow (very short term motion information) from static images. Gao et al.  propose better optical flow encoding method then previous works [48, 49, 33]. Their approach, by design, requires to use an encoder, and a decoder. Our approach on the other hand, does not require a decoder, which helps in reducing the computational load on resource-constrained edge devices. Moreover, our approach learns to hallucinate spatiotemporal representations corresponding to a stack of 16 frames, as compared to motion information in two consecutive frames like [48, 49, 33, 10]. As can be seen in Fig. 2, optical flow attends to all kind of motion, even irrelevant background motion, while spatiotemporal representations only attend to action relevant salient motion patterns. Through experiments, we confirm the benefits of hallucinating and using spatiotemporal representations over optical flow prediction. Bilen et al.  introduce a novel, compact representation of videos, called ‘dynamic image’. Dynamic images can be thought of as a summary of videos in a single image. Computing a dynamic image requires to all the corresponding frames, where as in our case, hallucinating requires to process just a single image.
Numerous works have developed approaches to make video processing more efficient, either through using less evidence [59, 1, 3, 39] or through more efficient processing [40, 43, 35, 60, 52, 50, 27, 28].
propose to learn a student recurrent neural network that can classify a video using fewer frames. Our goal is hallucinate 3D-CNN representations using a 2D-CNN from a single frame. We also discuss our intuition that stronger supervision can be artificially provided without any manual annotation efforts using our hallucination loss.[3, 39] focus on predicting optical flow stream features using raw RGB image stream. While their method can rid of optical flow stream, their method still processes all the frames for/through RGB stream. Our approach enjoys the benefits of processing fewer frames, reduced computation load, and receiving stronger supervision.
3D convolutions can be factorized into 2D convolutions (spatial convolutions) followed by 1D convolutions (along the temporal dimension). This concept has been studied in numerous works [40, 43, 35, 60, 52] and better designs have been developed that take advantage of this factorization. 3D-CNN’s inherently have larger number of trainable parameters than their 2D counterparts, because of which 3D-CNN’s might be prone to overfitting . To address this,  proposed to use 2D convolutions along with 3D convolutions. Tran et al.  explore many 3D-CNN variants and observe that by replacing 3D convolutions with (2+1)D convolutions, more non-linearities can be made available in the CNN, which may allow to learn more complex functions. Xie et al.  found that 3D convolutions in bottom layers might be redundant, and may be replaced with 2D convolutions followed by 3D convolutions in the top layers for better temporal reasoning. Following this design, they obtained better results with lesser complexity.
Lee et al.  introduce MFNet, in which spatiotemporal information is extracted from feature maps from consecutive appearance blocks and used along with appearance information. This reduces the computational cost in comparison to two-stream approaches like . In a concurrent work, Lin et al.  introduce a novel Temporal Shift Module (TSM) to allow information exchange among consecutive frames just by shifting channel, which gives strong temporal modeling ability with no additional computational cost.
While these works aim to address either using less visual evidence or more efficient, our solution to hallucinate spatiotemporal representations using a 2D-CNN from a single image aims to solve both the problems, and provides stronger supervision.
Let’s consider the visualization  of C3D model  shown in Fig. 2, particularly, the instance of gymnast on a balance beam, to understand what a 3D-CNN actually learns to capture. We notice that C3D fires at pixels belonging to the body of the gymnast, and captures the cartwheel done by the athlete over the span of 16 frames.
In order to complete the hallucination task, 2D-CNN, for e.g., will have to:
learn to identify that there’s an actor in the scene and localize them
spatially segment the actors and objects
identify event going on is a balance beam gymnastic event, the actor is a gymnast
identify that gymnast is on her way to attempt a cartwheel
predict how she would be moving while attempting the cartwheel
approximate the position gymnast would be in after 16 frames, etc.
Here we have just discussed a case of the action class balance-beam, but readers can imagine the same for other classes. This is a lot of semantic details to be predicted from a single frame. In typical action recognition task, the network would have been provided with just the action class label, which may be considered as a weak supervision signal. Incorporating hallucination task during training, would be equivalent of artificially providing with dense labels, a much stronger supervisory signal. Joint actor-action segmentation datasets  aim to provide such detailed annotations; actor-action segmentation is an actively pursued research direction [18, 11, 55, 19, 53]. However, following our proposition, we can get detailed supervision of a similar flavor (not exactly same) for free, which saves tremendous annotation efforts. Hallucination loss will encourage the network to focus on actors and objects and will develop better general understanding about actions and how objects are manipulated. 2D-CNN’s will now be less likely to take shortcuts – recognizing actions from background, ignoring the actual actor and action being performed [20, 15], as it cannot hallucinate spatiotemporal features from background. Moreover, the ability to just hallucinate spatiotemporal representations, would allow us to replace 3D-CNN’s with 2D counterparts in resource-constrained scenarios.
As a method to gain the benefits described in the preceding, we propose to use hallucination task. Note that an another way to do this would be to predict the future frames in pixel space. But we are interested in predicting at semantic level - perfect per pixel construction is not our goal. So rather than doing prediction in pixel space, we propose to do prediction in the feature space.
Hallucination task can also be seen as distilling knowledge from a teacher network (3D-CNN), to a student network (2D-CNN), ; where, is pretrained and then kept frozen, while parameters of are learnt. Let and represent mid-level representations from and , respectively, and be th video frame.
Hallucination loss, (Eq. 3), encourages to regress to by minimizing the Euclidean distance between and
Hallucination task is not the only goal. In addition to bringing down the computational cost, we would also like to improve the performance on action related tasks. To this end, we propose to incorporate hallucination task as an auxiliary task to be used with the actual action related main task, such as action recognition. So, main task loss (e.g., classification loss), , is used in conjunction with the hallucination loss, and the idea is that hallucination loss will help with the main task. So the overall loss can be expressed as follows,
where, is a loss balancing factor. Our approach is presented in Fig. 3. Realization of our approach is very straightforward.
We had hypothesized that incorporating hallucination task, would help by providing deeper understanding of actions. We evaluate the effect of incorporating hallucination task on the following action related tasks:
In principle, any 2D- and 3D-CNN’s can be used as student and teacher networks, respectively. We choose to use ResNeXt-101  as our teacher network, and VGG11-bn as our student model. Until not mentioned, assume that we have pretrained our teacher network on UCF-101 dataset, and is kept frozen. Student model is pretrained on ImageNet dataset . We name the network trained with hallucination loss as HalluciNet, without hallucination loss as just 2D-CNN or vanilla 2D-CNN.
we choose to hallucinate the activations of the last bottleneck group of ResNeXt-101, which are 2048-dimensional. Representations of shallower layer will have higher dimensionality, and will be less semantically mapped.
We PyTorch to implement all the networks. Network parameters are optimized using Adam optimizer  with starting learning rate of 0.0001. in Eq. 4 is set to 50, unless specified otherwise. Further experiment specific details are specified along with the experiment. We will make our code publicly available.
Our baseline to compare the performance is a 2D-CNN with same architecture, but which was trained without hallucination loss. In addition, we also compare the performance against other methods, which we specify in each experiment.
In first experiment, we evaluate to see if hallucination task helps with general action recognition. We compare the performance with dense optical flow prediction from static image approach , and motion prediction from static image approach .
UCF-101  and HMDB-51  action recognition datasets are considered. In order to be consistent with literature, we adopt their experiment protocol. Central frames from the train and test samples are used for reporting performance, which are named as UCF- and HMDB-static, as in the literature .
We report top-1 clip-level accuracy (in %).
We considered two cases as shown in Fig. 4. We found that fusing the hallucinated representations yielded better results. So we will consider that case in the remainder of the work.
First of all, we show the evolution of hallucination loss in Fig. 5. Through gradual decrease in value, we can clearly see that 2D-CNN is learning to hallucinate the spatiotemporal representations. Starting value of the loss is less as we are computing the loss after passing the activations through a sigmoid layer.
We summarize the performance on action recognition task in Table 2. We find that on both the datasets, incorporating hallucination task helps. Our HalluciNet outperforms prior approaches [49, 10] on UCF101. On HMDB51, our HalluciNet yields better results that , but  works better than ours. However, our method has an advantage of being computationally lighter than , as it does not use a flow image generator network.
|App stream ||63.60||35.10|
|App stream ensemble ||64.00||35.50|
|Motion stream ||24.10||13.90|
|Motion stream ||14.30||04.96|
|App + Motion ||65.50||37.10|
|App + Motion ||64.50||35.90|
|Nibali et al. ||3D||16||74.79||98.30||78.75||77.34||79.89|
We need to find suitable tasks to evaluate the utility of hallucinating future. Evaluating performance on ubiquitous task of recognizing actions in typically used datasets, like UCF-101 action recognition dataset, might not be sufficient. We need to evaluate on a task where the student network is required to hallucinate future in order to “fill the holes” in the input visual datastream. Fine-grained or detailed action recognition makes for a good candidate task.
In Olympic Diving, athletes attempt many different types of dives. In general action recognition dataset, like UCF101, all these dives would grouped under a single action class, Diving. However, these dives vary from each other in a subtle way. Each dive has following five components: a) Position (legs straight or bent?) b) starting from Armstand or not? c) Rotation direction (backwards, forwards, etc.?) d) how many times the diver Somersaulted? e) how many times the diver twisted? Different combinations of these components would produce a unique type of dive. The task is to predict all five components of a dive, using very few frames.
Unlike general action recognition datasets like UCF-101  or Kinetics , action in diving samples in this dataset vary very subtly. Furthermore, cues needed in order to differentiate or recognize a dive are distributed across the entire action sequence. So, to make dive classification task more suitable for our case, we ask the network to classify a dive correctly using only few frames. In particular, we every 16th frame is shown to the student network. We truncate diving samples to 96 frames. So, out of 96 frames, the student network is shown only 6 frames, based on which it needs to classify the dive.
For this task, we use a recently released Diving dataset, MTL-AQA , which has 1059 training and 353 test samples.
We take a teacher network pretrained on UCF-101 dataset, and a student network pretrained on ImageNet dataset.
Finally, the student network is trained to classify dives. Since we will be gathering evidence over six frames, we make use of LSTM 
to aggregate this evidence. LSTM is single-layered, with a hidden state being 256 dimensional. LSTM’s hidden state from last time step is passed through separate linear layers, one for each of the properties of a dive. The student network is trained end-to-end for 20 epochs using Adam solver with a constant learning rate of 0.0001.
Results of our models are summarized in Table 3, where we also compare them with other state-of-the-art 3D-CNN based approaches [29, 30]. We observe that our HalluciNet outperforms on four out of five fields. Difference in performance is more in case of RT, SS, TW than P, because position (legs straight or bent) may be equally identifiable from a single image or clip, but RT, SS, TW are more difficult to predict by a plain 2D-CNN without. In comparison, our HalluciNet has been trained to forecast short term future, and hence excels in situations which involve longer term dynamics. Our HalluciNet even outperforms 3D-CNN based approaches that use more frames (MSCADC  and Nibali et al. ). C3D-AVG outperforms HalluciNet, but is computationally very expensive and uses 16x more frames.
|Method||CNN Type||Frames||Sp. Corr.|
Action quality assessment (AQA) is another task which can help bring out the utility of hallucinating spatiotemporal representations from still images using 2D-CNN. In AQA, the task is to measure or quantify how well an action was performed. A good example of AQA would be that of judging Olympic events like diving, gymnastics, figure skating, etc.
Consistent with literature, we report Spearman’s rank correlation (in %).
We follow the same training procedure as in Sec. 4.2, except that for AQA task we use L2 loss to train, as it is a regression task. We train for 20 epochs with Adam as solver, and anneal the learning rate by a factor of 10 every 5 epochs.
The results are presented in Table 4. Incorporating hallucination task helps improve performance on AQA task. Our HalluciNet outperforms C3D-SVR as well and is close to MSCADC. Although, C3D-AVG performs best on AQA task, this experiment still supports advantage of using hallucination task.
Feichtenhofer et al. introduced YUP++ dataset for the task of dynamic scene recognition in . It has a total of 20 scene classes. Use of this dataset to evaluate the utility of inferred motion was suggested in . In the work by Feichtenhofer, 10% of the samples are used for training, while the remaining 90% of the samples are used for testing purpose. Gao et al.  form their own split, called ‘static-YUP++’.
For training and testing purposes, we consider the central frame of each sample.
We conduct two following experiments, and set in Eq. 4 to 1 in both the experiments.
In order to evaluate the utility of hallucination task for dynamic scene recognition, and comparing our methods with [41, 7, 36]. For fair comparison, we use the split used [41, 7, 36]. Results summarized in Table 5. HalluciNet improves the performance of our vanilla 2D-CNN, also outperforms spatiotemporal energy based approach (BoSE), slow feature analysis (SFA) approach and temporal CNN (T-CNN). T-CNN might be the closest for comparison because it uses a stack of 10 optical flow frames. Yet, our HalluciNet outperforms by a large margin.
3D-CNN’s extract richer spatiotemporal features than 2D-CNN’s, but this comes at a considerably higher computational cost. 2D-CNN’s have the benefit of being computationally much lighter. Since neural networks are universal function approximators, we propose a simple solution to approximate (hallucinate) spatiotemporal representations (computed by 3D-CNN) using a 2D-CNN. Hallucinating spatiotemporal representations, instead of actually computing, brings down the computational cost, and makes deployment on edge devices feasible, in addition to lowering the communication bandwidth requirement. Besides practical benefits, hallucination loss also provides stronger supervisory signal.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 354–363. Cited by: §2, §2.
An uncertain future: forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp. 835–851. Cited by: §2.