GTM: Gray Temporal Model for Video Recognition

10/20/2021 ∙ by Yanping Zhang, et al. ∙ Tianjin University 0

Data input modality plays an important role in video action recognition. Normally, there are three types of input: RGB, flow stream and compressed data. In this paper, we proposed a new input modality: gray stream. Specifically, taken the stacked consecutive 3 gray images as input, which is the same size of RGB, can not only skip the conversion process from video decoding data to RGB, but also improve the spatio-temporal modeling ability at zero computation and zero parameters. Meanwhile, we proposed a 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC) which captures the temporal relationship at channel-feature level within a controllable computation budget(by parameters G R). Finally, we confirm its effectiveness and efficiency on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB-51 and UCF-101, and achieve impressive results.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the real world, huge amounts of video data are generated every minute. As of May 2019, more than 500 hours of video were uploaded to YouTube every minute Hale (2019). Advances in edge computing and next generation communication technology made it possible to analyze these videos in a real time manner. So video-based task is getting more focus and becoming more important.

For video action recognition, deep learning 

Krizhevsky et al. (2012) has become the standard and we have witnessed great advancements. Most of them use three types of input: RGB, optical flow and compressed data. Karpathy et al. Karpathy et al. (2014) proposed to use a single 2D CNN model on each RGB frame independently and explored several fusing method to learn spatio-temporal features. Simonyan et al. Simonyan and Zisserman (2014a) first proposed the two-stream networks, which included a RGB input and an optical flow Brox et al. (2004) input respectively. Wu et al. Wu et al. (2018) proposed to directly apply deep learning method in the compressed domain for action recognition. We have a question: Is there another input modality for action recognition?

Figure 1: Example of 3 consecutive RGB images vs. gray images. First row: RGB. Second row: gray.

The RGB format is widely used in image based deep learning methods. It is straightforward and has a large number of ready-made models, such as VGG Simonyan and Zisserman (2014b), Inception Szegedy et al. (2015) and ResNet He et al. (2016). However, RGB format may not be entirely suitable for video tasks. Restricted by storage and bandwidth, video files and streams are stored or transmitted in compressed format, such as MPEG-4, H.264 Wiegand et al. (2003). After decompression, we will get YUV data directly. Y means luminance component and UV for two chrominance components. The YUV420 is the most widely used format which contains a subsampling process. So the data distribution of three signals(one Luma and two Chroma) is not equal. In Figure 2, we showed the simple process of decoding a video, and then convert to RGB format. During the conversion from YUV420 to RGB, we observed that the data size is doubled, which requires extra computation and more storage.

The flow stream Farnebäck (2003); Zach et al. (2007) has proven to be a good representation of the short-term motion between adjacent frames. Zhao et al. Zhao and Snoek (2019) proves that the more accurate the optical flow, the more the model improves. However, the computation of optical flow is time-consuming and storage demanding, thus making it impractical for real-world deployment.

For action recognition in compressed domain, it has a long tradition Tom and Babu (2013); Ma and Song (2019)

. The compressed data itself contains a significant number of useful clues that can be used to help classify, including Motion Vector, Residual, Quantization Parameter, Macro Block Size, MB in bits, QP Gradient. For deep learning methods, most of them 

Wang et al. (2019); Battash et al. (2020) only use Motion Vector and Residual so far. And many of them are not trained and evaluated on general large-scale datasets such as kinetics Kay et al. (2017). So the deep learning approaches in compressed domain are promising but far from being explored.

Figure 2: A simple process of decoding a video, and then convert to RGB.

To address this issue, we investigated several video-based inputs and found that taken the stacked consecutive 3 gray images as input, which is called gray stream, can improve the modeling ability at zero computation and zero parameters. In Figure 1, we visualize the 3 consecutive RGB vs. gray images. The gray stream contains not only local spatial appearance information represented by individual frame but also local temporal dependency among these successive frames.

Given the new input modality, we think more about current models. 3D based CNN models involve a huge amount of computation, while 2D models lack of temporal modeling capabilities. Inspired by this observation, we propose a 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC), which can be easily inserted into 2D CNNs with a plug-and-play manner to improve temporal modeling abilities. The 1D-ICSC learns to capture different temporal relationship for different channels at a controllable computation budget. To summarize, the main contributions of our method are three-fold:

  • We propose a new input modality (gray stream) for video action recognition and demonstrate its efficiency.

  • A simple yet effective 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC) is proposed, which can greatly improve temporal modeling ability for 2D based CNN.

  • We evaluated the proposed method on several public benchmark datasets, including Something-Something Goyal et al. (2017)

    , Kinetics-400 

    Kay et al. (2017), UCF-101 Soomro et al. (2012) and HMDB-51 Kuehne et al. (2011), and achieved impressive result.

2 Related Works

Figure 3: Illustration of our gray temporal model with ResNet-50 backbone. One input video is divided into segments. For each segment, we random sample 3 consecutive gray images. The ResNet block is replaced by Spatio-Temporal block which has 1D ICSC inside.

There are several trends in video action recognition. The first one is about the network evolvement, from 2D CNNs, LSTM to 3D CNNs. Karpathy et al. Karpathy et al. (2014)

proposed to apply a 2D CNN model on large-scale Sports-1M dataset and setup the beginning of deep learning methods. Ng et al. 

Yue-Hei Ng et al. (2015)

take the feature maps from CNNs then send to the LSTM network, and aggregate frame-level CNN features to model the temporal relation. In these approaches, the feature extraction of each frame is isolated and only late fusion of high-level features is performed, thus get no clear improvement. Tran et al. 

Tran et al. (2015) first proposed a deep 3D network, termed C3D which performed 3D convolutions on adjacent frames to jointly model the spatial and temporal features. However, with tremendous parameters to be optimized and lack of high-quality large-scale datasets, the performance of C3D remains unsatisfactory. The situation changed when Carreira et al. Carreira and Zisserman (2017) proposed I3D, which achieved very competitive performance with the help of high-quality large-scale Kinetics Kay et al. (2017) dataset and push video action recognition to the next level. In the latest work Feichtenhofer introduced X3D Feichtenhofer (2020), which progressively expand a tiny 2D image classification architecture along multiple network axes, such as temporal duration, spatial resolution, width, etc. X3D learned from the history of image classification models, and pushed 3D model to an extreme.

The second line mainly focuses the improvement of feature expression. Simonyan et al. Simonyan and Zisserman (2014a) proposed the two-stream approach and setup a trend. Following this trend, many excellent works Feichtenhofer et al. (2016); Wang et al. (2016) emerged and dominated the video recognition domain from year 2014 to 2017. Because pre-computing optical flow is computationally expensive and storage demanding, many works seeks for other substitutes. Kantorov et al. Kantorov and Laptev (2014) proposed the use of sparse MPEG flow instead of the dense optical flow, which improved the speed of feature extraction by two orders of magnitude with minor reduction in accuracy.

The last one focused on computational efficiency and real world deployment. ECO Zolfaghari et al. (2018), TSM Lin et al. (2019), STM Jiang et al. (2019) and TEA Li et al. (2020) are the excellent ones. Lin et al. proposed a new method, termed temporal shift module(TSM). It shifts part of the channels along the temporal dimension and thus facilitate information exchange among neighboring frames. It built temporal modeling inside 2D CNNs at zero computation and zero parameters.

Among these approaches, SlowFast Feichtenhofer et al. (2019) made attempt to replace the RGB input with gray-scale input in their fast pathway. They found that the gray-scale version is nearly as good as the RGB variant, meanwhile reduces FLOPs by %5. StNet He et al. (2019) sampled temporal segments, each of which consists of consecutive RGB frames. These

frames are stacked in the channel dimension. So the network input is a tensor of size

and is called super-image. Super-Image contains both spatial information and local temporal dependency. TDN Wang et al. (2021) generalize the idea of RGB difference to devise an efficient temporal difference module for motion modeling. These works made remarkable research in both input modalities and network architectures.

Different from previous works, our proposed approaches focus on video-based modalities and efficient 1D spatio-temporal modeling, make it more suitable for video-based tasks and more practical for real world deployment.

3 Approach

In this section, we will introduce the technical details of our approach. First, we will discuss several video-based modalities, such as YCbCr. Afterward, we will present 1D-ICSC which can be embedded 2D CNN in a plug-and-play manner.

3.1 Gray Stream

In H.264/AVC Wiegand et al. (2003) as well as the previous standards(MPEG-1 for Standardization/International Electrotechnical Commission and others (1993), MPEG-2 Union-Telecommun (1994)), they use video color space: YCbCr111In this paper we use the terms YCbCr and YUV interchangeably, although they are not exactly the same in a strict manner.. It separates a color representation into three components called Y, Cb, and Cr. Component Y is called luma, and represents brightness. The two chroma components Cb and Cr represent the extent to which the color deviates from gray toward blue and red, respectively.

Figure 4: Example of one RGB image and its corresponding Gray image and YUV component. The UV components are resized to the same size of Y for better visual perception.

Because the human visual system is more sensitive to luma than chroma, subsampling is performed in which all the luma(Y) information is preserved and chroma information(CbCr) is reduced by a factor 2 in both horizontal and vertical directions. This is called 4:2:0 sampling with 8 bits of precision per sample. The whole subsampling process is lossy but does not affect the perceived quality. In Figure 4 we visualize the three components of YCbCr. It can be seen that a single Y component is enough for human to recognize what is going on. According to standard ITU-R Recommendation BT.601 BT and others (2011), the conversion from YCbCr to RGB is as:


denotes clamping a value to the 8-bit range of 0 to 255. We can see that RGB frame is a transformation from YCbCr. This requires extra computation. In image domain, another widely used technology is to convert RGB image to gray-scale. According to BT and others (2011), the conversion from RGB to gray is computed as:


We visualize the gray-scale image in Figure 4

(b). We can see that the Y component image (Luma) and gray-scale image have visual similarity. An intuitive idea is to replace the network input with gray-scale image. So we choose gray-scale as another candidate. Until now we totally get 4 modalities: Y, U, V and Gray. All of them only have one channel, while RGB has three channels(red, green, blue). So for each modality, there are two ways to construct data. First is to use only one frame. This requires modifying the network input. Another is stacking consecutive 3 frame to form 3 channels. It is the same size of RGB and does not require any modification for network. For simplicity, we call this gray stream(for all 4 modalities). In the first ablation study, we will show its superiority over RGB and Flow.

3.2 Spatio-Temporal Block

In order to keep the framework effective yet lightweight, we choose the TSN Wang et al. (2016) with ResNet-50 He et al. (2016) backbone. Since a raw 2D network can not effectively capture temporal dynamics which has been evidenced by previous works Zhou et al. (2018); Lin et al. (2019), we designed a spatio-temporal module to tackle this problem. Figure 5 (b) shows our spatio-temporal block embedded with 1D-ICSC.


Channel-wise temporal modeling has been explored by Lin et al. (2019); Jiang et al. (2019); Li et al. (2020) previously, which is designed to model motion information based on the channel level instead of raw pixel-level. Different from previous works, we proposed 1D-ICSC to capture the temporal relationship, and introduce two factors( & ) to control the computation budget.

Figure 5: (a): Original ResNet block. (b): Spatio-Temporal block. The 1D-ICSC could be easily inserted into the ResNet block to construct a Spatio-Temporal block.

As illustrated in Figure 5 (b), the shape of input spatiotemporal feature is , where N is the batch size. T and C denote temporal dimension and feature channels, respectively. H and W correspond to spatial shape. We first reshape and then apply the channel-wise 1D convolution as equation (3).


is a 1D convolutional layer with kernel size 3 and . Next we reshape to the original input shape (i.e.[]) and model local-spatial information via original ResNet block. Different from previous works, we specially initialized the parameters of to make at initial stage. So it is called 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC). We don’t make assumptions about how the channel moves and interacts, but instead relax the kernel weights to learn it during training procedure. Experiments show that identity parameter initialization strategy brings performance improvement.

Further, we introduce two factors and to control the computation cost. is the groups number of . So the weight shape of is . is used to control how many spatio-temporal blocks are added. Theoretically, more spatio-temporal blocks will bring higher performance, but will increase parameters and FLOPs. Without loss of generality, when in each layer, we add a spatio-temporal block. Table 1 shows the GFLOPs when and varies. We can see that when and increases, the FLOPs decreases.

[width=4em]RG 1 2 4 8
1 107.26 70.11 51.54 42.25
2 72.73 52.85 42.90 37.93
3 57.93 45.45 39.20 36.08
4 53.00 42.98 37.97 35.47
Table 1: The corresponding GFLOPs between & . The bigger and , the smaller FLOPs. The network input is 8×3×224×224.

3.3 GTM Network

After introducing the gray stream and 1D-ICSC, we are ready to describe how to integrate them into the existing network architecture and build the gray temporal model (GTM) network. As shown in Figure 3, the 2D ResNet-50 is utilized as the backbone. First we divide one video into T segments. For each segment, we random sample 3 consecutive gray images. So the input of the network is , which is the same size of RGB input. From layer2 to layer5, 1D-ICSC is inserted at the beginning of each ResNet block to build Spatio-Temporal block, which is controlled by parameter and . The simple temporal pooling is applied to average action predictions for the entire video. Note that the whole framework is simple and straight forward, and does not require any modification for original blocks.

4 Experiments

4.1 Dataset & Implementations


We evaluate our approach on two large-scale action recognition datasets, Something-Something  Goyal et al. (2017), Kinetic-400 Kay et al. (2017), and other two small-scale datasets, HMDB-51 Kuehne et al. (2011) and UCF-101 Soomro et al. (2012)

. The Something-Something V2 dataset is a large collection of humans performing actions with everyday objects. Kinetics-400 is a large-scale YouTube video dataset and we download it from CVDF 

CVDF (2021), including 238,796 training videos , 19,877 validation videos and 38,671 test videos. We use Ffmpeg ffmpeg (2021) to extract the YUV data and save it in HDF5 Folk et al. (2011). For Kinetics, we resize the video height to 240 without changing its aspect ratio to speed up training.


We choose TSM Lin et al. (2019)

as our baseline. To have an apple-to-apple comparison with TSM, we used the same backbone (ResNet-50) and the models are pre-trainded on ImageNet 

Russakovsky et al. (2015) unless stated otherwise.


Most of the experimental settings are the same as TSM Lin et al. (2019) and STM Jiang et al. (2019). Given an input video, we first divide it into T segments, then we randomly sample one or 3 consecutive frames from each segment. During training, random scaling and corner cropping are utilized for data augmentation, and the cropped region is resized to 224 × 224 for each frame. Therefore, the input size of the network is N × T × C × 224 × 224, where N is the batch size, T is the number of segments, and C is the input channel number. Horizontal flipping is applied except for Something-Something dataset.

We train our model with 2 Tesla V100(16G) GPUs. Limited by GPU memory, we set T = 8 and use a relatively small batch size 32. For Kinetics, Something-Something v1 & v2 datasets, the initial learning rate is 0.005. It is reduced by a factor of 10 at 30,40,45 epochs and stop at 50 epochs. The dropout rate is 0.5. Stochastic gradient descent (SGD) is utilized as an optimizer. Momentum and weight decay value is set to 0.9 and 1e-4. All the batch normalization layers 

Ioffe and Szegedy (2015) are enabled during training.


Two evaluation protocols are considered to trade-off accuracy and speed:1) Efficient Protocol: 1-clip and center-crop where only a center crop of 224 × 224 from a single clip is used. 2) Accuracy Protocol: 10-clip and 3-crop where three crops(left, middle, right) of 224 × 224 and 10 clips are used for testing. The final prediction was the averaged score for all clips. By default we use Efficient Protocol for all tests. We only employ Accuracy Protocol for Kinetics.

4.2 Ablation Study

Input UCF-101 HMDB-51 STH-V1
RGB 8*3*H*W 83.1 49.9 18.2
Flow 8*10*H*W 86.6 57.3 36.9
1-Y 8*1*H*W 83.2 47.8 -
3-Y 8*3*H*W 87.7 55.9 -
1-U 8*1*H/2*W/2 54.9 26.7 -
3-U 8*3*H/2*W/2 68.5 37.3 -
1-V 8*1*H/2*W/2 58.2 30.7 -
3-V 8*3*H/2*W/2 70.1 38.9 -
1-Gray 8*1*H*W 83.0 48.0 17.3
3-Gray 8*3*H*W 87.8 55.5 38.9
Table 2: Comparison of modalities. Input(segments * channel * height * width) denotes the input shape of one video. 1-x means one single image. 3-x means consecutive 3 images.

In this section, we first conduct several ablation experiments to testify the effectiveness of different components in our proposed methods. The ablation experiments are performed on Something-Something v1, UCF-101 split 1 and HMDB-51 split 1. Top-1 accuracy is reported.


First we compare several modalities, including RGB, Grayscale, YUV and optical flow. We use denseflow Wang et al. (2020) to extract optical flow with Farnebäck algorithm Farnebäck (2003) because of its efficiency. Here we use TSN with ResNet-50 backbone. Videos are divided into 8 segments. The results are shown in Table 2.

First for STH-V1, 3-Gray gains 20% compared with RGB (38.9% vs. 18.2%) and it also increased by 2% over Flow. In UCF-101 dataset, 3-Gray modality achieved the best top-1 (87.8%) and 3-Y also got comparable result (87.7%). Both of them surpass Flow (86.6%) and RGB (83.1%). Second, for all three datasets, 1-Gray or 1-Y achieved similar performance compared with RGB. This is consist with SlowFast Feichtenhofer et al. (2019). It indicates for action recognition, a single Y or Gray channel contains equivalent information of RGB. Third, U and V modalities get inferior results, we argue that UV contains less information and has lower resolution(1/2 of Y).

In summary, compared with RGB, our gray stream can improve accuracy by a large margin without any extra parameters and FLOPs, or any optical flow pre-calculation. This indicates that video tasks are not exactly the same as image tasks.

Modality Interval UCF-101 STH-V1
3-Gray 1 87.8 38.9
2 87.1 38.4
3 87.3 38.3
4 86.3 38.5
5 86.6 38.0
Table 3: Comparison of different Sampling Intervals.

Sampling Intervals.

For our gray stream, it need 3 video frames to form one segment input. An intuitive question arises: do different sampling intervals affect the results? Here we compare different sampling intervals. The results are shown in Table 3. Intervals 1 to 3 achieved similar result. From interval 4, there was a performance reduction. We argue that for 2D backbone network, the spatial modeling plays an important role, and large sampling interval will hurt the spatial modeling ability. For simplicity, we use interval 1 as default(which means 3 consecutive frames).

Modalities Backbone UCF-101 HMDB-51

Resnet-18 84.6 49.0
Resnet-34 86.5 53.5
Resnet-50 87.7 55.9

Resnet-18 64.9 31.6
Resnet-34 67.7 36.1
Resnet-50 68.5 37.3

Resnet-18 66.5 34.9
Resnet-34 69.6 37.8
Resnet-50 70.1 38.9

Resnet-18 84.1 50.7
Resnet-34 86.3 53.7
Resnet-50 87.8 55.5
Table 4: Comparison of different backbones for 4 modalities.

Backbone Choice.

Because U and V image have 1/2 resolution of Y images, it is not necessary to use a relative heavy backbone Resnet-50. Here we compared 3 backbones: Resnet-18, Resnet-34 and Resnet-50 for 4 modalities. The results are shown in Table 4. For all modalities, Resnet-50 consistently achieved best results. For 3-U and 3-V, Resnet-50 provides slight performance boost(around 1%) compared to Resnet-34. So for 3-Y and 3-Gray, we use Resnet-50 as default backbone. For 3-U and 3-V, we use Resnet-34 as default.

Modality Temporal UCF-101 STH-V1
RGB None 83.1 18.2
Fixed 83.2 45.6
3D-Shift 83.8 45.8
3D-Identity 84.6 45.9
1D-Shift 85.0 45.9
1D-ICSC 85.3 46.1

None 87.8 38.9
Fixed 87.5 48.8
3D-Shift 87.2 48.7
3D-Identity 87.9 48.9
1D-Shift 87.8 48.6
1D-ICSC 88.0 49.3

Table 5: Comparison of different temporal modeling methods. In the second column, “None” means no temporal modeling is used.


Here we compare different temporal modeling methods, including Fixed, 1D Convolution and 3D Convolution. Fixed means 1/8 channels forward shift and 1/8 backward shift, which is the same as TSM Lin et al. (2019). For 1D and 3D convolution, there are two parameters initialization strategies: Identity and Shift. Identity convolutions are initialized as equation (3) to make the input and output equals. Shift convolutions are initialized to perform like Fixed (1/8 channels forward and 1/8 backward). The kernel size of 3D convolution is .

The results are shown in Table 5. First we notice that Identity convolution achieved better results than Shift, both 1D and 3D convolution. Second, the 1D-ICSC achieved best result, even surpassed the 3D convolution. This indicates that proper temporal convolution is essential for temporal modeling, even though 3D convolution involves much more parameters. For STH-V1 RGB, 1D-ICSC significantly increase accuracy from 18.2% to 46.1%. While for 3-Gray, it also get an increase of 10%.

G & R

We test different & paramters in STH-V1 dataset. The results are shown in Table 6. Smaller & get better results. This is consistent with Table 1 as smaller & involve more parameters and FLOPs.

[width=4em]RG 1 2 4 8
2 50.0 49.6 49.5 49.2
4 49.3 49.1 49.1 49.0
Table 6: Top-1 accuracy under different & paramters on STH-V1 dataset. 3-Gray modality is used.

Backbone Pre-train Frames Param. GFLOPs Top-1 Top-5

I3D-RGB(Carreira et al. 2017)
64×N/A 12.7M 108 × N/A 71.1 89.3
I3D-Flow(Carreira et al. 2017) 3D Inception V1 ImageNet 64×N/A 12.7M 108 × N/A 63.4 84.9
2-Stream I3D(Carreira et al. 2017) 128×N/A 25M 216 × N/A 74.2 91.3

ECO-RGB(Zolfaghari et al. 2018)
BNIncep+3D Res18 Scratch 92 47.5M 267 70.0 89.4

NL I3D-RGB Wang et al. (2018)
3D ResNet50 ImageNet 128 35.3M 282 67.3 -
128×3×10 35.3M 282×30 76.5 92.6

SlowFast 8×8 Feichtenhofer et al. (2019)
3D ResNet50 Scratch (8+64)×3×10 - 65.7×30 77.0 92.6

TSN-RGB Wang et al. (2016)
BN-Inception ImageNet 25×10 10.7M 53×10 69.1 88.7
ResNet-50 8 24.3M 33G 66.8 -

R(2+1)D-RGB Tran et al. (2018)
ResNet-34 Scratch 32×10 63.8M 152×10 72.0 90.0

R(2+1)D-Flow Tran et al. (2018)
32×10 63.8M 152×10 67.5 87.2

R(2+1)D 2-Stream Tran et al. (2018)
64×10 127.6M 304×10 73.9 90.9

TSM Lin et al. (2019)
ResNet-50 ImageNet 8 24.3M 33 70.6 -
8×3×10 24.3M 33×30 74.1 91.2

STM-RGB Jiang et al. (2019)
ResNet-50 ImageNet 16×3×10 24M 67×30 73.7 91.6

St-Net He et al. (2019)
ResNet-50 ImageNet 25 33M 189 69.9 -

TEA Li et al. (2020)
Res2Net-50 ImageNet 8 24.5M 35×1 72.5 90.4
8×3×10 24.5M 35×30 75.0 91.8

TDN Wang et al. (2021)
ResNet-50 ImageNet 8×3×10 - 36×30 76.6 92.8

ResNet-50 ImageNet 8 28M 43 70.8 89.5

GTM (3-Y)
8 28M 43 70.4 89.5

GTM (3-Gray)
8 28M 43 70.4 89.6

GTM (RGB + 3-Gray)
16 49M 86 73.4 91.2

GTM (RGB + 3-Gray)
16×3×10 49M 86×30 75.2 92.1

Table 7: Comparison of our GTM network with other state-of-the-art methods on Kinetics-400 validation set.

4.3 Comparisons with the State-of-the-arts

In this section, we compare our proposed GTM network with the existing state-of-the-art action recognition methods. In these experiments, we set =2 and =4 to get a balance between accuracy and FLOPs unless stated otherwise.

Results on Kinetics-400.

We evaluate the GTM network against the recent state-of-the-art 2D/3D convolution-based solutions. The comprehensive statistics, including the classification results, inference protocols, parameters, and the corresponding GFLOPs, are shown in Table 7. The first compartment contains the methods based on 3D CNNs or a mixup of 2D and 3D CNNs. The second compartment contains methods based on 2D CNNs. For fair comparison, we mainly list the architecture with ResNet-50 backbone. We can see that under Efficient Protocol, our RGB method get 70.8%, which surpass the stNet He et al. (2019) and ECO Zolfaghari et al. (2018). It is worth noting that for Kinetics dataset, flow modality usually gets inferior result than RGB. But our 3-Y and 3-Gray still achieved 70.4%, which is comparable with RGB(70.8%). This shows the robustness of our proposed gray stream modality. And they both surpass the flow modality methods by a large margin, such as I3D-Flow(63.4%), R(2+1)D-Flow(67.5%). Further a simple average of RGB and 3-Gray can bring top-1 accuracy to 73.4%. This shows the advantage of our gray stream which is complementary to RGB.

Results on STH V2.

The Something-Something V2 dataset is more temporal-related than Kinetics. The comparison results are list in Table 8. Our 3-Y achieves 61.7% top-1 accuracy which outperforms TSM Lin et al. (2019) by 2.6%. And it also improves our RGB by 2.8%. This indicates that in temporal-related datasets, gray stream can bring more improvement than that in scene-related datasets. The average result of RGB+3Y can increase the top-1 accuracy to 63.6%. And when combined with 3-V, it further increases to 64.5%. The superior performance demonstrates the effectiveness of our proposed approaches.

GFLOPs Top-1 Top-5

TRN Multiscale Zhou et al. (2018)
33 48.8 77.6
TRN 2-Stream Zhou et al. (2018) - 55.5 83.1

TSM Lin et al. (2019)
33×6 59.1 85.6

STM Jiang et al. (2019)
33×30 62.3 88.8

Dynamic Wu et al. (2020)
48 58.2 85.2

TEINet Liu et al. (2020)
33 61.3 -

ACTION-Net(Wang et al. 2021)
35 62.5 87.3

TDN Wang et al. (2021)
36 64.0 88.8

43 58.9 85.0
GTM (3-Y) 43 61.7 87.4
GTM (3-V) 30 51.6 80.1
GTM (RGB + 3-Y) 86 63.6 88.6
GTM (RGB + 3-Y + 3-V) 116 64.5 89.2
Table 8: Comparison with the state-of-the-art methods on Something-Something V2 validation set.

Results on UCF-101 & HMDB-51.

UCF-101 and HMDB-51 are comparatively small-scale datasets with a long history, but they are still worth studying to trace the development of action recognition. We list some main results in Table 9. On HMDB-51 RGB, our method achieved 56.5% compared with I3D(49.8%). On UCF-101 RGB, our method achieved 87.4%. When we compared 3-Y with Flow On both datasets, it surpassed Two-Stream and 3D-Fused by a large margin(around 4%).

UCF-101 HMDB-51
Architecture RGB Flow R+F RGB Flow R+F
LSTM* 81.0 - - 36.0 - -
3D-ConvNet* 51.6 - - 24.3 - -
Two-Stream* 83.6 85.6 91.2 43.2 56.3 58.3
3D-Fused* 83.2 85.8 89.3 49.2 55.5 56.8
I3D* 84.5 90.6 93.4 49.8 61.9 66.4
Architecture RGB 3-Y R+3Y RGB 3-Y R+3Y
GTM(Ours) 87.4 89.2 91.4 56.5 60.2 62.0
Table 9: Comparison in UCF-101 and HMDB-51(split 1 of both). All models are pre-trained on ImageNet except 3D-ConvNet. * denotes the results are cited from I3D Carreira and Zisserman (2017). R+F means average result of RGB and Flow. R+3Y means average result of RGB and 3-Y. Here we set =1 and =2 for better accuracy.

5 Conclusion

In this paper, we proposed a new input modality gray stream for action recognition. It skips the conversion process from video decoding data to RGB, and improves the spatiotemporal modeling ability at zero computation and zero parameters. Experiments showed its superiority over RGB and Flow on various datasets, including Kinetics-400, Something-Something, UCF-101 and HMDB-51. Further we proposed a 1D Identity Channel-wise Spatio-temporal Convolution(1D-ICSC), which is simple yet effective to improve spatio-temporal modeling ability. The further work may be to integrate gray stream and RGB into a unified framework. We hope our analysis will provide insights about video-based approaches for action recognition.


  • B. Battash, H. Barad, H. Tang, and A. Bleiweiss (2020) Mimic the raw domain: accelerating action recognition in the compressed domain. In

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops

    pp. 684–685. Cited by: §1.
  • T. Brox, A. Bruhn, N. Papenberg, and J. Weickert (2004)

    High accuracy optical flow estimation based on a theory for warping

    In European conference on computer vision, pp. 25–36. Cited by: §1.
  • R. I. BT et al. (2011) Studio encoding parameters of digital television for standard 4: 3 and wide-screen 16: 9 aspect ratios. Cited by: §3.1, §3.1.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2, Table 9.
  • CVDF (2021) CVDF. Note: Cited by: §4.1.
  • G. Farnebäck (2003) Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, pp. 363–370. Cited by: §1, §4.2.
  • C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211. Cited by: §2, §4.2, Table 7.
  • C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: §2.
  • C. Feichtenhofer (2020) X3d: expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213. Cited by: §2.
  • ffmpeg (2021) Ffmpeg. Note: Cited by: §4.1.
  • M. Folk, G. Heber, Q. Koziol, E. Pourmal, and D. Robinson (2011) An overview of the hdf5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pp. 36–47. Cited by: §4.1.
  • I. O. for Standardization/International Electrotechnical Commission et al. (1993) Coding of moving pictures and associated audio for digital storage media at up to about 1.5 mbit/s. ISO/IEC 11172. Cited by: §3.1.
  • R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850. Cited by: item , §4.1.
  • J. Hale (2019) More than 500 hours of content are now being uploaded to youtube every minute. Santa Monica, CA: Tubefilter. Cited by: §1.
  • D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, and S. Wen (2019) Stnet: local and global spatial-temporal modeling for action recognition. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8401–8408. Cited by: §2, §4.3, Table 7.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §3.2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International conference on machine learning

    pp. 448–456. Cited by: §4.1.
  • B. Jiang, M. Wang, W. Gan, W. Wu, and J. Yan (2019) Stm: spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009. Cited by: §2, §3.2, §4.1, Table 7, Table 8.
  • V. Kantorov and I. Laptev (2014) Efficient feature extraction, encoding and classification for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2593–2600. Cited by: §2.
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §1, §2.
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: item , §1, §2, §4.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §1.
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pp. 2556–2563. Cited by: item , §4.1.
  • Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang (2020) Tea: temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918. Cited by: §2, §3.2, Table 7.
  • J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093. Cited by: §2, §3.2, §3.2, §4.1, §4.1, §4.2, §4.3, Table 7, Table 8.
  • Z. Liu, D. Luo, Y. Wang, L. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and T. Lu (2020) Teinet: towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11669–11676. Cited by: Table 8.
  • M. Ma and H. Song (2019) Effective moving object detection in h. 264/avc compressed domain for video surveillance. Multimedia Tools and Applications 78 (24), pp. 35195–35209. Cited by: §1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.1.
  • K. Simonyan and A. Zisserman (2014a) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199. Cited by: §1, §2.
  • K. Simonyan and A. Zisserman (2014b) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §1.
  • K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: item , §4.1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
  • M. Tom and R. V. Babu (2013) Fast moving-object detection in h. 264/avc compressed domain for video surveillance. In 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–4. Cited by: §1.
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §2.
  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: Table 7.
  • I. T. Union-Telecommun (1994) Generic coding of moving pictures and associated audio information-part 2: video. Int. Standards Org./Int. Electrotech. Comm.(ISO/IEC) JTC 1, Rec. H. 262 and ISO/IEC 13 818-2 (MPEG-2 Video). Cited by: §3.1.
  • L. Wang, Z. Tong, B. Ji, and G. Wu (2021) TDN: temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1895–1904. Cited by: §2, Table 7, Table 8.
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §2, §3.2, Table 7.
  • S. Wang, Z. Li, Y. Zhao, Y. Xiong, L. Wang, and D. Lin (2020) denseflow. Note: Cited by: §4.2.
  • S. Wang, H. Lu, and Z. Deng (2019) Fast object detection in compressed video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7104–7113. Cited by: §1.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: Table 7.
  • T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra (2003) Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology 13 (7), pp. 560–576. Cited by: §1, §3.1.
  • C. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl (2018) Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6026–6035. Cited by: §1.
  • W. Wu, D. He, X. Tan, S. Chen, Y. Yang, and S. Wen (2020) Dynamic inference: a new approach toward efficient video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 676–677. Cited by: Table 8.
  • J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici (2015) Beyond short snippets: deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702. Cited by: §2.
  • C. Zach, T. Pock, and H. Bischof (2007) A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, pp. 214–223. Cited by: §1.
  • J. Zhao and C. G. Snoek (2019) Dance with flow: two-in-one stream action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9935–9944. Cited by: §1.
  • B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 803–818. Cited by: §3.2, Table 8.
  • M. Zolfaghari, K. Singh, and T. Brox (2018) Eco: efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (ECCV), pp. 695–712. Cited by: §2, §4.3.