Analysis of Latent-Space Motion for Collaborative Intelligence

02/08/2021 ∙ by Mateen Ulhaq, et al. ∙ 0

When the input to a deep neural network (DNN) is a video signal, a sequence of feature tensors is produced at the intermediate layers of the model. If neighboring frames of the input video are related through motion, a natural question is, "what is the relationship between the corresponding feature tensors?" By analyzing the effect of common DNN operations on optical flow, we show that the motion present in each channel of a feature tensor is approximately equal to the scaled version of the input motion. The analysis is validated through experiments utilizing common motion models. will be useful in collaborative intelligence applications where sequences of feature tensors need to be compressed or further analyzed.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Collaborative intelligence (CI) [1] has emerged as a promising strategy to bring AI “to the edge.” In a typical CI system (Fig. 1), a deep neural network (DNN) is split into two parts: the edge sub-model, deployed on the edge device near the sensor, and the cloud sub-model deployed in the cloud. Intermediate features produced by the edge sub-model are transferred from the edge device to the cloud. It has been shown that such a strategy may provide better energy efficiency [2, 3], lower latency [2, 3, 4], and lower bitrates over the communication channel [5, 6], compared to more traditional cloud-based analytics where the input signal is directly sent to the cloud. These potential benefits will find a number of applications in areas such as intelligent sensing [7] and video coding for machines [8, 9]. In particular, compression of intermediate features has become an important research problem, with a number of recent developments [10, 11, 12, 13, 14] for the case when the input to the edge sub-model is a still image.

When the input to the edge sub-model is video, its output is a sequence of feature tensors produced from successive frames in the input video. This sequence of feature tensors needs to be compressed prior to transmission and then decoded in the cloud for further processing. Since motion plays such an important role in video processing and compression, we are motivated to examine whether any similar relationship exists in the latent space among the feature tensors. Our theoretical and experimental results show that, indeed, motion from the input video is approximately preserved in the channels of the feature tensor. An illustration of this is presented in Fig. 2

, where the estimated input-space motion field is shown on the left, and the estimated motion fields in several feature tensor channels are shown on the right. These findings suggest that methods for motion estimation, compensation, and analysis that have been developed for conventional video processing and compression may provide a solid starting point for equivalent operations in the latent space.

Figure 1: Basic collaborative intelligence system.
Figure 2: Motion estimates for input frames (left) and select channels from the output of ResNet-34’s add_3 layer (right).

The paper is organized as follows. In Section 2

, we analyze the actions of typical operations found in deep convolutional neural networks on optical flow in the input signal, and show that these operations tend to preserve the optical flow, at least approximately, with an appropriate scale. In

Section 3 we provide empirical support for the theoretical analysis from Section 2. Finally, Section 4 concludes the paper.

Figure 3: The problem studied in this paper: if input images are related via motion, what is the relationship between the corresponding intermediate feature tensors?

2 Latent space motion analysis

The basic problem studied in this paper is illustrated in Fig. 3

. Consider two images (video frames) input to the edge sub-model of a CI system. It is common to represent their relationship via a motion model. The question we seek to answer here is, “what is the relationship between the corresponding feature tensors produced by the edge sub-model?” To answer this question, we will look at the processing pipeline between the input image and a given channel of a feature tensor. In most deep models for computer vision applications, this processing pipeline consists of a sequence of basic operations: convolutions, pointwise nonlinearities, and pooling. We will show that each of these operations tends to preserve motion, at least approximately, in a certain sense, and from this we will conclude that (approximate) input motion may be observed in individual channels of a feature tensor.

Motion model. Optical flow is a frequently used motion model in computer vision and video processing. In a “2D+t” model, denotes pixel intensity at time , at spatial position

. Under a constant-intensity assumption, optical flow satisfies the following partial differential equation 




represents the motion vector. For notational simplicity, in the analysis below we will use a “1D+t” model, which captures all the main ideas but keeps the equations shorter. In a “1D+t” model,

denotes intensity at position at time , and the optical flow equation is


with representing the motion. We will analyze the effect of basic operations — convolutions, pointwise nonlinearities, and pooling — on creftype 2, to gain insight into the relationship between input space motion and latent space motion.

Convolution. Let be a (spatial) filter kernel, then the optical flow after convolution is a solution to the following equation


where is the motion after the convolution. Since the convolution and differentiation are linear operations, we have


Hence, solution from creftype 2 is also a solution of creftype 4, but creftype 4 could also have other solutions, besides those that satisfy creftype 2.

Pointwise nonlinearity. Nonlinear activations such as sigmoid, ReLU, etc., are commonly applied in a pointwise fashion on the output of convolutions in deep models. Let denote such a pointwise nonlinearity, then the optical flow after this nonlinearity is a solution to the following equation



is the motion after the pointwise nonlinearity. By using the chain rule of differentiation, the above equation can be rewritten as


Hence, again, solution from creftype 2 is also a solution of creftype 6. It should be noted that creftype 6 may have solutions other than those from creftype 2. For example, in the region where inputs to ReLU are negative, the corresponding outputs will be zero, so . Hence, in those regions, creftype 6 will be satisfied for arbitrary . Nonetheless, the solution from creftype 2 is still one of those arbitrary solutions.


There are various forms of pooling, such as max-pooling, mean-pooling, learnt pooling (via strided convolutions), etc. All these can be decomposed into a sequence of two operations: a spatial operation (local maximum or convolution) followed by scale change (downsampling). Spatial convolution operations can be analyzed as above, and the conclusion is that motion before such an operation is also a solution to the optical flow equation after such an operation. Hence, we will focus here on the local maximum operation and the scale change.

Local maximum. Consider the maximum of function over a local spatial region , at a given time . We can approximate as a locally-linear function, whose slope is the spatial derivative of at , . If the derivative is positive, the maximum is , and if it is negative, it is . In the special case when the derivative is zero, any point in , including the endpoints, is a maximum. From Taylor series expansion of around up to and including the first-order term,


for . With such linear approximation, the local maximum of over occurs either at or at , depending on the sign of ; if the derivative is zero, every point in the interval is a local maximum. Hence, the local maximum of can be approximated as


Let creftype 8 be the definition of , the function that takes on local spatial maximum values of over windows of size . The optical flow after such a local maximum operation is described by


where represents the motion after local spatial maximum operation. Using creftype 8 in creftype 9, after some manipulation we obtain the following equation


Note that if satisfies the original optical flow equation creftype 3, it will also satisfy creftype 10, hence pre-max motion is also one possible solution to post-max motion .

Scale change. Finally, consider the change of spatial scale by a factor , such that the new signal is . The optical flow equation is now


Since and , we conclude that , where is the solution to pre-scaling motion creftype 2. Hence, as expected, down-scaling the signal spatially by a factor of () would reduce the motion by a factor of .

Combining the results of the above analyses, we conclude that convolutions, pointwise nonlinearities, and local maximum operations tend to be motion-preserving operations, in the sense that pre-operation motion is also a solution to post-operation optical flow, at least approximately. The operation with the most obvious impact on motion is scale change. Hence, when looking at latent-space motion at some layer in a deep model, we should expect to find motion similar to the input motion, but scaled down by a factor of , where is the number of pooling operations (over windows) between the input and the layer of interest. Specifically, if is the motion vector at some position in the input frame, then at the corresponding spatial location in all the channels of the feature tensor we can expect to find the vector


In Section 3, we will verify these conclusions experimentally.

3 Experiments

Figure 4: Examples of motion transformations applied to reference image. The output tensors of ResNet-34’s add_3 layer are reliably predicted from only the reference tensor and known input-space motion.
Figure 5: NRMSE across parameter ranges for translation (top-left), rotation (top-right), scaling (bottom-left), and shear (bottom-right). For translation, NRMSE local minima occur when the input-space shifts correspond to integer latent-space shifts in creftype 12.
Figure 6: NRMSE histogram computed for an affine motion model [16]

over the combination of the following seven independent uniformly distributed parameters: x- and y-translation (

px), x- and y-scaling (0.95–1.05), x- and y-shearing (), and rotation ().

An illustration of the correspondence between the input-space motion and latent-space motion was shown in Fig. 2. This example was produced using a pair of frames from a video of a moving car. The motion vectors were estimated using an exhaustive block-matching search at each pixel, which sought to minimize the sum of squared differences (SSD). In the input frames, whose resolution was , the block size of around each pixel and the search range of were used. In the corresponding feature tensor channels, whose resolution was , the block size of and a search range of were used. Although the estimated motion vector fields are somewhat noisy, the similarity between the input-space motion and latent-space motion is evident.

To examine the relationship between input-space and latent-space motion more closely, we performed several experiments with synthetic input-space motion. In this case, exact input-space motion is known, so relationship creftype 12 can be tested more reliably. Fig. 4 shows examples of various transformations (translation, rotation, stretching, shearing) applied to an input image of a dog. The second column displays several channels from the actual tensor produced by the transformed image, and the third column shows the corresponding channels produced by motion compensating the tensor of the original image via creftype 12. The last column shows the difference between the actual and predicted tensor channels. Note that regions that cannot be predicted, such as regions “entering the frame,” were excluded from difference computation. As seen in Fig. 4, the model creftype 12 works reasonably well, and the differences between the actual and predicted tensors are low.

For quantitative evaluation, experiments were conducted on several layers of ResNet-34 [17] and DenseNet-121 [18]. Normalized Root Mean Square Error (NRMSE) [19] was used for this purpose:


where is the actual tensor value produced from the transformed input, is the tensor value predicted using our motion model creftype 12, is the number of elements in the feature tensor, and is the dynamic range. Again, regions that cannot be predicted were excluded from NRMSE computation. Fig. 5 shows NRMSE computed across a range of parameters for several transformations, at various layers of the two DNNs.

As seen in Fig. 5, NRMSE goes up to about 0.04 for reasonable ranges of transformation parameters. How good is this? To answer this question, we set out to find the typical values of NRMSE found in conventional motion-compensated frame prediction. In a recent study [20], the quality of frames predicted by conventional motion estimation and motion compensation (MEMC) in High Efficiency Video Coding (HEVC) [21] was compared against a DNN developed for frame prediction. From Table III in [20]

, the luminance Peak Signal to Noise Ratio (PSNR) of frames predicted uni-directionally by the DNN and conventional HEVC MEMC was in the range 27–41 dB over several HEVC test sequences. NRMSE can be computed from PSNR as


so the PSNR range of 27–41 dB corresponds to the NRMSE range of 0.009–0.044. These levels of NRMSE are indicative of how much motion models used in video coding deviate from the true motion. As seen in Fig. 5, the model creftype 12 produces NRMSE in the same range, so the accuracy of creftype 12 is comparable to the accuracy of common motion models used in video coding. Another illustration of this is presented in Fig. 6, which shows the histogram of NRMSE computed across a range of affine transformation parameters. Hence, creftype 12 represents a good starting point for development of latent-space motion compensation.

4 Conclusions

Using the concept of optical flow, in this paper we analyzed motion in the latent space of a deep model induced by the motion in the input space, and showed that motion tends to be approximately preserved in the channels of intermediate feature tensors. These findings suggest that motion estimation, compensation, and analysis methods developed for conventional video signals should be able to provide a good starting point for latent-space motion processing, such as motion-compensated prediction and compression, tracking, action recognition, and other applications.


  • [1] I. V. Bajić, W. Lin, and Y. Tian, “Collaborative intelligence: Challenges and opportunities,” in Proc. IEEE ICASSP, 2021, to appear.
  • [2] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in Proc. 22nd ACM Int. Conf. Arch. Support Programming Languages and Operating Syst., 2017, pp. 615–629.
  • [3] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Trans. Mobile Computing, 2019, Early Access.
  • [4] M. Ulhaq and I. V. Bajić, “Shared mobile-cloud inference for collaborative intelligence,” arXiv:2002.00157, 2019, NeurIPS’19 demonstration.
  • [5] H. Choi and I. V. Bajić,

    Deep feature compression for collaborative object detection,”

    in Proc. IEEE ICIP, Oct. 2018, pp. 3743–3747.
  • [6] H. Choi and I. V. Bajić, “Near-lossless deep feature compression for collaborative intelligence,” in Proc. IEEE MMSP, Aug. 2018, pp. 1–6.
  • [7] Z. Chen, K. Fan, S. Wang, L. Duan, W. Lin, and A. C. Kot, “Toward intelligent sensing: Intermediate deep feature compression,” IEEE Trans. Image Processing, vol. 29, pp. 2230–2243, 2019.
  • [8] ISO/IEC, “Draft call for evidence for video coding for machines,” ISO/IEC JTC 1/SC 29/WG 11 W19508, Jul. 2020.
  • [9] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,” IEEE Transactions on Image Processing, vol. 29, pp. 8680–8695, 2020.
  • [10] S. R. Alvar and I. V. Bajić, “Multi-task learning with compressible features for collaborative intelligence,” in Proc. IEEE ICIP, Sep. 2019, pp. 1705–1709.
  • [11] H. Choi, R. A. Cohen, and I. V. Bajić, “Back-and-forth prediction for deep tensor compression,” in Proc. IEEE ICASSP, 2020, pp. 4467–4471.
  • [12] S. R. Alvar and I. V. Bajić, “Bit allocation for multi-task collaborative intelligence,” in Proc. IEEE ICASSP, May 2020, pp. 4342–4346.
  • [13] R. A. Cohen, H. Choi, and I. V. Bajić, “Lightweight compression of neural network feature tensors for collaborative intelligence,” in Proc. IEEE ICME, Jul. 2020, pp. 1–6.
  • [14] S. R. Alvar and I. V. Bajić, “Pareto-optimal bit allocation for collaborative intelligence,” arXiv:2009.12430, Sep. 2020.
  • [15] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, no. 1, pp. 185 – 203, 1981.
  • [16] Y. Wang, J. Ostermann, and Y.-Q. Zhang, Video Processing and Communications, Prentice-Hall, 2002.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE CVPR, 2016, pp. 770–778.
  • [18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE CVPR, 2017, pp. 2261–2269.
  • [19] Wikipedia contributors, “Root-mean-square deviation — Wikipedia, the free encyclopedia,” 2020, [Online] Available:
  • [20] H. Choi and I. V. Bajić, “Deep frame prediction for video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1843–1855, Jul. 2020.
  • [21] ITU, “High efficiency video coding,” Recommendation ITU-T H.265, Nov. 2019.