Single-Frame based Deep View Synchronization for Unsynchronized Multi-Camera Surveillance

07/08/2020 ∙ by Qi Zhang, et al. ∙ City University of Hong Kong 8

Multi-camera surveillance has been an active research topic for understanding and modeling scenes. Compared to a single camera, multi-cameras provide larger field-of-view and more object cues, and the related applications are multi-view counting, multi-view tracking, 3D pose estimation or 3D reconstruction, etc. It is usually assumed that the cameras are all temporally synchronized when designing models for these multi-camera based tasks. However, this assumption is not always valid,especially for multi-camera systems with network transmission delay and low frame-rates due to limited network bandwidth, resulting in desynchronization of the captured frames across cameras. To handle the issue of unsynchronized multi-cameras, in this paper, we propose a synchronization model that works in conjunction with existing DNN-based multi-view models, thus avoiding the redesign of the whole model. Under the low-fps regime, we assume that only a single relevant frame is available from each view, and synchronization is achieved by matching together image contents guided by epipolar geometry. We consider two variants of the model, based on where in the pipeline the synchronization occurs, scene-level synchronization and camera-level synchronization. The view synchronization step and the task-specific view fusion and prediction step are unified in the same framework and trained in an end-to-end fashion. Our view synchronization models are applied to different DNNs-based multi-camera vision tasks under the unsynchronized setting, including multi-view counting and 3D pose estimation, and achieve good performance compared to baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 7

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Compared to single cameras, multi-camera networks allow better understanding and modeling of the 3D world through more densely sampling of information in a 3D scene [1]

. Multi-camera based vision tasks have been a popular research field, especially deep learning related tasks, such as 3D pose estimation from multiple 2D observations

[20], 3D reconstruction [23, 15], multi-view tracking [5, 7] and multi-view crowd counting [43]. For these multi-view tasks, it is usually assumed that the multi-cameras are temporally synchronized when designing DNNs models, i.e., all cameras capture images at the same time point. However, the synchronization assumption for multi-camera systems may not always be valid in practical applications due to a variety of reasons, such as dropped camera frames due to limited network bandwidth or system resources, network transmission delays, etc. Other examples of situations where camera synchronization is not possible include: 1) using images captured from different camera systems; 2) using images from social media to reconstruct the crowd at an event; 3) performing 3D reconstruction of a dynamic scene using video from a drone.

Thus, handling unsynchronized multi-cameras is an important issue in the adoption and practical usage of multi-view computer vision.

Fig. 1: Two variants of the main pipeline for unsynchronized multi-view prediction tasks: (top) scene-level synchronization is performed after the projection on the scene-level feature representations; (bottom) camera-level synchronization is performed on the camera-view feature maps before projection.
Fig. 2:

A general multi-view pipeline consists of several stages: camera-view feature extraction, feature projection, multi-view feature fusion to obtain a scene-level representation, and prediction. (a)

Scene-level synchronization performs the synchronization after the projection. The unsynchronized projected features from the reference view and other views are concatenated to predict the motion flow, which is then used to warp the other views’ projected features to match those of the reference view. (b) Camera-level synchronization performs the synchronization before the projection. The unsynchronized camera-view features from the reference view and other views are matched together to predict the motion flow, which is used to warp features from other camera views to the reference view.

There are several possible methods to fix the problem of unsynchronized cameras. The first method is using hardware-based solutions to synchronize the capture times, such as improving network bandwidth, or by using a central clock to synchronize capture of all cameras in the multi-camera network. However, this will increase the cost and overhead of the system, and is not possible when there is limited bandwidth. The second method is to capture image sequences from each camera, and then synchronize the images afterwards by determining the frame offset between cameras. The fineness of the synchronization depends on the frame rate of the image sequences. However, acquiring high frame-rate image sequences is not always possible from multi-camera systems, especially when the bandwidth and storage space are limited (e.g., in practical applications under limited bandwidth only one frame can be sent every 5s). The final method is to modify the multi-view model to handle unsynchronized images, such as introducing new assumptions or relaxing the original constraints under the unsynchronized setting. Existing approaches for handling unsynchronized multi-cameras are largely based on optimization frameworks [46, 45], but are not directly applicable to DNNs-based multi-view methods, which have seen recent successes in tracking [5, 7], 3D pose estimation [20], and crowd counting [43, 44].

In this paper, we consider the regime of low-fps multi-view camera systems – we only assume that a single relevant image is captured from each camera, and thus the input is a set of unsynchronized multi-camera images. We propose a synchronization model that operates in conjunction with existing DNN-based multi-view models to allow them to work on unsynchronized data. Our proposed model first synchronizes other views to a reference view using a differentiable module, and then the synchronized multi-views features are fused and decoded to obtain the task-oriented output. We consider two variants of our model that perform synchronization at different stages in the pipeline (see Fig. 2

): 1) scene-level synchronization performs the synchronization after projecting the camera features to their 3D scene representation; 2) camera-level synchronization performs the synchronization between camera views first, and then projects the synchronized 2D feature maps to their 3D representations. In both cases, motion flow between the cameras’ feature maps are estimated and then used to warp the feature maps to align with the reference view (either at the scene-level or the camera-level). With both variants, the view synchronization and the multi-view fusion are unified in the same framework and trained in an end-to-end fashion. In this way, the original DNN-based multi-view model can be adapted to work in the unsynchronized setting by adding the view synchronization module, thus avoiding the need to design a new model. Furthermore, the synchronization module only relies on content-based image matching and

camera geometry, and thus is widely applicable to many DNNs-based multi-view tasks, such as crowd counting, tracking, 3D pose estimation, and 3D reconstruction.

In summary, the contributions of this paper are 3-fold:

  • We propose an end-to-end trainable framework to handle the issue of unsynchronized multi-camera images in DNNs-based multi-camera vision tasks. To the best of our knowledge, this is the first study on DNNs-based single-frame synchronization of multi-view cameras.

  • We propose two synchronization modules, scene-level synchronization and camera-view level synchronization, which are based on image-based content matching that is guided by epipolar geometry. The synchronization modules can be applied to many different DNNs-based multi-view tasks.

  • We conduct experiments on multi-view counting and 3D pose estimation from unsynchronized images, demonstrating the efficacy of our approach.

Ii Related Work

In this section, we review methods on synchronized multi-view images and unsynchronized multi-view videos, as well as traditional multi-view video synchronization methods. We then review DNN-based image matching and flow estimation.

Ii-a DNN-based synchronized multi-camera tasks

Multi-camera surveillance based on DNNs has been an active research area. By utilizing multi-view cues and the strong mapping power of DNNs, many DNNs models have been proposed to solve multi-view surveillance tasks, such as multi-view tracking and detection [5, 7], crowd counting [43], 3D reconstruction [23, 15, 9, 39] and 3D human pose estimation [20, 24, 8, 22, 29]. The DNN pipelines used for these multi-camera tasks can be generally divided into three stages: the single-view feature extraction stage, the multi-view fusion stage to obtain a scene-level representation, and prediction stage. Furthermore, all these DNN-based methods assume that the input multi-views are synchronized, which is not always possible in real multi-camera surveillance systems, or in multi-view data from disparate sources (e.g., crowd sourced images). Therefore, relaxing the synchronization assumption can allow more practical applications of multi-camera vision tasks in real world.

Ii-B Tasks on unsynchronized multi-camera video

Only a few works have considered computer vision tasks on unsynchronized multi-camera video. [46] posed the estimation of 3D structure observed by multiple unsynchronized video cameras as the problem of dictionary learning. [45] proposed a multi-camera motion segmentation method using unsynchronized videos by combining shape and dynamical information. [37] proposed a method of estimating 3D human pose from multi-view videos captured by unsynchronized and uncalibrated cameras by utilizing the projections of joint as the corresponding points. [2] presented a method for simultaneously estimating camera geometry and time shift from video sequences from multiple unsynchronized cameras using minimal correspondence sets. [25] addressed the problem of aligning unsynchronized camera views by low and/or variable frame rates using the intersections of corresponding object trajectories to match views. Note that all these methods assume that videos or image sequences are available to perform the synchronization. In contrast, our framework, which is motivated by practical low-fps systems, is solving a harder problem, where only a single image is available from each camera view, i.e., there is no temporal information available. Furthermore, these methods pose frame synchronization as optimization problems that are applicable only to the particular multi-view task, and cannot be directly applied to DNN-based multi-view models. In contrast, we propose a synchronization module that can be broadly applied to many DNN-based multi-camera models, enabling their use with unsynchronized inputs.

Ii-C Traditional methods for multi-view video synchronization

Traditional synchronization methods usually serve as a preprocessing step for multi-camera surveillance tasks. Except audio-based synchronization like [14], most traditional camera synchronization methods rely on videos or image sequences and hand-crafted features for camera alignment/synchronization [10, 28, 41, 38, 40]. Typical approaches recover the temporal offset by matching features extracted from the videos, e.g., space-time feature trajectories [6, 33, 27], image features [18], low-level temporal signals based on fundamental matrices [31], silhouette motion [35], and relative object motion [13]. The accuracy of feature matching is improved using epipolar geometry [18, 35] and rank constraints [33]. [6] proposed to use the space-time feature trajectories matching instead of feature-points matching to reduce the search space. [18] utilized image feature correspondences and epipolar geometry to find the corresponding frame indices and computes the relative frame rate and offset by fitting a 2D line to the index correspondences. [27] estimated the frame accurate offset by analysing the trajectories and matching their characteristic time patterns. [31] presented a method for online synchronization that relied on the video sequences with known fundamental matrix to compute low level temporal signals for matching. [33] proposed the rank constraint of corresponding points in two views to measure the similarity between trajectories to avoid the noise sensitivity of the fundamental matrix. [35] proposed a RANSAC-based algorithm that computed the epipolar geometry and synchronization of a pair of cameras from the motion of silhouettes in videos. [13] synchronized two independently moving cameras via the relative motion between objects and known camera intrinsic.

The main disadvantages for these traditional camera synchronization methods are: 1) videos and image-sequences are required, which might not be available in practical multi-camera systems with limited network bandwidth and storage; 2) a fixed frame rate of the multi-cameras are usually assumed, which means random frame dropping cannot be handled (except [31]); 3) feature matching is based on hand-crafted features, which lack representation ability, or known image correspondences, which requires extra manual annotations and may not always be available. Compared with these methods, we consider a more practical and difficult setting: only single-frames and no videos (no temporal information) are available, which means that these traditional video-based methods are not suitable solutions. These traditional methods perform image content matching using hand-crafted features and traditional matching algorithms, while in contrast our method uses DNN-based image matching. Because we also assume that only single-frames are available, our method also requires DNN-based motion estimation to estimate a frame’s features after synchronization. Finally, our synchronization module is end-to-end trainable with existing multi-view DNNs.

Ii-D DNN-based image matching and flow estimation

Image matching and optical flow estimation both involve estimating image-to-image correspondences, which is related to frame synchronization of multi-views. We mainly review the DNN-based image matching [34, 30, 3] or optical flow estimation methods [16, 17, 4], which inspire us to solve the unsynchronized multi-camera based problems in a DNN-based fashion. DNN flow [42] proposed an image matching method based on a DNN feature pyramid in a coarse-to-fine optimization manner. FlowNet [11] predicted the optical flow from DNNs with feature concatenation and correlation. SpyNet [32] combined a classical spatial-pyramid formulation with deep learning and estimated large motions in a coarse-to-fine approach by warping one image to the other at each pyramid level by the current flow estimate and computing an update to the flow. [34]

addressed image correspondence problem using a convolutional neural network architecture that mimics classic image matching algorithms. PWC-Net

[36] uses a feature pyramid instead of the image pyramid and one image feature map is warped to the other at each scale, which is guided by the upsampled optical flow estimated from the previous scale. [26] proposed a single network to jointly learn spatiotemporal correspondence for stereo matching and flow estimation.

Our method is related to the DNN-based image matching and optical flow estimation, but the difference is still significant: 1) typical image/geometric matching only involves either a camera view angle transformation (e.g., camera relative pose estimation, stereo matching) or a small time change in the same view (optical flow estimation), while both factors appear in our problem, which makes our problem harder; 2) image/geometric matching is directly supervised by the correspondence of two images, while the multi-view fusion ground-truth in the 3D world is used as supervisory signal in our problem; 3) The 2D-to-3D projection causes ambiguity for multi-view feature fusion, which also causes difficulties for view synchronization.

Iii Single-Frame DNNs Multi-Camera Synchronization

In this section we propose our single-frame synchronization model for DNN-based multi-view models. We assume the input is one set of multi-view images (one image from each camera) that are unsynchronized. The temporal offset between cameras can either be constant latency for each camera (the same offset over time), or random latency (random offsets over time). Similar to most multi-view methods [7, 44, 39, 20], we assume that the camera is static and the cameras’ intrinsic and extrinsic parameters are known. The main idea of our method is to choose a camera view as the reference view, and then use the view synchronization model to warp the other camera views to be synchronized with the reference view. The synchronization model should be general enough to handle both constant and random latencies between cameras, in order to work under various conditions causing de-synchronization.

DNNs models for the multi-camera surveillance tasks typically consist of 3 stages: single-view feature extraction, multi-view fusion of projected features, and prediction. Our view synchronization model can be embedded into one of the first two stages, without the need to redesign a new architecture. Thus, we propose two variants of the synchronization model: 1) scene-level synchronization, where the projected features from different camera views are synchronized during multi-camera feature fusion; and 2) camera-level synchronization, where the camera view features are synchronized before projection and fusion. We present the details of the two synchronization models next. Note that we first consider the case when both synchronized and unsynchronized multi-view images are available for training (but not available in the testing stage). We then extend this to the case when only unsynchronized training images are available.

Iii-a Scene-level synchronization

Scene-level synchronization works by synchronizing the multi-camera features after the projection stage in the multi-view pipeline. The pipeline for scene-level synchronization is shown in the Fig. 2 (a).

Synchronization module: Without loss in generality, we choose one view (denoted as view ) as the reference view, and other views are to be synchronized to this reference view. We assume that synchronized frame pairs are available in the training stage. The frames are from reference view captured at reference time , and and from view ( where is the number of camera views) taken at times and . Note that frames are synchronized, while are not.

The synchronization module consists of the following stages. First, camera frame feature maps (both synced and unsynced frames) are extracted and projected to the 3D world space, resulting in the projected feature maps . Second, synchronization is performed between the reference view and each other view . The projected feature map from the reference view is concatenated with the projected feature map from view , and then fed into a motion flow estimation network to predict the scene-level motion flow between view at time and the reference view at time . The from view is then synchronized with the reference view (at time ) using a warping transformation guided by , . Finally, the reference view features and all other views’ warped features are fused and decoded to obtain the final scene-level prediction . In the testing stage, only unsynchronized frames are available and the forward operations related to frame are removed from the network.

Training loss: Two losses are used in the training stage. The first loss is a task-specific loss between the scene-level prediction and the ground-truth . For example, for multi-view crowd counting is the mean-square error, and are the predicted and ground-truth scene-level density maps. The second loss is on the multi-view feature synchronization in the multi-view fusion stage. Since the synced frame pairs are available during training, the feature warping loss encourages the warped features to be similar to the features of the original synced frame of view ,

(1)

where is the mean-square error loss. Finally, the training loss combines the task loss and the warping loss summed over all non-reference views,

(2)

where

is a hyperparameter.

Iii-B Camera view-level synchronization

Due to the ambiguity of the image pixels’ height in 3D space in the projection operation, the projected features from different views could be difficult to synchronize. Therefore, we also consider synchronization between camera view features before the projection. The pipeline for camera-level synchronization is presented in Fig. 2 (b).

Synchronization model: The view synchronization model is applied to each view separately. The camera view features from the unsynchronized reference view and view are first passed through a matching module (see below) and then fed into the motion flow estimation network to predict the camera-view motion flow for view . The warping transformation guided by then warps the camera-view features from view to be synchronized with the reference view at time , via . Finally, the reference and warped camera views are projected and decoded to obtain the scene-level prediction . In the testing stage, only unsynchronized frames are available and the forward operations related to frame are removed from the network.

Fig. 3: Epipolar-guided weights. (a) In the synchronized setting, given the point in view , the matched point in view must be on the epipolar line . (b) In the unsynchronized setting, we assume a Gaussian motion model of the matched feature location from time to

. (c) An epipolar-guided weight mask is use to bias the feature matching towards high-probability regions according to the motion model.

Matching module: We propose 3 methods to match features to predict the view-level motion flow. The first method concatenates the features and then feeds them into the motion flow estimation network. The second method builds a correlation map between features from each pair of spatial locations in and , which is then fed into the motion flow estimation network. The third method incorporates camera geometry information into the correlation map to suppress false matches. If both cameras are synchronized at , then according the multi-view geometry, each spatial location in view must match a location in view on its corresponding epipolar line (Fig. 3a). Thus in the synchronized setting, detected matches that are not on the epipolar line can be rejected as false matches. For our unsynchronized setting, the matched location in view remains on the epipolar line only when its corresponding feature/object does not move between times and . To handle the case where the feature moves, we assume that a matched feature in view

moves according to a Gaussian motion model with standard deviation

(Fig. 3b). With the epipolar line and motion model, we then build a weighting mask, with high weights on locations with high probability of containing the matched features, and vice versa. Specifically, we set the mask if is on the epipolar line induced by , and 0 otherwise, and then convolve it with a 2D Gaussian with standard deviation (Fig. 3c). We then apply the weight mask on the correlation map , which will suppress false matches that are not consistent with the scene and motion model.

Multi-scale architecture: Multi-scale feature extractors are used in multi-camera tasks like crowd counting [43]. Therefore, we next show how to incorporate multi-scale feature extractors with our camera-level synchronization model.111No extra steps are needed to incorporate multi-scale features with scene-level synchronization because the synchronization occurs after the feature projection. Instead of performing the view synchronization in each scale separately, the motion flow estimate of neighbor scales is fused to refine the current scale’s estimate (see Fig. 4). In particular, let there be scales in the multi-scale architecture and denotes one scale in the scale range , with the largest scale. When (the smallest scale), the correlation map of scale is fed into the motion flow estimation net to predict the motion flow for scale . For scales , first the difference between the correlation map and the upsampled correlation map of the previous scale is fed into the motion flow estimation net to predict the residual of the motion flow between two scales, denoted as . The refined motion flow of scale is then .

Fig. 4: Multi-scale estimation of motion flow.

Training loss: Similar to scene-level synchronization, a combination of two losses (scene-level prediction and feature synchronization) is used in the training stage. The scene-level prediction loss is the same as before. The feature synchronization loss encourages the warped camera-view features at each scale to match the features of the original synced frame,

(3)

Finally, the training loss is the combination of the prediction loss and the synchronization loss summed over all non-reference views and scales,

(4)

where is a hyperparameter.

Iii-C Training with only unsynchronized frames

In the previous models, we assume that both synchronized and unsynchronized multi-camera frames are available during training. For more practical applications, we also consider the case when only unsynchronized multi-view frames are available for training. In this case, for the scene-level synchronization, the warping feature loss is replaced with a similarity loss on the projected features, to indirectly encourage synchronization of the projected multi-view features,

(5)

where “

” is the cosine similarity between feature maps (along the channel dimension), and “

” is the mean over all spatial locations. Similarly, for camera-level synchronization, the warping feature loss is replaced by the similarity loss of the projected features .

Iv Experiments

We validate the effectiveness of the proposed model on two unsynchronized multi-view tasks: multi-view crowd counting and multi-view 3d human pose estimation.

Iv-a Implementation details

The synchronization model consists of two parts: motion estimation network and feature warping layer. The input of the motion estimation network is the unsynchronized multi-view features (the concatenation of the projected features) for scene-level synchronization or the matching result of the 2D camera-view features for camera-level synchronization, and the output is a 2-channel motion flow. The layer setting of the motion estimation network is shown in Table I. The feature warping layer warps the features from other views to align with the reference views, guided by the estimated motion flow. The feature warping layer is based on the image resampler from the Spatial Transformation layer in [21].

Layer Filter
conv 1
conv 2
conv 3
conv 4
conv 5
conv 6
TABLE I: The layer settings for the motion estimation net in the view synchronization module. The filter dimensions are output channels, input channels and filter size .

Iv-B Experiment setup

We test four versions of our synchronization model: scene-level synchronization (denoted as SLS), and camera-level synchronization using concatenation, correlation, or correlation with epipolar-guided weights (denoted as CLS-cat, CLS-cor, CLS-epi) for the matching function. The synchronization models are trained with the multi-view DNNs (introduced in each application later). We consider two training scenarios: 1) both synced and unsynchronized training data is available; 2) only unsynchronized training data is available (the more difficult setting). For the first training scenario, we compare against two baseline methods: BaseS trains the DNN only on synchronized data; BaseSU fine-tunes the BaseS model using the unsynchronized training data (using the full training set). For the second training scenario, BaseU trains the DNN directly from the unsynchronized data. Note that traditional synchronization methods [10, 28, 41, 38, 40] are based on videos (temporal information) and assume high-fps cameras with fixed frame rates, which are unavailable in our problem setting. Thus, traditional and video-based synchronization methods are not suitable for comparison.

To test the proposed method, we first create an unsynchonized multi-view dataset from the existing multi-view datasets (the specific datasets are introduced in each application later). In particular, suppose the frame sequence in the reference view is captured at times , where is the time offset between neighbor frames, and is the number of frames. For view , the unsynchronized frames are captured at times , where is the desynchronization time offset between view and the reference view. We consider two settings of the desynchronization offset. The first is a constant latency for each view, , for some constant value . The second is random latency

, where the offset for each frame and view is randomly sampled from a uniform distribution,

. Finally, since the synchronization is with the reference view, the ground-truth labels for the multi-view task correspond to the times of the reference view, .

Fig. 5: Examples of unsynchronized multi-view crowd counting on PETS2009 (constant latency and training with synchronized and unsynchronized frames) and CityStreet (random latency and training with unsynchronized frames and similarity loss).
PETS2009 CityStreet
constant random constant random
loss model MAE NAE MAE NAE MAE NAE MAE NAE
BaseS 7.21 0.200 4.58 0.139 9.07 0.108 8.86 0.107
BaseSU 4.36 0.137 4.30 0.140 9.02 0.106 8.82 0.108
SLS 4.49 0.145 4.91 0.154 8.23 0.102 8.02 0.101
CLS-cat 4.18 0.130 4.85 0.150 8.82 0.111 8.57 0.108
CLS-cor 4.13 0.135 4.03 0.128 8.03 0.099 7.99 0.098
CLS-epi 3.95 0.130 4.09 0.129 8.05 0.100 7.93 0.096
TABLE II: Unsynchronized multi-view counting: experiment results for training set with both synchronized and unsynchronized frames. Two desynchronization settings are tested: constant latency and random latency.

The evaluation metric is MAE and NAE

.
PETS2009 CityStreet
constant random constant random
loss model MAE NAE MAE NAE MAE NAE MAE NAE
BaseU 6.18 0.187 6.22 0.192 10.22 0.134 9.35 0.121
SLS 5.37 0.178 4.82 0.150 8.50 0.105 8.33 0.100
CLS-cat 6.00 0.186 6.08 0.189 8.48 0.102 9.17 0.110
CLS-cor 4.18 0.136 4.34 0.136 8.02 0.098 7.77 0.093
CLS-epi 4.25 0.135 4.77 0.144 8.04 0.095 7.70 0.094
SLS 7.13 0.226 5.30 0.162 8.77 0.107 8.45 0.107
CLS-cat 6.30 0.194 5.98 0.184 8.28 0.098 9.15 0.108
CLS-cor 4.25 0.138 4.49 0.141 8.20 0.099 8.10 0.102
CLS-epi 4.27 0.135 4.53 0.143 8.16 0.097 7.86 0.096
TABLE III: Unsynchronized multi-view counting: experiment results for training set with only unsynchronized frames, under constant and random latency.

Iv-C Unsynchronized multi-view counting

We apply our synchronization model to unsynchronized multi-view counting. Here we adopt the multi-view multi-scale fusion model (MVMS) from [43], which is the state-of-the-art model for multi-view counting DNNs.

Datasets and metric.  Two multi-view counting datasets used in [43], PETS2009 [12] and CityStreet [43], are selected and desynchronized for the experiments. PETS2009 contains 3 views (camera 1, 2 and 3), and the first camera view is chosen as the reference view. The input image resolution () is and the ground-truth scene-level density map resolution is . There are 825 multi-view frames for training and 514 frames for testing. The frame rate of PETS2009 is 7 fps (). For constant frame latency, is used for cameras 2 and 3, and for random latency. CityStreet proposed in [43] consists of 3 views (camera 1, 3 and 4), and camera 1 is chosen as the reference view. The input image resolution is and the ground-truth density map resolution is . There are 500 multi-view frames, and the first 300 are used for training and the remaining 200 for testing. The frame rate of CityStreet is 1 fps ()222We obtained the higher fps version from the dataset authors.. For constant latency, for cameras 3 and 4, and for random latency. Following [43], the mean absolute error (MAE) and normalized absolute error (NAE) of the predicted counts on the test set are used as the evaluation metric.

Results for training with synchronized and unsynchronized frames. The experiment results using training with synchronized and unsynchronized frames are shown in Table II. The hyperparameter is used for feature warping loss. On both datasets, our camera-level synchronization methods, CLS-cor and CLS-epi, perform better than other methods, including the baselines, demonstrating the efficicacy of our approach. Scene-level synchronization (SLS) performs worse than camera-level synchronization methods (CLS), due to the ambiguity of the projected features from multi-views. Furthermore after projection to the ground-plane, the crowd movement between frames and on the ground-plane is less salient due to the low resolution of the ground-plane feature map. CLS-cat performs worse among the CLS methods because simple concatenation of features cannot capture the image correspondence between different views to estimate the motion flow. Finally, the two baselines (BaseS and BaseSU) perform badly on CityStreet because of the larger scene with larger crowd movement between neighboring frames (due to lower frame rate).

Results for training with only unsynchronized frames.  The experiment results by training with only unsynchronized frames (which is a more practical real-world case) are shown in Table III. Since the synchronized frames are not available, the MVMS model weights are trained from scratch using only unsynchronized data. Our models are trained with the similarity loss (with hyperparameter ), which encourages alignment of the projected multi-view features. Generally, without the synchronized frames in the training stage, the counting error increases for each method. Nonetheless, the proposed camera-level synchronization models CLS-cor and CLS-epi performs much better than the baseline BaseU. CLS-cor and CLS-epi trained on only unsynchronized data also performs better (on CityStreet) or on par with (on PETS2009) the baseline BaseSU, which uses both synchronized and unsynchronized training data. These two results demonstrate the efficacy of our synchronization model when only unsynchronized training data is available. Finally, the error for almost all synchronization models increases on both datasets when training without the similarity loss ( in Table III). This demonstrates the effectiveness of using to align the multi-view features in training. Example results are shown in Fig. 5.

Latency BaseS BaseU CLS-cor(0) CLS-cor CLS-epi
62.8/59.2 26.5/27.8 25.8/26.9 25.8/27.0 25.7/26.8
78.6/78.2 49.9/50.1 36.5/36.7 38.2/38.7 37.6/37.8
450.0/449.2 69.4/69.2 56.6/56.9 46.8/47.1 45.7/45.6
TABLE IV: Unsynchronized 3D human pose estimation: experiment results with random latency. For ‘CLS-cor’ and ‘CLS-epi’, the consistency loss hyperparameter . The evaluation metric is MPJPE and absolute position MPJPE (left/right).
Pose BaseS BaseU CLS-cor() CLS-cor CLS-epi
Dir. 42.8 29.3 26.1 25.8 26.1
Dis. 60.7 28.4 27.3 26.7 27.0
Eat. 60.7 26.4 23.9 24.0 23.4
Greet 63.8 19.7 25.3 24.3 25.1
Phone 52.2 25.7 24.7 24.5 24.4
Pose 49.7 22.0 24.1 24.0 24.0
Purch. 67.5 24.4 28.7 27.4 28.8
Sit 33.2 22.6 23.8 24.0 24.0
SitD. 37.4 25.7 25.9 26.8 27.2
Smoke 42.2 25.7 24.8 24.3 24.4
Photo 59.9 24.3 28.2 27.9 27.2
Wait 44.3 19.5 23.2 23.8 24.2
Walk 161.1 31.9 27.0 30.2 27.8
WalkD. 91.5 34.2 30.1 30.1 29.8
WalkT. 126.8 33.9 25.5 26.8 25.5
Average 62.8 26.5 25.8 25.8 25.7
TABLE V: Detailed performance for unsynchronized 3D human pose estimation with random latency . The evaluation metric is MPJPE.
Pose BaseS BaseU CLS-cor() CLS-cor CLS-epi
Dir. 46.2 48.7 42.9 42.5 43.9
Dis. 75.6 53.6 38.9 41.0 41.6
Eat. 64.5 39.1 32.5 32.7 30.8
Greet 71.5 48.5 35.7 38.1 36.8
Phone 64.5 43.6 33.9 35.2 35.1
Pose 49.3 42.1 32.7 33.3 30.8
Purch. 111.5 50.9 35.9 42.4 40.4
Sit 55.2 46.0 33.6 33.8 34.7
SitD. 108.3 79.3 36.8 41.8 42.8
Smoke 54.5 44.3 35.5 35.9 35.7
Photo 87.9 57.0 39.3 43.0 41.3
Wait 64.3 45.6 35.5 33.7 35.0
Walk 150.6 47.6 34.2 37.1 34.2
WalkD. 123.1 66.2 44.5 49.2 49.1
WalkT. 125.5 50.3 36.9 38.5 34.9
Average 78.6 49.9 36.5 38.2 37.6
TABLE VI: Detailed performance for unsynchronized 3D human pose estimation with random latency . The evaluation metric is MPJPE.
Pose BaseS BaseU CLS-cor() CLS-cor CLS-epi
Dir. 466.8 83.2 70.3 64.8 66.5
Dis. 450.1 72.0 57.3 48.2 48.4
Eat. 454.1 55.3 44.7 40.4 37.9
Greet 476.8 68.0 54.8 46.9 46.3
Phone 437.0 58.8 49.2 40.7 40.5
Pose 461.7 53.5 42.1 36.6 36.3
Purch. 468.2 69.0 58.1 47.0 50.6
Sit 437.6 64.2 55.2 41.1 39.6
SitD. 427.9 112.3 89.8 54.6 50.6
Smoke 428.8 58.7 49.2 41.7 41.7
Photo 485.3 76.8 64.5 57.7 53.5
Wait 449.2 62.7 51.1 42.0 42.6
Walk 430.7 66.3 49.7 44.0 41.7
WalkD. 456.6 95.9 77.0 62.7 55.5
WalkT. 460.7 67.2 54.5 43.5 42.5
Average 450.0 69.4 56.6 46.8 45.7
TABLE VII: Detailed performance for unsynchronized 3D human pose estimation with random latency . The evaluation metric is MPJPE.
0.005 0.01 0.02
25.6 25.7 26.0
38.3 37.6 37.9
51.7 45.7 46.8
TABLE VIII: Unsynchronized 3D human pose estimation: CLS-epi experiment results with different hyperparameter . The evaluation metric is MPJPE.

Iv-D Unsynchronized 3D pose estimation

We next apply our synchronization model to the unsynchronized 3D pose estimation task. The DNNs model for the 3D pose estimation task is adopted from [20], which proposed two learnable triangulation methods for multi-view 3D human pose from multiple 2D views: algebraic triangulation and volumetric aggregation. Here we use volumetric aggregation (softmax aggregation) as the multi-view fusion DNN in the experiments.

Datasets and Metrics.  We use the Human3.6M [19] dataset, which consists of 3.6 million frames from 4 synchronized 50 Hz digital cameras along with the 3D pose annotations. We follow the preprocessing step333https://github.com/anibali/h36m-fetch. Accessed: Oct. 10, 2019. recommended in [19], and sample one of every 64 frames () for the testing set, and sample one of every 4 frames () as the training set. The first camera view is always used as the reference view (if the first camera view is missing, the second one is used). We test desynchronization via random frame latency, with , and . Following [20], Mean Per Point Position Error (MPJPE) and absolute position MPJPE are used as the metric for evaluation. In training, the single-view backbone uses the pretrained weights from the original 3D pose estimation model. Two baselines BaseS and BaseU are compared with our proposed camera-view synchronization models CLS-cor and CLS-epi.

Experiment results. The experiments results are presented in Table IV. The original 3D pose estimation method (BaseS and BaseU) cannot perform well under the unsynchronized test condition, especially under large latencies (e.g., 64/50s). Our camera-view synchronization methods performs better than the baseline methods, with the performance gap increasing as the latency increases. Using similarity loss improves the performance of our models, and adding epipolar-guided weights can suppress false matches and further reduces the error. The detailed performance for each pose type under different frame latency settings is shown in Table V, Table VI and Table VII. From the tables, we can find that the proposed methods can perform better, especially on the poses with larger movement between unsynchronized frames, e.g., Walk, WalkD, and WalkT.

Ablation study on for 3D pose estimation.  The ablation study on hyperparameter of method CLS-epi for 3D pose estimation is presented in Table VIII. Generally, it’s observed from table that achieves better performance than other weights.

Extra visualization results.  More visualization results of unsynchronized 3D pose estimation can be found in Fig. 6 and Fig. 7, respectively. In Fig. 6, the blue lines are 2D key-jonits projected from 3D poses (Ground-truth or predictions). The red boxes (around the person’s chest and neck) indicate the better performance of the proposed methods compared to other methods (BaseS is too bad whose predictions’ 2D projections cannot be connected to form the 2D skeletons). In Fig. 7, both BaseS and BaseU fail on the example. Note that the frames are unsynchronized, so the 2D projections for other views (except the reference view) don’t match with the input image’s poses in the ground-truth.

Fig. 6: Examples of unsynchronized 3D pose estimation. Blue lines are 2D keyjonits projected from 3D poses. Red boxes indicate the better performance of the proposed methods.
Fig. 7: Examples of unsynchronized 3D pose estimation. Both BaseS and BaseU fail on the example.

V Conclusion and Discussion

In this paper, we focus on the issue of unsynchronized cameras in DNNs-based multi-view computer vision tasks. We propose two view synchronization models based on single frames (not videos) from each view, scene-level synchronization and camera-level synchronization. The two models are trained and evaluated under two training settings (with or without synchronized frame pairs), and a similarity loss of the projected multi-view features is proposed to boost the performance when synchronized training pairs are not available. Furthermore, to show its generality to different conditions of desynchronization, the proposed models are tested with desynchronization based on both constant and random latency. Finally, the proposed models are applied to unsynchronized multi-view counting and unsynchronized 3D human pose estimation, and achieve better performance compared to the baseline methods.

In our current model, image content matching is used for view synchronization, while the 2D-to-3D projection for multi-view fusion relies on known camera parameters. In future work, the 2D-3D projection could be replaced with image matching modules, which can allow the model to handle unsynchronized and uncalibrated multi-cameras, expanding its generality.

Vi Acknowledgements

This work was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. [T32-101/15-R] and CityU 11212518).

References

  • [1] H. Aghajan and A. Cavallaro (2009) Multi-camera networks: principles and applications. Academic press. Cited by: §I.
  • [2] C. Albl, Z. Kukelova, A. Fitzgibbon, J. Heller, M. Smid, and T. Pajdla (2017) On the two-view geometry of unsynchronized cameras. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4847–4856. Cited by: §II-B.
  • [3] H. Altwaijry, E. Trulls, J. Hays, P. Fua, and S. Belongie (2016) Learning to match aerial images with deep attentive architectures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3539–3547. Cited by: §II-D.
  • [4] M. Bai, W. Luo, K. Kundu, and R. Urtasun (2016) Exploiting semantic information and deep matching for optical flow. In European Conference on Computer Vision, pp. 154–170. Cited by: §II-D.
  • [5] P. Baqué, F. Fleuret, and P. Fua (2017) Deep occlusion reasoning for multi-camera multi-target detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 271–279. Cited by: §I, §I, §II-A.
  • [6] Y. Caspi, D. Simakov, and M. Irani Feature-based sequence-to-sequence matching. International Journal of Computer Vision 68 (1), pp. 53–64. Cited by: §II-C.
  • [7] T. Chavdarova, P. Baqué, S. Bouquet, A. Maksai, C. Jose, T. Bagautdinov, L. Lettry, P. Fua, L. Van Gool, and F. Fleuret (2018) WILDTRACK: a multi-camera hd dataset for dense unscripted pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5030–5039. Cited by: §I, §I, §II-A, §III.
  • [8] C. Chen, A. Tyagi, A. Agrawal, D. Drover, S. Stojanov, and J. M. Rehg (2019) Unsupervised 3d pose estimation with geometric self-supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5714–5724. Cited by: §II-A.
  • [9] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In ECCV, pp. 628–644. Cited by: §II-A.
  • [10] C. Dai, Y. Zheng, and L. Xin (2006) Subframe video synchronization via 3d phase correlation. In ICIP, Cited by: §II-C, §IV-B.
  • [11] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §II-D.
  • [12] J. Ferryman and A. Shahrokni (2009) Pets2009: dataset and challenge. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6. Cited by: §IV-C.
  • [13] T. Gaspar, P. Oliveira, and Paolo (2014) Synchronization of two independently moving cameras without feature correspondences. Cited by: §II-C.
  • [14] N. Hasler, B. Rosenhahn, T. Thormahlen, M. Wand, and H. P. Seidel (2009) Markerless motion capture with unsynchronized moving cameras. In Computer Vision and Pattern Recognition, Cited by: §II-C.
  • [15] P. Huang, K. Matzen, and et al. (2018) Deepmvs: learning multi-view stereopsis. In CVPR, pp. 2821–2830. Cited by: §I, §II-A.
  • [16] T. Hui, X. Tang, and C. Change Loy (2018) Liteflownet: a lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8981–8989. Cited by: §II-D.
  • [17] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: §II-D.
  • [18] E. Imre and A. Hilton (2012) Through-the-lens synchronisation for heterogeneous camera networks.. In BMVC, pp. 1–11. Cited by: §II-C.
  • [19] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2013) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36 (7), pp. 1325–1339. Cited by: §IV-D.
  • [20] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov (2019) Learnable triangulation of human pose. In ICCV, Cited by: §I, §I, §II-A, §III, §IV-D, §IV-D.
  • [21] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §IV-A.
  • [22] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh (2015) Panoptic studio: a massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3334–3342. Cited by: §II-A.
  • [23] A. Kar, C. Háne, and J. Malik (2017) Learning a multi-view stereo machine. In NIPS, pp. 365–376. Cited by: §I, §II-A.
  • [24] M. Kocabas, S. Karagoz, and E. Akbas (2019)

    Self-supervised learning of 3d human pose using multi-view geometry

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1077–1086. Cited by: §II-A.
  • [25] T. Kuo, S. Sunderrajan, and B. Manjunath (2013) Camera alignment using trajectory intersections in unsynchronized videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1121–1128. Cited by: §II-B.
  • [26] H. Lai, Y. Tsai, and W. Chiu (2019) Bridging stereo matching and optical flow via spatiotemporal correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1890–1899. Cited by: §II-D.
  • [27] B. Meyer, T. Stich, M. A. Magnor, and M. Pollefeys (2008) Subframe temporal alignment of non-stationary cameras.. In BMVC, pp. 1–10. Cited by: §II-C.
  • [28] F. Padua, R. Carceroni, G. Santos, and K. Kutulakos (2008) Linear sequence-to-sequence alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2), pp. 304–320. Cited by: §II-C, §IV-B.
  • [29] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Harvesting multiple views for marker-less 3d human pose annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6988–6997. Cited by: §II-A.
  • [30] S. Phillips and K. Daniilidis (2019) All graphs lead to rome: learning geometric and cycle-consistent representations with graph convolutional networks. arXiv preprint arXiv:1901.02078. Cited by: §II-D.
  • [31] D. Pundik and Y. Moses (2010) Video synchronization using temporal signals from epipolar lines. In ECCV, Cited by: §II-C, §II-C.
  • [32] A. Ranjan and M. J. Black (2017) Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170. Cited by: §II-D.
  • [33] Rao, Gritai, Shah, and Syeda-Mahmood (2003) View-invariant alignment and matching of video sequences. In Proceedings Ninth IEEE International Conference on Computer Vision, pp. 939–945. Cited by: §II-C.
  • [34] I. Rocco, R. Arandjelovic, and J. Sivic (2017) Convolutional neural network architecture for geometric matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6148–6157. Cited by: §II-D.
  • [35] S. N. Sinha and M. Pollefeys (2004) Synchronization and calibration of camera networks from silhouettes. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, Cited by: §II-C.
  • [36] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §II-D.
  • [37] K. Takahashi, D. Mikami, M. Isogawa, and H. Kimata (2018) Human pose as calibration pattern; 3d human pose estimation with multiple unsynchronized and uncalibrated cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1775–1782. Cited by: §II-B.
  • [38] P. A. Tresadern and I. D. Reid Video synchronization from human motion using rank constraints. Computer Vision and Image Understanding 113 (8), pp. 891–906. Cited by: §II-C, §IV-B.
  • [39] H. Xie, H. Yao, X. Sun, S. Zhou, and S. Zhang (2019-10) Pix2Vox: context-aware 3d reconstruction from single and multi-view images. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §II-A, §III.
  • [40] J. Yan and M. Pollefeys (2004) Video synchronization via space-time interest point distribution. In Advanced Concepts for Intelligent Vision Systems, Vol. 1, pp. 12–21. Cited by: §II-C, §IV-B.
  • [41] Y. Yang,

    Tri-focal tensor-based multiple video synchronization with subframe optimization

    .
    IEEE Transactions on Image Processing 15 (9), pp. 2473–2480. Cited by: §II-C, §IV-B.
  • [42] W. Yu, K. Yang, Y. Bai, H. Yao, and Y. Rui (2014) DNN flow: dnn feature pyramid based image matching.. In BMVC, Cited by: §II-D.
  • [43] Q. Zhang and A. B. Chan (2019) Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8297–8306. Cited by: §I, §I, §II-A, §III-B, §IV-C, §IV-C.
  • [44] Q. Zhang and A. B. Chan (2020) 3D crowd counting via multi-view fusion with 3d gaussian kernels. In

    AAAI Conference on Artificial Intelligence

    ,
    Cited by: §I, §III.
  • [45] X. Zhang, B. Ozbay, M. Sznaier, and O. Camps (2017) Dynamics enhanced multi-camera motion segmentation from unsynchronized videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4668–4676. Cited by: §I, §II-B.
  • [46] E. Zheng, D. Ji, E. Dunn, and J. Frahm (2015) Sparse dynamic 3d reconstruction from unsynchronized videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4435–4443. Cited by: §I, §II-B.