Rapid progress in the domain of media, networking and playback technologies has resulted in a proliferation of multimedia applications. These range from feature presentation in a cinema setting to video on demand services over the Internet to real-time applications such as live video streaming applications as well as interactive video conferencing. Good synchronization between audio and video modalities is a significant factor towards defining the quality of a multimedia presentation. As an example, “audio video not in sync” is a major customer issue that video over IP providers have to deal with especially in a patchy connection bandwidth scenario.
In a typical multimedia presentation, audio and video signals are managed by independent workflows. This is especially true in the case of highly produced premium content such as movies. In this case, the effort involved in going from raw audio and video (that were captured at the time of live action camera shots) to the finished work, is even managed via separate business processes. Examples of some of the steps that are involved here include video editing (deciding which portions of the raw footage will make it to the finished work), digital special effects (that are independently created for both video and audio), as well as sound mastering and mixing. Furthermore, a title intended for a global release often has several audio tracks corresponding to different international languages. Dubbed audio tracks are often created after the video and the primary language audio have been finalized. All of this opens up the possibility of temporal mis-alignment between audio and video. The high costs involved in making a movie as well as the size of the intended audience warrant a high multimedia quality, necessitating tight synchronization between the audio and video media.
As humans, we are highly capable of determining whether audio and visual signals of a multimedia presentation are synchronized. We seamlessly realize when and where to pay attention to in a video, in order to be able to judge whether there is a misalignment between audio and video. We often pay close attention to spatio-temporal details such as lip movement of humans while watching a movie, and make our decision solely based on such cues. At the same time, we ignore large spatio-temporal portions of video, when we cannot find the source of the sound in the visual content (e.g., background music in a movie scene). Our ability to attend to the important details while ignoring the rest enables us to have a better judgment.
A movie scene in which the visual content shows a person putting a coffee cup on the table, could contain many unrelated audio sources such as background noise or music. In principle, correctly identifying the sound of the coffee mug being placed on the table, and relating that to the visual content by focusing on the spatial location of the mug while ignoring the rest of the video, could lead to a very accurate audio-visual alignment prediction model. Correctly identifying when and where a bouncing basketball is hitting the ground could be another such example. In both these examples, the audio-visual phenomenon exhibits a high-degree of temporal and spatial localization. There are other scenarios such as a dialog or speech oriented multimedia presentation, where there are a lot of informative moments, that we can use to identify the synchronization between the utterance of each single word, and the lip movement of the speaker. In this effort, we aim to study the possibility of learning an attention model that is able to address all of the scenarios outline above. To do so, we use a convolutional neural network (CNN) based architecture that is capable of identifying the important portions of a multimedia presentation, and use them to determine the synchronization between the audio and visual signals. We study whether introducing attention modules would help the network emphasize on corresponding parts of the input data in order to make a better decision.
To conduct this study, we define the problem of audio-video synchronization as a binary classification problem. Given a video, a fully convolutional network is trained to be able to decide whether the audio and visual modalities of the video are synchronized with each other or not. In order to train the network for this task, we expose the network to synchronized and non-synchronized audio-video streams alongside their binary labels during training time. We evaluate two different attention modules taking into account temporal only and spatial plus temporal dimensions respectively. As mentioned above, there are huge variations in multimedia data in terms of discriminativity for synchronization. While there are some scenes such as ocean waves that do not exhibit much spatial or temporal localization, there are others such as a bouncing ball that could exhibit temporal localization and others such as speech that could exhibit good spatial as well as temporal localization. We employ a soft-attention module, weighting different blocks of the video without enforcing the network to make hard decisions on each individual blocks.
In order to take temporal attention into account, we divide each video into temporal blocks. From each temporal block of the video, we extract joint audio-visual features. We then compute a global feature pooled across the spatial and temporal domain within the block. Features from the various blocks are passed through a temporal weighting (attention) module, where the network assigns a confidence score to that temporal block of the video. The confidences for different temporal blocks of the video are normalized using a softmax function, enforcing the notion of probability across all the weights. The final decision about the alignment of the video is then made based on the weighted mean of the features extracted from different temporal blocks. Our experiments suggest that incorporating this attention module leads to higher classification accuracy and faster convergence rate compared to the baseline model (without any attention module).
We also study the effect of incorporating a more general spatio-temporal attention module on the classification accuracy. In this setup, the network is seeking to distinguish between not only different temporal blocks of the video, but also different spatial blocks of the visual content. To do so, similar to the first approach, we extract joint spatio-temporal features from each temporal block. Here however, instead of performing global average pooling, spatial features within each temporal block are directly fed into the weighting module, calculating a confidence score for each spatial block within each temporal block. Similar to the previous approach, a softmax function is applied across all the spatial and temporal features, and the final feature is computed as their weighted mean.
In this paper we propose an attention based framework, trained in a self-supervised manner, for the audio-visual synchronization problem. The proposed attention modules learn to determine what to attend to in order to decide about the audio-visual synchrony of the video in the wild. We evaluate the performance of each of the two approaches on publicly available data in terms of classification accuracy. We observe that taking into account temporal and spatio-temporal attention leads to improvement in classification accuracy. We also evaluate the performance of the attention modules qualitatively, verifying that the attention modules are correctly selecting discriminative parts of the video.
2 Related Work
To the best of our knowledge, this is the first attempt in using attention models for the audio-visual synchronization problem. Our approach could be classified as a self-supervised learning approach for utilizing attention models in the audio-visual synchronization problem. In the following, we go over recent works in the area of self-supervised learning, audio-visual synchronization, and attention models.
2.1 Self-supervised Learning
Data labeling and annotation in “big data” era can be an onerous task. It is common knowledge that performance of the deep neural network based learning approaches improves drastically with task relevant training data, however availability of good quality training data is scarce and it can be a very expensive and time consuming proposition to create the same. Fortunately, for some problem domains, self-supervised learning techniques can be employed to effectively create large amounts of “labeled” training data from input data that has no annotations.
A general framework for feature representation learning provided in . The impact of this framework has been shown on a variety of problems like object recognition, detection and segmentation. The main idea of  is to make a Jigsaw puzzle on any given image and learn features that can solve the puzzles.
Authors in  use self-supervised learning to produce relative labeled data for crowd counting problem. They crop each image and assume that the number of people in the cropped image is less than the original image. Finally, they train their network with a triplet loss and then fine-tune it with a limited number of supervised data.
Similarly, ranking loss as a self-supervision clue is discussed in  to solve image quality assessment. The main idea of  is to apply noise and blurring filters to natural images and solve a ranking problem between the original images and the noisy ones.
In this paper, we use self-supervised learning as a tool to train our deep neural network. Our work is most similar to the idea of . We shift the audio signal corresponding to video frames, randomly, to create training samples. Assuming audio and visual content are in sync in the original video (positive samples), we generate our negative samples by randomly shifting the audio signals for each positive sample.
2.2 Audio-visual Synchronization
As was discussed earlier audio-visual synchronization is a key factor for determining the quality of a multimedia presentation. It also has many real-world applications such as multimedia over network communications [4, 5], lip-reading , and so on.
A similar work on the audio-visual synchronization is represented in , in which, a simple Deep Neural Network has been used to detect the audio and video delays. A simple concatenation between audio and visual features has performed and the videos are recorded in a studio environment. Another related work to ours is that of lip synchronization . As opposed to our method, in  authors used a two stream network with a contrastive loss. Lip synchronization is an audio-video synchronization problem limited to the domain of monologue face scenes. Faces are tracked and the mouths is cropped and then fed to the network. In this paper, we use more realistic data.
Synchronization is also an important first step for many other applications such as Lip-reading . In this case, given an audio and a close-up video shot, the task is to predict the words of the speech happening in the video.
We argue that in this paper we solve a more general problem dealing with unconditioned input data from real world examples uploaded to YouTube. A video can have many different types of scenes as well as abrupt view changes, different types of audio noises, etc. Our proposed attention modules decide where to attend in the video eliminating the need for further restrictions on the input.
The most similar work to this paper is  where a self supervised method is used to train a network to classify sync and un-sync samples. A 3D convolutional network with early fusion of the two modalities is used. This work, however, does not have any notation of attention. On the other hand, our work is more focused on studying the contribution of temporal and spatio-temporal attentions in a similar problem setup. To the best of our knowledge, ours is the first work to use attention mechanisms for the audio-visual synchronization problem.
2.3 Attention Models
Attention mechanisms have been one of the popular set of techniques applied to various AI applications, namely, Question Answering (QA) , Visual Question Answering (VQA) [11, 12], Visual Captioning , Machine Translation(MT) [14, 15, 16], Action Recognition [17, 18], as well as Robotics .
Authors in  show that to generate a description sentence for a given image, each word relates to a spatial region of the image. Several attention maps were generated as they build the sentence. In machine translation , attentions can help the translator network to align the source sentence phrases or words to a word in the target language sentence. Our task differs from QA/VQA problems [11, 12, 10], in which given a question in form of a sentence, correlation between visual/textual features from different pieces of a video/image/text is computed as attention scores.
There are two main types of attention-based approaches: soft attention and hard attention. Hard attention methods [20, 13] outputs are based on sampling decision for each point as being attended or not (binary). These methods mostly need ground truth of attended points, annotated by humans. On the other hand, soft attention methods [16, 11, 10]
are able to capture the importance of the points in the given data, in terms of probability scores, to reach a given objective. Data points can be frames or shots in a video, audio segments, words in a sentence, or spatial regions in an image. Also, soft attention methods use differentiable loss functions and mathematical operations while hard attention approaches may not have continuous gradients.
The attention mechanism used in this paper falls in the soft attention category since we provide a differentiable loss function, and our attention network produces a probability map over spatio-temporal or temporal video segments. The most similar work to our framework is , in which authors use an attention network to select best shots of a video for action recognition based on separate streams of information. Similarly, in this paper, we apply attention modules on joint representations of audio-visual data in different blocks of a video. In contrast to , in which all the streams of data represent visual information, in this work, we deal with two different data modalities.
We argue that only certain portions of the input data (temporal/spatio-temporal segments of a video) are useful for deciding about synchronization in that we can measure the alignment of audio and video solely based on them. In contrast to [11, 12, 10], a strong/weak correlation does not mean a strong attention in our work. For example, a view of ocean with the background sound of waves may always have a high correlation in feature space, but doesn’t mean that it is a good shot to decide if audio and video are in sync. However, a close shot of a monologue speaker with a lot of background noises can be a very good shot to detect the synchronization.
Our proposed neural network architecture involves three main steps. The first step is a feature extraction step. Here, we split the input video into several blocks and extract joint audio-visual features from each block. In the second step, we calculate temporal or spatio-temporal attention, evaluating the importance of different (temporal or spatio-temporal) parts of the video. Finally, we combine features extracted from different parts of the video into a per video global feature based on a weighted average of all the features. In the following, we provide details on the data representation, the two architectures used for temporal and spatio-temporal attention, and the training and testing procedures.
3.1 Joint Representation
The backbone of our architecture, is that of . As shown in Figure 2, we divide the input video into non-overlapping temporal blocks of length frames (approximately secsonds given the frame rate of for our videos), namely . We extract a joint audio-visual feature from each temporal block resulting in a tensor for block . is obtained by applying the convolutional network introduced in , where visual and audio features are extracted separately in the initial layers of the network and later concatenated across channels. The visual features result in a feature and the audio feature results in a feature. The audio feature is replicated times and concatenated with the visual feature across channels, resulting in a dimensional tensor where . The network is followed by convolution layers applied to the concatenated features, combining the two modalities and resulting in a joint representation. The joint representation is the input to the attention modules. We describe the details of applying temporal and spatiotemporal attention modules in the following sections.
3.2 Attention Modules
Our attention modules consist of two layers of convolutions applied to the joint audio-visual features, resulting in a scalar confidence value per block (temporal or spatio-temporal). The confidences are then passed through a softmax function to obtain a weight for each of these blocks. The weights are used to obtain a weighted mean of all the features of the video. The weighted mean is passed to the decision layer (as depicted in Figure 2). In other words, the attention modules evaluate each portion of the video (a temporal or spatio-temporal block) in terms of its importance and therefore, its contribution to the final decision. In the following, we will go over a more detailed description of the two attention modules studied in this work.
3.2.1 Temporal Attention
As explained in Section 3.1, a video results in a set of features . For the temporal attention module, we apply global average pooling to each dimensional feature across spatial and temporal dimensions, resulting in a dimensional feature
. Therefore, representing each block of the video using a single global feature vector. We apply convolution layers on the global average pooled features, resulting in a single scalar confidence value for each temporal block . The confidence value is intuitively capturing the absolute importance of that specific temporal block. Applying a softmax normalization function over all the confidence values of different time-blocks of the same video, we obtain a weight for each feature . The normalization is performed to enforce the notion of probability and keep the norms of the output fearures in the same range as each individual global feature. The weighted mean of the features is passed to the decision layer (see figure 3).
3.2.2 Spatio-temporal Attention
For the spatio-temporal attention module, we apply the convolution layers directly on the dimensional features, resulting in a set of confidence values for each block. We then enforce the notion of probability across all the confidence values of all the blocks ( scalar values). The decision is made based on the weighted average on the spatio-temporal features , where is a feature vector extracted from a single spatial block at temporal block (see figure 4).
In order to evaluate the effect of our temporal and spatio-temporal attention modules, we compare their performance with the performance of a uniform weighting baseline. As the attention modules simply calculate weights for the features, and the decision is made based on the weighted average of those features, as a baseline, we simply feed the average of the input features directly into the decision layer. In other words, we evaluate the effect of bypassing the weighting step.
3.4 Implementation Details
Here we explain the implementation details of the feature extraction step, the two attention modules (temporal and spatio-temporal), and the decision layer used in our work.
3.4.1 Input and Feature Extraction Step
The feature extraction network consists of a 3D convolutional neural network with an early-fused design. In order to have a fair comparison with the baseline network, we use a similar setup as , meaning that the input video is resized to and center cropped to . Video lengths are 4.2 second and the lenbgth of each temporal block is 25 frames. All the videos have frame rate of 29.97 Hz. The amount of shift on the audio signal for the negative examples is a randomly generated value from 2 to 5.8 seconds.
3.4.2 Temporal Attention Module
The temporal attention module consist of two convolution layers with and
channels respectively, with relu activation and dropout ratio of. The convolutions are applied to the globally average pooled feature extracted from the input video.
3.4.3 Spatio-temporal Attention Module
Similar to the temporal attention module, two convolution layers here with and channels, respectively. The dropout ratio is . Unlike the temporal attention module, the spatio-temporal attention module is applied to the spatial features directly (without global average pooling).
3.4.4 Decision Layer
For the sake of a fair comparison, we use the same architecture for the decision layers of all three networks (base network with uniform weighting, temporal and spatio-temporal attention modules). It consist of layers of fully connected (or equivalently convolution with kernel size of ) of and dimensions for the binary classification. The output of the attention modules are passed through a softmax function, and used to obtain the weighted mean of the features, which is the input to the decision layer.
For all three networks (i.e, base network with uniform weighting, temporal weighting, and spatio-temporal weighting) we used the pre-trained weights of  for our feature extraction step, freezing the early layers’ weights. We trained the decision layer and attention modules from scratch with batch size of and for epochs. Binary cross-entropy loss, Adam optimizer, and learning rate of was used for all three networks.
In this section, we go over the dataset used for training and evaluating the performance of the proposed approach in Section 4.1. We report the quantitative results in Section 4.2 and go over some qualitative examples in Section 4.3.
We evaluate the proposed approach on the publicly AudioSet  dataset, which contains an ontology of 632 audio event categories. We train the temporal and spatio-temporal modules on 3000 examples of the speech subset of the dataset, and test the proposed approach on 7000 examples of the speech dataset, and 800 examples from the generic impact sound categories to further show the robustness of our method on sound classes such as breaking, hitting, bouncing etc. in which attention plays an important role. We used each video as a positive example, and a misaligned version of the video as a negative example.
4.2 Quantitative Evaluation
We evaluate the performance of the proposed approaches in terms of binary classification accuracy. The classification accuracies are reported in Table 1. The first row shows the performance of the baseline method, where no attention module is used. Comparing the first two rows of the table, we can observe the effect of using temporal attention in the classification accuracy. We can see that in the speech category, using temporal attention leads to improvement in classification accuracy. In the generic sound class, temporal attention yields a higher accuracy boost of . We attribute the lower margin in the speech class to the fact that in speech videos, most of the temporal blocks of the video do contain discriminative features (lip movement) and therefore, the weights are generally more uniform (see Figure 6). The last row shows the performance of our network with the spatio-temporal attention module. In the speech class, incorporating spatio-temporal attention leads to compared to using temporal attention, and compared to not incorporating attention at all and improvement on generic sound class. Spatio-temporal attention has a lower margin of improvement over temporal attention in generic sound class compared to speech. This lower margin could be associated to the fact that speech videos tend to be more spatially localized (towards the face of the speaker).
|Baseline network ||0.716||0.658|
To further illustrate the effect of attention modules, we plot and compare the distributions of output scores from our classification network in Figure 5. As can be seen attention modules help for a better separation of the two classes of sync and un-sync data distributions.
4.3 Qualitative Evaluation
Here we visualize some examples of the weights estimated by the network. We expect the informative parts of the video to lead to higher values. Two examples of the temporal attention weights are shown in Figure6 (one from each class of dataset). In each example, each row contains one of the temporal blocks. We also show the score for each temporal block. As it can be observed, in the example on the left, a high weight has been assigned to the informative moment of the shoe tapping the ground. In the example on the right, the moments when the words are uttered by the actor are selected as the most informative parts.
In Figure 7, we show the weights obtained from the spatio-temporal module on the same examples. It can be obsersved that the network correctly assigns higher values to more discriminative regions of the video (e.g. shoe tapping the floor, and the speakers face).
In this work we studied the effect of incorporating temporal and spatio-temporal attention modules in the problem of audio-visual synchronization. Our experiments suggest that a simple temporal attention module could lead to substantial gains in terms of classifying synchronized vs non-synchronized audio and visual content in a video. Also, a more general spatio-temporal attention module could even achieve higher accuracy as it is additionally capable of focusing on more discriminative spatial blocks of the video. Visualizing the weights generated by the temporal and spatio-temporal attention modules, we observe that the discriminative parts of the video are correctly given higher weights.
To conclude, our experiments suggest that incorporating attention models in the audio-visual synchronization problem could lead to higher accuracy. Other variations of this approach, such as using different backbones for feature extraction, adopting different architectures such as recurrent models, could be potentially explored in the future. Furthermore, estimating the level of misalignment between the two modalities can be explored through a modification in the architecture. We believe that this work could be a first step in these directions.
M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” inEuropean Conference on Computer Vision, pp. 69–84, Springer, 2016.
-  X. Liu, J. van de Weijer, and A. D. Bagdanov, “Leveraging unlabeled data for crowd counting by learning to rank,” arXiv preprint arXiv:1803.03095, 2018.
-  X. Liu, J. van de Weijer, and A. D. Bagdanov, “Rankiqa: Learning from rankings for no-reference image quality assessment,” Computer Vision and Pattern Recognition, https://arxiv. org/abs/1707.08347 v1, 2017.
-  P. Y. Teng, B. A. Thompson, and F. A. Tobagi, “Synchronization of audio and video signals in a live multicast in a lan,” Sept. 19 2000. US Patent 6,122,668.
-  D. E. Lankford and M. S. Deiss, “Audio/video synchronization in a digital transmission system,” July 4 1995. US Patent 5,430,485.
-  J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild.,” in CVPR, pp. 3444–3453, 2017.
-  E. Marcheret, G. Potamianos, J. Vopicka, and V. Goel, “Detecting audio-visual synchrony using deep neural networks,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
-  J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Asian Conference on Computer Vision, pp. 251–263, Springer, 2016.
-  A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” arXiv preprint arXiv:1804.03641, 2018.
-  S. Sukhbaatar, J. Weston, R. Fergus, et al., “End-to-end memory networks,” in Advances in neural information processing systems, pp. 2440–2448, 2015.
-  A. Mazaheri, D. Zhang, and M. Shah, “Video fill in the blank using lr/rl lstms with spatial-temporal attentions,”
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for
image question answering,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29, 2016.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y. Bengio, “Show, attend and tell: Neural image caption generation with
visual attention,” in
International conference on machine learning, pp. 2048–2057, 2015.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” inInternational Conference on Learning Representations, 2015.
-  M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
-  S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015.
J. Zang, L. Wang, Z. Liu, Q. Zhang, G. Hua, and N. Zheng, “Attention-based
temporal weighted convolutional neural network for action recognition,” in
IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 97–108, Springer, 2018.
-  P. Abolghasemi, A. Mazaheri, M. Shah, and L. Bölöni, “Pay attention!-robustifying a deep visuomotor policy through task-focused attention,” arXiv preprint arXiv:1809.10093, 2018.
-  J. Weston, S. Chopra, and A. Bordes, “Memory networks,” 2014.
-  J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 776–780, IEEE, 2017.