Soundify: Matching Sound Effects to Video

12/17/2021
by   David Chuan-En Lin, et al.
Carnegie Mellon University
0

In the art of video editing, sound is really half the story. A skilled video editor overlays sounds, such as effects and ambients, over footage to add character to an object or immerse the viewer within a space. However, through formative interviews with professional video editors, we found that this process can be extremely tedious and time-consuming. We introduce Soundify, a system that matches sound effects to video. By leveraging labeled, studio-quality sound effects libraries and extending CLIP, a neural network with impressive zero-shot image classification capabilities, into a "zero-shot detector", we are able to produce high-quality results without resource-intensive correspondence learning or audio generation. We encourage you to have a look at, or better yet, have a listen to the results at https://chuanenlin.com/soundify.

READ FULL TEXT VIEW PDF

page 1

page 2

page 7

07/22/2022

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

We introduce a zero-shot video captioning method that employs two frozen...
08/20/2019

From Text to Sound: A Preliminary Study on Retrieving Sound Effects to Radio Stories

Sound effects play an essential role in producing high-quality radio sto...
09/02/2020

Degradation effects of water immersion on earbud audio quality

Earbuds are subjected to constant use and scenarios that may degrade sou...
06/24/2021

AudioCLIP: Extending CLIP to Image, Text and Audio

In the past, the rapidly evolving field of sound classification greatly ...
04/20/2022

Sound-Guided Semantic Video Generation

The recent success in StyleGAN demonstrates that pre-trained StyleGAN la...
04/08/2021

SerumRNN: Step by Step Audio VST Effect Programming

Learning to program an audio production VST synthesizer is a time consum...
01/29/2018

Local Visual Microphones: Improved Sound Extraction from Silent Video

Sound waves cause small vibrations in nearby objects. A few techniques e...

1 Introduction

In the art of video editing, sound is really half the story. A skilled video editor overlays sounds, such as effects and ambients, over footage to add character to an object or immerse the viewer within a space 3. However, through formative interviews with 10 professional video editors, we found that this process can be extremely tedious and time-consuming. More specifically, video editors identified three key bottlenecks: (1) finding suitable sounds, (2) precisely aligning sounds to video, and (3) tuning parameters such as pan and gain frame-by-frame. To address these challenges, we introduce Soundify, a system that matches sound effects to video. Prior works have largely explored either learning audio-visual correspondence from large-scale data Arandjelovic and Zisserman (2017); Zhao et al. (2018); Tian et al. (2018) or performing audio synthesis from scratch Oord et al. (2016); Zhou et al. (2018); Ghose and Prevost (2020). In this work, we take a different approach. By leveraging labeled, studio-quality sound effects libraries 2 and extending CLIP Radford et al. (2021), a neural network with impressive zero-shot image classification capabilities, into a "zero-shot detector", we are able to produce high-quality results without resource-intensive correspondence learning or audio generation. We encourage you to have a look at, or better yet, have a listen to the results at https://chuanenlin.com/soundify.

2 Method

The following outlines our method (Figure 3

). We implemented Soundify in PyTorch and used Decord, OpenCV, NumPy, and SciPy for image processing and Pydub for audio processing.

Classify.

We match sound effects to a video by classifying "sound emitters" within (Figure 

4). A sound emitter is simply an object or environment that produces sound and is defined based on Epidemic Sound 2, a database of over 90,000 high-quality sound effects. To reduce the number of distinct sound emitters to classify simultaneously, we first split the video into scenes using a boundary detection algorithm based on absolute color histogram distances between neighboring frames. To construct a realistic soundscape, we then classify each scene for two types of sounds: effects (e.g. bicycle, camera, keyboard) and ambients

(e.g. street, room, cafe). For a given scene, we run each frame through the CLIP image encoder and concatenate them into a representation for the entire scene. For each effects type label in the sound database, we run it through the CLIP text encoder. We then perform pairwise comparisons between the encoded scene and each encoded effects label with cosine similarities and obtain the top-5 matching effects labels for the scene. The user may select one or more recommended effects or, by default, the top-matching effect is assigned. For ambients type labels, we perform the same encoding and pairwise comparison steps. However, ambients classification can be more error-prone due to the background often being visually out of focus or occluded. Thus, we additionally run both the predicted ambients and the previously user-selected effect(s) through CLIP text encoders, and rerank the predicted ambients based on their cosine similarities (Figure 

5). For example, forest may be ranked higher than cafe if the user had previously selected waterfall. The user may select one recommended ambient or, by default, the top-matching ambient is assigned.

Sync. A sound emitter may appear on screen for only a subset of the scene. Therefore, we want to synchronize effects to when their sound emitter appears (Figure 6). We pinpoint such intervals by comparing the effects label with each frame of the scene and identifying consecutive matches above a threshold. There may be multiple intervals, such as when a sound emitter disappears then reappears.

Mix. Video editors adjust sound according to the state of the scene. For instance, as a bicycle paddles from one side to another, we hear a shift in stereo panning. As an airplane glides up close, we experience a gain in sound intensity. Similarly, we mix an effect’s pan and gain parameters over time (Figure 2). To achieve this, we split an effects interval into around one-second chunks (Figure 6), mix "spatially-aware sound bits" for each chunk (Figure 7), and stitch the chunks smoothly with crossfades. A spatially-aware sound bit uses the first image frame of the chunk as the reference image. We run the reference image through Grad-CAM Selvaraju et al. (2017)

on the ReLU activation of the last visual layer (ResNet-50) to generate an activation map. This

localizes the sound emitter, functioning close to a coarse object detector. We then compute the pan parameter by the x-axis of its center of mass and the gain parameter by its normalized area. Next, we retrieve the effect’s corresponding .wav audio file and remix its pan and gain. For ambients, we assume a constant environment for each scene. Thus, we retrieve the corresponding .wav audio file and simply use it across the entire scene. Finally, we merge all audio tracks of effects and ambients for all scenes into one final audio track for the video.

Figure 2: Soundify adapts pan (top row) and gain (bottom row) parameters over time.

3 Conclusion and Future Work

In this paper, we introduce Soundify, a system that automatically matches sound effects to video. Our next step is to evaluate our system through a within-subjects user study with professionals and novices to measure output quality, usability, workload, and satisfaction (Figures 8 and 9 show our user interface). For future work, it may be interesting to also explore ultra-fine synchronizations for certain sounds, such as individual footsteps, to make the matches even more seamless.

Ethical Implications

The introduction of Soundify into the video editing process may also come with potential ethical implications. One example is bias. Several years ago, Google Photos went under criticism for mislabeling black people as gorillas. Similarly, there may be biases for Soundify in the sound domain that need to be carefully monitored and addressed over time.

References

  • R. Arandjelovic and A. Zisserman (2017) Look, Listen and Learn. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 609–617. Cited by: §1.
  • [2] (2021)(Website) External Links: Link Cited by: §1, §2.
  • [3] (2020)(Website) External Links: Link Cited by: §1.
  • S. Ghose and J. J. Prevost (2020)

    AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos with Deep Learning

    .
    IEEE Transactions on Multimedia. Cited by: §1.
  • A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499. Cited by: §1.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint arXiv:2103.00020. Cited by: §1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §2.
  • Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu (2018) Audio-Visual Event Localization in Unconstrained Videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263. Cited by: §1.
  • H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba (2018) The Sound of Pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 570–586. Cited by: §1.
  • Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg (2018) Visual to Sound: Generating Natural Sound for Videos in the Wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3550–3558. Cited by: §1.

Appendix A System Diagrams

Figure 3: Overview of Soundify. Soundify first splits a video into scenes. For each scene, Soundify classifies for effects and ambients. The matched ambient is used for the entire scene. For each matched effect, Soundify performs more fine-grained synchronization by identifying their appearing intervals. For each interval, Soundify mixes spatially-aware sound bits with computed pan and gain parameters. The final result consists of one or more effects tracks and an ambients track.
Figure 4: The Classify step of Soundify. Given the frames of a scene and a database of sound labels, Soundify performs pairwise comparisons to predict the top-5 matching sounds.
Figure 5: Since ambients classification can be more error-prone, given the user-select effects label and predicted ambients labels, Soundify performs pairwise comparisons to rerank the ambients.
Figure 6: The Sync step of Soundify. Given the frames of a scene and a sound label, Soundify identifies appearing intervals. An interval is split into chunks. Each chunk takes the first frame as its reference frame.
Figure 7: The Mix step of Soundify. Given a reference frame and a sound label, Soundify retrieves the relevant audio file and mixes its pan and gain parameters, by referencing the activation map, to generate a spatially-aware sound bit.

Appendix B User Interface

Figure 8: The Soundify interface in Sequel, an ML-powered, web-based video editor developed by Runway. For each scene, Soundify recommends matching effects and ambients. The user may then select one or more effects and one ambient. By default, the top-matching effect and the top-matching ambient are selected.
Figure 9: The main interface of Sequel, showing results generated with Soundify. From the bottom timeline, we see that the original video is split into scenes and populated with audio tracks matched with Soundify.