Read, Watch and Scream!
Sound Generation from Text and Video

* Works done during an internship at NAVER AI Lab  Corresponding author

NAVER AI Lab

An example of audio generation requiring both text and video control: The text instruction "dog growling" is used for the text control. The video-to-audio (V2A) or text-to-audio (T2A) generation methods cannot understand the detailed semantics from texts (the dog is growling, not barking) or video (the dog is biting something, and the alignment), respectively.

Abstract

Multimodal generative models have shown impressive advances with the help of powerful diffusion models. Despite the progress, generating sound solely from text poses challenges in ensuring comprehensive scene depiction and temporal alignment. Meanwhile, video-to-sound generation limits the flexibility to prioritize sound synthesis for specific objects within the scene. To tackle these challenges, we propose a novel video-and-text-to-sound generation method, called **ReWaS**, where video serves as a conditional control for a text-to-audio generation model. Our method estimates the structural information of audio (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-sound model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency.


Generation results

Temporal alignment

ReWaS can capture small movement, short transition and silent moment.

Lions growling

Skateboarding

Alarm clock ringing

Effectiveness of video condition

Multiple semantics in the video are transferred through energy control. Even if there is missing information in the text prompt, the energy control complements this.

Prompt: car engine

Generation: 🔊 car engine + 🔊 spray

Prompt: darts

Generation: 🔊 dart + 🔊 people talking

Comparison with baselines


1 / 9

ReWaS [Prompt: dog growling]

SpecVQGAN

Diff-foley

Im2wav

2 / 9

ReWaS [Prompt: people screaming]

SpecVQGAN

Diff-foley

Im2wav

3 / 9

ReWaS [Prompt: orchestra]

SpecVQGAN

Diff-foley

Im2wav

4 / 9

ReWaS [Prompt: car engine starting]

SpecVQGAN

Diff-foley

Im2wav

5 / 9

ReWaS [Prompt: sharpen knife]

SpecVQGAN

Diff-foley

Im2wav

6 / 9

ReWaS [Prompt: playing banjo]

SpecVQGAN

Diff-foley

Im2wav

7 / 9

ReWaS [Prompt: cat growling]

SpecVQGAN

Diff-foley

Im2wav

8 / 9

ReWaS [Prompt: canary calling]

SpecVQGAN

Diff-foley

Im2wav

9 / 9

ReWaS [Prompt: chainsawing trees]

SpecVQGAN

Diff-foley

Im2wav

BibTeX

@inproceedings{jeong2024read,
  author    = {Jeong, Yujin and Kim, Yunji and Chun, Sanghyuk and Lee, Jiyoung},
  title     = {Read, Watch and Scream! Sound Generation from Text and Video},
  journal   = {arXiv preprint arXiv:2407.05551},
  year      = {2024},
}