Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, supporting applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherently audio-visual nature of film, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of paired cinematic datasets with isolated sound sources, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and develop a dedicated visual encoder for this dual-stream setup. Trained on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks.
Use the toggle below to switch views: Comparison with CASS models compares AV-CASS against CASS models (BandIt [1] and MRX [2]), while Comparison with DAVIS-Flow compares AV-CASS against the audio-visual source separation model DAVIS-Flow [3].
@inproceedings{zhang2026cinematic,
title={Cinematic Audio Source Separation Using Visual Cues},
author={Zhang, Kang and Lee, Suyeon and Senocak, Arda and Chung, Joon Son},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}