Learning Source Disentanglement in Neural Audio Codec


1LTCI, Télécom Paris, Institut polytechnique de Paris, France

2CVSSP, University of Surrey, UK

ICASSP, 2025



Abstract

Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.



SD-Codec

SD-Codec is a RVQ-based Neural Audio Codec model, which consists of an encoder and a decoder for signal transformation, as well as a RVQ-based quantizer between them for information coding. Given an audio input that may contain multiple sources,SD-Codec learns to disentangle the latent features and distribute features of source domains to different quantizers. The decoder of SD-Codec can reconstruct either a single source track from a specific quantizer, or reconstruct the mixture using the sum of quantized latent features.

Approach

Figure 1: Framework of SD-Codec.



Resynthesis

Audio ID Ground Truth SD-Codec DAC
4470 (mixture) fname fname fname
4470 (speech) fname fname fname
4470 (music) fname fname fname
4470 (SFX) fname fname fname
Audio ID Ground Truth SD-Codec DAC
47461 (mixture) fname fname fname
47461 (speech) fname fname fname
47461 (music) fname fname fname
47461 (SFX) fname fname fname
54022 (mixture) fname fname fname
54022 (speech) fname fname fname
54022 (music) fname fname fname
54022 (SFX) fname fname fname
71510 (mixture) fname fname fname
71510 (speech) fname fname fname
71510 (music) fname fname fname
71510 (SFX) fname fname fname
98846 (mixture) fname fname fname
98846 (speech) fname fname fname
98846 (music) fname fname fname
98846 (SFX) fname fname fname


Separation

Audio ID Ground Truth SD-Codec TDANet
21594 (mixture) fname
21594 (speech) fname fname fname
21594 (music) fname fname fname
21594 (SFX) fname fname fname
Audio ID Ground Truth SD-Codec TDANet
37828 (mixture) fname
37828 (speech) fname fname fname
37828 (music) fname fname fname
37828 (SFX) fname fname fname
58627 (mixture) fname
58627 (speech) fname fname fname
58627 (music) fname fname fname
58627 (SFX) fname fname fname
59313 (mixture) fname
59313 (speech) fname fname fname
59313 (music) fname fname fname
59313 (SFX) fname fname fname
98448 (mixture) fname
98448 (speech) fname fname fname
98448 (music) fname fname fname
98448 (SFX) fname fname fname


BibTeX

@article{bie2024sdcodec,
  author={Bie, Xiaoyu and Liu, Xubo and Richard, Ga{\"e}l},
  title={Learning Source Disentanglement in Neural Audio Codec},
  journal={arXiv preprint arXiv:2409.11228},
  year={2024},
}


Acknowledgement

This work was funded by the European Union (ERC, HI-Audio, 101052978). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. This project was provided with computer and storage resources by GENCI at IDRIS thanks to the grant 2024-AD011015054 on the supercomputer Jean Zay's V100 and A100 partition.



Page updated on 18 Sep 2024