Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.
SD-Codec
SD-Codec is a RVQ-based Neural Audio Codec model, which consists of an encoder and a decoder for signal transformation, as well as a RVQ-based quantizer between them for information coding. Given an audio input that may contain multiple sources,SD-Codec learns to disentangle the latent features and distribute features of source domains to different quantizers. The decoder of SD-Codec can reconstruct either a single source track from a specific quantizer, or reconstruct the mixture using the sum of quantized latent features.
Resynthesis
Audio ID
Ground Truth
SD-Codec
DAC
4470 (mixture)
4470 (speech)
4470 (music)
4470 (SFX)
Audio ID
Ground Truth
SD-Codec
DAC
47461 (mixture)
47461 (speech)
47461 (music)
47461 (SFX)
54022 (mixture)
54022 (speech)
54022 (music)
54022 (SFX)
71510 (mixture)
71510 (speech)
71510 (music)
71510 (SFX)
98846 (mixture)
98846 (speech)
98846 (music)
98846 (SFX)
Separation
Audio ID
Ground Truth
SD-Codec
TDANet
21594 (mixture)
21594 (speech)
21594 (music)
21594 (SFX)
Audio ID
Ground Truth
SD-Codec
TDANet
37828 (mixture)
37828 (speech)
37828 (music)
37828 (SFX)
58627 (mixture)
58627 (speech)
58627 (music)
58627 (SFX)
59313 (mixture)
59313 (speech)
59313 (music)
59313 (SFX)
98448 (mixture)
98448 (speech)
98448 (music)
98448 (SFX)
BibTeX
@article{bie2024sdcodec,
author={Bie, Xiaoyu and Liu, Xubo and Richard, Ga{\"e}l},
title={Learning Source Disentanglement in Neural Audio Codec},
journal={arXiv preprint arXiv:2409.11228},
year={2024},
}
Acknowledgement
This work was funded by the European Union (ERC, HI-Audio, 101052978). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. This project was provided with computer and storage resources by GENCI at IDRIS thanks to the grant 2024-AD011015054 on the supercomputer Jean Zay's V100 and A100 partition.