Audio Samples for VCTK (English) Test-Set (unseen)
In recent years, there has been significant progress in Text-to-Speech (TTS) synthesis technology, enabling the high-quality synthesis of voices in common scenarios. In unseen situations, adaptive TTS requires a strong generalization capability to speaker style characteristics. However, the existing adaptive methods can only extract and integrate coarse-grained timbre or mixed rhythm information separately. In this paper, we propose AS-Speech, an adaptive style methodology that integrates the speaker timbre characteristics and rhythmic features into a unified framework for text-to-speech synthesis. Specifically, AS-Speech can accurately simulate style characteristics through fine-grained text-based timbre features and global rhythm information, and achieve high-fidelity speech synthesis through the diffusion model. Experiments show that the proposed model produces voices with higher similarity in terms of timbre and rhythm compared to a series of adaptive TTS models while maintaining the naturalness of synthetic speech.
Audio Samples for VCTK (English) Test-Set (unseen)
Name | Prompt | StyleSpeech | YourTTS | AS_Xvector | AS_ASE | AS-Speech |
---|---|---|---|---|---|---|
p225 |
||||||
p234 |
||||||
p245 |
||||||
p248 |
||||||
p294 |
||||||
p302 |
||||||
p335 |
Audio prompts from Style60 (Mandarin) Test-Set (unseen)
Style | Prompt | Gt (voc) | GradTTS | CSEDT | AS_wo_lort | AS-Speech |
---|---|---|---|---|---|---|
Neutral |
||||||
Happy |
||||||
Angry |
||||||
Sad |
||||||
Afraid |
||||||
News |
||||||
Story |
||||||
Poetry |
Audio prompts from Style60 (Mandarin) Test-Set (unseen)
Prompt_01 | Prompt_02 | Prompt_03 | Prompt_04 | Prompt_05 |
---|---|---|---|---|
Story Sample | ||||