AS-Speech

Abstract

In recent years, there has been significant progress in Text-to-Speech (TTS) synthesis technology, enabling the high-quality synthesis of voices in common scenarios. In unseen situations, adaptive TTS requires a strong generalization capability to speaker style characteristics. However, the existing adaptive methods can only extract and integrate coarse-grained timbre or mixed rhythm information separately. In this paper, we propose AS-Speech, an adaptive style methodology that integrates the speaker timbre characteristics and rhythmic features into a unified framework for text-to-speech synthesis. Specifically, AS-Speech can accurately simulate style characteristics through fine-grained text-based timbre features and global rhythm information, and achieve high-fidelity speech synthesis through the diffusion model. Experiments show that the proposed model produces voices with higher similarity in terms of timbre and rhythm compared to a series of adaptive TTS models while maintaining the naturalness of synthetic speech.

Zero Shot Demo

Audio Samples for VCTK (English) Test-Set (unseen)

Transcriptions:

p225: Two important points remained to be settled with that nation: their delivery of the king, and the estimation of their arrears.

p234: He knew now that his absence, for as long as he had to be away, would be covered up and satisfactorily accounted for.

p245: His soul was swooning into some new world, fantastic, dim, uncertain as under sea, traversed by cloudy shapes and beings.

p248: The silence never lasts long, however, for the feminine desire to talk it over usually gets the better of the deepest emotion.

p294: One perceives, without understanding it, a hideous murmur, sounding almost like human accents, but more nearly resembling a howl than an articulate word.

p302: The Land decree of the Congress of Soviets is identical in its fundamentals with the decisions of the first Peasants' Congress.

p335: I will briefly describe them to you, and you shall read the account of them at your leisure in the sacred registers.

Name

Prompt

StyleSpeech

YourTTS

AS_Xvector

AS_ASE

AS-Speech

p225