AS-Speech: Adaptive Style For Speech Synthesis

Abstract

In recent years, there has been significant progress in Text-to-Speech (TTS) synthesis technology, enabling the high-quality synthesis of voices in common scenarios. In unseen situations, adaptive TTS requires a strong generalization capability to speaker style characteristics. However, the existing adaptive methods can only extract and integrate coarse-grained timbre or mixed rhythm information separately. In this paper, we propose AS-Speech, an adaptive style methodology that integrates the speaker timbre characteristics and rhythmic features into a unified framework for text-to-speech synthesis. Specifically, AS-Speech can accurately simulate style characteristics through fine-grained text-based timbre features and global rhythm information, and achieve high-fidelity speech synthesis through the diffusion model. Experiments show that the proposed model produces voices with higher similarity in terms of timbre and rhythm compared to a series of adaptive TTS models while maintaining the naturalness of synthetic speech.

Model Overview

Interpolate start reference image.

Zero Shot Demo

Audio Samples for VCTK (English) Test-Set (unseen)

Transcriptions:

  • p225: Two important points remained to be settled with that nation: their delivery of the king, and the estimation of their arrears.
  • p234: He knew now that his absence, for as long as he had to be away, would be covered up and satisfactorily accounted for.
  • p245: His soul was swooning into some new world, fantastic, dim, uncertain as under sea, traversed by cloudy shapes and beings.
  • p248: The silence never lasts long, however, for the feminine desire to talk it over usually gets the better of the deepest emotion.
  • p294: One perceives, without understanding it, a hideous murmur, sounding almost like human accents, but more nearly resembling a howl than an articulate word.
  • p302: The Land decree of the Congress of Soviets is identical in its fundamentals with the decisions of the first Peasants' Congress.
  • p335: I will briefly describe them to you, and you shall read the account of them at your leisure in the sacred registers.
  • Name Prompt StyleSpeech YourTTS AS_Xvector AS_ASE AS-Speech

    p225

    p234

    p245

    p248

    p294

    p302

    p335

    Rhythm Demo

    Audio prompts from Style60 (Mandarin) Test-Set (unseen)

    Transcriptions:

    • Neutral: 这肯定是处理宗教问题的理性态度,但尤里安毕竟是政客。
    • Happy: 祝我家调皮可爱的小孩儿生日快乐!
    • Angry: 奕北忿忿地想,他不会任他们真的逍遥三十天的,绝不!
    • Sad: 疫情一直不结束,公司发的旅游基金马上要过期了,运气真差!
    • Afraid: 她们接触到秦倚天的眼神威胁,立即又齐齐的停下了脚步。
    • News: 二是规范平台服务,拆除办事关卡。
    • Story: 当我闭上眼睛的时候,我将怀着感激的心情向他祈祷,感谢他。
    • Poetry: 我柔弱的心啊,请试着去忘记,请千万千万别再哭泣。
    Style Prompt Gt (voc) GradTTS CSEDT AS_wo_lort AS-Speech

    Neutral

    Happy

    Angry

    Sad

    Afraid

    News

    Story

    Poetry

    Style Story Demo

    Audio prompts from Style60 (Mandarin) Test-Set (unseen)

    Transcriptions: (Prompt_id + Text)

    • (02) 开心的一天,我沿着海滩漫步,感受着清新的空气。
    • (02) 阳光透过云层洒在我身上,温暖而快乐。
    • (05) 突然,天空阴云密布,风声凄厉,让我感到一丝不安和恐惧。
    • (04) 望着周围,没有朋友的陪伴,让我有点感到伤心。
    • (05) 一刹那间,雷电交加,暴雨如注,我匆忙找了个树荫躲避。
    • (05) 树叶被风吹落,我被惊吓到了,身体不由得发抖。
    • (04) 在暴风雨中,我听到了一个孤独的鸟叫声,让我感到一种伤感。
    • (03) 手中的雨伞被大风吹走了,我愤怒地看着无情的大自然。
    • (01) 突然,风停了,雨渐渐停歇。
    • (02) 什么事都不可以阻挡我的快乐,我要去喝奶茶。
    • (04) 途中,我看到一只受伤的小鸟,悲伤涌上心头。
    • (01) 我蹲下身,轻抚小鸟,愿它早日康复。
    • (03) 可恶,才刚喝了几口,奶茶就被别人撞洒了,无比的愤怒充斥着我的内心。
    • (04) 看着地上的奶茶,感到伤心,为什么倒霉的总是我,好难过。
    • (01) 回到家里,看到准备好了的饭菜。
    • (02) 家里是永远温暖的港湾,我爱我家,幸福快乐!
    Prompt_01 Prompt_02 Prompt_03 Prompt_04 Prompt_05
    Story Sample