Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Näytä kaikki kuvailutiedot



Pysyväisosoite

http://hdl.handle.net/10138/318782

Lähdeviite

Suni , A , Kakouros , S , Vainio , M & Šimko , J 2020 , Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis . in Proceedings of 10th International Conference on Speech Prosody 2020 : Communicative and Interactive Prosody . Speech prosody , ISCA , Baixas , pp. 940-944 , The 10th International Conference on Speech Prosody , Tokyo , Japan , 25/05/2020 . https://doi.org/10.21437/SpeechProsody.2020-192

Julkaisun nimi: Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis
Tekijä: Suni, Antti; Kakouros, Sofoklis; Vainio, Martti; Šimko, Juraj
Tekijän organisaatio: Department of Digital Humanities
Phonetics and Speech Synthesis
Phonetics
Mind and Matter
Julkaisija: ISCA
Päiväys: 2020
Kieli: eng
Sivumäärä: 5
Kuuluu julkaisusarjaan: Proceedings of 10th International Conference on Speech Prosody 2020
Kuuluu julkaisusarjaan: Speech prosody
ISSN: 2333-2042
DOI-tunniste: https://doi.org/10.21437/SpeechProsody.2020-192
URI: http://hdl.handle.net/10138/318782
Tiivistelmä: Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text. In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization of sentence prosody. We use an automatic wavelet-based technique to extract such labels from speech material, and use them as an input to a tacotron-like synthesis system alongside textual information. The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations.
Avainsanat: 6161 Phonetics
6121 Languages
Vertaisarvioitu: Kyllä
Pääsyrajoitteet: openAccess
Rinnakkaistallennettu versio: publishedVersion


Tiedostot

Latausmäärä yhteensä: Ladataan...

Tiedosto(t) Koko Formaatti Näytä
87.pdf 812.5KB PDF Avaa tiedosto

Viite kuuluu kokoelmiin:

Näytä kaikki kuvailutiedot