Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Show simple item record

dc.contributor.author Suni, Antti
dc.contributor.author Kakouros, Sofoklis
dc.contributor.author Vainio, Martti
dc.contributor.author Šimko, Juraj
dc.date.accessioned 2020-08-30T12:56:01Z
dc.date.available 2020-08-30T12:56:01Z
dc.date.issued 2020
dc.identifier.citation Suni , A , Kakouros , S , Vainio , M & Šimko , J 2020 , Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis . in Proceedings of 10th International Conference on Speech Prosody 2020 : Communicative and Interactive Prosody . Speech prosody , ISCA , Baixas , pp. 940-944 , The 10th International Conference on Speech Prosody , Tokyo , Japan , 25/05/2020 . https://doi.org/10.21437/SpeechProsody.2020-192
dc.identifier.citation conference
dc.identifier.other PURE: 139767692
dc.identifier.other PURE UUID: 8433b6f3-88ff-4df6-b3c4-c8a888f5ff28
dc.identifier.other ORCID: /0000-0003-2570-0196/work/79877745
dc.identifier.other ORCID: /0000-0001-8996-0793/work/79880550
dc.identifier.uri http://hdl.handle.net/10138/318782
dc.description.abstract Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text. In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization of sentence prosody. We use an automatic wavelet-based technique to extract such labels from speech material, and use them as an input to a tacotron-like synthesis system alongside textual information. The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations. en
dc.format.extent 5
dc.language.iso eng
dc.publisher ISCA
dc.relation.ispartof Proceedings of 10th International Conference on Speech Prosody 2020
dc.relation.ispartofseries Speech prosody
dc.rights.uri info:eu-repo/semantics/openAccess
dc.subject 6161 Phonetics
dc.subject 6121 Languages
dc.title Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis en
dc.type Conference contribution
dc.contributor.organization Department of Digital Humanities
dc.contributor.organization Phonetics and Speech Synthesis
dc.contributor.organization Phonetics
dc.contributor.organization Mind and Matter
dc.description.reviewstatus Peer reviewed
dc.relation.doi https://doi.org/10.21437/SpeechProsody.2020-192
dc.relation.issn 2333-2042
dc.rights.accesslevel openAccess
dc.type.version publishedVersion

Files in this item

Total number of downloads: Loading...

Files Size Format View
87.pdf 812.5Kb PDF View/Open

This item appears in the following Collection(s)

Show simple item record