Self-supervised end-to-end ASR for low resource L2 Swedish

Show full item record



Permalink

http://hdl.handle.net/10138/336685

Citation

Al-Ghezi , R , Getman , Y , Rouhe , A , Hildén , R & Kurimo , M 2021 , Self-supervised end-to-end ASR for low resource L2 Swedish . in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 . Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH , vol. 2 , ISCA , Baixas , pp. 1429-1433 , Annual Conference of the International Speech Communication Association , Brno , Czech Republic , 30/08/2021 . https://doi.org/10.21437/Interspeech.2021-1710

Title: Self-supervised end-to-end ASR for low resource L2 Swedish
Author: Al-Ghezi, Ragheb; Getman, Yaroslav; Rouhe, Aku; Hildén, Raili; Kurimo, Mikko
Contributor: University of Helsinki, Aalto University
University of Helsinki, Department of Education
Publisher: ISCA
Date: 2021
Language: eng
Number of pages: 5
Belongs to series: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Belongs to series: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISBN: 9781713836902
URI: http://hdl.handle.net/10138/336685
Abstract: Unlike traditional (hybrid) Automatic Speech Recognition (ASR), end-to-end ASR systems simplify the training procedure by directly mapping acoustic features to sequences of graphemes or characters, thereby eliminating the need for specialized acoustic, language, or pronunciation models. However, one drawback of end-to-end ASR systems is that they require more training data than conventional ASR systems to achieve similar word error rate (WER). This makes it difficult to develop ASR systems for tasks where transcribed target data is limited such as developing ASR for Second Language (L2) speakers of Swedish. Nonetheless, recent advancements in selfsupervised acoustic learning, manifested in wav2vec models [1, 2, 3], leverage the available untranscribed speech data to provide compact acoustic representation that can achieve low WER when incorporated in end-to-end systems. To this end, we experiment with several monolingual and cross-lingual selfsupervised acoustic models to develop end-to-end ASR system for L2 Swedish. Even though our test is very small, it indicates that these systems are competitive in performance with traditional ASR pipeline. Our best model seems to reduce the WER by 7% relative to our traditional ASR baseline trained on the same target data.
Subject: End-to-End L2 ASR
Nonnative ASR
Self-supervised
113 Computer and information sciences
6121 Languages
Rights:


Files in this item

Total number of downloads: Loading...

Files Size Format View
alghezi21_interspeech.pdf 208.6Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record