Browsing by Subject "Speech analysis"

Sort by: Order: Results:

Now showing items 1-2 of 2
  • Suviranta, Rosa (Helsingin yliopisto, 2021)
    This study is a preliminary study to verify how well a Conditioned Convolutional Variational Autoencoder (CCVAE) learns the prosodic characteristics of interaction between the Lombard effect and different focus conditions. Lombard speech is an adaptation to ambient noise manifested by rising vocal intensity, fundamental frequency, and duration. Focus marks new propositional information and is signalled by making the focused word more prominent in relation to others. A CCVAE was trained on the f0 contours and speech envelopes of a Lombard speech corpus of Finnish utterances. The model’s capability to reconstruct the prosodic charac- teristics was statistically evaluated based on bottleneck representations alone. The following questions were addressed: the appropriate size of the bottleneck layer for the task, the ability of the bottleneck representations to capture the prosodic characteris- tics and the encoding of the bottleneck representations. The study shows promising results. The method can elicit representations that can quantify prosodic effects of the underlying influences and interactions. The study found that even the low dimensional bottlenecks can conceptualise and consis- tently typologize the prosodic events of interest. However, finding the optimal bottleneck dimension still needs more research. Subsequently, the model’s ability to capture the prosodic characteristics was verified by investigating the generated samples. Based on the results, the CCVAE can capture prosodic events. The quality of the reconstruction is positively correlated with the bottleneck dimension. Finally, the encoding of the bottlenecks were examined. The CCVAE encodes the bottleneck representations similarly regardless of the training instance or the bottleneck dimension. The Lombard effect was most efficiently captured and focus conditions as second.
  • Airaksinen, Manu; Juvela, Lauri; Alku, Paavo; Rasanen, Okko (IEEE, 2019)
    International Conference on Acoustics Speech and Signal Processing ICASSP
    This study explores various speech data augmentation methods for the task of noise-robust fundamental frequency (F0) estimation with neural networks. The explored augmentation strategies are split into additive noise and channel-based augmentation and into vocoder-based augmentation methods. In vocoder-based augmentation, a glottal vocoder is used to enhance the accuracy of ground truth F0 used for training of the neural network, as well as to expand the training data diversity in terms of F0 patterns and vocal tract lengths of the talkers. Evaluations on the PTDB-TUG corpus indicate that noise and channel augmentation can be used to greatly increase the noise robustness of trained models, and that vocoder-based ground truth enhancement further increases model performance. For smaller datasets, vocoder-based diversity augmentation can also be used to increase performance. The best-performing proposed method greatly outperformed the compared F0 estimation methods in terms of noise robustness.