When Word Embeddings Become Endangered

Show full item record


Title: When Word Embeddings Become Endangered
Author: Alnajjar, Khalid
Date: 2021
URI: http://hdl.handle.net/10138/327791
Abstract: Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis models which achieved high accuracies. All our cross-lingual word embeddings and sentiment analysis models will be released openly via an easy-to-use Python library.
Subject: Cross-lingual Word Embeddings
Endangered Languages
Sentiment Analysis
Rights: CC BY 4.0

Files in this item

Total number of downloads: Loading...

Files Size Format View
24_Alnajjar_Multilingual_Facilitation.pdf 210.0Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record

CC BY 4.0 Except where otherwise noted, this item's license is described as CC BY 4.0