When Word Embeddings Become Endangered

Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.


Introduction
The interest in building natural language processing (NLP) solutions for lowresourced languages is constantly increasing [1], not only because of the challenges associated with dealing with scarce resources but also because NLP solutions facilitate documenting and analysing languages. Examples of such solutions are applying optical character recognition to scan books [45], normalizing historical variation [7], using speech recognition [19] and more. However, most of the existing research is conducted in a simulated setting [15,24,22] where a reduced portion of the resource-rich language is used to represent a low-resourced language. Other approaches consider Wikipedias of languages having a small number of articles (i.e., < 500,000) such as Latin, Hindi, Thai and Swahili [9,48].
In this paper, we are dealing with languages that are classified as endangered based on UNESCO Atlas 1 . These languages are Erzya 2 (myv), Moksha (mdf), Komi-Zyrian (kpv) and Skolt Sami (sms). The most common methodology for documenting endangered languages is constructing translation dictionaries, whether digitizing physical dictionaries or reaching to native speakers. Universal dependencies (UD) written by dedicated researchers studying such endangered languages might also be available, and, in a fortunate scenario, they would include translations to a language with more speakers. The bigger languages that endangered languages are translated to are very inconsistent and vary depending on the language family, geographically close languages and the languages spoken by the documenter.
English, without a doubt, is currently the most resourced language in the field of NLP. However, English translations are not frequently found for endangered and low-resourced languages. To overcome this and make using existing English resources possible, we leverage the recent advances in the field of NLP for aligning word embeddings of big languages such as Finnish and Russian with English word embeddings.
The contributions in this paper are: -Proposing a method for constructing word embeddings for low-resourced and endangered languages, which are also aligned with word embeddings of big languages. -Building a universal sentiment analysis model that achieves high accuracy in both endangered and resource-rich languages covered in this work.
-Releasing an open-source and easy-to-use Python library with all the word embeddings and the sentiment analyzer model to support the community and researchers 3 .
This paper is structured as follows. Section 2 contains a brief description of the related work on building cross-lingual low-resource word embeddings. Thereafter, we describe the linguistic resources used in this work, including the translation dictionaries, universal dependencies and existing word embeddings of resource-rich. The proposed method for constructing cross-lingual word embeddings for endangered languages is elaborated then, followed by the description of the sentiment analysis model. We then present the results and evaluation for word embeddings and sentiment analysis model. Lastly, we discuss and highlight our remarks in the conclusions.

Related work
The largest scale model for capturing the computational semantics of endangered Uralic languages, Erzya, Moksha, Komi-Zyrian and Skolt Sami, is, perhaps, Se-mUr [16]. The database consists of words that are connected to each other based on their syntactic co-occurrences in a large internet corpus for Finnish. The extracted relations have been automatically translated by using Jack Rueter's XML dictionaries. In human evaluation, the quality was surprisingly acceptable given that the method was based on word-level translations. This gives hope in using these high-quality dictionaries in building computational semantic models.
Apart from SemUr, there has not been any other attempts in automatically modelling semantics for endangered Uralic languages. Some recent work, however presents interesting work on higher-resourced languages using word embeddings [2,11]. In general, word embeddings based methods such as word2vec [28] and fastText [6] are optimal for the task of applying high-resource language data to endangered languages as they work on word-level.
Several recent approaches such as GPT-2 [34], ELMo [33] and BERT [10] aim to capture richer semantic representations from text. However, they are very data intensive and their representation is no longer on the level of individual words. This makes it more difficult to use them for endangered languages.
Recently, neural networks have been used heavily in the field of NLP due to their great capabilities in learning a generalization, which resulted in high accuracies. However, neural networks demand a large amount of data, which usually is not available for low-resource languages. Despite this, researchers have employed neural networks in a low-resource setting by producing synthetic data. For instance, Hämäläinen and Rueter have built a neural network to detect cognates for between two endangered languages [18], Skolt Sami and North Sami. Their approach reached to a better accuracy when they combined data, synthetically produced by a statistical model, with real data.

Linguistic resources
Here, we describe the linguistic resources used throughout the research presented in this paper. We will focus on resources related to the endangered languages (i.e., Erzya, Moksha, Komi-Zyrian and Skolt Sami), while still providing a brief introduction to resource-rich resources. The resources for endangered languages that we cover here are: 1) translation dictionaries, 2) universal dependencies and 3) finite-state transducers. This list by no means is inclusive of all available and useful resources for endangered languages, as additional resources might exist such as the work of Jack Rueter on Online dictionaries [41] and making them usable even through click-in-text interfaces [38]. In terms of resource-languages, we describe their word-embeddings.

Translation dictionaries
Low-resource and endangered languages commonly have translation dictionaries to a bigger language. For our case, such dictionaries are multilingual and are provided in an Extensible Markup Language (XML) format. Fortunately, the target languages of the translations are mostly consistent in all the dictionaries (which is not the typical case), but each dictionary contains different portions of translations. Table 1 shows a statistical summary of the translations existing in the dictionaries. The source language represents the endangered languages and the target language indicates the resource-rice language. A meaning group in the dictionaries may contain multiple translations that can be used interchangeably as they share the same meaning. The analysis shows that for Erzya (myv) and Skolt Sami (sms), Finnish (fin) translations are the most common ones, whereas Russian (rus) and English (eng) translations are the most frequent ones for Komi-Zyrian (kpv) and Moksha (mdf). Entries in the dictionaries are in the lemma form, and, typically, their partof-speech tags are provided. Further metadata information might exist, such as stems and example usages of the word in the source language. We use the Giella [29] dictionaries that have been mainly authored by Jack Rueter through UralicNLP [21]. While Moksha has Finnish translations, the Moksha dictionary in UralicNLP did not contain any of these translations because the data was missing from the repository.

Universal dependencies
Universal dependencies (UD) [47] is a standard framework for annotating the grammar (parts of speech, morphological features, and syntactic dependencies of sentences. Additionally, UD allows annotators to supply their own comments. In the UD we are dealing with, translation sentences might appear in the comments. The UD of the endangered languages can be obtained directly from Universal Dependencies' website 4 . At the time of writing, 1,690, 167, 104 and 435 sen-tences were in Erzya's [44], Moksha's [42], Skolt Sami's [30] and Komi-Zyrian's UDs 5 [32], respectively. These numbers highlight the insufficient amount of data present for training machine learning or NLP models for endangered languages. We have used the UralicNLP [21], a Python library, to read the universal dependencies.

Finite-state transducers
The common automatic tools found for endangered languages are finite-state transducers (FSTs), as they are rule-based which allows language experts to define how the finite-state machine should behave depending on the language. As a result, FSTs make it possible to lemmatize words and produce mini-and full-paradigms. In this work, we use Jack Rueter's FSTs for Skolt Sami [39], Erzya and Moksha [40], and Komi-Zyrian [36]. The FSTs are supplied as part of the UralicNLP [21] Python library.

Word embeddings of resource-rich languages
Word embeddings are a vector representation of words, which are built based on the surrounding context of the word. Semantic similarity between words captured in the word embeddings can be measured using cosine similarity, which can then be utilized to cluster meanings in text [17]. Common usages for word embeddings is to acquire semantically similar words to an input word. For example, the most 5 similar words to "king" are "queen", "monarch", "prince", "sultan", and "ruler". The vector nature of these words makes it possible to perform vector operations such as addition, multiplication and subtractions. With such operations, analogies could be predicted such as "king" -"man" + "woman" = "queen". Simply, this asks what is the equivalent of a king that is not a man but rather a woman in the semantic space, the answer is a queen.
When building word embeddings, there are many preprocessing configurations and hyperparameters that influence the performance of the models, such as lemmatization, part-of-speech tagging, window size, the dimension size of the embeddings, minimum and maximum thresholds for word frequencies and so on. There is no fixed nor optimal configuration that is apt for all applications.
In the translation dictionaries, words and their translations are provided in their lemma form. Due to this reason, the vocabulary in any word embeddings we will be using has to be lemmatized. Ideally, all the hyperparameters and configurations for word embeddings should be the same to capture similar features and semantics, which would yield better results across models once they are aligned. For the scope of this research, we use the most similar models we could get our hands on.
We utilize the Russian and English [12], and Finish [25] word embeddings. The Russian embeddings are trained on a news corpus, while the English is based on Wikipedia and Gigaword 5th Edition corpora [31]. The Finnish word embeddings are trained on Common Crawls. The dimension size of the English and Russian embeddings is 300 but 200 is the size of the Finnish one. The window size is 5 for all embeddings but Finnish, which is 2. These differences, in addition to other reasons, end up affecting the quality of the models we will build of endangered languages. We discuss them more in the Discussion section.

Cross-lingual word embeddings for endangered languages
Cross-lingual word embeddings are word embeddings where vectors across multiple languages are aligned. For instance, the vector for "dog" in the English embeddings points roughly to the same direction for the same word in other languages (i.e., "koira" and "собака" for Finnish and Russian, respectively). Example applications for employing cross-lingual word embeddings are: headline generation [5], loan word identification [27] and cognate identification [26]. Before we build and align the word embeddings, we apply a dimensionality reduction using the method proposed in [35] to the three pre-trained models (i.e., English, Russian and Finnish). We set the target dimension to 100. This is to ensure that the vectors in all the embeddings share the same size. Subsequently, we process the vocabulary of the Finnish by removing all occurrences of the hashtag symbol "#", which is there to mark compounds. Regarding the Russian word embeddings, the vocabulary contained part-of-speech information and, hence, each lemma might be present multiple times. To address this, we discard the part-of-speech information and use all vectors matching the target lemma.
To align the main three word embedding models, we employ the state-of-theart supervised multilingual word embeddings alignment technique introduced in MUSE [8]. Figure 1 illustrates transforming the word embeddings of the source language X with the target language Y so that words in both languages are aligned together. In this example the source language is English and the target language is Italian. What supervised means in this context is that the alignment process relies on a bilingual dictionary that guides the transformation process. In our work, we set the target language to English and align both Russian and Finnish models with it using the bilingual dictionaries released as part of MUSE. The models are refined over 20 iterations. Following the alignment of the resource-rich models, we construct the word embeddings for the endangered languages: Erzya, Moksha, Komi-Zyrian and Skolt Sami. In doing so, we iterate over all the lexemes in the dictionary of a given endangered language. In the case where a lexeme had translations to any of the three resource-rich languages and the translation existed in the word embeddings of the corresponding language, a vector for the lexeme is constructed as the centroid -an average vector-of all translation vectors.
Once the word embeddings for the endangered languages have been constructed, we fine-tune them using the sentences in their universal dependencies. Lastly, we realign each word embeddings model with the resource-rich language having most translations to. In other words, Erzya and Skolt Sami are aligned with Finnish but Komi-Zyrian and Moksha are aligned with Russian and English, respectively. The models are aligned over 5 refinement steps.

Sentiment analysis
In this section, we describe an experiment with the newly produced word embeddings. We apply them in the task of sentiment analysis. We hand pick all positive and negative sentences from the Erzya treebank [44] based on the translations provided in the treebank in English and Finnish. This constitutes our Erzya test corpus that contains 23 negative sentences and 22 positive sentences, giving us a total of 45 sentences.
We use the Stanford Sentiment Treebank for English [46] to train our sentiment analyzer model. As the Erzya test data is binary -negative and positive sentences -we treat the sentiment information in the treebank as binary as well, ignoring any neutral examples. It is important to note that we do not use any examples written in Erzya during the training, only sentences in English.
We train a neural model that takes in a sentence in English as a source and a sentiment label (positive or negative) as a target. We train the neural model with the aligned embeddings by substituting the words in the input sentences with their vectors. As our models are lemmatized, we need to ensure that all words are lemmatized in the in input as well. We use spaCy [20] for this lemmatization step. The architecture and training of the neural model is inspired by the work presented in [23], where bi-grams are added to the input sentences during the training phase and the neural network is a linear classifier. Table 3 shows some examples of the input in Erzya, its translation in English and the correctly predicted label. For Erzya, we use the lemmas from the treebank, and get their closest English vectors through the aligned word embeddings. This way, the model treats the Erzya sentences as though they were English and it can predict the sentiment in the language it did not see during the training. The resulting model was trained for 30 epochs and it reached to 53.3% accuracy for Erzya and 75.5% accuracy for English in the treebank sentences and an accuracy of 83.5% in English in the Stanford Sentiment Treebank dataset. We have obtained an accuracy boost for Erzya predictions, reaching 57.8%, when we also considered vectors of other resource-rich languages with the aid of the The score after the colon is the semantic similarity score (the higher, the more similar).  Table 3. Example sentences in Erzya and their translations in English, along with the predicted sentiment by our method for each sentence.
It is a warm day. Сехте паро шка.
The best time of all. Цёрынентень аламодо визькс теевсь.
The boy felt a little ashamed. Negative Баягинень ёмавтомась пек берянь тешксэсь. Losing a bell was a really bad sign. Весе те апаро вийтнень тандавтнемс. This is all meant to scare away the evil spirits.
translation dictionary (Finnish in this case, as Erzya has many translations to Finnish). The resulting accuracy is respectable given that the test data is fundamentally different from the training data. First of all, the testing and training are in different languages. Second of all, they represent very different genres: the training data is based on movie reviews, whereas the testing data has sentences from novels.

Discussion and Conclusions
The work conducted in this paper has been a first step for using machine learning in modelling the semantics of some of the endangered Uralic languages. It is evident that these aligning based approaches embraced before in the literature cannot get us too far in truly representing the semantics do to socio-cultural mismatches in concepts. For instance, we saw that Finland, which is a very important concept for a Finnish model was completely misaligned with geographically close countries such as Denmark, Norway and Estonia. Alignment can only get us so far and using models trained on larger languages has its inherent problems when applied to completely new domains in a completely different language.
Even the starting quality for the pretrained embeddings was low. The Russian model was unacceptably bad and the Finnish model has too many words that are not lemmatized at all, or are lemmatized to a wrong lemma. When the quality of the models available for a high-resourced languages is substandard, one cannot expect any sophisticated machine learning method to come to the rescue. Unfortunately in our field, too little attention is paid to the quality of resources and more attention is paid into single values representing overall accuracies and overall performance.
As there is no shortcut to happiness, we should look into the data available in the endangered languages themselves. For instance, FU-Lab has a plethora of resources for Komi languages [14,13] that are just waiting for lemmatization. Once lemmatized, these resources could be used to build word embeddings directly in that language. Of course, this requires collaboration between many parties and willingness to make data openly available. While this might not be an issue with FU-Lab, it might be with some other instances holding onto their immaterial rights too tight.
At the current, stage our dictionary editing system, Ve rdd [4,3], contains words for multiple endangered languages and their translations in a graph structure. This data could be extended by predicting new relations into the graph with semantic models such as word embeddings. This could help at least in resolving meaning groups and polysemy of the lexical entries. However, the word embeddings available for the endangered languages in question has not yet reached to a stage mature enough for their incorporation as a part of the lexicon.

Acknowledgement
I would like to dedicate the acknowledgement section to Jack Rueter, for all his work on endangered languages and his brilliant ideas on improving the current technological state of endangered languages. Jack's enthusiasm and dedication to endangered languages is clearly shown in all the various dictionaries and FSTs built and maintained by him. He supervised my work on building the dictionary editing system, Ve rdd [4,3]. He was always available for discussing and supporting my work, without him Ve rdd would not be in the great level it is at at the moment. He truly is a pioneer in the field, and the entire community appreciates all of his work.