  • Gong, Heng (Helsingin yliopisto, 2019)
    With the construction of the largest water dam in the world, China’s Three Gorges Dam, many severe environmental problems have emerged along the Yangtze River. Its constructor, the China Three Gorges Corporation (CTGC), publishes an annual environmental report (AER) to address the ecological problems. This study aims to investigate these reports from the perspective of ecolinguistics under the Story Theory put forward by Arran Stibbe (2015). This study addresses three questions: 1) Are there any beneficial, ambivalent, or destructive discourses in China Three Gorges Corporation's annual environmental reports? 2) If so, how is each story being constructed? 3) What suggestions and implications can we obtain from the analyses of these stories? To answer these questions, 10 English AERs published from 2008 to 2017 by CTGC were collected to compile a corpus with a size of 114,770 tokens. Six story types, including frame, metaphor, evaluation, identity, erasure, and salience, were then chosen for analysis with the combination method of ecolinguistics, corpus linguistics, and critical discourse analysis. The results show that, within the frame story, the sustainable development frame and the green development frame were ambivalent discourses. Within the metaphor story, RIVER AS A TOOL FOR MAKING MONEY, NATURE IS A MACHINE, ECOLOGICAL DAMAGE IS AN ACCIDENT, and COMPANY IS A HUMAN were destructive discourses; NATURE IS A COMPETITION and CLIMATE CHANGE IS A WAR were ambivalent discourses. Within the evaluation story, using three purr-words (clean, new, and renewable) to describe energy formed a destructive discourse. Within the identity story, the use of the pronouns we and our distanced more-than-human participants from human participants, which formed a destructive discourse. Within the erasure story, the nominalization of the word pollute formed a destructive discourse. Within the salience story, describing endangered fish species with abstract words and describing fish as a type of resource formed two destructive discourses, and using the basic level word fish formed a beneficial discourse. Based on these judgments, this study concludes that the beneficial discourse should be promoted, the destructive discourses should be resisted, and the positive parts of the ambivalent discourses should be highlighted while their negative parts should be rejected. These findings can contribute to our understanding of the ecological discourse of the water dam.
  • Kolla, Elena (Helsingin yliopisto, 2018)
    This study combines metadiscourse research and sociolinguistic methods to establish which social variables influence the choice of metadiscourse resources containing first-person pronouns in US opinion news texts. The study has three main goals. The first goal is to establish which first-person pronouns are used by the authors of opinion articles, and which social variables influence or at least correlate with their choice of first-person pronouns the most, as well as to study the contexts in which these pronouns are used. The second goal is to establish which metadiscourse resources and to what extent are used by the authors of different social groups. The third goal is to establish if there is any correlation between various social factors and the use of particular metadiscourse resources. The corpus for the study was collected from articles posted on the sites of eleven US news publishers and consists of op-ed texts on politics and social issues along with the information about the authors of these texts including gender, age, ethnic background, education, and occupation. To fulfill these goals the study uses corpus linguistics methods for calculating and comparing the occurrence frequencies of first-person pronouns by social variables and Ken Hyland's interpersonal model of metadiscourse. The results show that social variables do indeed significantly correlate with the choice of first-person pronouns and the metadiscourse resources containing these pronouns. The pronouns that are mostly used are the subject pronouns I and we, the mostly used metadiscourse resources being Self-mentions and Engagement markers. The most prominent social variables that correlate with the use of pronouns are gender and, to a lesser degree, occupation. The female authors of the articles in the corpus use more first-person pronouns than male authors and show a preference for first-person singular pronouns and plural inclusive pronouns while male authors use more first-person plural pronouns. The most noticeable difference in pronoun usage between genders can be observed between male and female journalists; however, journalists of one gender do not differ from each other in either pronoun or metadiscourse use with other factors being equal.
  • Nevalainen, Terttu (John Benjamins, 2018)
  • McKenzie, Emma (Helsingin yliopisto, 2020)
    This project is a corpus-based study on numeral + noun phrases in Scottish Gaelic. The typical pattern in Scottish Gaelic is to use a singular noun after numerals one and two and a plural noun after numerals three through ten. However, there are some nouns that do not follow this expected pattern. These exceptions are called numeratives and there are three different categories of numeratives in Scottish Gaelic: duals, numeratives identical in form to a singular, and numeratives with a form that differs from singular and plural and only used with numerals. This study aims to find which nouns have numerative forms and how their use varies diachronically and between dialects. While numeratives have been more researched in Welsh and Irish, there is not much research on numeratives in Scottish Gaelic. Ò Maolalaigh (2013) did a more restricted corpus study to find what nouns use singular after numerals three through ten. The past research provides a good comparison for my results and gives me a good foundation to expand on. From the past research, there seems to be a semantic relationship between the kinds of nouns that have numerative forms, so I sort my results into semantic categories as well. I also look at numeratives from the perspective of linguistic complexity since Scottish Gaelic is a minority language with a large proportion of L2 speakers. This project uses Corpas na Gàidhlig (the Corpus of Scottish Gaelic), which is part of the University of Glasgow’s Digital Archive of Scottish Gaelic. I search the corpus for numerals two through four to see which nouns use numeratives and how consistently they use them. I also look at how frequently numeratives are used diachronically and how usage varies across dialects. I focus especially on nouns that have a high number of numerative tokens to see if there is a pattern in their usage. In my results, I found 47 nouns that use a dual form and 105 nouns that use a numerative identical in form to a singular. The overall findings for numerative use are that dual use is decreasing, while use of numeratives identical in form to singular has been increasing since 1900-1949. The semantic category with the most dual tokens is natural pairs. The nouns with numeratives identical in form to singular tend to be nouns frequently used with numerals, such as measurement words.
  • Säily, Tanja; Tyrkkö, Jukka (2021)
    Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.
  • Säily, Tanja (John Benjamins, 2018)
  • Sairio, Anni Riitta Susanna; Kaislaniemi, Samuli; Merikallio, Anna Maria; Nevalainen, Taimi Terttu Annikki (2018)
  • Ruohonen, Juho; Rudanko, Juhani (2019)
    Several factors have been identified in the recent literature to explain variation in the selection of sentential complements in recent English, and the article begins with a survey of such factors. The article then offers a case study of the impact of such factors on non-finite complements of the adjective afraid on the basis of the Strathy Corpus of Canadian English. Attention is paid for instance to the Extraction and Choice Principles, passive lower predicates, and text type. Multivariate analysis is applied to compare and to shed light on such different explanatory principles. The Choice Principle proves to be by far the most significant predictor of the alternation, while the heavily correlated syntactic feature of Voice appears non-significant. Fiction, as opposed to the informative registers, shows a notable preference for to infinitives, though this finding needs to be replicated in datasets where controlling for author idiolect is possible. Theoretically plausible odds ratios are observed on the Extraction Principle and negation of the predicate, but they are not statistically significant. In the former case, this may well be due to the variable’s collinearity with the Choice Principle and its low overall frequency, resulting in a low effective sample size.
  • Silvennoinen, Olli O. (2018)
    This paper discusses constructional variation in the domain of contrastive negation in English, using data from the British National Corpus. Contrastive negation refers to constructs with two parts, one negative and the other affirmative, such that the affirmative offers an alternative to the negative in the frame in question (e.g. shaken, not stirred; not once but twice; I don't like it - I love it). The paper utilises multiple correspondence analysis to explore the degree of synonymy among the various constructional schemas of contrastive negation, finding that different schemas are associated with different semantic, pragmatic and extralinguistic contexts but also that certain schemas do not differ from each other in a significant way.
  • Loureiro-Porto, Lucía; Hiltunen, Turo (2020)
    "Democratization" and "gender-neutrality" are two concepts commonly used in recent studies on language variation. While both concepts link linguistic phenomena to sociocultural changes, the extent to which they overlap and/or interact has not been studied in detail. In particular, not much is known about how linguistic changes related to democratization and gender-neutrality spread across registers or varieties of English, as well as whether speakers are aware of the changes that are taking place. In this paper we review the main theoretical issues regarding these concepts and relate them to the main findings in the articles in this issue, all of which study lexical and grammatical variation from a corpus-based perspective. Taken together, they help unveil some of the conscious and unconscious mechanisms that operate at the interface between democratization and gender-neutrality.
  • Säily, Tanja; Mäkelä, Eetu; Hämäläinen, Mika (2018)
    This paper describes ongoing work towards a rich analysis of the social contexts of neologism use in historical corpora, in particular the Corpora of Early English Correspondence, with research questions concerning the innovators, meanings and diffusion of neologisms. To enable this kind of study, we are developing new processes, tools and ways of combining data from different sources, including the Oxford English Dictionary, the Historical Thesaurus, and contemporary published texts. Comparing neologism candidates across these sources is complicated by the large amount of spelling variation. To make the issues tractable, we start from case studies of individual suffixes (-ity, -er) and people (Thomas Twining). By developing tools aiding these studies, we build toward more general analyses. Our aim is to develop an open-source environment where information on neologism candidates is gathered from a variety of algorithms and sources, pooled, and presented to a human evaluator for verification and exploration.
  • Kesäniemi, Joonas; Vartiainen, Turo; Säily, Tanja; Nevalainen, Terttu (2018)
    Empirical work on English historical corpus linguistics is plentiful but fragmented, and some of it is hard to come by. This paper proposes a solution for making it more accessible and reusable for meta-analysis. We present an online Language Change Database (LCD), which provides comparative, real-time baseline data from earlier corpus-based studies. LCD entries summarize the findings and include numerical data from the articles. We discuss the LCD from the perspective of database design and linked data management. Furthermore, we illustrate the reuse of LCD data through a meta-analysis of the history of English connectives. For this purpose, we have developed an application called the LCD Aggregated Data Analysis workbench (LADA). We show how researchers can use LADA to filter, refine and visualize LCD data. Thus we are paving the way for a future where both research results and research data are regularly available for verification, validation and re-use.
  • Laitinen, Mikko; Säily, Tanja (John Benjamins, 2018)
  • Vogiatzi, Athina (Helsingin yliopisto, 2021)
    This thesis studies whether reappropriation of the term “bitch” occurs in American TV shows and movies based on corpus data retrieved from the Corpus of Contemporary American English (2021) for the period 2010-2019. The thesis combines methods from corpus linguistics and sociolinguistics. The subcorpus of TV and movies in COCA (2021) for the selected time frame contained 128 013 334 tokens. The research was performed in two stages: first, through a general collocation analysis of the most frequent words paired with the term “bitch”, and then through a concordance analysis of 100 random samples of concordance lines for each five-year period. Reappropriation was explored through lens of the reappropriation theory of derogatory terms by looking at the meaning of the collocates during the collocation analysis and at the meaning of the words occurring in proximity of the term “bitch” during the concordance analysis. The results revealed that there are very few instances of reappropriation, in which the speakers self-labelled with the term, thus the term appeared to maintain its pejorative nature in the majority of the cases. Swear words and insults were observed in most of the results in both analyses, with the idiomatic phrase son of a bitch having the highest frequency per million tokens. Corpus linguistics methods are applicable to study language use and reveal linguistic patterns that can reflect people’s ideologies.
  • Nevalainen, Terttu; Säily, Tanja; Vartiainen, Turo; Liimatta, Aatu; Lijffijt, Jefrey (2020)
    In this paper, we explore the rate of language change in the history of English. Our main focus is on detecting periods of accelerated change in Middle English (1150–1500), but we also compare the Middle English data with the Early Modern period (1500–1700) in order to establish a longer diachrony for the pace at which English has changed over time. Our study is based on a meta-analysis of existing corpus research, which is made available through a new linguistic resource, the Language Change Database (LCD). By aggregating the rates of 44 individual changes, we provide a critical assessment of how well the theory of punctuated equilibria (Dixon 1997) fits with our results. More specifically, by comparing the rate of language change with major language-external events, such as the Norman Conquest and the Black Death, we provide the first corpus-based meta-analysis of whether these events, which had significant societal consequences, also had an impact on the rate of language change. Our results indicate that major changes in the rate of linguistic change in the late medieval period could indeed be connected to the social and cultural after-effects of the Norman Conquest. We also make a methodological contribution to the field of English historical linguistics: by re-using data from existing research, linguists can start to ask new, fundamental questions about the ways in which language change progresses.
  • Siirtola, Harri; Säily, Tanja; Nevalainen, Terttu (IEEE Computer Society, 2017)
    Principal Component Analysis (PCA) is an established and efficient method for finding structure in a multidimensional data set. PCA is based on orthogonal transformations that convert a set of multidimensional values into linearly uncorrelated variables called principal components.The main disadvantage to the PCA approach is that the procedure and outcome are often difficult to understand. The connection between input and output can be puzzling, a small change in input can yield a completely different output, and the user may often wonder if the PCA is doing the right thing.We introduce a user interface that makes the procedure and result easier to understand. We have implemented an interactive PCA view in our text visualization tool called Text Variation Explorer. It allows the user to interactively study the result of PCA, and provides a better understanding of the process.We believe that although we are addressing the problem of interactive principal component analysis in the context of text visualization, these ideas should be useful in other contexts as well.
  • Siirtola, Harri; Isokoski, Poika; Säily, Tanja; Nevalainen, Terttu (IEEE Computer Society, 2016)
    Digitalization is changing how research is carried out in all areas of science. Humanities is no exception - materials that used to be hand-written or printed on paper are increasingly available in digital form. This development is changing how scholars are interacting with their material. We are addressing the problem of interactive text visualization in the context of sociolinguistic language study. When a scholar is reading and analyzing text from a computer screen instead of a paper, we can support this by providing a dashboard for reading, and by creating visualizations of the text structure, variation, and change. We have designed and developed a software tool called Text Variation Explorer (TVE) for sociolinguistic language study. It is based on interactive visualization with a direct manipulation user interface, and aimed for exploratory corpus linguistics. The TVE software tool has proven to be useful in supporting the study of language variation and change in its social contexts, or sociolinguistics. It is, to a certain degree, language-independent, and generic enough to be useful in other linguistic contexts as well. We are now in the process of designing and implementing the next iteration of TVE. We present the lessons learned from the first version, discuss the old and the new design, and welcome feedback from the communities involved.
  • Nevalainen, Terttu; Vartiainen, Turo; Säily, Tanja; Kesäniemi, Joonas; Dominowska, Agata; Öhman, Emily (2016)
    We introduce the Language Change Database (LCD), which provides access to the results of previous corpus-based research dealing with change in the English language. The LCD will be published on an open-access linked data platform that will allow users to enter information about their own publications into the database and to conduct searches based on linguistic and extralinguistic parameters. Both metadata and numerical data from the original publications will be available for download, enabling systematic reviews, meta-analyses, replication studies and statistical modelling of language change. The LCD will be of interest to scholars, teachers and students of English.
  • Korkiakangas, Timo (2021)
    This paper describes the construction and annotation of the Late Latin Charter Treebank, a set of three dependency treebanks (LLCT1, LLCT2, and LLCT3) which contain together 1,261 Early Medieval Latin documentary texts (i.e., original charters) written in Italy between AD 714 and 1000 (c. 594,000 tokens). The paper focuses on issues which a linguistically or philologically inclined user of LLCT needs to know: the criteria on which the charters were selected, the special characteristics of the annotation types utilized and the geographical and chronological distribution of the data. In addition to normal queries on forms, lemmas, morphology and syntax, complex philological research settings are enabled by the textual annotation layer of LLCT, which indicates abbreviated and damaged words, as well as the formulaic and non-formulaic passages of each charter.
  • Säily, Tanja; Nurmi, Arja; Sairio, Anni (John Benjamins, 2018)
