Browsing by Subject "corpus linguistics"

Sort by: Order: Results:

Now showing items 1-20 of 37
  • Gong, Heng (Helsingin yliopisto, 2019)
    With the construction of the largest water dam in the world, China’s Three Gorges Dam, many severe environmental problems have emerged along the Yangtze River. Its constructor, the China Three Gorges Corporation (CTGC), publishes an annual environmental report (AER) to address the ecological problems. This study aims to investigate these reports from the perspective of ecolinguistics under the Story Theory put forward by Arran Stibbe (2015). This study addresses three questions: 1) Are there any beneficial, ambivalent, or destructive discourses in China Three Gorges Corporation's annual environmental reports? 2) If so, how is each story being constructed? 3) What suggestions and implications can we obtain from the analyses of these stories? To answer these questions, 10 English AERs published from 2008 to 2017 by CTGC were collected to compile a corpus with a size of 114,770 tokens. Six story types, including frame, metaphor, evaluation, identity, erasure, and salience, were then chosen for analysis with the combination method of ecolinguistics, corpus linguistics, and critical discourse analysis. The results show that, within the frame story, the sustainable development frame and the green development frame were ambivalent discourses. Within the metaphor story, RIVER AS A TOOL FOR MAKING MONEY, NATURE IS A MACHINE, ECOLOGICAL DAMAGE IS AN ACCIDENT, and COMPANY IS A HUMAN were destructive discourses; NATURE IS A COMPETITION and CLIMATE CHANGE IS A WAR were ambivalent discourses. Within the evaluation story, using three purr-words (clean, new, and renewable) to describe energy formed a destructive discourse. Within the identity story, the use of the pronouns we and our distanced more-than-human participants from human participants, which formed a destructive discourse. Within the erasure story, the nominalization of the word pollute formed a destructive discourse. Within the salience story, describing endangered fish species with abstract words and describing fish as a type of resource formed two destructive discourses, and using the basic level word fish formed a beneficial discourse. Based on these judgments, this study concludes that the beneficial discourse should be promoted, the destructive discourses should be resisted, and the positive parts of the ambivalent discourses should be highlighted while their negative parts should be rejected. These findings can contribute to our understanding of the ecological discourse of the water dam.
  • Kolla, Elena (Helsingin yliopisto, 2018)
    This study combines metadiscourse research and sociolinguistic methods to establish which social variables influence the choice of metadiscourse resources containing first-person pronouns in US opinion news texts. The study has three main goals. The first goal is to establish which first-person pronouns are used by the authors of opinion articles, and which social variables influence or at least correlate with their choice of first-person pronouns the most, as well as to study the contexts in which these pronouns are used. The second goal is to establish which metadiscourse resources and to what extent are used by the authors of different social groups. The third goal is to establish if there is any correlation between various social factors and the use of particular metadiscourse resources. The corpus for the study was collected from articles posted on the sites of eleven US news publishers and consists of op-ed texts on politics and social issues along with the information about the authors of these texts including gender, age, ethnic background, education, and occupation. To fulfill these goals the study uses corpus linguistics methods for calculating and comparing the occurrence frequencies of first-person pronouns by social variables and Ken Hyland's interpersonal model of metadiscourse. The results show that social variables do indeed significantly correlate with the choice of first-person pronouns and the metadiscourse resources containing these pronouns. The pronouns that are mostly used are the subject pronouns I and we, the mostly used metadiscourse resources being Self-mentions and Engagement markers. The most prominent social variables that correlate with the use of pronouns are gender and, to a lesser degree, occupation. The female authors of the articles in the corpus use more first-person pronouns than male authors and show a preference for first-person singular pronouns and plural inclusive pronouns while male authors use more first-person plural pronouns. The most noticeable difference in pronoun usage between genders can be observed between male and female journalists; however, journalists of one gender do not differ from each other in either pronoun or metadiscourse use with other factors being equal.
  • Nevalainen, Terttu (John Benjamins, 2018)
    Advances in Historical Sociolinguistics
  • McKenzie, Emma (Helsingin yliopisto, 2020)
    This project is a corpus-based study on numeral + noun phrases in Scottish Gaelic. The typical pattern in Scottish Gaelic is to use a singular noun after numerals one and two and a plural noun after numerals three through ten. However, there are some nouns that do not follow this expected pattern. These exceptions are called numeratives and there are three different categories of numeratives in Scottish Gaelic: duals, numeratives identical in form to a singular, and numeratives with a form that differs from singular and plural and only used with numerals. This study aims to find which nouns have numerative forms and how their use varies diachronically and between dialects. While numeratives have been more researched in Welsh and Irish, there is not much research on numeratives in Scottish Gaelic. Ò Maolalaigh (2013) did a more restricted corpus study to find what nouns use singular after numerals three through ten. The past research provides a good comparison for my results and gives me a good foundation to expand on. From the past research, there seems to be a semantic relationship between the kinds of nouns that have numerative forms, so I sort my results into semantic categories as well. I also look at numeratives from the perspective of linguistic complexity since Scottish Gaelic is a minority language with a large proportion of L2 speakers. This project uses Corpas na Gàidhlig (the Corpus of Scottish Gaelic), which is part of the University of Glasgow’s Digital Archive of Scottish Gaelic. I search the corpus for numerals two through four to see which nouns use numeratives and how consistently they use them. I also look at how frequently numeratives are used diachronically and how usage varies across dialects. I focus especially on nouns that have a high number of numerative tokens to see if there is a pattern in their usage. In my results, I found 47 nouns that use a dual form and 105 nouns that use a numerative identical in form to a singular. The overall findings for numerative use are that dual use is decreasing, while use of numeratives identical in form to singular has been increasing since 1900-1949. The semantic category with the most dual tokens is natural pairs. The nouns with numeratives identical in form to singular tend to be nouns frequently used with numerals, such as measurement words.
  • Säily, Tanja; Tyrkkö, Jukka (2021)
    Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.
  • Säily, Tanja (John Benjamins, 2018)
    Advances in Historical Sociolinguistics
  • Sairio, Anni Riitta Susanna; Kaislaniemi, Samuli; Merikallio, Anna Maria; Nevalainen, Taimi Terttu Annikki (2018)
  • Ruohonen, Juho; Rudanko, Juhani (2019)
    Several factors have been identified in the recent literature to explain variation in the selection of sentential complements in recent English, and the article begins with a survey of such factors. The article then offers a case study of the impact of such factors on non-finite complements of the adjective afraid on the basis of the Strathy Corpus of Canadian English. Attention is paid for instance to the Extraction and Choice Principles, passive lower predicates, and text type. Multivariate analysis is applied to compare and to shed light on such different explanatory principles. The Choice Principle proves to be by far the most significant predictor of the alternation, while the heavily correlated syntactic feature of Voice appears non-significant. Fiction, as opposed to the informative registers, shows a notable preference for to infinitives, though this finding needs to be replicated in datasets where controlling for author idiolect is possible. Theoretically plausible odds ratios are observed on the Extraction Principle and negation of the predicate, but they are not statistically significant. In the former case, this may well be due to the variable’s collinearity with the Choice Principle and its low overall frequency, resulting in a low effective sample size.
  • Silvennoinen, Olli O. (2018)
    This paper discusses constructional variation in the domain of contrastive negation in English, using data from the British National Corpus. Contrastive negation refers to constructs with two parts, one negative and the other affirmative, such that the affirmative offers an alternative to the negative in the frame in question (e.g. shaken, not stirred; not once but twice; I don't like it - I love it). The paper utilises multiple correspondence analysis to explore the degree of synonymy among the various constructional schemas of contrastive negation, finding that different schemas are associated with different semantic, pragmatic and extralinguistic contexts but also that certain schemas do not differ from each other in a significant way.
  • Loureiro-Porto, Lucía; Hiltunen, Turo (2020)
    "Democratization" and "gender-neutrality" are two concepts commonly used in recent studies on language variation. While both concepts link linguistic phenomena to sociocultural changes, the extent to which they overlap and/or interact has not been studied in detail. In particular, not much is known about how linguistic changes related to democratization and gender-neutrality spread across registers or varieties of English, as well as whether speakers are aware of the changes that are taking place. In this paper we review the main theoretical issues regarding these concepts and relate them to the main findings in the articles in this issue, all of which study lexical and grammatical variation from a corpus-based perspective. Taken together, they help unveil some of the conscious and unconscious mechanisms that operate at the interface between democratization and gender-neutrality.
  • Säily, Tanja; Mäkelä, Eetu; Hämäläinen, Mika (2018)
    This paper describes ongoing work towards a rich analysis of the social contexts of neologism use in historical corpora, in particular the Corpora of Early English Correspondence, with research questions concerning the innovators, meanings and diffusion of neologisms. To enable this kind of study, we are developing new processes, tools and ways of combining data from different sources, including the Oxford English Dictionary, the Historical Thesaurus, and contemporary published texts. Comparing neologism candidates across these sources is complicated by the large amount of spelling variation. To make the issues tractable, we start from case studies of individual suffixes (-ity, -er) and people (Thomas Twining). By developing tools aiding these studies, we build toward more general analyses. Our aim is to develop an open-source environment where information on neologism candidates is gathered from a variety of algorithms and sources, pooled, and presented to a human evaluator for verification and exploration.
  • Kesäniemi, Joonas; Vartiainen, Turo; Säily, Tanja; Nevalainen, Terttu (2018)
    Empirical work on English historical corpus linguistics is plentiful but fragmented, and some of it is hard to come by. This paper proposes a solution for making it more accessible and reusable for meta-analysis. We present an online Language Change Database (LCD), which provides comparative, real-time baseline data from earlier corpus-based studies. LCD entries summarize the findings and include numerical data from the articles. We discuss the LCD from the perspective of database design and linked data management. Furthermore, we illustrate the reuse of LCD data through a meta-analysis of the history of English connectives. For this purpose, we have developed an application called the LCD Aggregated Data Analysis workbench (LADA). We show how researchers can use LADA to filter, refine and visualize LCD data. Thus we are paving the way for a future where both research results and research data are regularly available for verification, validation and re-use.
  • Laitinen, Mikko; Säily, Tanja (John Benjamins, 2018)
    Advances in Historical Sociolinguistics
  • Vogiatzi, Athina (Helsingin yliopisto, 2021)
    This thesis studies whether reappropriation of the term “bitch” occurs in American TV shows and movies based on corpus data retrieved from the Corpus of Contemporary American English (2021) for the period 2010-2019. The thesis combines methods from corpus linguistics and sociolinguistics. The subcorpus of TV and movies in COCA (2021) for the selected time frame contained 128 013 334 tokens. The research was performed in two stages: first, through a general collocation analysis of the most frequent words paired with the term “bitch”, and then through a concordance analysis of 100 random samples of concordance lines for each five-year period. Reappropriation was explored through lens of the reappropriation theory of derogatory terms by looking at the meaning of the collocates during the collocation analysis and at the meaning of the words occurring in proximity of the term “bitch” during the concordance analysis. The results revealed that there are very few instances of reappropriation, in which the speakers self-labelled with the term, thus the term appeared to maintain its pejorative nature in the majority of the cases. Swear words and insults were observed in most of the results in both analyses, with the idiomatic phrase son of a bitch having the highest frequency per million tokens. Corpus linguistics methods are applicable to study language use and reveal linguistic patterns that can reflect people’s ideologies.
  • Nevalainen, Terttu; Säily, Tanja; Vartiainen, Turo; Liimatta, Aatu; Lijffijt, Jefrey (2020)
    In this paper, we explore the rate of language change in the history of English. Our main focus is on detecting periods of accelerated change in Middle English (1150–1500), but we also compare the Middle English data with the Early Modern period (1500–1700) in order to establish a longer diachrony for the pace at which English has changed over time. Our study is based on a meta-analysis of existing corpus research, which is made available through a new linguistic resource, the Language Change Database (LCD). By aggregating the rates of 44 individual changes, we provide a critical assessment of how well the theory of punctuated equilibria (Dixon 1997) fits with our results. More specifically, by comparing the rate of language change with major language-external events, such as the Norman Conquest and the Black Death, we provide the first corpus-based meta-analysis of whether these events, which had significant societal consequences, also had an impact on the rate of language change. Our results indicate that major changes in the rate of linguistic change in the late medieval period could indeed be connected to the social and cultural after-effects of the Norman Conquest. We also make a methodological contribution to the field of English historical linguistics: by re-using data from existing research, linguists can start to ask new, fundamental questions about the ways in which language change progresses.
  • Wu , Junyu; Tissari, Heli (2021)
    It is difficult for L2 English learners in general, and especially Chinese learners of English, to form idiomatic collocations. This article presents a comparison of the use of intensifier-verb collocations in English by native speaker students and Chinese ESL learners, paying particular attention to verbs which collocate with intensifiers. The data consisted of written production from three corpora: two of these are native English corpora: the British Academic Written English (BAWE) Corpus and Michigan Corpus of Upper-Level Student Papers (MICUSP). The third one is a recently created Chinese Learner English corpus, Ten-thousand English Compositions of Chinese Learners (TECCL). Findings suggest that Chinese learners of English produce significantly more intensifier-verb collocations than native speaker students, but that their English attests a smaller variety of intensifier-verb collocations compared with the native speakers. Moreover, Chinese learners of English use the intensifier-verb collocation types just-verb, only-verb and really-verb very frequently compared with native speaker students. As regards verb collocates, the intensifiers hardly, clearly, well, strongly and deeply collocate with semantically different verbs in native and Chinese learner English. Compared with the patterns in Chinese learner English, the intensifiers in native speaker English collocate with a more stable and restricted set of verb collocates.
  • Siirtola, Harri; Säily, Tanja; Nevalainen, Terttu (IEEE Computer Society, 2017)
    Information Visualization
    Principal Component Analysis (PCA) is an established and efficient method for finding structure in a multidimensional data set. PCA is based on orthogonal transformations that convert a set of multidimensional values into linearly uncorrelated variables called principal components.The main disadvantage to the PCA approach is that the procedure and outcome are often difficult to understand. The connection between input and output can be puzzling, a small change in input can yield a completely different output, and the user may often wonder if the PCA is doing the right thing.We introduce a user interface that makes the procedure and result easier to understand. We have implemented an interactive PCA view in our text visualization tool called Text Variation Explorer. It allows the user to interactively study the result of PCA, and provides a better understanding of the process.We believe that although we are addressing the problem of interactive principal component analysis in the context of text visualization, these ideas should be useful in other contexts as well.
  • Siirtola, Harri; Isokoski, Poika; Säily, Tanja; Nevalainen, Terttu (IEEE Computer Society, 2016)
    Information Visualization
    Digitalization is changing how research is carried out in all areas of science. Humanities is no exception - materials that used to be hand-written or printed on paper are increasingly available in digital form. This development is changing how scholars are interacting with their material. We are addressing the problem of interactive text visualization in the context of sociolinguistic language study. When a scholar is reading and analyzing text from a computer screen instead of a paper, we can support this by providing a dashboard for reading, and by creating visualizations of the text structure, variation, and change. We have designed and developed a software tool called Text Variation Explorer (TVE) for sociolinguistic language study. It is based on interactive visualization with a direct manipulation user interface, and aimed for exploratory corpus linguistics. The TVE software tool has proven to be useful in supporting the study of language variation and change in its social contexts, or sociolinguistics. It is, to a certain degree, language-independent, and generic enough to be useful in other linguistic contexts as well. We are now in the process of designing and implementing the next iteration of TVE. We present the lessons learned from the first version, discuss the old and the new design, and welcome feedback from the communities involved.
  • Nevalainen, Terttu; Vartiainen, Turo; Säily, Tanja; Kesäniemi, Joonas; Dominowska, Agata; Öhman, Emily (2016)
    We introduce the Language Change Database (LCD), which provides access to the results of previous corpus-based research dealing with change in the English language. The LCD will be published on an open-access linked data platform that will allow users to enter information about their own publications into the database and to conduct searches based on linguistic and extralinguistic parameters. Both metadata and numerical data from the original publications will be available for download, enabling systematic reviews, meta-analyses, replication studies and statistical modelling of language change. The LCD will be of interest to scholars, teachers and students of English.
  • Vartiainen, Minna (Helsingin yliopisto, 2022)
    This study aims to examine how many of the verb errors first-year Spanish university students make in their informal writing assignments are caused by negative language transfer. Another objective of the study is to analyze the errors using content analysis and an error taxonomy in order to find out what kind of contexts lead the students to make transfer errors in verb structures. The material of the study consists of 150 essays taken from three different essay categories of a learner corpus (WriCLE) and annotated for both transfer and non-transfer errors. Studies on language transfer between English and Spanish form the theoretical framework for this study (e.g. Lahuerta 2018; García-Pastor 2018; Swasey & Iglesias 2015). Language transfer is defined as applying linguistic rules from one’s mother tongue to a foreign or a second language and it can either be negative or positive based on whether it leads a learner to choose an erroneous or a correct form. Negative transfer of vocabulary has been studied considerably more than negative transfer of grammatical items between English and Spanish, which is why this study specifically focuses on the negative transfer of verb structures from Spanish to English. Error analysis is another important part of both the theory and methods of this study as I use a modified version of James’ (1998) Target Modification Taxonomy to classify the errors by their semantic origins and analyze them further. There are five categories in this taxonomy: Misselection, Omission, Complete Transfer, Bilingual Blend and Overinclusion. There were overall 461 (71%) non-transfer errors and 185 (29%) transfer errors. These were distributed quite unevenly between the three essay categories. The non-transfer errors mainly consisted of tense shifts. The biggest transfer error categories were misselections and omissions. The most frequent type of misselection was using the infinitive in place of the present participle because in Spanish the infinitive is used in most contexts where English requires the -ing form. Leaving out prepositions and the pronoun it in contexts where they were required made up most of the omissions, which is due to the languages having very differing basic structures. The results of this study may help EFL teachers plan lessons and materials which better respond to students’ learning needs and help anticipate the errors that are likely to occur in the writing of a group of students with a shared native language.