  • Siddiqui, Saara (Helsingin yliopisto, 2020)
    This thesis examines non-finite verbs and their collocates in translated and non-translated Finnish-language baking recipes. The frequencies of non-finite verb forms in the two language varieties are compared, and collocates and colligate types occurring in connection with non-finite verbs are examined. These results are, then, viewed in relation to the translation universals of simplification, explicitation, interference and untypical frequencies. In addition, the frequencies are compared with frequencies in standard language. The analysis finds most non-finite forms to occur with fairly concordant frequencies in both language varieties. However, some forms, namely the inessive of the E-infinitive and the illative of the MA-infinitive, present a higher frequency in translated recipes. The overrepresentation of the inessive is line with earlier studies (Eskola 2002 and Puurtinen 2005) and could be regarded as support for the universals of untypical frequencies and, potentially, of interference. On the other hand, significant differences are also found between translated texts, particularly with regard to the illative of the MA-infinitive and the instructive of the E-infinitive, which occur with frequencies both higher and lower than in the non-translated texts. These discrepancies might be considered a manifestation of untypical frequencies in translations, but overall support for explicitation or simplification is not found. Most frequencies of non-finite forms analyzed are in concordance with frequencies in standard language (according to Ikola et al. 1989). However, the illative of the MA-infinitive is found to occur with a lower frequency and the instructive of the E-infinitive with a higher frequency than in standard Finnish. This thesis suggests that this may be due to the relationship between the function of recipes and the functions of the two verb forms. In an analysis of collocate positions, the recipes present a tendency to left-positioning. Interestingly, the analysis shows no significant differences between translated and non-translated language. This contradicts earlier studies, which have shown right-positioning to be more prevalent in Finnish translated from English than in non-translated Finnish (Eskola 2004). In contrast with these studies, the results here suggest no interference from the source language in the positioning of collocates. The material consists of forty baking recipes from four cookbooks, two of them translated and two non-translated. Recipe language, more specifically the language of their instructions, presents a highly conventionalized syntax with few complex structures and many imperatives (Pakkala-Weckström 2014). This thesis suggests, however, that non-finite verbs, instructives of the E-infinitive in particular, may be an essential component of recipe Finnish. The collocate analysis performed further suggests that it is the collocates – e.g. adverbials of time, manner and instrument – that make these non-finites meaningful, instructing the reader on how often, in which way and with what to process the ingredients, thus helping to fulfil the operative function of recipes.
  • Yli-Jyrä, Anssi Mikael (The Association for Computational Linguistics, 2017)
    A recently proposed encoding for noncrossing digraphs can be used to implement generic inference over families of these digraphs and to carry out first-order factored dependency parsing. It is now shown that the recent proposal can be substantially streamlined without information loss. The improved encoding is less dependent on hierarchical processing and it gives rise to a high-coverage bounded-depth approximation of the space of non- crossing digraphs. This subset is presented elegantly by a finite-state machine that recognizes an infinite set of encoded graphs. The set includes more than 99.99% of the 0.6 million noncrossing graphs obtained from the UDv2 treebanks through planarisation. Rather than taking the low probability of the residual as a flat rate, it can be modelled with a joint probability distribution that is factorised into two underlying stochastic processes – the sentence length distribution and the related conditional distribution for deep nesting. This model points out that deep nesting in the streamlined code requires extreme sentence lengths. High depth is categorically out in common sentence lengths but emerges slowly at infrequent lengths that prompt further inquiry.
  • Jurkiewicz-Rohrbacher, Edyta Kinga; Hansen, Björn; Kolaković, Zrinka (2017)
    In the paper, we discuss the phenomenon of clitic climbing out of finite da2-complements in contemporary Serbian. Scholars’ opinions on the acceptability and occurrence of this construction, based on a handful of self-made examples, vary considerably. Expanding on the assumption that the correctness of the phenomenon has often been denied due to its rareness we employ large corpora to examine the problem. We focus on possible constraints arising from the syntactic properties of clause-embedding predicates.
  • Ruohonen, Juho; Rudanko, Juhani (2019)
    Several factors have been identified in the recent literature to explain variation in the selection of sentential complements in recent English, and the article begins with a survey of such factors. The article then offers a case study of the impact of such factors on non-finite complements of the adjective afraid on the basis of the Strathy Corpus of Canadian English. Attention is paid for instance to the Extraction and Choice Principles, passive lower predicates, and text type. Multivariate analysis is applied to compare and to shed light on such different explanatory principles. The Choice Principle proves to be by far the most significant predictor of the alternation, while the heavily correlated syntactic feature of Voice appears non-significant. Fiction, as opposed to the informative registers, shows a notable preference for to infinitives, though this finding needs to be replicated in datasets where controlling for author idiolect is possible. Theoretically plausible odds ratios are observed on the Extraction Principle and negation of the predicate, but they are not statistically significant. In the former case, this may well be due to the variable’s collinearity with the Choice Principle and its low overall frequency, resulting in a low effective sample size.
  • Halla-aho, Hilla (Brill, 2018)
    In the construction known as left-dislocation, an element appears in a fronted position, before the clause to which it belongs, usually introducing the topic of the sentence. Based on a detailed analysis of syntax, information structure and pragmatic organization, this study explores how left-dislocation is used in republican Latin comedy, prose and inscriptions as a device to introduce topics or other pragmatically prominent elements. Taking into consideration especially relative clause syntax and constraints of each text type, Hilla Halla-aho shows that, in the context of early Latin syntax and the evolving standards of the written language, left-dislocation performs similar functions in dramatic dialogue, legal inscriptions and archaic prose.
  • Yli-Jyrä, Anssi Mikael (CSLI publications, 2005)
    CSLI Studies in Computational Linguistics ONLINE
    We have presented an overview of the FSIG approach and related FSIG gram- mars to issues of very low complexity and parsing strategy. We ended up with serious optimism according to which most FSIG grammars could be decom- posed in a reasonable way and then processed efficiently.
  • Shagal, Ksenia; Volkova, Anna (2018)
  • Spronck, Stef; Nikitina, Tatiana (2019)
    In many languages, expressions of the type ‘x said: “p”’, ‘x said that p’ or ‘allegedly, p’ share properties with common syntactic types such as construc- tions with subordination, paratactic constructions, and constructions with sen- tence-level adverbs. On closer examination, however, they often turn out to be atypical members of these syntactic classes. In this paper we argue that a more coherent picture emerges if we analyse these expressions as a dedicated syntac- tic domain in itself, which we refer to as ‘reported speech’. Based on typological observations we argue for the idiosyncrasy of reported speech as a syntactic class. The article concludes with a proposal for a cross-linguistic characterisa- tion that aims at capturing this broadly conceived domain of reported speech with a single semantic definition.
  • Marjokorpi, Jenni (Helsingfors universitet, 2014)
    According to the recent draft of the renewed Finnish national core curriculum, the basic concepts of grammar are to be learned already in the primary school when they are taught by a classroom teacher. As the basis of metalinguistic awareness, the grammatical concepts are complex and abstract, and a body of research evidence has raised public worry about the teachers' insufficient pedagogical content knowledge in this area; some authorities have even suggested replacing the classroom teachers, who receive very little grammar instruction during their training, with subject teachers of Finnish as the mother tongue in the fifth and sixth grades of basic education. This study aims at understanding student teachers' grammatical thinking from the point of view of the sentence elements subject and object, both usually taught during the fifth grade. I research the students' capability of identifying and defining the sentence elements and the minitheories they used in this cognitive process. I also study the relation between each minitheory and success in the grammar test. The study is part of a project that evaluates the student teachers' grammatical content knowledge, for which the data was collected in 2011. The students (N = 128) took a grammar test in which they identified the sentence elements, explained the strategies they used in the task, and also marked a fifth-grader's grammar test. I studied the minitheories using content analysis of the open-ended questions and examined their effectiveness with quantitative methods. I also considered the students' earlier performance in the national matriculation exam in relation to the level of grammatical content knowledge pictured by the test. The students were familiar with the concepts of subject and object as well as their semantic definitions but only 9.4 % of the participants managed to identify all the five subjects, and 21 % of them all the four objects. The separate and content-based analysis of the minitheories of subject and object showed that the students searched for both of them by using the same minitheories that I call semantic, syntactic, interrogative, and morphological. The morphological minitheory appeared effective in both cases, the syntactic minitheory in the subject tasks, and a combination of many minitheories in the object tasks. Therefore, the teacher education needs to put emphasis on the students' content knowledge in order to ensure that they have the profound grammatical understanding required by the curriculum.
  • Rueter, Jack; Partanen, Niko (The Association for Computational Linguistics, 2019)
    This paper attempts to evaluate some of the systematic differences in Uralic Universal Dependencies treebanks from a perspective that would help to introduce reasonable improvements in treebank annotation consistency within this language family. The study finds that the coverage of Uralic languages in the project is already relatively high, and the majority of typically Uralic features are already present and can be discussed on the basis of existing treebanks. Some of the idiosyncrasies found in individual treebanks stem from language-internal grammar traditions, and could be a target for harmonization in later phases.
  • Grünthal, Riho (Finno-Ugrian Society, 2016)
    Uralica Helsingiensia
    In Erzya, transitivity is indexed both in the verb conjugation and the inflection of the object. The degree of definiteness and case of object alternate, while the verb displays the cross-reference of the person and number of both the subject and object. The morphologically complex transitivity marking system is a major challenge for speakers of Erzya as a second language. The article examines the variation of Erzya transitive clause in the light of data drawn from interviews with non-native Erzya speakers who have a Slavic or Turkic language as their first language. During the HALS fieldwork in Dubenskiy district at the Republic of Mordovia in August 2013, a survey test was made on 68 transitive clauses representing patterns of both high and low transitivity. The answers of the Erzya second language speakers showed that they had adopted Erzya transitivity as a system involving both nominal and verbal inflection. However, the marking of transitivity varied between individual speakers regardless of their background. Although the interviewed non-native Erzya speakers were very fluent, there was a clear contrast between the answers of native and non-native speakers.
  • Yli-Jyrä, Anssi (The Linguistic Association of Finland, 2006)
    The trees in the Penn Treebank have a standard representation that involves complete balanced bracketing. In this article, an alternative for this standard representation of the tree bank is proposed. The proposed representation for the trees is loss-less, but it reduces the total number of brackets by 28%. This is possible by omitting the redundant pairs of special brackets that encode initial and final embedding, using a technique proposed by Krauwer and des Tombe (1981). In terms of the paired brackets, the maximum nesting depth in sentences decreases by 78%. The 99.9% coverage is achieved with only five non-top levels of paired brackets. The observed shallowness of the reduced bracketing suggests that finite-state based methods for parsing and searching could be a feasible option for tree bank processing.
  • Yli-Jyrä, Anssi Mikael (The Linguistic Association of Finland, 2006)
    SKY journal of linguistics, special supplement
  • Lindstedt, Jouko (University of Sofia "St. Kliment Ohridski", 1981)
  • Rueter, Jack (Издательский центр Историко-социологического института, 2020)
    This paper addresses the issue of a national corpus for language documentation of the Moksha and Erzya literary languages in coordination with dialect archives comprising over 80 years of fieldwork (inclusive Shoksha, Karatai). It shows necessary development in computer-assisted research tools and ongoing research aligned with a consistent and systematic open research project.
  • Korpinen, Ruut (Helsingfors universitet, 2016)
    Tutkielmani aiheena on periodinen virkerakenne ruotsin kielessä. Periodiksi kutsutaan latinasta periytyvää pitkää ja mutkikasta virketyyppiä, jolle ominaista on alisteisten ja sisäkkäisten lauseiden käyttö. Ruotsissa periodeita on ennen käytetty muodollisessa kielessä, ja ne ovat rikastuttaneet kielen syntaksia. Tutkijat ovat kuitenkin erimielisiä siitä, miten periodi tulee määritellä ruotsissa. Muutamat hyväksyvät periodeiksi vain sellaiset virkkeet, joiden keskelle on upotettu liikkuva kiilalause latinan tapaan. Toisten mukaan periodeita ovat ylipäänsä kompleksiset virkkeet, joissa on paljon sivulauseita, lauseenvastikkeita ja pitkiä lausekkeita. Jotkut taas pitävät periodia etupäässä taidokkaasti hallittuna ajatuskokonaisuutena. Näistä lähtökohdista tutkimukseni tavoitteena on analysoida ja kuvata periodista virkerakennetta tehtailija ja taidekeräilijä Paul Sinebrychoffin liikekirjekokoelmassa. Aineistoni käsittää 130 ruotsinkielistä kirjettä, jotka Sinebrychoff on kirjoittanut kolmelle taidekauppakumppanilleen tai saanut heiltä vuosina 1899–1907. Tutkin, kuinka yleisiä periodit ovat kirjeissä, mistä osista ja miten ne rakentuvat sekä mitä eroja on eri kirjoittajien periodeissa. Perimmäinen tarkoitukseni on saada lisätietoa siitä, mikä periodi on ja miten sen voisi määritellä ruotsissa. Tutkimusmenetelmäni on määrällinen. Ensin poimin kirjeistä periodiset virkkeet aiempien tutkijoiden mainitsemien kieliopillisten tunnusmerkkien avulla. Sitten analysoin periodien pituutta, sivulauseiden ja lauseenvastikkeiden määrää, sijoittelua ja tyyppejä, fundamenttien pituutta, muotoja ja funktioita sekä lauseadverbiaaliposition käyttöä. Tilastoin tulokset ja vertaan niitä muihin tutkimuksiin, joissa samoja syntaktisia piirteitä on analysoitu eri teksteissä. Tulokset osoittavat, että periodien osuus kaikista virkkeistä on pieni. Pituutensa tähden ne kattavat silti usein huomattavan osan koko tekstimäärästä. Periodisessa rakenteessa on tyypillisesti erilaisia kiilalauseita, monia virkkeen alkuun tai sisään sijoitettuja sivulauseita, jotka ovat poikkeuksellisen usein adverbiaalisia, runsaasti etenkin latinalaisperäisiä lauseenvastikkeita, raskaita ja monesti adverbiaalisia fundamentteja sekä tavallista enemmän muita adverbiaaleja lauseadverbiaalin paikalla. Uusia löytöjä ovat erilaisten adverbiaalisten elementtien yleisyys eri puolilla virkettä sekä taipumus sijoittaa lauseita ja määritteitä virkkeessä vasemmalle, mikä aiheuttaa epätavallista etupainoisuutta kauttaaltaan koko rakenteessa. Syntaktinen kompleksisuus vaihtelee kuitenkin kovasti eri periodien ja kirjoittajien välillä. Vaihtelua voivat osin selittää kirjoittajien keskinäiset suhteet, koulutus ja asema sekä kirjeiden sisältö ja tarkoitus. Johtopäätös on, että on vaikeaa muodostaa täsmällistä ja yksiselitteistä määritelmää periodille. Edes liikkuva kiilalause tai sivulauseiden lukuisuus ei ole riittävä tai välttämätön ehto periodiselle virkerakenteelle. Olennaisia piirteitä tuntuvat silti olevan runsas ja monimuotoinen alisteisuus sekä etupainoisuus, jotka antavat periodille vieraan ja kirjakielisen leiman.