Significance testing of word frequencies in corpora

Show simple item record Lijffijt, Jefrey Nevalainen, Terttu Säily, Tanja Papapetrou, Panagiotis Puolamäki, Kai Mannila, Heikki 2017-09-14T09:49:56Z 2021-12-17T18:49:20Z 2016
dc.identifier.citation Lijffijt , J , Nevalainen , T , Säily , T , Papapetrou , P , Puolamäki , K & Mannila , H 2016 , ' Significance testing of word frequencies in corpora ' , Digital scholarship in the humanities , vol. 31 , no. 2 , pp. 374-397 .
dc.identifier.other PURE: 56023940
dc.identifier.other PURE UUID: dbfc0409-e16a-43e2-8f2d-4fcc890b4ff4
dc.identifier.other Scopus: 84974696001
dc.identifier.other WOS: 000384763200010
dc.identifier.other ORCID: /0000-0003-3088-4903/work/39874344
dc.identifier.other ORCID: /0000-0003-4407-8929/work/28758644
dc.identifier.other ORCID: /0000-0003-1819-1047/work/53186283
dc.description.abstract Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the X2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics, 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank sum test, or bootstrap test for comparing word frequencies across corpora. en
dc.format.extent 24
dc.language.iso eng
dc.relation.ispartof Digital scholarship in the humanities
dc.rights.uri info:eu-repo/semantics/openAccess
dc.subject 113 Computer and information sciences
dc.subject significance testing
dc.subject bootstrap
dc.subject chi-square test
dc.subject log-likelihood ratio test
dc.subject keywords
dc.subject 6121 Languages
dc.subject corpus linguistics
dc.subject text corpora
dc.subject British National Corpus
dc.title Significance testing of word frequencies in corpora en
dc.type Article
dc.contributor.organization Department of Modern Languages 2010-2017
dc.description.reviewstatus Peer reviewed
dc.relation.issn 2055-7671
dc.rights.accesslevel openAccess
dc.type.version acceptedVersion
dc.relation.funder Unknown funder

Files in this item

Total number of downloads: Loading...

Files Size Format View
llc_postprint.pdf 1.793Mb PDF View/Open

This item appears in the following Collection(s)

Show simple item record