Browsing by Subject "data mining"

Sort by: Order: Results:

Now showing items 1-8 of 8
  • Galbrun, Esther; Tang, Hui; Fortelius, Mikael; Zliobaite, Indre (2018)
    As organisms are adapted to their environments, assemblages of taxa can be used to describe environments in the present and in the past. Here, we use a data mining method, namely redescription mining, to discover and analyze patterns of association between large herbivorous mammals and their environments via their functional traits. We focus on functional properties of animal teeth, characterized using a recently developed dental trait scoring scheme. The teeth of herbivorous mammals serve as an interface to obtain energy from food, and are therefore expected to match the types of plant food available in their environment. Hence, dental traits are expected to carry a signal of environmental conditions. We analyze a global compilation of occurrences of large herbivorous mammals and of bioclimatic conditions. We identify common patterns of association between dental traits distributions and bioclimatic conditions and discuss their implications. Each pattern can be considered as a computational biome. Our analysis distinguishes three global zones, which we refer to as the boreal-temperate moist zone, the tropical moist zone and the tropical-subtropical dry zone. The boreal-temperate moist zone is mainly characterized by seasonal cold temperatures, a lack of hypsodonty and a high share of species with obtuse lophs. The tropical moist zone is mainly characterized by high temperatures, high isothermality, abundant precipitation and a high share of species with acute rather than obtuse lophs. Finally, the tropical dry zone is mainly characterized by a high seasonality of temperatures and precipitation, as well as high hypsodonty and horizodonty. We find that the dental traits signature of African rain forests is quite different from the signature of climatically similar sites in North America and Asia, where hypsodont species and species with obtuse lophs are mostly absent. In terms of climate and dental signatures, the African seasonal tropics share many similarities with Central-South Asian sites. Interestingly, the Tibetan plateau is covered both by redescriptions from the tropical-subtropical dry group and by redescriptions from the boreal-temperate moist group, suggesting a combination of features from both zones in its dental traits and climate.
  • Fridlund, Mats; Oiva, Mila; Paju, Petri (Helsinki University Press, 2020)
    Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using internet resources and digital tools. Digital Histories showcases this emerging wave of digital history research. It presents work by historians who – on their own or through collaborations with e.g. information technology specialists – have uncovered new, empirical historical knowledge through digital and computational methods. The topics of the volume range from the medieval period to the present day, including various parts of Europe. The chapters apply an exemplary array of methods, such as digital metadata analysis, machine learning, network analysis, topic modelling, named entity recognition, collocation analysis, critical search, and text and data mining. The volume argues that digital history is entering a mature phase, digital history ‘in action’, where its focus is shifting from the building of resources towards the making of new historical knowledge. This also involves novel challenges that digital methods pose to historical research, including awareness of the pitfalls and limitations of the digital tools and the necessity of new forms of digital source criticisms. Through its combination of empirical, conceptual and contextual studies, Digital Histories is a timely and pioneering contribution taking stock of how digital research currently advances historical scholarship.
  • Passos, Ives C.; Ballester, Pedro L.; Barros, Rodrigo C.; Librenza-Garcia, Diego; Mwangi, Benson; Birmaher, Boris; Brietzke, Elisa; Hajek, Tomas; Lopez Jaramillo, Carlos; Mansur, Rodrigo B.; Alda, Martin; Haarman, Bartholomeus C. M.; Isometsa, Erkki; Lam, Raymond W.; McIntyre, Roger S.; Minuzzi, Luciano; Kessing, Lars V.; Yatham, Lakshmi N.; Duffy, Anne; Kapczinski, Flavio (2019)
    Objectives The International Society for Bipolar Disorders Big Data Task Force assembled leading researchers in the field of bipolar disorder (BD), machine learning, and big data with extensive experience to evaluate the rationale of machine learning and big data analytics strategies for BD. Method A task force was convened to examine and integrate findings from the scientific literature related to machine learning and big data based studies to clarify terminology and to describe challenges and potential applications in the field of BD. We also systematically searched PubMed, Embase, and Web of Science for articles published up to January 2019 that used machine learning in BD. Results The results suggested that big data analytics has the potential to provide risk calculators to aid in treatment decisions and predict clinical prognosis, including suicidality, for individual patients. This approach can advance diagnosis by enabling discovery of more relevant data-driven phenotypes, as well as by predicting transition to the disorder in high-risk unaffected subjects. We also discuss the most frequent challenges that big data analytics applications can face, such as heterogeneity, lack of external validation and replication of some studies, cost and non-stationary distribution of the data, and lack of appropriate funding. Conclusion Machine learning-based studies, including atheoretical data-driven big data approaches, provide an opportunity to more accurately detect those who are at risk, parse-relevant phenotypes as well as inform treatment selection and prognosis. However, several methodological challenges need to be addressed in order to translate research findings to clinical settings.
  • Huovelin, Juhani; Gross, Oskar; Solin, Otto; Linden, Krister; Maisala, Sami Petri Tapio; Oittinen, Tero; Toivonen, Hannu; Niemi, Jyrki; Silfverberg, Miikka (2013)
    We have developed tools and applied methods for automated identification of potential news from textual data for an automated news search system called Software Newsroom. The purpose of the tools is to analyze data collected from the internet and to identify information that has a high probability of containing new information. The identified information is summarized in order to help understanding the semantic contents of the data, and to assist the news editing process. It has been demonstrated that words with a certain set of syntactic and semantic properties are effective when building topic models for English. We demonstrate that words with the same properties in Finnish are useful as well. Extracting such words requires knowledge about the special characteristics of the Finnish language, which are taken into account in our analysis. Two different methodological approaches have been applied for the news search. One of the methods is based on topic analysis and it applies Multinomial Principal Component Analysis (MPCA) for topic model creation and data profiling. The second method is based on word association analysis and applies the log-likelihood ratio (LLR). For the topic mining, we have created English and Finnish language corpora from Wikipedia and Finnish corpora from several Finnish news archives and we have used bag-of-words presentations of these corpora as training data for the topic model. We have performed topic analysis experiments with both the training data itself and with arbitrary text parsed from internet sources. The results suggest that the effectiveness of news search strongly depends on the quality of the training data and its linguistic analysis. In the association analysis, we use a combined methodology for detecting novel word associations in the text. For detecting novel associations we use the background corpus from which we extract common word associations. In parallel, we collect the statistics of word co-occurrences from the documents of interest and search for associations with larger likelihood in these documents than in the background. We have demonstrated the applicability of these methods for Software Newsroom. The results indicate that the background-foreground model has significant potential in news search. The experiments also indicate great promise in employing background-foreground word associations for other applications. A combined application of the two methods is planned as well as the application of the methods on social media using a pre-translator of social media language.
  • Honkela, Timo; Raitio, Juha; Lagus, Krista; Nieminen, Ilari T.; Honkela, Nina; Pantzar, Mika (IEEE, 2012)
    A substantial amount of subjectivity is involved in how people use language and conceptualize the world. Computational methods and formal representations of knowledge usually neglect this kind of individual variation. We have developed a novel method, Grounded Intersubjective Concept Analysis (GICA), for the analysis and visualization of individual differences in language use and conceptualization. The GICA method first employs a conceptual survey or a text mining step to elicit to elicit from varied groups of individuals the particular ways in which terms and associated concepts are used among the individuals. The subsequent analysis and visualization reveals potential underlying groupings of subjects, objects and contexts. One way of viewing the GICA method is to compare it with the traditional word space models. In the word space models, such as latent semantic analysis (LSA), statistical analysis of word-context matrices reveals latent information. A common approach is to analyze term-document matrices in the analysis. The GICA method extends the basic idea of the traditional term-document matrix analysis to include a third dimension of different individuals. This leads to a formation of a third-order tensor of dimension subjectobjectcontexts. Through flattening, these subject-object-context (SOC) tensors can be analyzed using different computational methods including principal component analysis (PCA), singular value decomposition (SVD), independent component analysis (ICA) or any existing or future method suitable for analyzing high-dimensional data sets. In order to demonstrate the use of the GICA method, we present the results of two case studies. In the first case, a GICA analysis of health-related concepts is conducted. In the second one, the State of the Union addresses by US presidents are analyzed. In these case studies, we apply multidimensional scaling (MDS), the self-organizing map (SOM) and Neighborhood Retrieval Visualizer (NeRV) as specific data analysis methods within the overall GICA method. The GICA method can be used, for instance, to support education of heterogeneous audiences, public planning processes and participatory design, conflict resolution, environmental problem solving, interprofessional and interdisciplinary communication, product development processes, mergers of organizations, and building enhanced knowledge representations in semantic web.
  • Zhou, Fang; Qu, Qiang; Toivonen, Hannu (2017)
    Networks often contain implicit structure. We introduce novel problems and methods that look for structure in networks, by grouping nodes into supernodes and edges to superedges, and then make this structure visible to the user in a smaller generalised network. This task of finding generalisations of nodes and edges is formulated as network Summarisation'. We propose models and algorithms for networks that have weights on edges, on nodes or on both, and study three new variants of the network summarisation problem. In edge-based weighted network summarisation, the summarised network should preserve edge weights as well as possible. A wider class of settings is considered in path-based weighted network summarisation, where the resulting summarised network should preserve longer range connectivities between nodes. Node-based weighted network summarisation in turn allows weights also on nodes and summarisation aims to preserve more information related to high weight nodes. We study theoretical properties of these problems and show them to be NP-hard. We propose a range of heuristic generalisation algorithms with different trade-offs between complexity and quality of the result. Comprehensive experiments on real data show that weighted networks can be summarised efficiently with relatively little error.
  • Söderholm, Sofia (Helsingin yliopisto, 2020)
    Tutkielma käsittelee rikollisuuden ennustamiseen kehitettyjä predictive policing -algoritmeja ja arvioi algoritmin ennustuksen perusteella potentiaaliseksi rikoksentekijäksi luokitellun henkilön asemaa Suomen voimassaolevan lainsäädännön valossa. Predictive policing -menetelmien voidaan katsoa olevan osa 2000-luvulla lisääntynyttä ihmisten valvontaa, jota perustellaan tarpeella torjua terrorismia ja vakavaa rikollisuutta. Koska tutkielma koskee hypoteettista tilannetta, on tutkimuksessa nojauduttu tähän ilmiöön liittyvään massavalvontaa koskevaan keskusteluun sekä matkustajarekisteritietojen käyttöä koskevaan lainsäädäntöön, jonka voi katsoa olevan osa tätä ilmiötä. Predictive policing -menetelmät ovat yksinkertaistaen poliisin käyttämiä tietokoneohjelmistoja, joiden algoritmit analysoivat valtavia datamassoja rikollisuutta koskevan ennustuksen tuottamiseksi. Predictive policing on tarkoitettu poliisin työkaluksi, jonka perusteella poliisi voi kohdentaa resurssejaan ennustuksen ehdottamalla tavalla. Predictive policing -menetelmiä on erilaisia, mutta yleensä käsitteellä viitataan tulevien rikosten tapahtumapaikkojen ja -aikojen tai potentiaalisten rikoksentekijöiden ennustamiseen. Potentiaalisella rikoksentekijällä tarkoitetaan henkilöä, joka algoritmin arvion mukaan todennäköisesti osallistuu rikolliseen toimintaan tulevaisuudessa. Predictive policing -menetelmät toimivat hyödyntämällä big dataa, tiedon louhintaa sekä koneoppivia algoritmeja. Niihin on liitetty algoritmisia järjestelmiä koskevia oikeusturvahuolia, jotka liittyvät predictive policing -algoritmien syrjivyyteen ja epätarkkuuteen, läpinäkymättömyyteen sekä teknologian aiheuttamaan automaatioharhaan. On olennaista muistaa, että predictive policing -menetelmien tuottamat ennustukset ovat todellisuudessa algoritmin tuottamia tilastollisia todennäköisyyksiä, jotka perustuvat menneisiin tapahtumiin. Tutkielma on metodiltaan lainopillinen ja lähestyy potentiaalisen rikoksentekijän asemaa kahden tutkimuskysymyksen kautta. Ensimmäisen tutkimuskysymyksen tarkoitus on selvittää potentiaalisen rikoksentekijän asema poliisin toimintaa koskevassa voimassaolevassa lainsäädännössä kysymällä: Voidaanko potentiaalista rikoksentekijää pitää rikoksesta epäiltynä poliisilain ja esitutkintalain systematiikassa? Tämän kysymyksen osalta tutkielma tarkastelee nykylainsäädännön systematiikan soveltuvuutta potentiaalisen rikoksentekijän asemaan, jossa tämä on määritelty jollain tavalla epäilyttäväksi, mutta havaintoa hänen tekemästään rikoksesta ei ole. Toinen tutkimuskysymys jatkaa potentiaalisen rikoksentekijän aseman arviointia tähän kohdistuvaa epäilyä koskevalla teemalla kysymällä: Vaarantaisiko predictive policing -menetelmien käyttö potentiaalisen rikoksentekijän syyttömyysolettaman? Perinteisesti syyttömyysolettaman on katsottu kuuluvan rikosprosessissa epäillyn oikeusturvatakeisiin ja se määrittää, kuinka epäiltyä ja syytettyä tulee kohdella rikosprosessissa. Toinen tutkimuskysymys kuitenkin perehtyy tämän oikeusturvatakeen soveltumiseen potentiaalisen rikoksentekijän asemaan ja mahdolliseen ulottuvuuteen ennen rikosprosessin alkamista. Tutkielman keskeinen havainto on, että Suomen nykyinen poliisilain ja esitutkintalain systematiikka, joka rakentuu epäillyn ja ei-epäillyn väliseen rajanvetoon, ei ole ajantasainen arvioitaessa potentiaalisen rikoksentekijän asemaa. Jotta potentiaalista rikoksentekijää voitaisiin pitää rikoksesta epäiltynä olisi ensin oltava vireillä esitutkinta, johon vasta sitten liitetään rikoksesta epäilty. Ottaen huomioon predictive policing -menetelmän luonteen poliisin työkaluna, koko menetelmän käyttö olisi turhaa, jos poliisi ei tekisi saamallaan ennustuksella mitään. Yhtäältä tämä voisi johtaa siihen, että poliisi pyrkisi paljastamaan potentiaalisen rikoksentekijän mahdollisesti tekemän rikoksen tai muuten kohdistaisi tarkennettua valvontaa tähän yksilöön odottaen mahdollisen rikoksen tapahtumista. Potentiaalinen rikoksentekijä joutuisi siten epämääräiseen asemaan, jossa hän ei olisi oikeutettu esitutkinnan oikeusturvatakeisiin, mutta saattaisi silti joutua poliisin toimien kohteeksi. Syyttömyysolettamaa koskevan arvioinnin osalta päädyttiin tutkielmassa samaan lopputulokseen. Syyttömyysolettama ei suojaa potentiaalista rikoksentekijää poliisin epäillyiltä ja niitä mahdollisesti seuraavilta toimilta, koska syyttömyysolettama ei ulotu aikaan ennen kuin poliisilla on havainto rikokseksi epäillystä teosta. Oikeuskirjallisuudessa on kuitenkin esitetty kannanottoja syyttömyysolettaman laajentumisesta poliisitoiminnan muutoksen myötä. Tutkielman lopuksi tuodaankin esille ehdotuksia lainsäädännön kehittämiseksi, mikäli Suomessa otettaisiin käyttöön potentiaalisia rikoksentekijöitä ennustavia predictive policing -algoritmeja.
  • Zhou, Fang; Toivonen, Hannu; King, Ross D. (2014)