Browsing by Subject "112 Statistics and probability"

Sort by: Order: Results:

Now showing items 1-20 of 152
  • Tadei, Alessandro; Pensar, Johan; Corander, Jukka; Finnilä, Katarina; Santtila, Pekka; Antfolk, Jan (2019)
    In assessments of child sexual abuse (CSA) allegations, informative background information is often overlooked or not used properly. We therefore created and tested an instrument that uses accessible background information to calculate the probability of a child being a CSA victim that can be used as a starting point in the following investigation. Studying 903 demographic and socioeconomic variables from over 11,000 Finnish children, we identified 42 features related to CSA. Using Bayesian logic to calculate the probability of abuse, our instrument-the Finnish Investigative Instrument of Child Sexual Abuse (FICSA)-has two separate profiles for boys and girls. A cross-validation procedure suggested excellent diagnostic utility (area under the curve [AUC] = 0.97 for boys and AUC = 0.88 for girls). We conclude that the presented method can be useful in forensic assessments of CSA allegations by adding a reliable statistical approach to considering background information, and to support clinical decision making and guide investigative efforts.
  • Björklund, Heta; Pölönen, Janne (2019)
    This article uses citation analysis to track the citation patterns of works by Fritz Schulz, Paul Koschaker, Fritz Pringsheim, Franz Wieacker and Helmut Coing – key figures in the field of Roman law – and to see whether databases, such as Google Scholar and Web of Science, provide meaningful data that accurately reflects the popularity and influence of these works. The article also takes into account those limitations regarding the availability of the material, which include the language of the publications, as well as the research field.
  • McGillivray, Barbara; Hengchen, Simon; Lähteenoja, Viivi Esteri; Palma, Marco; Vatri, Alessandro (2019)
  • Pihlaja, Miika; Gutmann, Michael Urs; Hyvärinen, Aapo Johannes (2010)
  • Lumme, Sonja; Sund, Reijo Tapani; Leyland, Alastair H; Keskimäki, Ilmo (2015)
    In this paper, we introduce several statistical methods to evaluate the uncertainty in the concentration index (C) for measuring socioeconomic equality in health and health care using aggregated total population register data. The C is a widely used index when measuring socioeconomic inequality, but previous studies have mainly focused on developing statistical inference for sampled data from population surveys. While data from large population-based or national registers provide complete coverage, registration comprises several sources of error. We simulate confidence intervals for the C with different Monte Carlo approaches, which take into account the nature of the population data. As an empirical example, we have an extensive dataset from the Finnish cause-of-death register on mortality amenable to health care interventions between 1996 and 2008. Amenable mortality has been often used as a tool to capture the effectiveness of health care. Thus, inequality in amenable mortality provides evidence on weaknesses in health care performance between socioeconomic groups. Our study shows using several approaches with different parametric assumptions that previously introduced methods to estimate the uncertainty of the C for sampled data are too conservative for aggregated population register data. Consequently, we recommend that inequality indices based on the register data should be presented together with an approximation of the uncertainty and suggest using a simulation approach we propose. The approach can also be adapted to other measures of equality in health.
  • Niskanen, Vesa A. (Springer-Verlag, 2017)
    Studies in Computational Intelligence
    A rapid soft computing method for dimensionality reduction of data sets is presented. Traditional approaches usually base on factor or principal component analysis. Our method applies fuzzy cluster analysis and approximate reasoning instead, and thus it is also viable to nonparametric and nonlinear models. Comparisons are drawn between the methods with two empiric data sets.
  • Vanhatalo, Jarno; Hartmann, Marcelo; Veneranta, Lari (2020)
    Species distribution models (SDM) are a key tool in ecology, conservation and management of natural resources. Two key components of the state-of-the-art SDMs are the description for species distribution response along environmental covariates and the spatial random effect that captures deviations from the distribution patterns explained by environmental covariates. Joint species distribution models (JSDMs) additionally include interspecific correlations which have been shown to improve their descriptive and predictive performance compared to single species models. However, current JSDMs are restricted to hierarchical generalized linear modeling framework. Their limitation is that parametric models have trouble in explaining changes in abundance due, for example, highly non-linear physical tolerance limits which is particularly important when predicting species distribution in new areas or under scenarios of environmental change. On the other hand, semi-parametric response functions have been shown to improve the predictive performance of SDMs in these tasks in single species models. Here, we propose JSDMs where the responses to environmental covariates are modeled with additive multivariate Gaussian processes coded as linear models of coregionalization. These allow inference for wide range of functional forms and interspecific correlations between the responses. We propose also an efficient approach for inference with Laplace approximation and parameterization of the interspecific covariance matrices on the euclidean space. We demonstrate the benefits of our model with two small scale examples and one real world case study. We use cross-validation to compare the proposed model to analogous semi-parametric single species models and parametric single and joint species models in interpolation and extrapolation tasks. The proposed model outperforms the alternative models in all cases. We also show that the proposed model can be seen as an extension of the current state-of-the-art JSDMs to semi-parametric models.
  • Karttunen, Henri (2020)
    We define a nonlinear autoregressive time series model based on the generalized hyperbolic distribution in an attempt to model time series with non-Gaussian features such as skewness and heavy tails. We show that the resulting process has a simple condition for stationarity and it is also ergodic. An empirical example with a forecasting experiment is presented to illustrate the features of the proposed model.
  • Surakhi, Ola M.; Zaidan, Martha Arbayani; Serhan, Sami; Salah, Imad; Hussein, Tareq (2020)
    Time-series prediction is an important area that inspires numerous research disciplines for various applications, including air quality databases. Developing a robust and accurate model for time-series data becomes a challenging task, because it involves training different models and optimization. In this paper, we proposed and tested three machine learning techniques—recurrent neural networks (RNN), heuristic algorithm and ensemble learning—to develop a predictive model for estimating atmospheric particle number concentrations in the form of a time-series database. Here, the RNN included three variants—Long-Short Term Memory, Gated Recurrent Network, and Bi-directional Recurrent Neural Network—with various configurations. A Genetic Algorithm (GA) was then used to find the optimal time-lag in order to enhance the model’s performance. The optimized models were used to construct a stacked ensemble model as well as to perform the final prediction. The results demonstrated that the time-lag value can be optimized by using the heuristic algorithm; consequently, this improved the model prediction accuracy. Further improvement can be achieved by using ensemble learning that combines several models for better performance and more accurate predictions.
  • Pöntinen, Anna K.; Top, Janetta; Arredondo-Alonso, Sergio; Tonkin-Hill, Gerry; Freitas, Ana R.; Novais, Carla; Gladstone, Rebecca A.; Pesonen, Maiju; Meneses, Rodrigo; Pesonen, Henri; Lees, John A.; Jamrozy, Dorota; Bentley, Stephen D.; Lanza, Val F.; Torres, Carmen; Peixe, Luisa; Coque, Teresa M.; Parkhill, Julian; Schurch, Anita C.; Willems, Rob J. L.; Corander, Jukka (2021)
    Enterococcus faecalis is a commensal and nosocomial pathogen, which is also ubiquitous in animals and insects, representing a classical generalist microorganism. Here, we study E. faecalis isolates ranging from the pre-antibiotic era in 1936 up to 2018, covering a large set of host species including wild birds, mammals, healthy humans, and hospitalised patients. We sequence the bacterial genomes using short- and long-read techniques, and identify multiple extant hospital-associated lineages, with last common ancestors dating back as far as the 19th century. We find a population cohesively connected through homologous recombination, a metabolic flexibility despite a small genome size, and a stable large core genome. Our findings indicate that the apparent hospital adaptations found in hospital-associated E. faecalis lineages likely predate the "modern hospital" era, suggesting selection in another niche, and underlining the generalist nature of this nosocomial pathogen.Enterococcus faecalis is a commensal microorganism of animals, insects and humans, but also a nosocomial pathogen. Here, the authors analyse genomic sequences from E. faecalis isolates from animals and humans, and find that the last common ancestors of multiple hospital-associated lineages date to the pre-antibiotic era.
  • Simola, Umberto; Cisewski-Kehe, Jessi; Wolpert, Robert L. (2020)
    Finite mixture models are used in statistics and other disciplines, but inference for mixture models is challenging due, in part, to the multimodality of the likelihood function and the so-called label switching problem. We propose extensions of the Approximate Bayesian Computation?Population Monte Carlo (ABC?PMC) algorithm as an alternative framework for inference on finite mixture models. There are several decisions to make when implementing an ABC?PMC algorithm for finite mixture models, including the selection of the kernels used for moving the particles through the iterations, how to address the label switching problem and the choice of informative summary statistics. Examples are presented to demonstrate the performance of the proposed ABC?PMC algorithm for mixture modelling. The performance of the proposed method is evaluated in a simulation study and for the popular recessional velocity galaxy data.
  • Koslicki, David; Chatterjee, Saikat; Shahrivar, Damon; Walker, Alan W.; Francis, Suzanna C.; Fraser, Louise J.; Vehkaperae, Mikko; Lan, Yueheng; Corander, Jukka (2015)
    Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at A Matlab implementation is available at
  • Zheng, Zhong; Wei, Lu; Speicher, Roland; Muller, Ralf R.; Hämäläinen, Jyri; Corander, Jukka (2017)
    The Rayleigh product channel model is useful in capturing the performance degradation due to rank deficiency of MIMO channels. In this paper, such a performance degradation is investigated via the distribution of mutual information assuming the block fading channels and the uniform power transmission scheme. Using techniques of free probability theory, the asymptotic variance of mutual information is derived when the dimensions of the channel matrices approach infinity. In this asymptotic regime, the mutual information is rigorously proven to be Gaussian distributed. Using the obtained results, a fundamental tradeoff between multiplexing gain and diversity gain of Rayleigh product channels under the uniform power transmission can be characterized by the closed-form expression at any finite signal-to-noise ratio. Numerical results are provided to compare the outage performance between the Rayleigh product channels and the conventional Rayleigh MIMO channels.
  • Sipola, Aleksi; Marttinen, Pekka; Corander, Jukka (2018)
    The advent of genomic data from densely sampled bacterial populations has created a need for flexible simulators by which models and hypotheses can be efficiently investigated in the light of empirical observations. Bacmeta provides fast stochastic simulation of neutral evolution within a large collection of interconnected bacterial populations with completely adjustable connectivity network. Stochastic events of mutations, recombinations, insertions/deletions, migrations and micro-epidemics can be simulated in discrete non-overlapping generations with a Wright-Fisher model that operates on explicit sequence data of any desired genome length. Each model component, including locus, bacterial strain, population and ultimately the whole metapopulation, is efficiently simulated using C++ objects and detailed metadata from each level can be acquired. The software can be executed in a cluster environment using simple textual input files, enabling, e.g. large-scale simulations and likelihood-free inference. Availability and implementation: Bacmeta is implemented with C++ for Linux, Mac and Windows. It is available at under the BSD 3-clause license. Contact: or Supplementary information: Supplementary data are available at Bioinformatics online.
  • Hartmann, Marcelo; Ehlers, Ricardo S. (2017)
    In this article, we propose to evaluate and compare Markov chain Monte Carlo (MCMC) methods to estimate the parameters in a generalized extreme value model. We employed the Bayesian approach using traditional Metropolis-Hastings methods, Hamiltonian Monte Carlo (HMC), and Riemann manifold HMC (RMHMC) methods to obtain the approximations to the posterior marginal distributions of interest. Applications to real datasets and simulation studies provide evidence that the extra analytical work involved in Hamiltonian Monte Carlo algorithms is compensated by a more efficient exploration of the parameter space.
  • Tietavainen, A.; Gutmann, M. U.; Keski-Vakkuri, E.; Corander, J.; Haeggstrom, E. (2017)
    The control of the human body sway by the central nervous system, muscles, and conscious brain is of interest since body sway carries information about the physiological status of a person. Several models have been proposed to describe body sway in an upright standing position, however, due to the statistical intractability of the more realistic models, no formal parameter inference has previously been conducted and the expressive power of such models for real human subjects remains unknown. Using the latest advances in Bayesian statistical inference for intractable models, we fitted a nonlinear control model to posturographic measurements, and we showed that it can accurately predict the sway characteristics of both simulated and real subjects. Our method provides a full statistical characterization of the uncertainty related to all model parameters as quantified by posterior probability density functions, which is useful for comparisons across subjects and test settings. The ability to infer intractable control models from sensor data opens new possibilities for monitoring and predicting body status in health applications.
  • Gupta, Rashi; Greco, Dario; Auvinen, Petri; Arjas, Elja (2010)
  • Liu, Jia; Vanhatalo, Jarno (2020)
    In geostatistics, the spatiotemporal design for data collection is central for accurate prediction and parameter inference. An important class of geostatistical models is log-Gaussian Cox process (LGCP) but there are no formal analyses on spatial or spatiotemporal survey designs for them. In this work, we study traditional balanced and uniform random designs in situations where analyst has prior information on intensity function of LGCP and show that the traditional balanced and random designs are not efficient in such situations. We also propose a new design sampling method, a rejection sampling design, which extends the traditional balanced and random designs by directing survey sites to locations that are a priori expected to provide most information. We compare our proposal to the traditional balanced and uniform random designs using the expected average predictive variance (APV) loss and the expected Kullback-Leibler (KL) divergence between the prior and the posterior for the LGCP intensity function in simulation experiments and in a real world case study. The APV informs about expected accuracy of a survey design in point-wise predictions and the KL-divergence measures the expected gain in information about the joint distribution of the intensity field. The case study concerns planning a survey design for analyzing larval areas of two commercially important fish stocks on Finnish coastal region. Our experiments show that the designs generated by the proposed rejection sampling method clearly outperform the traditional balanced and uniform random survey designs. Moreover, the method is easily applicable to other models in general. (C) 2019 The Author(s). Published by Elsevier B.V.
  • Lanne, Markku; Luoma, Arto; Luoto, Jani (2012)
  • Cheng, Lu; Connor, Thomas R.; Aanensen, David M.; Spratt, Brian G.; Corander, Jukka (2011)