Browsing by Subject "INFERENCE"

Sort by: Order: Results:

Now showing items 1-20 of 51
  • Sahlin, Ullrika; Helle, Inari; Perepolkin, Dmytro (2021)
    Failing to communicate current knowledge limitations, that is, epistemic uncertainty, in environmental risk assessment (ERA) may have severe consequences for decision making. Bayesian networks (BNs) have gained popularity in ERA, primarily because they can combine variables from different models and integrate data and expert judgment. This paper highlights potential gaps in the treatment of uncertainty when using BNs for ERA and proposes a consistent framework (and a set of methods) for treating epistemic uncertainty to help close these gaps. The proposed framework describes the treatment of epistemic uncertainty about the model structure, parameters, expert judgment, data, management scenarios, and the assessment's output. We identify issues related to the differentiation between aleatory and epistemic uncertainty and the importance of communicating both uncertainties associated with the assessment predictions (direct uncertainty) and the strength of knowledge supporting the assessment (indirect uncertainty). Probabilities, intervals, or scenarios are expressions of direct epistemic uncertainty. The type of BN determines the treatment of parameter uncertainty: epistemic, aleatory, or predictive. Epistemic BNs are useful for probabilistic reasoning about states of the world in light of evidence. Aleatory BNs are the most relevant for ERA, but they are not sufficient to treat epistemic uncertainty alone because they do not explicitly express parameter uncertainty. For uncertainty analysis, we recommend embedding an aleatory BN into a model for parameter uncertainty. Bayesian networks do not contain information about uncertainty in the model structure, which requires several models. Statistical models (e.g., hierarchical modeling outside the BNs) are required to consider uncertainties and variability associated with data. We highlight the importance of being open about things one does not know and carefully choosing a method to precisely communicate both direct and indirect uncertainty in ERA. Integr Environ Assess Manag 2020;00:1-12. (c) 2020 The Authors. Integrated Environmental Assessment and Management published by Wiley Periodicals LLC on behalf of Society of Environmental Toxicology & Chemistry (SETAC)
  • Pensar, Johan; Talvitie, Topi; Hyttinen, Antti; Koivisto, Mikko (The Association for the Advancement of Artificial Intelligence (AAAI), 2020)
    AAAI Conference on Artificial Intelligence
    We present a novel Bayesian method for the challenging task of estimating causal effects from passively observed data when the underlying causal DAG structure is unknown. To rigorously capture the inherent uncertainty associated with the estimate, our method builds a Bayesian posterior distribution of the linear causal effect, by integrating Bayesian linear regression and averaging over DAGs. For computing the exact posterior for all cause-effect variable pairs, we give an algorithm that runs in time O(3(d) d) for d variables, being feasible up to 20 variables. We also give a variant that computes the posterior probabilities of all pairwise ancestor relations within the same time complexity. significantly improving the fastest previous algorithm. In simulations, our Bayesian method outperforms previous methods in estimation accuracy, especially for small sample sizes. We further show that our method for effect estimation is well-adapted for detecting strong causal effects markedly deviating from zero, while our variant for computing posteriors of ancestor relations is the method of choice for detecting the mere existence of a causal relation. Finally, we apply our method on observational flow cytometry data, detecting several causal relations that concur with previous findings from experimental data.
  • Vanhatalo, Jarno; Li, Zitong; Sillanpää, Mikko J. (2019)
    Motivation: Recent advances in high dimensional phenotyping bring time as an extra dimension into the phenotypes. This promotes the quantitative trait locus (QTL) studies of function-valued traits such as those related to growth and development. Existing approaches for analyzing functional traits utilize either parametric methods or semi-parametric approaches based on splines and wavelets. However, very limited choices of software tools are currently available for practical implementation of functional QTL mapping and variable selection. Results: We propose a Bayesian Gaussian process (GP) approach for functional QTL mapping. We use GPs to model the continuously varying coefficients which describe how the effects of molecular markers on the quantitative trait are changing over time. We use an efficient gradient based algorithm to estimate the tuning parameters of GPs. Notably, the GP approach is directly applicable to the incomplete datasets having even larger than 50% missing data rate (among phenotypes). We further develop a stepwise algorithm to search through the model space in terms of genetic variants, and use a minimal increase of Bayesian posterior probability as a stopping rule to focus on only a small set of putative QTL. We also discuss the connection between GP and penalized B-splines and wavelets. On two simulated and three real datasets, our GP approach demonstrates great flexibility for modeling different types of phenotypic trajectories with low computational cost. The proposed model selection approach finds the most likely QTL reliably in tested datasets.
  • Marshall, H. H.; Johnstone, R. A.; Thompson, F. J.; Nichols, H. J.; Wells, D.; Hoffman, J. I.; Kalema-Zikusoka, G.; Sanderson, J. L.; Vitikainen, E. I. K.; Blount, J. D.; Cant, M. A. (2021)
    Rawls argued that fairness in human societies can be achieved if decisions about the distribution of societal rewards are made from behind a veil of ignorance, which obscures the personal gains that result. Whether ignorance promotes fairness in animal societies, that is, the distribution of resources to reduce inequality, is unknown. Here we show experimentally that cooperatively breeding banded mongooses, acting from behind a veil of ignorance over kinship, allocate postnatal care in a way that reduces inequality among offspring, in the manner predicted by a Rawlsian model of cooperation. In this society synchronized reproduction leaves adults in a group ignorant of the individual parentage of their communal young. We provisioned half of the mothers in each mongoose group during pregnancy, leaving the other half as matched controls, thus increasing inequality among mothers and increasing the amount of variation in offspring birth weight in communal litters. After birth, fed mothers provided extra care to the offspring of unfed mothers, not their own young, which levelled up initial size inequalities among the offspring and equalized their survival to adulthood. Our findings suggest that a classic idea of moral philosophy also applies to the evolution of cooperation in biological systems. Obscuring knowledge of personal gains from individuals can theoretically maintain fairness in a cooperative group. Experiments show that wild, cooperatively breeding banded mongooses uncertain of kinship allocate postnatal care in a way that reduces inequality among offspring, suggesting a classic idea of moral philosophy can apply in biological systems.
  • Simola, Umberto; Cisewski-Kehe, Jessi; Gutmann, Michael U.; Corander, Jukka (2021)
    Approximate Bayesian Computation (ABC) methods are increasingly used for inference in situations in which the likelihood function is either computationally costly or intractable to evaluate. Extensions of the basic ABC rejection algorithm have improved the computational efficiency of the procedure and broadened its applicability. The ABC - Population Monte Carlo (ABC-PMC) approach has become a popular choice for approximate sampling from the posterior. ABC-PMC is a sequential sampler with an iteratively decreasing value of the tolerance, which specifies how close the simulated data need to be to the real data for acceptance. We propose a method for adaptively selecting a sequence of tolerances that improves the computational efficiency of the algorithm over other common techniques. In addition we define a stopping rule as a by-product of the adaptation procedure, which assists in automating termination of sampling. The proposed automatic ABC-PMC algorithm can be easily implemented and we present several examples demonstrating its benefits in terms of computational efficiency.
  • Vanhatalo, Jarno; Hartmann, Marcelo; Veneranta, Lari (2020)
    Species distribution models (SDM) are a key tool in ecology, conservation and management of natural resources. Two key components of the state-of-the-art SDMs are the description for species distribution response along environmental covariates and the spatial random effect that captures deviations from the distribution patterns explained by environmental covariates. Joint species distribution models (JSDMs) additionally include interspecific correlations which have been shown to improve their descriptive and predictive performance compared to single species models. However, current JSDMs are restricted to hierarchical generalized linear modeling framework. Their limitation is that parametric models have trouble in explaining changes in abundance due, for example, highly non-linear physical tolerance limits which is particularly important when predicting species distribution in new areas or under scenarios of environmental change. On the other hand, semi-parametric response functions have been shown to improve the predictive performance of SDMs in these tasks in single species models. Here, we propose JSDMs where the responses to environmental covariates are modeled with additive multivariate Gaussian processes coded as linear models of coregionalization. These allow inference for wide range of functional forms and interspecific correlations between the responses. We propose also an efficient approach for inference with Laplace approximation and parameterization of the interspecific covariance matrices on the euclidean space. We demonstrate the benefits of our model with two small scale examples and one real world case study. We use cross-validation to compare the proposed model to analogous semi-parametric single species models and parametric single and joint species models in interpolation and extrapolation tasks. The proposed model outperforms the alternative models in all cases. We also show that the proposed model can be seen as an extension of the current state-of-the-art JSDMs to semi-parametric models.
  • Kerr, Shona M.; Klaric, Lucija; Halachev, Mihail; Hayward, Caroline; Boutin, Thibaud S.; Meynert, Alison M.; Semple, Colin A.; Tuiskula, Annukka M.; Swan, Heikki; Santoyo-Lopez, Javier; Vitart, Veronique; Haley, Chris; Dean, John; Miedzybrodzka, Zosia; Aitman, Timothy J.; Wilson, James F. (2019)
    The Viking Health Study Shetland is a population-based research cohort of 2,122 volunteer participants with ancestry from the Shetland Isles in northern Scotland. The high kinship and detailed phenotype data support a range of approaches for associating rare genetic variants, enriched in this isolate population, with quantitative traits and diseases. As an exemplar, the c.1750G > A; p.Gly584Ser variant within the coding sequence of the KCNH2 gene implicated in Long QT Syndrome (LQTS), which occurred once in 500 whole genome sequences from this population, was investigated. Targeted sequencing of the KCNH2 gene in family members of the initial participant confirmed the presence of the sequence variant and identified two further members of the same family pedigree who shared the variant. Investigation of these three related participants for whom single nucleotide polymorphism (SNP) array genotypes were available allowed a unique shared haplotype of 1.22 Mb to be defined around this locus. Searching across the full cohort for this haplotype uncovered two additional apparently unrelated individuals with no known genealogical connection to the original kindred. All five participants with the defined haplotype were shown to share the rare variant by targeted Sanger sequencing. If this result were verified in a healthcare setting, it would be considered clinically actionable, and has been actioned in relatives ascertained independently through clinical presentation. The General Practitioners of four study participants with the rare variant were alerted to the research findings by letters outlining the phenotype (prolonged electrocardiographic QTc interval). A lack of detectable haplotype sharing between c.1750G > A; p.Gly584Ser chromosomes from previously reported individuals from Finland and those in this study from Shetland suggests that this mutation has arisen more than once in human history. This study showcases the potential value of isolate population-based research resources for genomic medicine. It also illustrates some challenges around communication of actionable findings in research participants in this context.
  • Simola, Umberto; Cisewski-Kehe, Jessi; Wolpert, Robert L. (2020)
    Finite mixture models are used in statistics and other disciplines, but inference for mixture models is challenging due, in part, to the multimodality of the likelihood function and the so-called label switching problem. We propose extensions of the Approximate Bayesian Computation?Population Monte Carlo (ABC?PMC) algorithm as an alternative framework for inference on finite mixture models. There are several decisions to make when implementing an ABC?PMC algorithm for finite mixture models, including the selection of the kernels used for moving the particles through the iterations, how to address the label switching problem and the choice of informative summary statistics. Examples are presented to demonstrate the performance of the proposed ABC?PMC algorithm for mixture modelling. The performance of the proposed method is evaluated in a simulation study and for the popular recessional velocity galaxy data.
  • Siren, Jukka; Lens, Luc; Cousseau, Laurence; Ovaskainen, Otso (2018)
    1. Individual-based models (IBMs) allow realistic and flexible modelling of ecological systems, but their parameterization with empirical data is statistically and computationally challenging. Approximate Bayesian computation (ABC) has been proposed as an efficient approach for inference with IBMs, but its applicability to data on natural populations has not been yet fully explored. 2. We construct an IBM for the metapopulation dynamics of a species inhabiting a fragmented patch network, and develop an ABC method for parameterization of the model. We consider several scenarios of data availability from count data to combination of mark-recapture and genetic data. We analyse both simulated and real data on white-starred robin (Pogonocichla stellata), a passerine bird living in montane forest environment in Kenya, and assess how the amount and type of data affect the estimates of model parameters and indicators of population state. 3. The indicators of the population state could be reliably estimated using the ABC method, but full parameterization was not achieved due to strong posterior correlations between model parameters. While the combination of the data types did not provide more accurate estimates for most of the indicators of population state or model parameters than the most informative data type (ringing data or genetic data) alone, the combined data allowed robust simultaneous estimation of all unknown quantities. 4. Our results show that ABC methods provide a powerful and flexible technique forparameterizing complex IBMs with multiple data sources, and assessing the dynamics of the population in a robust manner.
  • Within-family Consortium; 23andMe Res Team; Brumpton, Ben; Sanderson, Eleanor; Heilbron, Karl; Kaprio, Jaakko; Davies, Neil M. (2020)
    Estimates from Mendelian randomization studies of unrelated individuals can be biased due to uncontrolled confounding from familial effects. Here we describe methods for within-family Mendelian randomization analyses and use simulation studies to show that family-based analyses can reduce such biases. We illustrate empirically how familial effects can affect estimates using data from 61,008 siblings from the Nord-TrOndelag Health Study and UK Biobank and replicated our findings using 222,368 siblings from 23andMe. Both Mendelian randomization estimates using unrelated individuals and within family methods reproduced established effects of lower BMI reducing risk of diabetes and high blood pressure. However, while Mendelian randomization estimates from samples of unrelated individuals suggested that taller height and lower BMI increase educational attainment, these effects were strongly attenuated in within-family Mendelian randomization analyses. Our findings indicate the necessity of controlling for population structure and familial effects in Mendelian randomization studies. Family-based study designs have been applied to resolve confounding by population stratification, dynastic effects and assortative mating in genetic association analyses. Here, Brumpton et al. describe theory and simulations for overcoming such biases in Mendelian randomization through within-family studies.
  • Liu, Jia; Vanhatalo, Jarno (2020)
    In geostatistics, the spatiotemporal design for data collection is central for accurate prediction and parameter inference. An important class of geostatistical models is log-Gaussian Cox process (LGCP) but there are no formal analyses on spatial or spatiotemporal survey designs for them. In this work, we study traditional balanced and uniform random designs in situations where analyst has prior information on intensity function of LGCP and show that the traditional balanced and random designs are not efficient in such situations. We also propose a new design sampling method, a rejection sampling design, which extends the traditional balanced and random designs by directing survey sites to locations that are a priori expected to provide most information. We compare our proposal to the traditional balanced and uniform random designs using the expected average predictive variance (APV) loss and the expected Kullback-Leibler (KL) divergence between the prior and the posterior for the LGCP intensity function in simulation experiments and in a real world case study. The APV informs about expected accuracy of a survey design in point-wise predictions and the KL-divergence measures the expected gain in information about the joint distribution of the intensity field. The case study concerns planning a survey design for analyzing larval areas of two commercially important fish stocks on Finnish coastal region. Our experiments show that the designs generated by the proposed rejection sampling method clearly outperform the traditional balanced and uniform random survey designs. Moreover, the method is easily applicable to other models in general. (C) 2019 The Author(s). Published by Elsevier B.V.
  • PCAWG Evolution Heterogeneity Work; PCAWG Consortium; Dentro, Stefan C.; Mustonen, Ville (2021)
    Intra-tumor heterogeneity (ITH) is a mechanism of therapeutic resistance and therefore an important clinical challenge. However, the extent, origin, and drivers of ITH across cancer types are poorly understood. To address this, we extensively characterize ITH across whole-genome sequences of 2,658 cancer samples spanning 38 cancer types. Nearly all informative samples (95.1 %) contain evidence of distinct subclonal expansions with frequent branching relationships between subclones, We observe positive selection of subclonal driver mutations across most cancer types and identify cancer type-specific subclonal patterns of driver gene mutations, fusions, structural variants, and copy number alterations as well as dynamic changes in mutational processes between subclonal expansions. Our results underline the importance of ITH and its drivers in tumor evolution and provide a pan-cancer resource of comprehensively annotated subclonal events from whole-genome sequencing data.
  • Roberts, Sean G.; Killin, Anton; Deb, Angarika; Sheard, Catherine; Greenhill, Simon J.; Sinnemäki, Kaius; Segovia Martín, José; Nölle, Jonas; Berdicevskis, Aleksandrs; Humphreys-Balkwill, Archie; Little, Hannah; Opie, Kit; Jacques, Guillaume; Bromham, Lindell; Tinits, Peeter; Ross, Robert M.; Lee, Sean; Gasser, Emily; Calladine, Jasmine; Spike, Matthew; Mann, Stephen; Shcherbakova, Olena; Singer, Ruth; Zhang, Shuya; Benítez-Burraco, Antonio; Kliesch, Christian; Thomas-Colquhoun, Ewan; Skirgård, Hedvig; Tamariz, Monica; Passmore, Sam; Pellard, Thomas; Jordan, Fiona (2020)
    Language is one of the most complex of human traits. There are many hypotheses about how it originated, what factors shaped its diversity, and what ongoing processes drive how it changes. We present the Causal Hypotheses in Evolutionary Linguistics Database (CHIELD, https://chield.excd.org/), a tool for expressing, exploring, and evaluating hypotheses. It allows researchers to integrate multiple theories into a coherent narrative, helping to design future research. We present design goals, a formal specification, and an implementation for this database. Source code is freely available for other fields to take advantage of this tool. Some initial results are presented, including identifying conflicts in theories about gossip and ritual, comparing hypotheses relating population size and morphological complexity, and an author relation network.
  • Lanne, Markku; Luoto, Jani Pentti (2018)
    We propose imposing data-driven identification constraints to alleviate the multimodality problem arising in the estimation of poorly identified dynamic stochastic general equilibrium models under non-informative prior distributions. We also devise an iterative procedure based on the posterior density of the parameters for finding these constraints. An empirical application to the Smets and Wouters () model demonstrates the properties of the estimation method, and shows how the problem of multimodal posterior distributions caused by parameter redundancy is eliminated by identification constraints. Out-of-sample forecast comparisons as well as Bayes factors lend support to the constrained model.
  • Puonti, Paivi (2019)
    We apply a novel Bayesian structural vector autoregressive method to analyze the macroeconomic effects of unconventional monetary policy in Japan, the US and the euro area. The method exploits statistical properties of the data to uniquely identify the model without restrictions, and thus enables formal assessment of the plausibility of given sign restrictions. Unlike previous research, the data-based analysis reveals differences in the output and price effects of the Bank of Japan's, Federal Reserve's and European Central Bank's balance sheet operations.
  • Marttinen, Pekka; Hanage, William P.; Croucher, Nicholas J.; Connor, Thomas R.; Harris, Simon R.; Bentley, Stephen D.; Corander, Jukka (2012)
  • Foster, Scott D.; Vanhatalo, Jarno; Trenkel, Verena M.; Schulz, Torsti; Lawrence, Emma; Przeslawski, Rachel; Hosack, Geoffrey (2021)
    Data are currently being used, and reused, in ecological research at an unprecedented rate. To ensure appropriate reuse however, we need to ask the question: "Are aggregated databases currently providing the right information to enable effective and unbiased reuse?" We investigate this question, with a focus on designs that purposefully favor the selection of sampling locations (upweighting the probability of selection of some locations). These designs are common and examples are those designs that have uneven inclusion probabilities or are stratified. We perform a simulation experiment by creating data sets with progressively more uneven inclusion probabilities and examine the resulting estimates of the average number of individuals per unit area (density). The effect of ignoring the survey design can be profound, with biases of up to 250% in density estimates when naive analytical methods are used. This density estimation bias is not reduced by adding more data. Fortunately, the estimation bias can be mitigated by using an appropriate estimator or an appropriate model that incorporates the design information. These are only available however, when essential information about the survey design is available: the sample location selection process (e.g., inclusion probabilities), and/or covariates used in their specification. The results suggest that such information must be stored and served with the data to support meaningful inference and data reuse.
  • Honkela, Antti; Das, Mrinal; Nieminen, Arttu; Dikmen, Onur; Kaski, Samuel (2018)
    Background: Users of a personalised recommendation system face a dilemma: recommendations can be improved by learning from data, but only if other users are willing to share their private information. Good personalised predictions are vitally important in precision medicine, but genomic information on which the predictions are based is also particularly sensitive, as it directly identifies the patients and hence cannot easily be anonymised. Differential privacy has emerged as a potentially promising solution: privacy is considered sufficient if presence of individual patients cannot be distinguished. However, differentially private learning with current methods does not improve predictions with feasible data sizes and dimensionalities. Results: We show that useful predictors can be learned under powerful differential privacy guarantees, and even from moderately-sized data sets, by demonstrating significant improvements in the accuracy of private drug sensitivity prediction with a new robust private regression method. Our method matches the predictive accuracy of the state-of-the-art non-private lasso regression using only 4x more samples under relatively strong differential privacy guarantees. Good performance with limited data is achieved by limiting the sharing of private information by decreasing the dimensionality and by projecting outliers to fit tighter bounds, therefore needing to add less noise for equal privacy. Conclusions: The proposed differentially private regression method combines theoretical appeal and asymptotic efficiency with good prediction accuracy even with moderate-sized data. As already the simple-to-implement method shows promise on the challenging genomic data, we anticipate rapid progress towards practical applications in many fields.
  • Nene, Nuno R.; Mustonen, Ville; Illingworth, Christopher J. R. (2018)
    The Wright-Fisher model is the most popular population model for describing the behaviour of evolutionary systems with a finite population size. Approximations have commonly been used but the model itself has rarely been tested against time-resolved genomic data. Here, we evaluate the extent to which it can be inferred as the correct model under a likelihood framework. Given genome-wide data from an evolutionary experiment, we validate the Wright-Fisher drift model as the better option for describing evolutionary trajectories in a finite population. This was found by evaluating its performance against a Gaussian model of allele frequency propagation. However, we note a range of circumstances under which standard Wright-Fisher drift cannot be correctly identified. (C) 2017 The Author(s). Published by Elsevier Ltd.