Browsing by Subject "Gaussian process"

Sort by: Order: Results:

Now showing items 1-6 of 6
  • Perälä, Tommi; Vanhatalo, Jarno; Chrysafi, Anna (2020)
    Expert assessments are routinely used to inform management and other decision making. However, often these assessments contain considerable biases and uncertainties for which reason they should be calibrated if possible. Moreover, coherently combining multiple expert assessments into one estimate poses a long-standing problem in statistics since modeling expert knowledge is often difficult. Here, we present a hierarchical Bayesian model for expert calibration in a task of estimating a continuous univariate parameter. The model allows experts' biases to vary as a function of the true value of the parameter and according to the expert's background. We follow the fully Bayesian approach (the so-called supra-Bayesian approach) and model experts' bias functions explicitly using hierarchical Gaussian processes. We show how to use calibration data to infer the experts' observation models with the use of bias functions and to calculate the bias corrected posterior distributions for an unknown system parameter of interest. We demonstrate and test our model and methods with simulated data and a real case study on data-limited fisheries stock assessment. The case study results show that experts' biases vary with respect to the true system parameter value and that the calibration of the expert assessments improves the inference compared to using uncalibrated expert assessments or a vague uniform guess. Moreover, the bias functions in the real case study show important differences between the reliability of alternative experts. The model and methods presented here can be also straightforwardly applied to other applications than our case study.
  • Tikhonov, Gleb; Duan, Li; Abrego, Nerea; Newell, Graeme; White, Matt; Dunson, David; Ovaskainen, Otso (2020)
    The ongoing global change and the increased interest in macroecological processes call for the analysis of spatially extensive data on species communities to understand and forecast distributional changes of biodiversity. Recently developed joint species distribution models can deal with numerous species efficiently, while explicitly accounting for spatial structure in the data. However, their applicability is generally limited to relatively small spatial data sets because of their severe computational scaling as the number of spatial locations increases. In this work, we propose a practical alleviation of this scalability constraint for joint species modeling by exploiting two spatial-statistics techniques that facilitate the analysis of large spatial data sets: Gaussian predictive process and nearest-neighbor Gaussian process. We devised an efficient Gibbs posterior sampling algorithm for Bayesian model fitting that allows us to analyze community data sets consisting of hundreds of species sampled from up to hundreds of thousands of spatial units. The performance of these methods is demonstrated using an extensive plant data set of 30,955 spatial units as a case study. We provide an implementation of the presented methods as an extension to the hierarchical modeling of species communities framework.
  • Siivola, Eero; Vehtari, Aki; Vanhatalo, Jarno; Gonzalez, Javier; Andersen, Michael (IEEE, 2018)
    IEEE International Workshop on Machine Learning for Signal Processing
    Bayesian optimization (BO) is a global optimization strategy designed to find the minimum of an expensive black-box function, typically defined on a compact subset of ℛ d , by using a Gaussian process (GP) as a surrogate model for the objective. Although currently available acquisition functions address this goal with different degree of success, an over-exploration effect of the contour of the search space is typically observed. However, in problems like the configuration of machine learning algorithms, the function domain is conservatively large and with a high probability the global minimum does not sit on the boundary of the domain. We propose a method to incorporate this knowledge into the search process by adding virtual derivative observations in the GP at the boundary of the search space. We use the properties of GPs to impose conditions on the partial derivatives of the objective. The method is applicable with any acquisition function, it is easy to use and consistently reduces the number of evaluations required to optimize the objective irrespective of the acquisition used. We illustrate the benefits of our approach in an extensive experimental comparison.
  • Topa, Hande; Honkela, Antti (2018)
    Background: Genome-wide high-throughput sequencing (HIS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. Results: Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HIS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. Conclusions: Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes.
  • Topa, Hande; Honkela, Antti (BioMed Central, 2018)
    Abstract Background Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. Results Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. Conclusions Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes.
  • Vanhatalo, Jarno; Foster, Scott D.; Hosack, Geoffrey R. (2021)
    The categorization of multidimensional data into clusters is a common task in statistics. Many applications of clustering, including the majority of tasks in ecology, use data that is inherently spatial and is often also temporal. However, spatiotemporal dependence is typically ignored when clustering multivariate data. We present a finite mixture model for spatial and spatiotemporal clustering that incorporates spatial and spatiotemporal autocorrelation by including appropriate Gaussian processes (GP) into a model for the mixing proportions. We also allow for flexible and semiparametric dependence on environmental covariates, once again using GPs. We propose to use Bayesian inference through three tiers of approximate methods: a Laplace approximation that allows efficient analysis of large datasets, and both partial and full Markov chain Monte Carlo (MCMC) approaches that improve accuracy at the cost of increased computational time. Comparison of the methods shows that the Laplace approximation is a useful alternative to the MCMC methods. A decadal analysis of 253 species of teleost fish from 854 samples collected along the biodiverse northwestern continental shelf of Australia between 1986 and 1997 shows the added clarity provided by accounting for spatial autocorrelation. For these data, the temporal dependence is comparatively small, which is an important finding given the changing human pressures over this time.