Browsing by Subject "R"

Sort by: Order: Results:

Now showing items 1-6 of 6
  • Fischer, Daniel; Mosler, Karl; Mottonen, Jyrki; Nordhausen, Klaus; Pokotylo, Oleksii; Vogel, Daniel (2020)
    The Oja median is one of several extensions of the univariate median to the multivariate case. It has many desirable properties, but is computationally demanding. In this paper, we first review the properties of the Oja median and compare it to other multivariate medians. Then, we discuss four algorithms to compute the Oja median, which are implemented in our R package OjaNP. Besides these algorithms, the package contains also functions to compute Oja signs, Oja signed ranks, Oja ranks, and the related scatter concepts. To illustrate their use, the corresponding multivariate one- and C-sample location tests are implemented.
  • Topa, Hande; Honkela, Antti (2018)
    Background: Genome-wide high-throughput sequencing (HIS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. Results: Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HIS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. Conclusions: Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes.
  • Topa, Hande; Honkela, Antti (BioMed Central, 2018)
    Abstract Background Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. Results Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. Conclusions Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes.
  • Larsson, Aron (Helsingin yliopisto, 2021)
    The science of fish stock assessment is one that is very resource and labor intensive, with stock assessment models historically being based on data that causes a model to overestimate the strength of a population, sometimes with drastic consequences. The need of cost-effective assessment models and approaches increases, which is why I looked into using Bayesian modeling and networks as an approach not often used in fisheries science. I wanted to determine if it could be used to predict both recruitment and spawning stock biomass of four fish species in the north Atlantic, cod, haddock, pollock and capelin, based on no other evidence other than the recruitment or biomass data of the other species and if these results could be used to lower the uncertanties of fish stock models. I used data available on the RAM legacy database to produce four different models with the statistical software R, based on four different Bayes algorithms found in the R-package bnlearn, two based on continuous data and two based on discrete data. What I found was that there is much potential in the Bayesian approach to stock prediction and forecasting, as our prediction error percentage ranged between 1 and 40 percent. The best predictions were made when the species used as evidence had a high correlation coefficient with the target species, which was the case with cod and haddock biomass, which had a unusually high correlation of 0.96. As such, this approach could be used to make preliminary models of interactions between a high amount of species in a specific area, where there is data abundantly available and these models could be used to lower the uncertanties of the stock assessments. However, more research into the applicability for this approach to other species and areas needs to be conducted.
  • Wailan, Alexander M.; Coll, Francesc; Heinz, Eva; Tonkin-Hill, Gerry; Corander, Jukka; Feasey, Nicholas A.; Thomson, Nicholas R. (2019)
    The ability to distinguish different circulating pathogen clones from each other is a fundamental requirement to understand the epidemiology of infectious diseases. Phylogenetic analysis of genomic data can provide a powerful platform to identify lineages within bacterial populations, and thus inform outbreak investigation and transmission dynamics. However, resolving differences between pathogens associated with low-variant (LV) populations carrying low median pairwise single nucleotide variant (SNV) distances remains a major challenge. Here we present rPinecone, an R package designed to define sub-lineages within closely related LV populations. rPinecone uses a root-to-tip directional approach to define sub-lineages within a phylogenetic tree according to SNV distance from the ancestral node. The utility of this software was demonstrated using both simulated outbreaks and real genomic data of two LV populations: a hospital outbreak of methicillin-resistant Staphylococcus aureus and endemic Salmonella Typhi from rural Cambodia. rPinecone identified the transmission branches of the hospital outbreak and geographically confined lineages in Cambodia. Sub-lineages identified by rPinecone in both analyses were phylogenetically robust. It is anticipated that rPinecone can be used to discriminate between lineages of bacteria from LV populations where other methods fail, enabling a deeper understanding of infectious disease epidemiology for public health purposes.
  • Lange, Alexander; Dalheimer, Bernhard; Herwartz, Helmut; Maxand, Simone (2021)
    Structural vector autoregressive (SVAR) models are frequently applied to trace the contemporaneous linkages among (macroeconomic) variables back to an interplay of orthogonal structural shocks. Under Gaussianity the structural parameters are unidentified without additional (often external and not data-based) information. In contrast, the often reasonable assumption of heteroskedastic and/or non-Gaussian model disturbances offers the possibility to identify unique structural shocks. We describe the R package svars which implements statistical identification techniques that can be both heteroskedasticity-based or independence-based. Moreover, it includes a rich variety of analysis tools that are well known in the SVAR literature. Next to a comprehensive review of the theoretical background, we provide a detailed description of the associated R functions. Furthermore, a macroeconomic application serves as a step-by-step guide on how to apply these functions to the identification and interpretation of structural VAR models.