Browsing by Subject "REGRESSION"

Sort by: Order: Results:

Now showing items 1-20 of 32
  • Manninen, Antti J.; O'Connor, Ewan J.; Vakkari, Ville; Petäjä, Tuukka (2016)
    Current commercially available Doppler lidars provide an economical and robust solution for measuring vertical and horizontal wind velocities, together with the ability to provide co- and cross-polarised backscatter profiles. The high temporal resolution of these instruments allows turbulent properties to be obtained from studying the variation in radial velocities. However, the instrument specifications mean that certain characteristics, especially the background noise behaviour, become a limiting factor for the instrument sensitivity in regions where the aerosol load is low. Turbulent calculations require an accurate estimate of the contribution from velocity uncertainty estimates, which are directly related to the signal-to-noise ratio. Any bias in the signal-to-noise ratio will propagate through as a bias in turbulent properties. In this paper we present a method to correct for artefacts in the background noise behaviour of commercially available Doppler lidars and reduce the signal-to-noise ratio threshold used to discriminate between noise, and cloud or aerosol signals. We show that, for Doppler lidars operating continuously at a number of locations in Finland, the data availability can be increased by as much as 50% after performing this background correction and subsequent reduction in the threshold. The reduction in bias also greatly improves subsequent calculations of turbulent properties in weak signal regimes.
  • Syrjälä, Essi; Nevalainen, Jaakko; Peltonen, Jaakko; Takkinen, Hanna-Mari; Hakola, Leena; Åkerlund, Mari; Veijola, Riitta; Ilonen, Jorma; Toppari, Jorma; Knip, Mikael; Virtanen, Suvi M. (2019)
    Several dietary factors have been suspected to play a role in the development of advanced islet autoimmunity (IA) and/or type 1 diabetes (T1D), but the evidence is fragmentary. A prospective population-based cohort of 6081 Finnish newborn infants with HLA-DQB1-conferred susceptibility to T1D was followed up to 15 years of age. Diabetes-associated autoantibodies and diet were assessed at 3-to 12-month intervals. We aimed to study the association between consumption of selected foods and the development of advanced IA longitudinally with Cox regression models (CRM), basic joint models (JM) and joint latent class mixed models (JLCMM). The associations of these foods to T1D risk were also studied to investigate consistency between alternative endpoints. The JM showed a marginal association between meat consumption and advanced IA: the hazard ratio adjusted for selected confounding factors was 1.06 (95% CI: 1.00, 1.12). The JLCMM identified two classes in the consumption trajectories of fish and a marginal protective association for high consumers compared to low consumers: the adjusted hazard ratio was 0.68 (0.44, 1.05). Similar findings were obtained for T1D risk with adjusted hazard ratios of 1.13 (1.02, 1.24) for meat and 0.45 (0.23, 0.86) for fish consumption. Estimates from the CRMs were closer to unity and CIs were narrower compared to the JMs. Findings indicate that intake of meat might be directly and fish inversely associated with the development of advanced IA and T1D, and that disease hazards in longitudinal nutritional epidemiology are more appropriately modeled by joint models than with naive approaches.
  • Alghamdi, Mansour A.; Al-Hunaiti, Afnan; Arar, Sharif; Khoder, Mamdouh; Abdelmaksoud, Ahmad S.; Al-Jeelani, Hisham; Lihavainen, Heikki; Hyvärinen, Antti; Shabbaj, Ibrahim I.; Almehmadi, Fahd M.; Zaidan, Martha A.; Hussein, Tareq; Dada, Lubna (2019)
    Ground level ozone (O-3) plays an important role in controlling the oxidation budget in the boundary layer and thus affects the environment and causes severe health disorders. Ozone gas, being one of the well-known greenhouse gases, although present in small quantities, contributes to global warming. In this study, we present a predictive model for the steady-state ozone concentrations during daytime (13:00-17:00) and nighttime (01:00-05:00) at an urban coastal site. The model is based on a modified approach of the null cycle of O-3 and NOx and was evaluated against a one-year data-base of O-3 and nitrogen oxides (NO and NO2) measured at an urban coastal site in Jeddah, on the west coast of Saudi Arabia. The model for daytime concentrations was found to be linearly dependent on the concentration ratio of NO2 to NO whereas that for the nighttime period was suggested to be inversely proportional to NO2 concentrations. Knowing that reactions involved in tropospheric O-3 formation are very complex, this proposed model provides reasonable predictions for the daytime and nighttime concentrations. Since the current description of the model is solely based on the null cycle of O-3 and NOx, other precursors could be considered in future development of this model. This study will serve as basis for future studies that might introduce informing strategies to control ground level O-3 concentrations, as well as its precursors' emissions.
  • He, Liang; Pitkaniemi, Janne; Silventoinen, Karri; Sillanpaa, Mikko J. (2017)
    Estimating dynamic effects of age on the genetic and environmental variance components in twin studies may contribute to the investigation of gene-environment interactions, and may provide more insights into more accurate and powerful estimation of heritability. Existing parametric models for estimating dynamic variance components suffer from various drawbacks such as limitation of predefined functions. We present ACEt, an R package for fast estimating dynamic variance components and heritability that may change with respect to age or other moderators. Building on the twin models using penalized splines, ACEt provides a unified framework to incorporate a class of ACE models, in which each component can be modeled independently and is not limited by a linear or quadratic function. We demonstrate that ACEt is robust against misspecification of the number of spline knots, and offers a refined resolution of dynamic behavior of the genetic and environmental components and thus a detailed estimation of age-specific heritability. Moreover, we develop resampling methods for testing twin models with different variance functions including splines, log-linearity and constancy, which can be easily employed to verify various model assumptions. We evaluated the type I error rate and statistical power of the proposed hypothesis testing procedures under various scenarios using simulated datasets. Potential numerical issues and computational cost were also assessed through simulations. We applied the ACEt package to a Finnish twin cohort to investigate age-specific heritability of body mass index and height. Our results show that the age-specific variance components of these two traits exhibited substantially different patterns despite of comparable estimates of heritability. In summary, the ACEt R package offers a useful tool for the exploration of age-dependent heritability and model comparison in twin studies.
  • Cardoso, Pedro; Branco, Vasco V.; Borges, Paulo A.; Carvalho, Jose C.; Rigal, Francois; Gabriel, Rosalina; Mammola, Stefano; Cascalho, Jose; Correia, Luis (2020)
    Ecological systems are the quintessential complex systems, involving numerous high-order interactions and non-linear relationships. The most used statistical modeling techniques can hardly accommodate the complexity of ecological patterns and processes. Finding hidden relationships in complex data is now possible using massive computational power, particularly by means of artificial intelligence and machine learning methods. Here we explored the potential of symbolic regression (SR), commonly used in other areas, in the field of ecology. Symbolic regression searches for both the formal structure of equations and the fitting parameters simultaneously, hence providing the required flexibility to characterize complex ecological systems. Although the method here presented is automated, it is part of a collaborative human-machine effort and we demonstrate ways to do it. First, we test the robustness of SR to extreme levels of noise when searching for the species-area relationship. Second, we demonstrate how SR can model species richness and spatial distributions. Third, we illustrate how SR can be used to find general models in ecology, namely new formulas for species richness estimators and the general dynamic model of oceanic island biogeography. We propose that evolving free-form equations purely from data, often without prior human inference or hypotheses, may represent a very powerful tool for ecologists and biogeographers to become aware of hidden relationships and suggest general theoretical models and principles.
  • Khan, Suleiman A.; Leppaaho, Eemeli; Kaski, Samuel (2016)
    We introduce Bayesian multi-tensor factorization, a model that is the first Bayesian formulation for joint factorization of multiple matrices and tensors. The research problem generalizes the joint matrix-tensor factorization problem to arbitrary sets of tensors of any depth, including matrices, can be interpreted as unsupervised multi-view learning from multiple data tensors, and can be generalized to relax the usual trilinear tensor factorization assumptions. The result is a factorization of the set of tensors into factors shared by any subsets of the tensors, and factors private to individual tensors. We demonstrate the performance against existing baselines in multiple tensor factorization tasks in structural toxicogenomics and functional neuroimaging.
  • Peltola, Tomi; Marttinen, Pekka; Jula, Antti; Salomaa, Veikko; Perola, Markus; Vehtari, Aki (2012)
  • Guo, Qi; Burgess, Stephen; Turman, Constance; Bolla, Manjeet K.; Wang, Qin; Lush, Michael; Abraham, Jean; Aittomäki, Kristiina; Andrulis, Irene L.; Apicella, Carmel; Arndt, Volker; Barrdahl, Myrto; Benitez, Javier; Berg, Christine D.; Blomqvist, Carl; Bojesen, Stig E.; Bonanni, Bernardo; Brand, Judith S.; Brenner, Hermann; Broeks, Annegien; Burwinkel, Barbara; Caldas, Carlos; Campa, Daniele; Canzian, Federico; Chang-Claude, Jenny; Chanock, Stephen J.; Chin, Suet-Feung; Couch, Fergus J.; Cox, Angela; Cross, Simon S.; Cybulski, Cezary; Czene, Kamila; Darabi, Hatef; Devilee, Peter; Diver, W. Ryan; Dunning, Alison M.; Earl, Helena M.; Eccles, Diana M.; Ekici, Arif B.; Eriksson, Mikael; Evans, D. Gareth; Fasching, Peter A.; Figueroa, Jonine; Flesch-Janys, Dieter; Flyger, Henrik; Gapstur, Susan M.; Gaudet, Mia M.; Giles, Graham G.; Muranen, Taru A.; Nevanlinna, Heli; kConFab AOCS Investigators (2017)
    There is increasing evidence that elevated body mass index (BMI) is associated with reduced survival for women with breast cancer. However, the underlying reasons remain unclear. We conducted a Mendelian randomization analysis to investigate a possible causal role of BMI in survival from breast cancer. We used individual-level data from six large breast cancer case-cohorts including a total of 36 210 individuals (2475 events) of European ancestry. We created a BMI genetic risk score (GRS) based on genotypes at 94 known BMI-associated genetic variants. Association between the BMI genetic score and breast cancer survival was analysed by Cox regression for each study separately. Study-specific hazard ratios were pooled using fixed-effect meta-analysis. BMI genetic score was found to be associated with reduced breast cancer-specific survival for estrogen receptor (ER)-positive cases [hazard ratio (HR) = 1.11, per one-unit increment of GRS, 95% confidence interval (CI) 1.01-1.22, P = 0.03). We observed no association for ER-negative cases (HR = 1.00, per one-unit increment of GRS, 95% CI 0.89-1.13,P = 0.95). Our findings suggest a causal effect of increased BMI on reduced breast cancer survival for ER-positive breast cancer. There is no evidence of a causal effect of higher BMI on survival for ER-negative breast cancer cases.
  • Picazo, Felix; Vilmi, Annika; Aalto, Juha; Soininen, Janne; Casamayor, Emilio O.; Liu, Yongqin; Wu, Qinglong; Ren, Lijuan; Zhou, Jizhong; Shen, Ji; Wang, Jianjun (2020)
    Background Understanding the large-scale patterns of microbial functional diversity is essential for anticipating climate change impacts on ecosystems worldwide. However, studies of functional biogeography remain scarce for microorganisms, especially in freshwater ecosystems. Here we study 15,289 functional genes of stream biofilm microbes along three elevational gradients in Norway, Spain and China. Results We find that alpha diversity declines towards high elevations and assemblage composition shows increasing turnover with greater elevational distances. These elevational patterns are highly consistent across mountains, kingdoms and functional categories and exhibit the strongest trends in China due to its largest environmental gradients. Across mountains, functional gene assemblages differ in alpha diversity and composition between the mountains in Europe and Asia. Climate, such as mean temperature of the warmest quarter or mean precipitation of the coldest quarter, is the best predictor of alpha diversity and assemblage composition at both mountain and continental scales, with local non-climatic predictors gaining more importance at mountain scale. Under future climate, we project substantial variations in alpha diversity and assemblage composition across the Eurasian river network, primarily occurring in northern and central regions, respectively. Conclusions We conclude that climate controls microbial functional gene diversity in streams at large spatial scales; therefore, the underlying ecosystem processes are highly sensitive to climate variations, especially at high latitudes. This biogeographical framework for microbial functional diversity serves as a baseline to anticipate ecosystem responses and biogeochemical feedback to ongoing climate change.
  • van der Wal, Jessica E. M.; Thorogood, Rose; Horrocks, Nicholas P. C. (2021)
    Collaboration and diversity are increasingly promoted in science. Yet how collaborations influence academic career progression, and whether this differs by gender, remains largely unknown. Here, we use co-authorship ego networks to quantify collaboration behaviour and career progression of a cohort of contributors to biennial International Society of Behavioral Ecology meetings (1992, 1994, 1996). Among this cohort, women were slower and less likely to become a principal investigator (PI; approximated by having at least three last-author publications) and published fewer papers over fewer years (i.e. had shorter academic careers) than men. After adjusting for publication number, women also had fewer collaborators (lower adjusted network size) and published fewer times with each co-author (lower adjusted tie strength), albeit more often with the same group of collaborators (higher adjusted clustering coefficient). Authors with stronger networks were more likely to become a PI, and those with less clustered networks did so more quickly. Women, however, showed a stronger positive relationship with adjusted network size (increased career length) and adjusted tie strength (increased likelihood to become a PI). Finally, early-career network characteristics correlated with career length. Our results suggest that large and varied collaboration networks are positively correlated with career progression, especially for women.
  • Fung, Pak L.; Zaidan, Martha A.; Timonen, Hilkka; Niemi, Jarkko V.; Kousa, Anu; Kuula, Joel; Luoma, Krista; Tarkoma, Sasu; Petäjä, Tuukka; Kulmala, Markku; Hussein, Tareq (2021)
    Air quality prediction with black-box (BB) modelling is gaining widespread interest in research and industry. This type of data-driven models work generally better in terms of accuracy but are limited to capture physical, chemical and meteorological processes and therefore accountability for interpretation. In this paper, we evaluated different white-box (WB) and BB methods that estimate atmospheric black carbon (BC) concentration by a suite of observations from the same measurement site. This study involves data in the period of 1st January 2017–31st December 2018 from two measurement sites, from a street canyon site in Mäkelänkatu and from an urban background site in Kumpula, in Helsinki, Finland. At the street canyon site, WB models performed (R² = 0.81–0.87) in a similar way as the BB models did (R² = 0.86–0.87). The overall performance of the BC concentration estimation methods at the urban background site was much worse probably because of a combination of smaller dynamic variability in the BC values and longer data gaps. However, the difference in WB (R²= 0.44–0.60) and BB models (R² = 0.41–0.64) was not significant. Furthermore, the WB models are closer to physics-based models, and it is easier to spot the relative importance of the predictor variable and determine if the model output makes sense. This feature outweighs slightly higher performance of some individual BB models, and inherently the WB models are a better choice due to their transparency in the model architecture. Among all the WB models, IAP and LASSO are recommended due to its flexibility and its efficiency, respectively. Our findings also ascertain the importance of temporal properties in statistical modelling. In the future, the developed BC estimation model could serve as a virtual sensor and complement the current air quality monitoring.
  • Kaikkonen, Laura; Virtanen, Elina A.; Kostamo, Kirsi; Lappalainen, Juho; Kotilainen, Aarno T. (2019)
    Ferromanganese (FeMn) concretions are mineral precipitates found on soft sediment seafloors both in the deep sea and coastal sea areas. These mineral deposits potentially form a three-dimensional habitat for marine organisms, and contain minerals targeted by an emerging seabed mining industry. While FeMn concretions are known to occur abundantly in coastal sea areas, specific information on their spatial distribution and significance for marine ecosystems is lacking. Here, we examine the distribution of FeMn concretions in Finnish marine areas. Drawing on an extensive dataset of 140,000 sites visited by the Finnish Inventory Programme for the Underwater Marine Environment (VELMU), we examine the occurrence of FeMn concretions from seabed mapping, and use spatial modeling techniques to estimate the potential coverage of FeMn concretions. Using seafloor characteristics and hydrographical conditions as predictor variables, we demonstrate that the extent of seafloors covered by concretions in the northern Baltic Sea is larger than anticipated, as concretions were found at similar to 7000 sites, and were projected to occur on over 11% of the Finnish sea areas. These results provide new insights into seafloor complexity in coastal sea areas, and further enable examining the ecological role and resource potential of seabed mineral concretions.
  • Batllori, Enric; Lloret, Francisco; Aakala, Tuomas; Anderegg, William R. L.; Aynekulu, Ermias; Bendixsen, Devin P.; Bentouati, Abdallah; Bigler, Christof; Burk, C. John; Camarero, J. Julio; Colangelo, Michele; Coop, Jonathan D.; Fensham, Roderick; Floyd, M. Lisa; Galiano, Lucia; Ganey, Joseph L.; Gonzalez, Patrick; Jacobsen, Anna L.; Kane, Jeffrey Michael; Kitzberger, Thomas; Linares, Juan C.; Marchetti, Suzanne B.; Matusick, George; Michaelian, Michael; Navarro-Cerrillo, Rafael M.; Pratt, Robert Brandon; Redmond, Miranda D.; Rigling, Andreas; Ripullone, Francesco; Sanguesa-Barreda, Gabriel; Sasal, Yamila; Saura-Mas, Sandra; Suarez, Maria Laura; Veblen, Thomas T.; Vila-Cabrera, Albert; Vincke, Caroline; Ben Zeeman, (2020)
    Forest vulnerability to drought is expected to increase under anthropogenic climate change, and drought-induced mortality and community dynamics following drought have major ecological and societal impacts. Here, we show that tree mortality concomitant with drought has led to short-term (mean 5 y, range 1 to 23 y after mortality) vegetation-type conversion in multiple biomes across the world (131 sites). Self-replacement of the dominant tree species was only prevalent in 21% of the examined cases and forests and woodlands shifted to nonwoody vegetation in 10% of them. The ultimate temporal persistence of such changes remains unknown but, given the key role of biological legacies in long-term ecological succession, this emerging picture of postdrought ecological trajectories highlights the potential for major ecosystem reorganization in the coming decades. Community changes were less pronounced under wetter postmortality conditions. Replacement was also influenced by management intensity, and postdrought shrub dominance was higher when pathogens acted as codrivers of tree mortality. Early change in community composition indicates that forests dominated by mesic species generally shifted toward more xeric communities, with replacing tree and shrub species exhibiting drier bioclimatic optima and distribution ranges. However, shifts toward more mesic communities also occurred and multiple pathways of forest replacement were observed for some species. Drought characteristics, species-specific environmental preferences, plant traits, and ecosystem legacies govern post drought species turnover and subsequent ecological trajectories, with potential far-reaching implications for forest biodiversity and ecosystem services.
  • Genetics DNA Methylation Consort; NHLBI Trans-Omics Precision Med; McCartney, Daniel L.; Min, Josine L.; Richmond, Rebecca C.; Palviainen, Teemu; Ollikainen, Miina; Kaprio, Jaakko (2021)
    Background Biological aging estimators derived from DNA methylation data are heritable and correlate with morbidity and mortality. Consequently, identification of genetic and environmental contributors to the variation in these measures in populations has become a major goal in the field. Results Leveraging DNA methylation and SNP data from more than 40,000 individuals, we identify 137 genome-wide significant loci, of which 113 are novel, from genome-wide association study (GWAS) meta-analyses of four epigenetic clocks and epigenetic surrogate markers for granulocyte proportions and plasminogen activator inhibitor 1 levels, respectively. We find evidence for shared genetic loci associated with the Horvath clock and expression of transcripts encoding genes linked to lipid metabolism and immune function. Notably, these loci are independent of those reported to regulate DNA methylation levels at constituent clock CpGs. A polygenic score for GrimAge acceleration showed strong associations with adiposity-related traits, educational attainment, parental longevity, and C-reactive protein levels. Conclusion This study illuminates the genetic architecture underlying epigenetic aging and its shared genetic contributions with lifestyle factors and longevity.
  • Mpindi, John Patrick; Sara, Henri; Haapa-Paananen, Saija; Kilpinen, Sami; Pisto, Tommi; Bucher, Elmar; Ojala, Kalle; Iljin, Kristiina; Vainio, Paula; Bjorkman, Mari; Gupta, Santosh; Kohonen, Pekka; Nees, Matthias; Kallioniemi, Olli (2011)
    Background Meta-analysis of gene expression microarray datasets presents significant challenges for statistical analysis. We developed and validated a new bioinformatic method for the identification of genes upregulated in subsets of samples of a given tumour type (‘outlier genes’), a hallmark of potential oncogenes. Methodology A new statistical method (the gene tissue index, GTI) was developed by modifying and adapting algorithms originally developed for statistical problems in economics. We compared the potential of the GTI to detect outlier genes in meta-datasets with four previously defined statistical methods, COPA, the OS statistic, the t-test and ORT, using simulated data. We demonstrated that the GTI performed equally well to existing methods in a single study simulation. Next, we evaluated the performance of the GTI in the analysis of combined Affymetrix gene expression data from several published studies covering 392 normal samples of tissue from the central nervous system, 74 astrocytomas, and 353 glioblastomas. According to the results, the GTI was better able than most of the previous methods to identify known oncogenic outlier genes. In addition, the GTI identified 29 novel outlier genes in glioblastomas, including TYMS and CDKN2A. The over-expression of these genes was validated in vivo by immunohistochemical staining data from clinical glioblastoma samples. Immunohistochemical data were available for 65% (19 of 29) of these genes, and 17 of these 19 genes (90%) showed a typical outlier staining pattern. Furthermore, raltitrexed, a specific inhibitor of TYMS used in the therapy of tumour types other than glioblastoma, also effectively blocked cell proliferation in glioblastoma cell lines, thus highlighting this outlier gene candidate as a potential therapeutic target. Conclusions/Significance Taken together, these results support the GTI as a novel approach to identify potential oncogene outliers and drug targets. The algorithm is implemented in an R package (Text S1).
  • Murphy, Neil; Ward, Heather A.; Jenab, Mazda; Rothwell, Joseph A.; Boutron-Ruault, Marie-Christine; Carbonnel, Franck; Kvaskoff, Marina; Kaaks, Rudolf; Kuehn, Tilman; Boeing, Heiner; Aleksandrova, Krasimira; Weiderpass, Elisabete; Skeie, Guri; Borch, Kristin Benjaminsen; Tjonneland, Anne; Kyro, Cecilie; Overvad, Kim; Dahm, Christina C.; Jakszyn, Paula; Sanchez, Maria-Jose; Gil, Leire; Huerta, Jose M.; Barricarte, Aurelio; Ramon Quiros, J.; Khaw, Kay-Tee; Wareham, Nick; Bradbury, Kathryn E.; Trichopoulou, Antonia; La Vecchia, Carlo; Karakatsani, Anna; Palli, Domenico; Grioni, Sara; Tumino, Rosario; Fasanelli, Francesca; Panico, Salvatore; Bueno-de-Mesquita, Bas; Peeters, Petra H.; Gylling, Bjorn; Myte, Robin; Jirstrom, Karin; Berntsson, Jonna; Xue, Xiaonan; Riboli, Elio; Cross, Amanda J.; Gunter, Marc J. (2019)
    BACKGROUND & AIMS: Colorectal cancer located at different anatomical subsites may have distinct etiologies and risk factors. Previous studies that have examined this hypothesis have yielded inconsistent results, possibly because most studies have been of insufficient size to identify heterogeneous associations with precision. METHODS: In the European Prospective Investigation into Cancer and Nutrition study, we used multivariable joint Cox proportional hazards models, which accounted for tumors at different anatomical sites (proximal colon, distal colon, and rectum) as competing risks, to examine the relationships between 14 established/suspected lifestyle, anthropometric, and reproductive/menstrual risk factors with colorectal cancer risk. Heterogeneity across sites was tested using Wald tests. RESULTS: After a median of 14.9 years of follow-up of 521,330 men and women, 6291 colorectal cancer cases occurred. Physical activity was related inversely to proximal colon and distal colon cancer, but not to rectal cancer (P heterogeneity = .03). Height was associated positively with proximal and distal colon cancer only, but not rectal cancer (P heterogeneity = .0001). For men, but not women, heterogeneous relationships were observed for body mass index (P heterogeneity = .008) and waist circumference (P heterogeneity = .03), with weaker positive associations found for rectal cancer, compared with proximal and distal colon cancer. Current smoking was associated with a greater risk of rectal and proximal colon cancer, but not distal colon cancer (P heterogeneity = .05). No heterogeneity by anatomical site was found for alcohol consumption, diabetes, nonsteroidal anti-inflammatory drug use, and reproductive/menstrual factors. CONCLUSIONS: The relationships between physical activity, anthropometry, and smoking with colorectal cancer risk differed by subsite, supporting the hypothesis that tumors in different anatomical regions may have distinct etiologies.
  • Cichonska, Anna; Pahikkala, Tapio; Szedmak, Sandor; Julkunen, Heli; Airola, Antti; Heinonen, Markus; Aittokallio, Tero; Rousu, Juho (2018)
    Motivation: Many inference problems in bioinformatics, including drug bioactivity prediction, can be formulated as pairwise learning problems, in which one is interested in making predictions for pairs of objects, e.g. drugs and their targets. Kernel-based approaches have emerged as powerful tools for solving problems of that kind, and especially multiple kernel learning (MKL) offers promising benefits as it enables integrating various types of complex biomedical information sources in the form of kernels, along with learning their importance for the prediction task. However, the immense size of pairwise kernel spaces remains a major bottleneck, making the existing MKL algorithms computationally infeasible even for small number of input pairs. Results: We introduce pairwiseMKL, the first method for time- and memory-efficient learning with multiple pairwise kernels. pairwiseMKL first determines the mixture weights of the input pairwise kernels, and then learns the pairwise prediction function. Both steps are performed efficiently without explicit computation of the massive pairwise matrices, therefore making the method applicable to solving large pairwise learning problems. We demonstrate the performance of pairwiseMKL in two related tasks of quantitative drug bioactivity prediction using up to 167 995 bioactivity measurements and 3120 pairwise kernels: (i) prediction of anticancer efficacy of drug compounds across a large panel of cancer cell lines; and (ii) prediction of target profiles of anticancer compounds across their kinome-wide target spaces. We show that pairwiseMKL provides accurate predictions using sparse solutions in terms of selected kernels, and therefore it automatically identifies also data sources relevant for the prediction problem.
  • Li, Zitong; Kemppainen, Petri; Rastas, Pasi; Merilä, Juha (2018)
    Genomewide association studies (GWAS) aim to identify genetic markers strongly associated with quantitative traits by utilizing linkage disequilibrium (LD) between candidate genes and markers. However, because of LD between nearby genetic markers, the standard GWAS approaches typically detect a number of correlated SNPs covering long genomic regions, making corrections for multiple testing overly conservative. Additionally, the high dimensionality of modern GWAS data poses considerable challenges for GWAS procedures such as permutation tests, which are computationally intensive. We propose a cluster-based GWAS approach that first divides the genome into many large nonoverlapping windows and uses linkage disequilibrium network analysis in combination with principal component (PC) analysis as dimensional reduction tools to summarize the SNP data to independent PCs within clusters of loci connected by high LD. We then introduce single- and multilocus models that can efficiently conduct the association tests on such high-dimensional data. The methods can be adapted to different model structures and used to analyse samples collected from the wild or from biparental F-2 populations, which are commonly used in ecological genetics mapping studies. We demonstrate the performance of our approaches with two publicly available data sets from a plant (Arabidopsis thaliana) and a fish (Pungitius pungitius), as well as with simulated data.
  • Louvanto, Karolina; Aro, Karoliina; Nedjai, Belinda; Bützow, Ralf; Jakobsson, Maija; Kalliala, Ilkka; Dillner, Joakim; Nieminen, Pekka; Lorincz, Attila (2020)
    BACKGROUND: There is no baseline prognostic test to ascertain whether cervical intraepithelial neoplasia (CIN) will regress or progress. The majority of CIN regress in young women and since local treatments are known to increase the risk of adverse pregnancy outcomes interventions need to be sparing. We investigated the ability of a DNA methylation panel (the S5-classifier) to discriminate between progression and regression among women of childbearing age with untreated CIN grade 2 (CIN2). METHODS: Pyrosequencing methylation and HPV genotyping assays were performed on exfoliated cervical cells from 149 young women with CIN2 in a 2-year cohort study of active surveillance. RESULTS: Twenty-five lesions progressed to CIN grade 3 or worse, 88 regressed to less than CIN grade 1, and 36 lesions persisted as CIN1/2. When cytology, HPV16/18- and HPV16/18/31/33-genotyping, and S5 at baseline were compared to outcomes, S5 was the strongest biomarker associated with regression versus progression. S5 alone or in combination with HPV16/18/31/33-genotyping also showed significantly increased sensitivity versus cytology, comparing regression vs. persistence/progression. With both S5 and cytology tests set at a specificity of 38.6% (95% CI 28.4-49.6) the sensitivity of S5 was significantly higher (83.6%, 95% CI 71.9-91.8) than for cytology (62.3%, 95% CI 49.0-74.4) (p=0.005). The highest area under the curve (AUC) was 0.735 (95% CI 0.621-0.849) in the regression vs. progression outcome with a combination of S5 and cytology, whereas HPV16/18 or HPV16/18/31/33-genotyping did not provide additional prognostic information. CONCLUSIONS: The S5-classifier shows high potential as a prognostic biomarker to identify women with progressive CIN2.
  • Tiittala, Paula; Ristola, Matti; Liitsola, Kirsi; Ollgren, Jukka; Koponen, Päivikki; Surcel, Heljä-Marja; Hiltunen-Back, Eija; Davidkin, Irja; Kivela, Pia (2018)
    Background: Migrants are considered a key population at risk for sexually transmitted and blood-borne diseases in Europe. Prevalence data to support the design of infectious diseases screening protocols are scarce. We aimed to estimate the prevalence of hepatitis B and C, human immunodefiency virus (HIV) infection and syphilis in specific migrant groups in Finland and to assess risk factors for missed diagnosis. Methods: A random sample of 3000 Kurdish, Russian, or Somali origin migrants in Finland was invited to a migrant population-based health interview and examination survey during 2010-2012. Participants in the health examination were offered screening for hepatitis B and C, HIV and syphilis. Notification prevalence in the National Infectious Diseases Register (NIDR) was compared between participants and non-participants to assess non-participation. Missed diagnosis was defined as test-positive case in the survey without previous notification in NIDR. Inverse probability weighting was used to correct for non-participation. Results: Altogether 1000 migrants were screened for infectious diseases. No difference in the notification prevalence among participants and non-participants was observed. Seroprevalence of hepatitis B surface antigen (HBsAg) was 2.3%, hepatitis C antibodies 1.7%, and Treponema pallidum antibodies 1.3%. No cases of HIV were identified. Of all test-positive cases, 61% (34/56) had no previous notification in NIDR. 48% of HBsAg, 62.5% of anti-HCV and 84.6% of anti-Trpa positive cases had been missed. Among the Somali population (n = 261), prevalence of missed hepatitis B diagnosis was 3.0%. Of the 324 Russian migrants, 3.0% had not been previously diagnosed with hepatitis C and 2.4% had a missed syphilis diagnosis. In multivariable regression model missed diagnosis was associated with migrant origin, living alone, poor self-perceived health, daily smoking, and previous diagnosis of another blood-borne infection. Conclusions: More than half of chronic hepatitis and syphilis diagnoses had been missed among migrants in Finland. Undiagnosed hepatitis B among Somali migrants implies post-migration transmission that could be prevented by enhanced screening and vaccinations. Rate of missed diagnoses among Russian migrants supports implementation of targeted hepatitis and syphilis screening upon arrival and also in later health care contacts. Coverage and up-take of current screening among migrants should be evaluated.