Browsing by Subject "logistic regression"

Sort by: Order: Results:

Now showing items 1-14 of 14
  • Pohjonen, Joona (Helsingin yliopisto, 2020)
    Prediction of the pathological T-stage (pT) in men undergoing radical prostatectomy (RP) is crucial for disease management as curative treatment is most likely when prostate cancer (PCa) is organ-confined (OC). Although multiparametric magnetic resonance imaging (MRI) has been shown to predict pT findings and the risk of biochemical recurrence (BCR), none of the currently used nomograms allow the inclusion of MRI variables. This study aims to assess the possible added benefit of MRI when compared to the Memorial Sloan Kettering, Partin table and CAPRA nomograms and a model built from available preoperative clinical variables. Logistic regression is used to assess the added benefit of MRI in the prediction of non-OC disease and Kaplan-Meier survival curves and Cox proportional hazards in the prediction of BCR. For the prediction of non-OC disease, all models with the MRI variables had significantly higher discrimination and net benefit than the models without the MRI variables. For the prediction of BCR, MRI prediction of non-OC disease separated the high-risk group of all nomograms into two groups with significantly different survival curves but in the Cox proportional hazards models the variable was not significantly associated with BCR. Based on the results, it can be concluded that MRI does offer added value to predicting non-OC disease and BCR, although the results for BCR are not as clear as for non-OC disease.
  • Preussner, Annina (Helsingin yliopisto, 2021)
    The Y chromosome has an essential role in the genetic sex determination in humans and other mammals. It contains a male-specific region (MSY) which escapes recombination and is inherited exclusively through the male line. The genetic variations inherited together on the MSY can be used in classifying Y chromosomes into haplogroups. Y-chromosomal haplogroups are highly informative of genetic ancestry, thus Y chromosomes have been widely used in tracing human population history. However, given the peculiar biology and analytical challenges specific to the Y chromosome, the chromosome is routinely excluded from genetic association studies. Consequently, potential impacts of Y-chromosomal variation on complex disease remain largely uncharacterized. Lately the access to large-scale biobank data has enabled to extend the Y-chromosomal genetic association studies. A recent UK Biobank study suggested links between Y-chromosomal haplogroup I1 and coronary artery disease (CAD) in the British population, but this result has not been validated in other datasets. Since Finland harbours a notable frequency of Y-chromosomal haplogroup I1, the relationship between haplogroup I1 and CAD can further be inferred in the Finnish population using data from the FinnGen project. The first aim of this thesis was to determine the prevalence of Y-chromosomal haplogroups in Finland and characterize their geographical distributions using genotyping array data from the FinnGen project. The second aim was to assess the role between Finnish Y-chromosomal haplogroups and coronary artery disease (CAD) by logistic regression. This thesis characterized the Y-chromosomal haplogroups in Finland for 24 160 males and evaluated the association between Y-chromosomal haplogroups and CAD in Finland. The dataset used in this study was extensive, providing an opportunity to study the Y-chromosomal variation geographically in Finland and its role in complex disease more accurately compared to previous studies. The geographical distribution of the Y-chromosomal haplogroups was characterized on 20 birth regions, and between eastern and western areas of Finland. Consistent with previous studies, the results demonstrated that two major Finnish Y-chromosomal haplogroup lineages, N1c1 and I1, displayed differing distributions within regions, especially between eastern and western Finland. Results from logistic regression analysis between CAD and Y-chromosomal haplogroups suggested no significant association between haplogroup I1 and CAD. Instead, the major Finnish Y-chromosomal haplogroup N1c1 displayed a decreased risk for CAD in the association analysis when compared against other haplogroups. Moreover, this thesis also demonstrated that the association results were not straightforwardly comparable between populations. For instance, haplogroup I1 displayed a decreased risk for CAD in the FinnGen dataset when compared against haplogroup R1b, whereas the same association was reported as risk increasing for CAD in the UK Biobank. Overall, this thesis demonstrates the possibility to study the genetics of Y chromosome using data from the FinnGen project, and highlights the value of including this part of the genome in the future complex disease studies.
  • Kantola, Tuula; Vastaranta, Mikko; Yu, Xiaowei; Lyytikäinen-Saarenmaa, Päivi; Holopainen, Markus; Talvitie, Mervi; Kaasalainen, Sanna; Solberg, Svein; Hyyppä, Juha (2010)
    Climate change and rising temperatures have been observed to be related to the increase of forest insect damage in the boreal zone. The common pine sawfly (Diprion pini L.) (Hymenoptera, Diprionidae) is regarded as a significant threat to boreal pine forests. Defoliation by D. pini can cause severe growth loss and tree mortality in Scots pine (Pinus sylvestris L.) (Pinaceae). In this study, logistic LASSO regression, Random Forest (RF) and Most Similar Neighbor method (MSN) were investigated for predicting the defoliation level of individual Scots pines using the features derived from airborne laser scanning (ALS) data and aerial images. Classification accuracies from 83.7% (kappa 0.67) to 88.1% (kappa 0.76) were obtained depending on the method. The most accurate result was produced using RF with a combination of data from the two sensors, while the accuracies when using ALS and image features separately were 80.7% and 87.4%, respectively. Evidently, the combination of ALS and aerial images in detecting needle losses is capable of providing satisfactory estimates for individual trees.
  • Toivonen, Jaakko; Fortelius, Mikael; Žliobaite, Indrė (2022)
    A species factory refers to the source that gives rise to an exceptionally large number of species. However, what is it exactly: a place, a time or a combination of places, times and environmental conditions, remains unclear. Here we search for species factories computationally, for which we develop statistical approaches to detect origination, extinction and sorting hotspots in space and time in the fossil record. Using data on European Late Cenozoic mammals, we analyse where, how and how often species factories occur, and how they potentially relate to the dynamics of environmental conditions. We find that in the Early Miocene origination hotspots tend to be located in areas with relatively low estimated net primary productivity. Our pilot study shows that species first occurring in origination hotspots tend to have a longer average longevity and a larger geographical range than other species, thus emphasizing the evolutionary importance of the species factories.
  • Junna, Liina (Helsingfors universitet, 2017)
    Self-rated health (SRH) is a frequently used survey indicator of general health. It is periodically utilised in the study of educational health disparities. Several researchers have, however, suggested that systematic population sub group differences in health self-ratings (reporting heterogeneity) may results in SRH reflecting a different health status, or aspects of health, for different educational groups. Previous studies imply that the associations between SRH and other indicators of health may be strengthened by higher education. However, the studies disagree on the strength and the scope of the interaction effect. Comparability is also an issue due to, for example, the variation in the selected health indicators by which SRH is assessed. No such studies have so far been conducted in Norther Europe. The purpose of this Master’s thesis is to address educational SRH reporting heterogeneity. Using quantitative methods, this thesis analyses which aspects of health are included in dichotomised poor or very poor SRH ratings, and whether education moderates the relationship between SRH and the indicators of health. The selected health indicators represent five health dimensions identified in previous studies: clinical health, functional health, health behaviours, mental health and bodily symptoms and experiences. The analyses are conducted using logistic regression and regression –based nonlinear decomposition methods. The study utilises the Health 2000 data (n= 5586) for the household and institution dwelling population over the age of 30 residing in mainland Finland. The data is nationally representative and consists of a clinical- and mental health examination, and survey sections. Overall, a high volume of somatic complaints was found strongly associated with poor self-rated health for all educational groups. Other significant contributors were functional health, diagnosed mental health conditions, and to some extent diagnosed diseases. An educational interaction effect was found for cardiovascular disease, subjective functional limitations in everyday tasks, and high volume of somatic complaints. In all cases education strengthened the association. However, for the majority of the indicators, SRH was associated with, no interaction effect was found. Compared to those respondents with a higher education, those with lower educational attainments more often reported poor SRH, but the selected health indicators and demographic variables explained virtually the whole difference. The study then, to some extent, concurs with earlier findings of higher education strengthening some of the associating between poor SRH and other indicators of health. However, the effect was statistically significant only when comparing basic education to higher educational attainments, and it was less systematic than some of the previous studies have suggested.
  • Masalin, Walter (2007)
    Franchising is not particularly well researched topic in organizational economics. However, its economic importance is significant. For example, the total revenue of franchising in the US is now approximately 50% of all retail trade. A typical franchising chain consists of units that are operated by the franchisees but there are also centrally managed units? Which ones are managed centrally and why? Some of the prior research on the matter is incomplete. The research question of this thesis is: “What factors tend to determine which units of a typical franchising chain are centrally managed?” The main issue of focus in this study is the free riding problems that arise in environments where the franchisee is not compelled to work in the best interests of the franchisor. The objectives of this study are threefold. The goal is to (1) write a thorough literature review on the theory of the firm in order to understand franchising as an organizational form, (2) develop a model describing the franchisor-franchisee relationship and (3) perform a case type empirical analysis on one typical franchising chain to test two hypotheses derived from the theoretical model. The target of the empirical research in this study is McDonald’s, a well established franchising chain with a valuable brand. The data used in the research is collected from three main sources: Interviewing the franchising director of McDonald’s Finland, searching public web pages and using an internet based service solution for maps (Eniro). Also, a particular qualitative index for metropolitan context was created. It was used to estimate the influence of the franchisees operating environment to the business. A logistic regression model was used to test the proposed hypotheses. The sample size comprises 84 restaurants. The underlying assumption in the empirical research is that an established franchisor has learned an efficient ownership allocation for its units. The key results from the theoretical model are interesting and together with the analyzed theory motivate two hypotheses. According to these hypotheses, this ownership allocation is as a function of operating environment and monitoring costs. The empirical results supported the proposed hypotheses: the likelihood for a given McDonald’s restaurant to be owned and operated by the franchisor decreases the closer the unit is to the headquarter (monitoring costs) and the higher the metropolitan index is for that given location (environment). Even though the sample size was not large, the results are statistically significant. The study is concluded with three propositions, suggestions for future research and recommendations for franchising firms.
  • Huong Thi Thanh Nguyen; Trung Minh Doan; Tomppo, Erkki; McRoberts, Ronald E. (2020)
    Information on land use and land cover (LULC) including forest cover is important for the development of strategies for land planning and management. Satellite remotely sensed data of varying resolutions have been an unmatched source of such information that can be used to produce estimates with a greater degree of confidence than traditional inventory estimates. However, use of these data has always been a challenge in tropical regions owing to the complexity of the biophysical environment, clouds, and haze, and atmospheric moisture content, all of which impede accurate LULC classification. We tested a parametric classifier (logistic regression) and three non-parametric machine learning classifiers (improved k-nearest neighbors, random forests, and support vector machine) for classification of multi-temporal Sentinel 2 satellite imagery into LULC categories in Dak Nong province, Vietnam. A total of 446 images, 235 from the year 2017 and 211 from the year 2018, were pre-processed to gain high quality images for mapping LULC in the 6516 km(2) study area. The Sentinel 2 images were tested and classified separately for four temporal periods: (i) dry season, (ii) rainy season, (iii) the entirety of the year 2017, and (iv) the combination of dry and rainy seasons. Eleven different LULC classes were discriminated of which five were forest classes. For each combination of temporal image set and classifier, a confusion matrix was constructed using independent reference data and pixel classifications, and the area on the ground of each class was estimated. For overall temporal periods and classifiers, overall accuracy ranged from 63.9% to 80.3%, and the Kappa coefficient ranged from 0.611 to 0.813. Area estimates for individual classes ranged from 70 km(2) (1% of the study area) to 2200 km(2) (34% of the study area) with greater uncertainties for smaller classes.
  • Kuronen, Juri (Helsingin yliopisto, 2017)
    This Master’s thesis introduces a new score-based method for learning the structure of a pairwise Markov network without imposing the assumption of chordality on the underlying graph structure by approximating the joint probability distribution using the popular pseudo-likelihood framework. Together with the local Markov property associated with the Markov network, the joint probability distribution is decomposed into node-wise conditional distributions involving only a tiny subset of variables each, getting rid of the problematic intractable normalizing constant. These conditional distributions can be naturally modeled using logistic regression, giving rise to pseudo-likelihood maximization with logistic regression (plmLR) which is designed to be especially well-suited for capturing pairwise interactions by restricting the explanatory variables to main effects (no interaction terms). To deal with overfitting, plmLR is regularized using an extended variant of the Bayesian information criterion. To select the best model out of the vast discrete model space of network structures, a dynamic greedy hill-climbing search algorithm can be readily implemented with the pseudo-likelihood framework where each Markov blanket is learned separately so that the full graph can be composed from the solutions to these subproblems. This work also presents a novel improvement to the algorithm by drastically reducing the search space associated with each node-wise hill-climbing run by first running a set of pairwise queries to isolate only the promising candidates. In experiments on data sets sampled from synthetic pairwise Markov networks, plmLR performs favorably against competing methods with respect to the Hamming distance between the learned and true network structure. Additionally, unlike most logistic regression based methods, plmLR is not limited to binary variables and performs well on learning benchmark network structures based on real-world non-binary models even though plmLR is not designed for their structural form.
  • Lyytikäinen, Minna (Helsingfors universitet, 2013)
    Climate change and following extreme weather patterns can increase forest damages caused by pest insects especially in higher latitudes. The number, density and intensity of damages by pest insects already have increased because of the changing conditions. Pest insects can e.g. cause reduced tree growth and even tree death. Defoliation by the Common Pine Sawfly (Diprion pini L.) causes severe growth losses and tree mortality of Scots Pine (Pinus sylvestris L.). D. pini has caused damages in Finland over 500 000 hectares between years 1997–2001. The field work was carried out in Palokangas area, Ilomantsi, eastern Finland in years 2002–2010. Stand- and tree-wise characteristics were measured on 11 plots. Tree-wise defoliation with 10% accuracy and amount of D. pini cocoons and fallen shoots of P. sylvestris were estimated annually. In addition, radial tree growths were measured from total of n trees in 2010. The aim this study was to estimate the effect of the natural enemies on population densities of D. pini. The aim was also to estimate the effect of the defoliation caused by D. pini on tree growth. In addition, the aim was also to estimate the consequence of a beetle attack by pith borers (Tomicus spp.) to the defoliation. Effect of natural enemies as regulative factors was estimated from D. pini cocoons. Natural enemies were divided into birds, small mammals and to insect families of Ichneumonidae, Chalcidoidea, Tachinidae, Elateridae and Carabidae. Consequence of beetle attack was assessed from fallen shoots. Tree growth simulation was used to estimate economic losses. Growth losses were estimated from drill chip sample. Logistic regression was used to explicate tree-wise defoliation with tree- and stand-wise variables. Two different classification schemes with threshold values of 20% (class 1) and 30% (class 2) of defoliation were used in regression. The major regulative factor was Ichneumonid parasites (22%) and the second powerful regulative factor was small mammals (21%). Relative proportion of natural enemies increased along the research period as defoliation percentages decreased. Consequence of beetle attack was most violent in 2004 (17 shoots/ m²). Plot-wise defoliation level varied significantly between the years and the plots. The mean defoliation level was 37% in 2002 and 22% in 2010. The most substantial defoliation was in plot 9 in 2005, over 99%. Simulated economic losses were perceptible only on plots 9 and 16; 2785 € and 1623 € per hectare, respectively. Defoliation by D. pini caused growth losses for radial growth in different defoliation classes. The mean growth loss of severe damaged trees (70–100% of defoliation) was approximately 65% and of trees with low defoliation level (0–10% of defoliation) 40%. Classification accuracy of logistic regression for class 1 was 92.4% with kappa value of 0.81 and 94.2% and 0.84 for class 2, respectively. The results of this study showed that control of natural enemies effected on D. pini density. Population density of D. pini affected the defoliation level; when population density was low the defoliation was milder. Peak sawfly densities can affect tree growth during outbreaks. Consequence beetle attack by the pith borers was only slight and delayed.
  • Kukkonen, Tommi (Helsingin yliopisto, 2020)
    The Arctic is warming with an increased pace, and it can affect ecosystems, infrastructure and communities. By studying periglacial landforms and processes, and using improved methods, more knowledge on these changing environmental conditions and their impacts can be obtained. The aim of this thesis is to map studied landforms and predict their probability of occurrence in the circumpolar region utilizing different modelling methods. Periglacial environments occur in high latitudes and other cold regions. These environments host permafrost, which is frozen ground and responds effectively to climate warming, and underlays areas that host many landform types. Therefore, landform monitoring and modelling in permafrost regions under changing climate can provide information about the ongoing changes in the Arctic and landform distributions. Here four landform/process types were mapped and studied: patterned ground, pingos, thermokarst activity and solifluction. The study consisted of 10 study areas across the circumpolar Arctic that were mapped for their landforms. The study utilized GLM, GAM and GBM analyses in determining landform occurrences in the Arctic based on environmental variables. Model calibration utilized logit link function, and evaluation explained the deviance value. Data was sampled to evaluation and calibration sets to assess prediction abilities. The predictive accuracy of the models was assessed using ROC/AUC values. Thermokarst activity proved to be most abundant in studied areas, whereas solifluction activity was most scarce. Pingos were discovered evenly throughout studied areas, and patterned ground activity was absent in some areas but rich in others. Climate variables and mean annual ground temperature had the biggest influence in explaining landform occurrence throughout the circumpolar region. GBM proved to be the most accurate and had the best predictive performance. The results show that mapping and modelling in mesoscale is possible, and in the future, similar studies could be utilized in monitoring efforts regarding global change and in studying environmental and periglacial landform/process interactions.
  • Salo, Tuukka (Helsingfors universitet, 2016)
    The purpose of the act on the financing of sustainable forestry (Kemera-law) is to advance economically, ecologically and socially sustainable silviculture and use of the forests. A private forest owner may receive financial support from the State for forest management, forest improvement work and for nature management. The purpose of this thesis was to find out the factors affecting the private forest owners’ participation in the Kemera cost sharing program and are there differences between forest owners’ objectives in forest ownership and opinions about Kemera-subsidies depending on the participation in the cost sharing program. The data used in this thesis is from a survey that was implemented in the spring of 2016 as a part of a project in Tapio Oy. Also additional information from The Finnish Forest Centre was used in the regression analysis. The factors affecting the use of Kemera-subsidy was analyzed with logistic regression. The differences in the forest ownership objectives and in the opinions about the Kemera-subsidy depending on the participation to the Kemera cost sharing program were determined by descriptive analysis. With the used factors, the regression analysis did not succeed in making a model that would successfully predict the participation to the cost sharing program. However, the results implied that the factors positively affecting the participation to the cost sharing program were forested area owned, forest owners’ self-determined activity and use of external services in forest. The differences between the forest owners’ objectives depending on the participation in the cost sharing program imply that the participants did not value the non-monetary values less than those who had not participated in the cost sharing program, but they did value monetary values more. The average opinions about Kemera-subsidy did not vary much depending on the participation to the cost sharing program. Those who had participated in the cost sharing program in the last 10 years were a little more satisfied about the Kemera-subsidies. The majority thought that the best incentive in the Kemera-subsidy is the gained benefit in the future. The most common reason not to participate in the cost sharing program was the challenging applying.
  • Sahlberg, Eero (Helsingin yliopisto, 2018)
    This thesis examines underlying causes of customer churn in the Finnish insurance market. Using individual data on moving insurance customers, econometric modeling is conducted to find significant relations between observed customer characteristics and behavior, and the probability to churn. A subscription-based business gains revenue not only from new sales but more importantly from automatic renewals of existing customers, i.e. retention. Significant drops in retention are important to understand for the insurer in order to not lose profit. Churn is an antonym for retention. A change of address – or moving homes – is an event around which churn rates spike, as it is a time when all address-specific subscriptions (electricity, internet, etc.) need to be proactively renewed by the consumer. There were one million moving individuals in 2016, as reported by Posti. This means that a significant share of an insurer’s customers are at a heightened risk to churn, with an address change being the common denominator. This thesis asks which customer characteristics and experiences significantly either increase or decrease the probability of a customer either changing their home insurance or churning completely around the time of their move. Insurance literature such as Hillson & Murray-Webster (2007) and Vaughan (1996) are reviewed to present the nature of risk, the insurance mechanism and the modern insurance business model. An annual report by Finance Finland (2017) provides accounting data via which the Finnish market situation is presented, while data and reports by Posti (2016; 2017a; 2017b) provide the numbers and facts regarding Finnish movers. Churn modeling is based on 20th century discrete choice theory, literature of which is reviewed, most notably by Nobel-laureate Daniel McFadden (1974; 2000). Also presented are modern applications of choice theory into churn problems, such as Madden et al (1999). The empirical section of the thesis consists of data presentation, model construction and evaluation and finally discussion of the results. The final sample of customer data consists of 24 230 observations with 21 variables. Following Madden et al (1999) and with help from Cox (1958) and McFadden (1974), binomial logistic regression models are constructed to relate the probability of churning with the specified variables. It is found that customer data can be used to predict churn among movers. Significant weights are found for variables denoting the size of a customer’s insurance portfolio as well as customer age and the duration of customership. Also the presence of personal insurance products and contact with one’s insurer notably affect retention positively. Younger segments and customers with implications of lower income (with fewer insurance products, more payment installments) exhibit a significantly increased probability of churning.
  • Kokko, J.; Remes, U.; Thomas, Owen; Pesonen, H.; Corander, J. (2019)
    Likelihood-free inference for simulator-based models is an emerging methodological branch of statistics which has attracted considerable attention in applications across diverse fields such as population genetics, astronomy and economics. Recently, the power of statistical classifiers has been harnessed in likelihood-free inference to obtain either point estimates or even posterior distributions of model parameters. Here we introduce PYLFIRE, an open-source Python implementation of the inference method LFIRE (likelihood-free inference by ratio estimation) that uses penalised logistic regression. PYLFIRE is made available as part of the general ELFI inference software to benefit both the user and developer communities for likelihood-free inference. © 2019 Kokko J et al.
  • Sutela, Tapio; Vehanen, Teppo; Jounela, Pekka; Aroviita, Jukka (John Wiley & Sons, 2021)
    Ecology and Evolution 11 (15), 10457-10467
    Species–environment relationships were studied between the occurrence of 13 fish and lamprey species and 9 mainly map-based environmental variables of Finnish boreal small streams. A self-organizing map (SOM) analysis showed strong relationships between the fish species and environmental variables in a single model (explained variance 55.9%). Besides basic environmental variables such as altitude, catchment size, and mean temperature, land cover variables were also explored. A logistic regression analysis indicated that the occurrence probability of brown trout, Salmo trutta L., decreased with an increasing percentage of peatland ditch drainage in the upper catchment. Ninespine stickleback, Pungitius pungitius (L.), and three-spined stickleback, Gasterosteus aculeatus L., seemed to benefit from urban areas in the upper catchment. Discovered relationships between fish species occurrence and land-use attributes are encouraging for the development of fish-based bioassessment for small streams. The presented ordination of the fish species in the mean temperature gradient will help in predicting fish community responses to climate change.