Browsing by Subject "Feature selection"

Sort by: Order: Results:

Now showing items 1-2 of 2
  • Murtojärvi, Mika; Halkola, Anni S.; Airola, Antti; Laajala, Teemu D.; Mirtti, Tuomas; Aittokallio, Tero; Pahikkala, Tapio (2020)
    Introduction Predictive survival modeling offers systematic tools for clinical decision-making and individualized tailoring of treatment strategies to improve patient outcomes while reducing overall healthcare costs. In 2015, a number of machine learning and statistical models were benchmarked in the DREAM 9.5 Prostate Cancer Challenge, based on open clinical trial data for metastatic castration resistant prostate cancer (mCRPC). However, applying these models into clinical practice poses a practical challenge due to the inclusion of a large number of model variables, some of which are not routinely monitored or are expensive to measure. Objectives To develop cost-specified variable selection algorithms for constructing cost-effective prognostic models of overall survival that still preserve sufficient model performance for clinical decision making. Methods Penalized Cox regression models were used for the survival prediction. For the variable selection, we implemented two algorithms: (i) LASSO regularization approach; and (ii) a greedy cost-specified variable selection algorithm. The models were compared in three cohorts of mCRPC patients from randomized clinical trials (RCT), as well as in a real-world cohort (RWC) of advanced prostate cancer patients treated at the Turku University Hospital. Hospital laboratory expenses were utilized as a reference for computing the costs of introducing new variables into the models. Results Compared to measuring the full set of clinical variables, economic costs could be reduced by half without a significant loss of model performance. The greedy algorithm outperformed the LASSO-based variable selection with the lowest tested budgets. The overall top performance was higher with the LASSO algorithm. Conclusion The cost-specified variable selection offers significant budget optimization capability for the real-world survival prediction without compromising the predictive power of the model.
  • Lange, Moritz Johannes (Helsingin yliopisto, 2020)
    In the context of data science and machine learning, feature selection is a widely used technique that focuses on reducing the dimensionality of a dataset. It is commonly used to improve model accuracy by preventing data redundancy and over-fitting, but can also be beneficial in applications such as data compression. The majority of feature selection techniques rely on labelled data. In many real-world scenarios, however, data is only partially labelled and thus requires so-called semi-supervised techniques, which can utilise both labelled and unlabelled data. While unlabelled data is often obtainable in abundance, labelled datasets are smaller and potentially biased. This thesis presents a method called distribution matching, which offers a way to do feature selection in a semi-supervised setup. Distribution matching is a wrapper method, which trains models to select features that best affect model accuracy. It addresses the problem of biased labelled data directly by incorporating unlabelled data into a cost function which approximates expected loss on unseen data. In experiments, the method is shown to successfully minimise the expected loss transparently on a synthetic dataset. Additionally, a comparison with related methods is performed on a more complex EMNIST dataset.