Supervised dimensionality reduction for molecular data

Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekuntaUniversity of Helsinki, Faculty of ScienceHelsingfors universitet, Matematisk-naturvetenskapliga fakultetenPrasad, Ayush2024-06-142024-06-142024URN:NBN:fi:hulib-202406143068http://hdl.handle.net/10138/577383Machine learning is increasingly being applied to model molecular data in various scientific fields such as drug discovery, materials science, and atmospheric science. However, the high dimensionality that molecular features present causes challenges when applying machine learning algorithms directly. Dimensionality reduction methods can help reduce the feature space and create new in- formative features. In this thesis, we first review current methods for representing molecules for machine learning. We then discuss the importance of evaluating dimensionality reduction visualizations; and review and propose metrics for it. We present Gradient Boosting Mapping (GBMAP), a supervised dimensionality reduction method. Through experiments on benchmark datasets and the GeckoQ molecular dataset, we demonstrate that low-dimensional embeddings created by GBMAP can be used as features to improve the performance of simpler interpretable machine learning models significantly.engsupervised dimensionality reductioncheminformaticsSupervised dimensionality reduction for molecular datapro gradu -tutkielmatei opintosuuntaano specializationingen studieinriktningTeoreettisten ja laskennallisten menetelmien maisteriohjelma (Theoretical Calculation Methods)Master's Programme in Theoretical and Computational MethodsMagisterprogrammet i teoretiska och beräkningsmetoder