Browsing by Subject "principal component analysis"

Sort by: Order: Results:

Now showing items 1-8 of 8
  • Holma, Paula (Helsingfors universitet, 2011)
    Metabolomics is a rapidly growing research field that studies the response of biological systems to environmental factors, disease states and genetic modifications. It aims at measuring the complete set of endogenous metabolites, i.e. the metabolome, in a biological sample such as plasma or cells. Because metabolites are the intermediates and end products of biochemical reactions, metabolite compositions and metabolite levels in biological samples can provide a wealth of information on on-going processes in a living system. Due to the complexity of the metabolome, metabolomic analysis poses a challenge to analytical chemistry. Adequate sample preparation is critical to accurate and reproducible analysis, and the analytical techniques must have high resolution and sensitivity to allow detection of as many metabolites as possible. Furthermore, as the information contained in the metabolome is immense, the data set collected from metabolomic studies is very large. In order to extract the relevant information from such large data sets, efficient data processing and multivariate data analysis methods are needed. In the research presented in this thesis, metabolomics was used to study mechanisms of polymeric gene delivery to retinal pigment epithelial (RPE) cells. The aim of the study was to detect differences in metabolomic fingerprints between transfected cells and non-transfected controls, and thereafter to identify metabolites responsible for the discrimination. The plasmid pCMV-β was introduced into RPE cells using the vector polyethyleneimine (PEI). The samples were analyzed using high performance liquid chromatography (HPLC) and ultra performance liquid chromatography (UPLC) coupled to a triple quadrupole (QqQ) mass spectrometer (MS). The software MZmine was used for raw data processing and principal component analysis (PCA) was used in statistical data analysis. The results revealed differences in metabolomic fingerprints between transfected cells and non-transfected controls. However, reliable fingerprinting data could not be obtained because of low analysis repeatability. Therefore, no attempts were made to identify metabolites responsible for discrimination between sample groups. Repeatability and accuracy of analyses can be influenced by protocol optimization. However, in this study, optimization of analytical methods was hindered by the very small number of samples available for analysis. In conclusion, this study demonstrates that obtaining reliable fingerprinting data is technically demanding, and the protocols need to be thoroughly optimized in order to approach the goals of gaining information on mechanisms of gene delivery.
  • Kyrö, Minna (Helsingfors universitet, 2011)
    FTIR spectroscopy (Fourier transform infrared spectroscopy) is a fast method of analysis. The use of interferometers in Fourier devices enables the scanning of the whole infrared frequency region in a couple of seconds. There is no need to elaborate sample preparation when the FTIR spectrometer is equipped with an ATR accessory and the method is therefore easy to use. ATR accessory facilitates the analysis of various sample types. It is possible to measure infrared spectra from samples which are not suitable for traditional sample preparation methods. The data from FTIR spectroscopy is frequently combined with statistical multivariate analysis techniques. In cluster analysis the data from spectra can be grouped based on similarity. In hierarchical cluster analysis the similarity between objects is determined by calculating the distance between them. Principal component analysis reduces the dimensionality of the data and establishes new uncorrelated principal components. These principal components should preserve most of the variation of the original data. The possible applications of FTIR spectroscopy combined with multivariate analysis have been studied a lot. For example in food industry its feasibility in quality control has been evaluated. The method has also been used for the identification of chemical compositions of essential oils and for the detection of chemotypes in oil plants. In this study the use of the method was evaluated in the classification of hog's fennel extracts. FTIR spectra of extracts from different plant parts of hog's fennel were compared with the measured FTIR spectra of standard substances. The typical absorption bands in the FTIR spectra of standard substances were identified. The wave number regions of the intensive absorption bands in the spectra of furanocoumarins were selected for multivariate analyses. Multivariate analyses were also performed in the fingerprint region of IR spectra, including the wave number region 1785-725 cm-1. The aim was to classify extracts according to the habitat and coumarin concentration of the plants. Grouping according to habitat was detected, which could mainly be explained by coumarin concentrations as indicated by analyses of the wave number regions of the selected absorption bands. In these analyses extracts mainly grouped and differed by their total coumarin concentrations. In analyses of the wave number region 1785-725 cm-1 grouping according to habitat was also detected but this could not be explained by coumarin concentrations. These groupings may have been caused by similar concentrations of other compounds in the samples. Analyses using other wave number regions were also performed, but the results from these experiments did not differ from previous results. Multivariate analyses of second-order derivative spectra in the fingerprint region did not reveal any noticeable changes either. In future studies the method could perhaps be further developed by investigating narrower carefully selected wave number regions of second-order derivative spectra.
  • Siirtola, Harri; Säily, Tanja; Nevalainen, Terttu (IEEE Computer Society, 2017)
    Information Visualization
    Principal Component Analysis (PCA) is an established and efficient method for finding structure in a multidimensional data set. PCA is based on orthogonal transformations that convert a set of multidimensional values into linearly uncorrelated variables called principal components.The main disadvantage to the PCA approach is that the procedure and outcome are often difficult to understand. The connection between input and output can be puzzling, a small change in input can yield a completely different output, and the user may often wonder if the PCA is doing the right thing.We introduce a user interface that makes the procedure and result easier to understand. We have implemented an interactive PCA view in our text visualization tool called Text Variation Explorer. It allows the user to interactively study the result of PCA, and provides a better understanding of the process.We believe that although we are addressing the problem of interactive principal component analysis in the context of text visualization, these ideas should be useful in other contexts as well.
  • Pour-Aboughadareh, Alireza; Yousefian, Mohsen; Moradkhani, Hoda; Poczai, Péter; Siddique, Kadambot HM (2019)
    PREMISE: In crop breeding programs, breeders use yield performance in both optimal and stressful environments as a key indicator for screening the most tolerant genotypes. During the past four decades, several yield-based indices have been suggested for evaluating stress tolerance in crops. Despite the well-established use of these indices in agronomy and plant breeding, a user-friendly software that would provide access to these methods is still lacking. METHODS AND RESULTS: The Plant Abiotic Stress Index Calculator (iPASTIC) is an online program based on JavaScript and R that calculates common stress tolerance and susceptibility indices for various crop traits including the tolerance index (TOL), relative stress index (RSI), mean productivity (MP), harmonic mean (HM), yield stability index (YSI), geometric mean productivity (GMP), stress susceptibility index (SSI), stress tolerance index (STI), and yield index (YI). Along with these indices, this easily accessible tool can also calculate their ranking patterns, estimate the relative frequency for each index, and create heat maps based on Pearson's and Spearman's rank-order correlation analyses. In addition, it can also render three-dimensional plots based on both yield performances and each index to separate entry genotypes into Fernandez's groups (A, B, C, and D), and perform principal component analysis. The accuracy of the results calculated from our software was tested using two different data sets obtained from previous experiments testing the salinity and drought stress in wheat genotypes, respectively. CONCLUSIONS: iPASTIC can be widely used in agronomy and plant breeding programs as a user-friendly interface for agronomists and breeders dealing with large volumes of data. The software is available at
  • Ikonen, Juha (Helsingin yliopisto, 2018)
    Study research how finnish farmers react to risk. Outcome is that finnish farmers are in average risk averse, and they weight lower probabilities more than high. Questionnaire was sent to 5 000 farmers, which 820 farmers sent their answer. Questionnaire included questions related to principal component analysis to confirm reliability. After analysis there were to principal components, which were compared in regression analysis with risk parameters alfa (value function parameter) and gamma (weighting function) with farmer's background information. Two principal components were not significant when alfa or gamma was dependent variable. Production sector was significant variable when weighting function parameter gamma acted as dependent variable. Age, amount of field owned or farms location did not have any meaning in attitudes towards risk. Study research how finnish farmers react to risk. Outcome is that finnish farmers are in average risk averse, and they weight lower probabilities more than high. Questionnaire was sent to 5 000 farmers, which 820 farmers sent their answer. Questionnaire included questions related to principal component analysis to confirm reliability. After analysis there were to principal components, which were compared in regression analysis with risk parameters alfa (value function parameter) and gamma (weighting function) with farmer's background information. Two principal components were not significant when alfa or gamma was dependent variable. Production sector was significant variable when weighting function parameter gamma acted as dependent variable. Age, amount of field owned or farms location did not have any meaning in attitudes towards risk.
  • Badshah, Ihsan Ullah (Svenska handelshögskolan, 2010)
    Economics and Society
    Modeling and forecasting of implied volatility (IV) is important to both practitioners and academics, especially in trading, pricing, hedging, and risk management activities, all of which require an accurate volatility. However, it has become challenging since the 1987 stock market crash, as implied volatilities (IVs) recovered from stock index options present two patterns: volatility smirk(skew) and volatility term-structure, if the two are examined at the same time, presents a rich implied volatility surface (IVS). This implies that the assumptions behind the Black-Scholes (1973) model do not hold empirically, as asset prices are mostly influenced by many underlying risk factors. This thesis, consists of four essays, is modeling and forecasting implied volatility in the presence of options markets’ empirical regularities. The first essay is modeling the dynamics IVS, it extends the Dumas, Fleming and Whaley (DFW) (1998) framework; for instance, using moneyness in the implied forward price and OTM put-call options on the FTSE100 index, a nonlinear optimization is used to estimate different models and thereby produce rich, smooth IVSs. Here, the constant-volatility model fails to explain the variations in the rich IVS. Next, it is found that three factors can explain about 69-88% of the variance in the IVS. Of this, on average, 56% is explained by the level factor, 15% by the term-structure factor, and the additional 7% by the jump-fear factor. The second essay proposes a quantile regression model for modeling contemporaneous asymmetric return-volatility relationship, which is the generalization of Hibbert et al. (2008) model. The results show strong negative asymmetric return-volatility relationship at various quantiles of IV distributions, it is monotonically increasing when moving from the median quantile to the uppermost quantile (i.e., 95%); therefore, OLS underestimates this relationship at upper quantiles. Additionally, the asymmetric relationship is more pronounced with the smirk (skew) adjusted volatility index measure in comparison to the old volatility index measure. Nonetheless, the volatility indices are ranked in terms of asymmetric volatility as follows: VIX, VSTOXX, VDAX, and VXN. The third essay examines the information content of the new-VDAX volatility index to forecast daily Value-at-Risk (VaR) estimates and compares its VaR forecasts with the forecasts of the Filtered Historical Simulation and RiskMetrics. All daily VaR models are then backtested from 1992-2009 using unconditional, independence, conditional coverage, and quadratic-score tests. It is found that the VDAX subsumes almost all information required for the volatility of daily VaR forecasts for a portfolio of the DAX30 index; implied-VaR models outperform all other VaR models. The fourth essay models the risk factors driving the swaption IVs. It is found that three factors can explain 94-97% of the variation in each of the EUR, USD, and GBP swaption IVs. There are significant linkages across factors, and bi-directional causality is at work between the factors implied by EUR and USD swaption IVs. Furthermore, the factors implied by EUR and USD IVs respond to each others’ shocks; however, surprisingly, GBP does not affect them. Second, the string market model calibration results show it can efficiently reproduce (or forecast) the volatility surface for each of the swaptions markets.
  • Laban, Tracey Leah; Van Zyl, Pieter Gideon; Beukes, Johan Paul; Mikkonen, Santtu; Santana, Leonard; Josipovic, Miroslav; Vakkari, Ville; Thompson, Anne M.; Kulmala, Markku; Laakso, Lauri (2020)
    Statistical relationships between surface ozone (O-3) concentration, precursor species and meteorological conditions in continental South Africa were examined from data obtained from measurement stations in north-eastern South Africa. Three multivariate statistical methods were applied in the investigation, i.e. multiple linear regression (MLR), principal component analysis (PCA) and -regression (PCR), and generalised additive model (GAM) analysis. The daily maximum 8-h moving average O-3 concentrations were considered in these statistical models (dependent variable). MLR models indicated that meteorology and precursor species concentrations are able to explain similar to 50% of the variability in daily maximum O-3 levels. MLR analysis revealed that atmospheric carbon monoxide (CO), temperature and relative humidity were the strongest factors affecting the daily O-3 variability. In summer, daily O-3 variances were mostly associated with relative humidity, while winter O-3 levels were mostly linked to temperature and CO. PCA indicated that CO, temperature and relative humidity were not strongly collinear. GAM also identified CO, temperature and relative humidity as the strongest factors affecting the daily variation of O-3. Partial residual plots found that temperature, radiation and nitrogen oxides most likely have a non-linear relationship with O-3,while the relationship with relative humidity and CO is probably linear. An inter-comparison between O-3 levels modelled with the three statistical models compared to measured O-3 concentrations showed that the GAM model offered a slight improvement over the MLR model. These findings emphasise the critical role of regional-scale O-3 precursors coupled with meteorological conditions in daily variances of O-3 levels in continental South Africa.
  • Malmberg, Anni (Helsingin yliopisto, 2020)
    A population is said to be genetically structured when it can be divided into subpopulations based on genetic differences between the individuals. As in case of Finland for example, the population has been shown to consist of genetic subpopulations that correspond strongly to geographical subgroups. Such information may be interesting when seeking answers to questions related to the settlement and migration history of some population. Information about genetic population structure is also required for example in studies looking for associations between genetic variants and some inheritable disease to ensure that the groups with and without diagnosis of the disease resemble each other genetically except for the genetic variants causing the disease. In my thesis, I have compared how two different mathematical models, principal component analysis (PCA) and generative topographic mapping (GTM), visualize ancestry and identify genetic structure in Finnish population. PCA was introduced already in 1901, and nowadays it is a standard tool in identifying genetic structure and visualizing ancestry. GTM instead was published relatively recently, in 1998, and has not yet been applied in population structure studies as widely than PCA. Both PCA and GTM transform high-dimensional data to a low-dimensional, interpretable representation where relationships between observations of the data are summarized. In case of data containing genetic heterogeneity between individuals, this representation gives a visual approximation of the genetic structure of the population. However, Hèlèna A. Gaspar and Gerome Breen found in 2018 that GTM is able to classify ancestry of populations from around the world more accurately than PCA: the differences recognized by PCA were mainly between geographically most distant populations, while GTM detected also more their subpopulations. My aims in the thesis were to examine whether applying the methods for Finnish data would give similar results, and to give thorough presentations of the mathematical background for both the methods. I also discuss how the results fit into what is currently known about the genetic population structure in Finland. The study results are based on data from the FINRISK Study Survey collected by the National Institute for Health and Welfare (THL) in 1992-2012 and include 35 499 samples. After performing quality control on the data, I analysed the data with SmartPCA program and ugtm Python package implementing PCA and GTM, respectively. The final results have been presented for such 2010 individuals that participated the FINRISK Study Survey in 1997 and whose both parents were born close to each other. I have assigned the individuals into distinct geographical subgroups according to the birthplaces of their mothers to find out whether PCA and GTM identify individuals having a similar geographical origin to be genetically close to each other. Based on the results, the genetic structure in Finland is clearly geographically clustered, which fits into what is known from earlier studies. The results were also similar to those observed by Gaspar and Breen: Both the methods identified the genetic substructure but GTM was able to recognize more subtle differences in ancestry between the geographically defined subgroups than PCA. For example, GTM discovered the group corresponding to the region of Northern Ostrobothnia to consist of four smaller separate subgroups, while PCA interpreted the individuals with a Northern Ostrobothnian origin to be genetically rather homogeneous. Locating these individuals on the map of Finland according to the birthplaces of their mothers reveals that they also make four geographical clusters corresponding to the genetic subpopulations detected by GTM. As a final conclusion I state that GTM is a noteworthy alternative to PCA for studying genetic population structure, especially when it comes to identifying substructures from a population that PCA may interpret to be genetically homogeneous. I also note that the reason why GTM generally seems to be capable of more fine-grained clustering than PCA, is probably that PCA as a linear model may cause more bias to the results than GTM which accounts for also non-linear relationships when transforming the data into a more interpretable form.