Suitability of random forest analysis for epidemiological research: Exploring sociodemographic and lifestyle-related risk factors of overweight in a cross-sectional design.

Show full item record



Permalink

http://hdl.handle.net/10138/303648

Citation

Kanerva , N , Kontto , J , Erkkola , M , Nevalainen , J & Männistö , S 2018 , ' Suitability of random forest analysis for epidemiological research: Exploring sociodemographic and lifestyle-related risk factors of overweight in a cross-sectional design. ' , Scandinavian Journal of Public Health , vol. 46 , no. 5 , pp. 557-564 . https://doi.org/10.1177/1403494817736944

Title: Suitability of random forest analysis for epidemiological research: Exploring sociodemographic and lifestyle-related risk factors of overweight in a cross-sectional design.
Author: Kanerva, Noora; Kontto, Jukka; Erkkola, Maijaliisa; Nevalainen, Jaakko; Männistö, Satu
Contributor: University of Helsinki, Clinicum
University of Helsinki, Maijaliisa Erkkola / Principal Investigator
Date: 2018-07
Language: eng
Number of pages: 8
Belongs to series: Scandinavian Journal of Public Health
ISSN: 1403-4948
URI: http://hdl.handle.net/10138/303648
Abstract: Aims: Factors that contribute to the development of overweight are numerous and form a complex structure with many unknown interactions and associations. We aimed to explore this structure (i.e. the mutual importance or hierarchy of sociodemographic and lifestyle-related risk factors of being overweight) using a machine-learning technique called random forest (RF). The results were compared with traditional logistic regression (LR) analysis. Methods: The cross-sectional FINRISK 2007 Study included 4757 Finns (aged 25-74 years). Information on participants' lifestyle and sociodemographic characteristics were collected with questionnaires. Diet was assessed, using a validated food-frequency questionnaire. Height and weight were measured. Participants with a body mass index (BMI) 25 kg/m(2) were classified as overweight. R-statistical software was used to run RF analysis (randomForest') to derive estimates for variable importance and out-of-bag error, which were compared to a LR model. Results: In total, 704 (32%) men and 1119 (44%) women had normal BMI, whereas 1502 (69%) men and 1432 (57%) women had BMI 25. Estimated error rates for the models were similar (RF vs. LR: 42% vs. 40% for men, 38% vs. 35% for women). Both models ranked age, education and physical activity as the most important risk factors for being overweight, but RF ranked macronutrients (carbohydrates and protein) as more important compared to LR. Conclusions: RF did not demonstrate higher power in variable selection compared to LR in our study. The features of RF are more likely to appear beneficial in settings with a larger number of predictors.
Subject: 3142 Public health care science, environmental and occupational health
Machine learning
mutual importance
obesity
random forest
risk factor
FOOD FREQUENCY QUESTIONNAIRE
DOUBLY LABELED WATER
VALIDITY
CLASSIFICATION
REGRESSION
INDEX
Rights:


Files in this item

Total number of downloads: Loading...

Files Size Format View
1403494817736944.pdf 152.2Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record