Browsing by Subject "random forest"

Sort by: Order: Results:

Now showing items 1-11 of 11
  • Nadal-Sala, Daniel; Grote, Ruediger; Birami, Benjamin; Lintunen, Anna; Mammarella, Ivan; Preisler, Yakir; Rotenberg, Eyal; Salmon, Yann; Tatarinov, Fedor; Yakir, Dan; Ruehr, Nadine K. (2021)
    Climate change will impact forest productivity worldwide. Forecasting the magnitude of such impact, with multiple environmental stressors changing simultaneously, is only possible with the help of process-based models. In order to assess their performance, such models require careful evaluation against measurements. However, direct comparison of model outputs against observational data is often not reliable, as models may provide the right answers due to the wrong reasons. This would severely hinder forecasting abilities under unprecedented climate conditions. Here, we present a methodology for model assessment, which supplements the traditional output-to-observation model validation. It evaluates model performance through its ability to reproduce observed seasonal changes of the most limiting environmental driver (MLED) for a given process, here daily gross primary productivity (GPP). We analyzed seasonal changes of the MLED for GPP in two contrasting pine forests, the Mediterranean Pinus halepensis Mill. Yatir (Israel) and the boreal Pinus sylvestris L. Hyytiala (Finland) from three years of eddy-covariance flux data. Then, we simulated the same period with a state-of-the-art process-based simulation model (LandscapeDNDC). Finally, we assessed if the model was able to reproduce both GPP observations and MLED seasonality. We found that the model reproduced the seasonality of GPP in both stands, but it was slightly overestimated without site-specific fine-tuning. Interestingly, although LandscapeDNDC properly captured the main MLED in Hyytiala (temperature) and in Yatir (soil water availability), it failed to reproduce high-temperature and high-vapor pressure limitations of GPP in Yatir during spring and summer. We deduced that the most likely reason for this divergence is an incomplete description of stomatal behavior. In summary, this study validates the MLED approach as a model evaluation tool, and opens up new possibilities for model improvement.
  • Kantola, Tuula; Vastaranta, Mikko; Yu, Xiaowei; Lyytikäinen-Saarenmaa, Päivi; Holopainen, Markus; Talvitie, Mervi; Kaasalainen, Sanna; Solberg, Svein; Hyyppä, Juha (2010)
    Climate change and rising temperatures have been observed to be related to the increase of forest insect damage in the boreal zone. The common pine sawfly (Diprion pini L.) (Hymenoptera, Diprionidae) is regarded as a significant threat to boreal pine forests. Defoliation by D. pini can cause severe growth loss and tree mortality in Scots pine (Pinus sylvestris L.) (Pinaceae). In this study, logistic LASSO regression, Random Forest (RF) and Most Similar Neighbor method (MSN) were investigated for predicting the defoliation level of individual Scots pines using the features derived from airborne laser scanning (ALS) data and aerial images. Classification accuracies from 83.7% (kappa 0.67) to 88.1% (kappa 0.76) were obtained depending on the method. The most accurate result was produced using RF with a combination of data from the two sensors, while the accuracies when using ALS and image features separately were 80.7% and 87.4%, respectively. Evidently, the combination of ALS and aerial images in detecting needle losses is capable of providing satisfactory estimates for individual trees.
  • Suikkanen, Sanna; Uusitalo, Laura; Lehtinen, Sirpa; Lehtiniemi, Maiju; Kauppila, Pirkko; Mäkinen, Katja; Kuosa, Harri (Elsevier, 2021)
    Food Webs 28, e00202
    Blooms of cyanobacteria are recurrent phenomena in coastal estuaries. Their maximum abundance coincides with the productive period of zooplankton and pelagic fish. Experimental studies indicate that diazotrophic, i.e. dinitrogen (N2)-fixing cyanobacterial (taxonomic order Nostocales) blooms affect zooplankton, as well as other phytoplankton. We used multidecadal monitoring data from one archipelago station (1992–2013) and ten open sea stations (1979–2013) in the Baltic Sea to explore the potential bottom-up connections between diazotrophic and non-diazotrophic cyanobacteria and phyto- and zooplankton in natural plankton communities. Random forest regression, combined with linear regression analysis showed that the biomass of cyanobacteria (both diazotrophic and non-diazotrophic) was barely connected to any of the phytoplankton and zooplankton variables examined. Instead, physico-chemical variables (salinity, temperature, total phosphorus), as well as spatial and temporal variability seemed to have more significant connections to both phytoplankton and zooplankton variables. Zooplankton variables were also connected to the biomass of phytoplankton groups other than cyanobacteria (such as chrysophytes, cryptophytes and prymnesiophytes), and phytoplankton variables had connections with the biomass of different zooplankton groups, especially copepods. Overall, negative relationships between cyanobacteria and other plankton taxa were scarcer than expected based on previous experimental studies.
  • Roivainen, Hege (Helsingin yliopisto, 2017)
    Kansalliskirjastojen metadataluettelot ovat hyviä informaatiolähteitä, sillä ne sisältävät tiedon lähes kaikesta tiettynä aikana ja tietyllä alueella julkaistusta aineistosta. Yleensä ne ovat kattavasti kuvailtuja, joten niitä voi käyttää kvantitatiivisen tutkimuksen lähteinä. Usein tutkimusta tehtäessä tutkimusaineisto kannattaa jakaa pienempiin osiin esimerkiksi genren perusteella. Monissa tapauksissa aineiston aukkoisuus kuitenkin vähentää aineiston käytettävyyttä. Tämä pro gradu -työ arvioi mahdollisuutta hyödyntää koneoppimista etsittäessä tutkimukselle relevantteja osajoukkoja kirjastoluetteloista. Esimerkkitapaukseksi valitsin English Short Title Cataloguen (ESTC) ja etsittäväksi osajoukoksi runokirjat. Runokirjojen genretiedon kuuluisi olla annotoitu, mutta todellisista kirjastoluetteloista tämä tieto usein puuttuu. Käytin random forest -algoritmiä perinteisillä tekijän tunnistuksessa ja genreluokittelussa käytetyillä erityyppisillä piirrevektoreilla sekä metadatakenttien arvoilla parhaan tuloksen saamiseksi. Koska kirjastoluettelot eivät sisällä kirjojen koko tekstiä, piirteiden valinta keskittyi otsikoissa käytettyihin sanoihin ja lingvistisiin ominaisuuksiin. Otsikot ovat yleensä lyhyitä ja sisältävät hyvin vähän informaatiota, minkä vuoksi yhdistin piirrevektoreiden parhaiten toimivat piirteet yhteen ja tein lopullisen haun niillä. Tutkimuksen päätulos oli varmistus siitä, että otsikoiden käyttö piirteiden muodostamisessa on käyttökelpoinen strategia. Tutkimus avaa mahdollisuuksia määrittää osajoukkoja tulevaisuudessa koneoppimisen keinoin ja lisätä kirjastoluetteloiden hyödyntämistä kvantitatiivisessa tutkimuksessa.
  • Huong Thi Thanh Nguyen,; Trung Minh Doan,; Tomppo, Erkki; McRoberts, Ronald E. (2020)
    Information on land use and land cover (LULC) including forest cover is important for the development of strategies for land planning and management. Satellite remotely sensed data of varying resolutions have been an unmatched source of such information that can be used to produce estimates with a greater degree of confidence than traditional inventory estimates. However, use of these data has always been a challenge in tropical regions owing to the complexity of the biophysical environment, clouds, and haze, and atmospheric moisture content, all of which impede accurate LULC classification. We tested a parametric classifier (logistic regression) and three non-parametric machine learning classifiers (improved k-nearest neighbors, random forests, and support vector machine) for classification of multi-temporal Sentinel 2 satellite imagery into LULC categories in Dak Nong province, Vietnam. A total of 446 images, 235 from the year 2017 and 211 from the year 2018, were pre-processed to gain high quality images for mapping LULC in the 6516 km(2) study area. The Sentinel 2 images were tested and classified separately for four temporal periods: (i) dry season, (ii) rainy season, (iii) the entirety of the year 2017, and (iv) the combination of dry and rainy seasons. Eleven different LULC classes were discriminated of which five were forest classes. For each combination of temporal image set and classifier, a confusion matrix was constructed using independent reference data and pixel classifications, and the area on the ground of each class was estimated. For overall temporal periods and classifiers, overall accuracy ranged from 63.9% to 80.3%, and the Kappa coefficient ranged from 0.611 to 0.813. Area estimates for individual classes ranged from 70 km(2) (1% of the study area) to 2200 km(2) (34% of the study area) with greater uncertainties for smaller classes.
  • Rintarunsala, Juhani (Helsingin yliopisto, 2018)
    As an internationally important topic for forestry, climate change has long been a topic of concern, as well as the ability of the forests to accumulate carbon. In addition, in Finland, these values have essentially been associated with the economic, cultural and social value of forests. In view of these values, it is important to be able to maintain forest resources at a sustainable level for all the different sectors. As far as sustainability is concerned, knowing the current state of forests is significant. This information is collected through the inventory of forests, and today it is mainly based on different remote sensing methods. In order to support reliable decisionmaking, forest information needs to be up-to-date and accurate. The aim of the thesis was to examine the accuracy of different tree attribute estimates and compare them between themselves and to investigate the accuracy of growth models in producing the estimates. In addition, the aim was to evaluate the effects of the accuracy of the remote sensing estimates on the determination of the timing harvests. The research area was located in boreal coniferous forest zone in Southern Finland, Evo (61.19˚N, 25.11˚E). The area comprised a 5 km x 5 km area, comprising about 2000 hectares of forest treated in different ways. Field measurements, aerial imagery, and airborne laser scanning data were generated using estimates for forest inventory attributes based on three different statistical features derived from the remote sensing data in the preparation of estimates. The forest inventory attributes were volume V, basal area-weighted mean diameter Dg, basal area-weighted mean height, number of the stems per hectare, and basal area G. In the prediction of the forest inventory attributes a non-parametric k-NN method was used, and random forest -algorithm was used in the selection of the nearest neighbors. Growth modeling was carried out using SIMO software. It can be seen from the results that, as a rule, more accurate results are obtained by producing airborne lasers canning estimates than by aerial imagery estimates. In addition, prediction precisions were better in coniferous trees than in deciduous trees. In forest inventory attribute estimates, especially the basal area G and volume V are generally underestimated, which is likely to delay the scheduled timing of harvests. Updating remote sensing estimates with growth models would appear to yield more biased estimates compared to the new remote sensing inventory.
  • Kämäräinen, Emma (Helsingin yliopisto, 2018)
    Tässä työssä aiheena oleva mobiilipuhelimien käyttöiän mallintaminen ja ennustaminen on osa teleoperaattori DNA Oyj:n laitemallia. Laitemalliin kuuluu asiakkaan seuraavan puhelinlaitteen ostoajanhetken, hinnan ja valmistajan ennustaminen. Ostoajanhetken arviointi on olennainen tieto yrityksille, jotka myyvät mobiililaitteita, sillä sen avulla voidaan ajoittaa laitesuositteluja sekä tehdä asiakkaalle ajankohtaisia toimenpiteitä. Käyttöiän mallintamista varten haettiin aineisto DNA Oyj:n tietokannasta, jota jatkojalostettiin mallinnukseen sopivaksi. Aineistoa kertyy koko ajan lisää, jonka takia mallinnuksessa käytetty aineisto muuttuu jopa päivittäin. Laitemallia ajetaan DNA Oyj:n tuotantoympäristössä ja sen tulokset ovat operatiivisessa käytössä. Tutkielmani alussa esittelen mallinnuksessa käytettävän satunnainen metsä-algoritmin, joka on päätöspuiden kokoelmaan perustuva menetelmä. Ensin kerron hieman algoritmin historiasta ja sen teoreettisesta taustasta. Algoritmin toiminnan ymmärtämiseksi esittelen myös muita koneoppimisen menetelmiä, jotka ovat oleellinen osa algoritmia. Satunnainen metsä- menetelmässä on monia hyviä ominaisuuksia, joita täsmennän teoriaosuuden yhteydessä. Menetelmän suorituksen yhteydessä voidaan esimerkiksi laskea selittäville muuttujille niiden tärkeys mallinnuksessa. Algoritmin teorian esittelyn jälkeen määrittelen vielä muutamia metriikoita, joita käytän mallinnusvaiheessa tulosten analysoinnissa ja validoinnissa. Seuraavaksi kuvailen työssä käytetyn aineiston. Aineiston hakuja tehtiin kaksi, joista toinen on mallin koulutusaineistoa varten ja toinen on aineisto, jolle lopulliset ennusteet muodostetaan. Aineistoissa on paljon muuttujia, joten esittelen ne kahdessa osassa. Ensin kerron laitteeseen liittyvät ominaisuudet ja sen jälkeen asiakkaaseen liittyvät tiedot. Laitteiden ostopäivätiedoista saatiin selville mallinnuksen selitettävä muuttuja, puhelimen käyttöaika, joka luokiteltiin kolmen kuukauden tarkkuudella. Ostopäivän lisäksi puhelinlaitteesta on tiedossa monenlaisia teknisiä ominaisuuksia, muun muassa laitteen käyttöjärjestelmä sekä 4G- kyvykkyys. Asiakkaan tiedoista mallinnuksessa käytettiin demografisia tietoja, kuten sukupuolta ja ikää. Lisäksi hyödynnettiin asiakkaan ilmoittaman osoitetiedon perustella määriteltyä laajakaistasaatavuutta ja mobiilidatan käyttöön liittyviä muuttujia. Aineiston esittelyn jälkeen kerron varsinaisesta mallinnuksesta. Mallinnuksen yhteydessä tutkin eri parametrien vaikutusta ennustetuloksiin. Optimaalisten parametrien avulla luotiin luokkaennusteet mobiililaitteiden käyttöiälle. Eräs satunnainen metsä- algoritmin ominaisuus liittyy siihen, että menetelmän suorituksen yhteydessä pystytään arvioimaan sen tuottamia tuloksia aineistolle, jota menetelmä ei ole käyttänyt kyseisellä suorituskerralla mallin rakentamiseen. Arviointiin käytettiin luokittelumenetelmiin sopivia metriikoita, joiden perusteella algoritmi ennustaa onnistuneesti suuren osan aineistosta. Parametrien määrittämisen ja mallin kouluttamisen jälkeen muodostettiin luokat ennusteaineistolle. Lopullisten ennusteiden paikkansapitävyyttä ei voida arvioida, ennen kuin asiakas ostaa uuden puhelimen. Joissakin tapauksissa vaihtoon voi mennä useampi vuosi. Päätän opinnäytetyöni arvioimalla menetelmän toimivuutta ja pohtimalla laitevaihdon taustalla olevia muuttujia. Vaikka työssä oli käytössä rikas aineisto, puhelinvaihdon luultavasti yleisintä syytä eli laitteen vikatilannetta ei ollut saatavilla työn tekohetkellä. Laitevaihdon syihin perustuvan aineiston lisääminen parantaisi mallinnuksen tuloksia entisestään. Lopussa pohdin myös tuotannossa ajettavan, päivittäin muuttuvan mallinnuksen haasteita. Eräs mallinnuksen tuloksiin vaikuttava tekijä on muuttumattomat parametrit, jotka aineiston muuttuessa eivät välttämättä tuota enää parhaita ennustetuloksia. Laitemallia aiotaan kehittää entistä paremmaksi DNA Oyj:llä.
  • Mäkinen, Antti (Helsingin yliopisto, 2020)
    Urban trees and forests are important for human well-being and the diversity of urban nature. Urban forests maintain biodiversity, improve air quality and offer aesthetic and recreational value. The urban trees have also some negative effects. Trees in bad condition can cause harm or danger to humans property. Dense and shady urban forests may cause feelings of insecurity and tree pollen can cause health problems. The urban trees require intensive management and their condition must be constantly monitored. Maximizing the benefits of urban trees and minimizing disadvantages requires detailed data on urban trees. For this reason, many municipalities and cities maintain a tree register with accurate information on individual city trees. Traditionally, data on urban trees have been collected and updated by field surveys, which is laborious and expensive. New laser scanning methods that produce accurate three-dimensional information offer the opportunity to automatically update the tree register. Interest in utilizing them in urban tree mapping and monitoring has been growing rapidly in recent years. This thesis studied ALS-based individual tree detection methods in urban tree mapping. The aim of this study was to determine whether the accuracy of the automatically generated canopy map from ALS-data could be improved by a semi-automatic method. Initially, a detailed canopy map of trees was produced by automated method. Tree candidates were deliniated from the surface model by utilizing watershed segmentation. The canopy segmentation produced by the automated method was visually modified and incorrectly delimited canopy segments were corrected. This resulted in a semi-automatically produced canopy map. The results of the automatic and semi-automatic canopy segmentation method were compared by determining the detection accuracy of the trees and the modeling accuracy of the tree diameter. The results were compared with the number and the diameter of trees measured in the field. Non-parametric random forest method and the nearest neighbor (kNN) method were used in the diameter modeling process. The study area consisted of nine Helsinki hospital areas with a total area of 47,2 ha. There were 4365 trees and 37 different tree species measured in the field. The automatic method produced 6860 trees and the semi-automatic method produced 3500 trees. Thus, the automatic method produced an overestimation of 57.2% and the semi-automatic method produced an underestimation of 19.5 % compared to the reference trees. The largest overestimation by the automatic method was in the Koskela study area (221.6 %) and the smallest underestimation was produced by the semi-automatic method in the Suursuo study area (75.5 %). 63 % of the canopy segments produced by the automatic method were commission errors and 33% of the canopy segments produced by semi-automatic method were commission errors. With the automatic method, the absolute RMSE of the diameter prediction was 12,84 cm and 10,99 cm with semi-automatic method. The diameter predictions of the whole data were 6 % more accurate with the semi-automatic method. The results of the study showed that the accuracy of the automatically generated canopy map from the laser scanning data can be improved by the semi-automatic method. Tree mapping accuracy improved in terms of both tree detection accuracy and diameter modeling accuracy. Based on the results of the study, it can be stated that the semi-automatic method is useful especially in parkland areas, but in densely wooded forest areas there is still issues to solve make this method practical. The benefits of a semi-automated method should be assessed by comparing the workload with the results. Based on this study, the semi-automatic individual tree detection method used in this work could be useful in the operational mapping and monitoring of urban trees.
  • Yu, Xiaowei; Hyyppa, Juha; Litkey, Paula; Kaartinen, Harri; Vastaranta, Mikko; Holopainen, Markus (2017)
    This paper investigated the potential of multispectral airborne laser scanning (ALS) data for individual tree detection and tree species classification. The aim was to develop a single-sensor solution for forest mapping that is capable of providing species-specific information, required for forest management and planning purposes. Experiments were conducted using 1903 ground measured trees from 22 sample plots and multispectral ALS data, acquired with an Optech Titan scanner over a boreal forest, mainly consisting of Scots pine (Pinus Sylvestris), Norway spruce (Picea Abies), and birch (Betula sp.), in southern Finland. ALS-features used as predictors for tree species were extracted from segmented tree objects and used in random forest classification. Different combinations of features, including point cloud features, and intensity features of single and multiple channels, were tested. Among the field-measured trees, 61.3% were correctly detected. The best overall accuracy (OA) of tree species classification achieved for correctly-detected trees was 85.9% (Kappa = 0.75), using a point cloud and single-channel intensity features combination, which was not significantly different from the ones that were obtained either using all features (OA = 85.6%, Kappa = 0.75), or single-channel intensity features alone (OA = 85.4%, Kappa = 0.75). Point cloud features alone achieved the lowest accuracy, with an OA of 76.0%. Field-measured trees were also divided into four categories. An examination of the classification accuracy for four categories of trees showed that isolated and dominant trees can be detected with a detection rate of 91.9%, and classified with a high overall accuracy of 90.5%. The corresponding detection rate and accuracy were 81.5% and 89.8% for a group of trees, 26.4% and 79.1% for trees next to a larger tree, and 7.2% and 53.9% for trees situated under a larger tree, respectively. The results suggest that Channel 2 (1064 nm) contains more information for separating pine, spruce, and birch, followed by channel 1 (1550 nm) and channel 3 (532 nm) with an overall accuracy of 81.9%, 78.3%, and 69.1%, respectively. Our results indicate that the use of multispectral ALS data has great potential to lead to a single-sensor solution for forest mapping.
  • Räty, Matti (Helsingin yliopisto, 2020)
    SQL kuuluu suositeltujen oppiaineiden joukkoon tietojenkäsittelytieteestä. Se on tehokas tapa varastoida dataa kontekstista riippumatta. SQL on kuitenkin opittavana aiheena opiskelijoilleen vaikea, ja tämän vuoksi SQL-opetuksen rinnalla käytetään opetusohjelmistoja. Opetusohjelmistojen avulla SQL:ää päästään opettelemaan käytännössä, paikataan suurta oppilaiden määrää opettajien määrään nähden, ja kerätään aineistoa opiskelijoiden suoriutumisesta. Oppimisohjelmistojen keräämä aineisto oppilaiden suoriutumisesta tarjoaa mahdollisuuden ennustaa opiskelijoiden suoriutumista kurssilla koneoppimismenetelmin. Tämä tutkielma kouluttaa SQL-opetusohjelmiston aineistoilla hyväksi todettuja koneoppimisalgoritmeja malleiksi, jotka osaavat ennustaa osaako opiskelija seuraavalla yrityksellään SQL-harjoitustehtävän oikein. Kyseessä ei ole tehdä mallia joka osaisi tarkastaa SQL-tehtäviä, vaan tarkoituksena on antaa koneoppimisalgoritmien tarkkailla opiskelijoilta muita kerättyjä tilastoja tehtäväyrityksen oikeellisuuden arvioimiseen ilman itse oppilaan antamaa ratkaisua. Tutkielmassa huomataan useiden koneoppimismallien olevan toimivia tämän tavoitteen saavuttamiseksi. Vastaavia koneoppimismalleja voidaan hyödyntää oppilaiden löytämisessä, joilla on vaikeuksia tehtävien tekemisessä. Tämä tieto on arvokasta esimerkiksi opetusohjelmistoille, jotka pyrkivät antamaan SQL-tehtävien tekijöille vihjeitä hyödylliseen aikaan.
  • Kanerva, Noora; Kontto, Jukka; Erkkola, Maijaliisa; Nevalainen, Jaakko; Männistö, Satu (2018)
    Aims: Factors that contribute to the development of overweight are numerous and form a complex structure with many unknown interactions and associations. We aimed to explore this structure (i.e. the mutual importance or hierarchy of sociodemographic and lifestyle-related risk factors of being overweight) using a machine-learning technique called random forest (RF). The results were compared with traditional logistic regression (LR) analysis. Methods: The cross-sectional FINRISK 2007 Study included 4757 Finns (aged 25-74 years). Information on participants' lifestyle and sociodemographic characteristics were collected with questionnaires. Diet was assessed, using a validated food-frequency questionnaire. Height and weight were measured. Participants with a body mass index (BMI) 25 kg/m(2) were classified as overweight. R-statistical software was used to run RF analysis (randomForest') to derive estimates for variable importance and out-of-bag error, which were compared to a LR model. Results: In total, 704 (32%) men and 1119 (44%) women had normal BMI, whereas 1502 (69%) men and 1432 (57%) women had BMI 25. Estimated error rates for the models were similar (RF vs. LR: 42% vs. 40% for men, 38% vs. 35% for women). Both models ranked age, education and physical activity as the most important risk factors for being overweight, but RF ranked macronutrients (carbohydrates and protein) as more important compared to LR. Conclusions: RF did not demonstrate higher power in variable selection compared to LR in our study. The features of RF are more likely to appear beneficial in settings with a larger number of predictors.