Using Random forests to predict total career performance of young Finnish trotters with early racing performance and other variables

Show full item record



Permalink

http://urn.fi/URN:NBN:fi:hulib-202006012541
Title: Using Random forests to predict total career performance of young Finnish trotters with early racing performance and other variables
Author: Niinikoski, Eerik
Contributor: University of Helsinki, Faculty of Science
Publisher: Helsingin yliopisto
Date: 2020
Language: eng
URI: http://urn.fi/URN:NBN:fi:hulib-202006012541
http://hdl.handle.net/10138/315768
Thesis level: master's thesis
Degree program: Life Science Informatics -maisteriohjelma
Master's Programme in Life Science Informatics
Magisterprogrammet i Life Science Informatics
Specialisation: Biostatistics and Bioinformatics
Biostatistics and Bioinformatics
Biostatistics and Bioinformatics
Discipline: none
Abstract: The aim of this thesis is to predict total career racing performance of Finnish trotter horses by using trotters early career racing performance and other early career variables. This thesis presents a brief introductory of harness racing and horses used in Finnish trotting sport. The data is presented and modified for predictions, with descriptive statistics of tables and visuals. The machine learning method of Random forests for regression is introduced and used in the predictions. After training the model, this thesis presents the prediction accuracy and variables of importance of the predictions of total career racing performance for both Finnhorse trotters and Finnish Standardbred trotter population. Finally, the writer discusses on the shortages and possible improvements for future research. The data for this thesis was provided by The Finnish trotting and breeding association (Suomen Hippos ry), which included all information of harness races from 1984 to the end of 2019, raced in Finland. From almost three million rows, the data was summarised to a data table of 46704 rows of trotters, that have started their career at earliest allowed three age groups. A total of 37 independent variables were used to predict three outcomes of total career earnings, total number of career starts and total number of career first placings, as separate models. The predictors are derived from other studies that estimate the environmental and genetic factors of racing performance of a trotter. The three models performed poor to moderate, with total earnings having the highest prediction accuracy. The model predicted quite well larger amounts of earnings, but was avid to predict some earnings when there in fact were none. Prediction accuracy of total number of starts was poor, especially when the true amount of starts was low. Model that predicted total number of career first placings performed the worst. This can partially be explained by the fact that winning is a rare event for a trotter in general. The models fit better for Finnish Standardbred trotters than for Finnhorse trotters. This thesis works as a good basis for future similar research, where massive amounts of data and machine learning is used to predict trotter’s career, racing performance or other factors. The results show that predicting total career racing performance as a classification problem could be a better fit than regression. These adequate classes, as well as possible better predictors and suitable imputes for missing values, should be consulted with an audience of superior knowledge in harness racing.
Subject: harness racing
trotter
racing performance
early career
machine learning
random forests


Files in this item

Total number of downloads: Loading...

Files Size Format View
Niinikoski_Eerik_Pro_gradu_2020.pdf 967.6Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record