Software Plagiarism Detection Using N-grams

Show full item record

Title: Software Plagiarism Detection Using N-grams
Author: Wahlroos, Kristian
Contributor: University of Helsinki, Faculty of Science
Publisher: Helsingin yliopisto
Date: 2019
Language: eng
Thesis level: master's thesis
Discipline: Algorithms and Machine Learning
Abstract: Plagiarism is an act of copying where one doesn’t rightfully credit the original source. The motivations behind plagiarism can vary from completing academic courses to even gaining economical advantage. Plagiarism exists in various domains, where people want to take credit from something they have worked on. These areas can include e.g. literature, art or software, which all have a meaning for an authorship. In this thesis we conduct a systematic literature review from the topic of source code plagiarism detection methods, then based on the literature propose a new approach to detect plagiarism which combines both similarity detection and authorship identification, introduce our tokenization method for the source code, and lastly evaluate the model by using real life data sets. The goal for our model is to point out possible plagiarism from a collection of documents, which in this thesis is specified as a collection of source code files written by various authors. Our data, which we will use to our statistical methods, consists of three datasets: (1) collection of documents belonging to University of Helsinki’s first programming course, (2) collection of documents belonging to University of Helsinki’s advanced programming course and (3) submissions for source code re-use competition. Statistical methods in this thesis are inspired by the theory of search engines, which are related to data mining when detecting similarity between documents and machine learning when classifying document with the most likely author in authorship identification. Results show that our similarity detection model can be used successfully to retrieve documents for further plagiarism inspection, but false positives are quickly introduced even when using a high threshold that controls the minimum allowed level of similarity between documents. We were unable to use the results of authorship identification in our study, as the results with our machine learning model were not high enough to be used sensibly. This was possibly caused by the high similarity between documents, which is due to the restricted tasks and the course setting that teaches a specific programming style during the timespan of the course.

Files in this item

Total number of downloads: Loading...

Files Size Format View
Wahlroos_Kristian_Pro_gradu_2018.pdf 815.1Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record