Differentially Private Robust Linear Regression

Show full item record


Title: Differentially Private Robust Linear Regression
Author: Nieminen, Arttu
Other contributor: Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta, Matematiikan ja tilastotieteen laitos
University of Helsinki, Faculty of Science, Department of Mathematics and Statistics
Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten, Institutionen för matematik och statistik
Publisher: Helsingfors universitet
Date: 2017
Language: eng
URI: http://urn.fi/URN:NBN:fi-fe2017112252535
Thesis level: master's thesis
Discipline: Applied Mathematics
Soveltava matematiikka
Tillämpad matematik
Abstract: Differential privacy is a mathematically defined concept of data privacy that is based on the idea that a person should not face any additional harm by opting to give their data to a data collector. Data release mechanisms that satisfy the definition are said to be differentially private and they guarantee the privacy of the data on a specified privacy level by utilising carefully designed randomness that sufficiently masks the participation of each individual in the data set. The introduced randomness decreases the accuracy of the data analysis, but this effect can be diminished by clever algorithmic design. Robust private linear regression algorithm is a differentially private mechanism originally introduced by A. Honkela, M. Das, O. Dikmen, and S. Kaski in 2016. The algorithm is based on projecting the studied data inside known bounds and applying differentially private Laplace mechanism to perturb the sufficient statistics of the Bayesian linear regression model that is then fitted to the data using the privatised statistics. In this thesis, the idea, definitions and the most important theorems and properties of differential privacy are presented and discussed. The robust private linear regression algorithm is then presented in detail, including improvements that are related to determining and handling the parameters of the mechanism and were developed during my work as a research assistant in the Probabilistic Inference and Computational Biology research group (Department of Computer Science at University of Helsinki and Helsinki Institute for Information Technology) in 2016-2017. The performance of the algorithm is evaluated experimentally on both synthetic and real-life data. The latter data are from the Genomics of Drug Sensitivity in Cancer (GDSC) project and consist of the gene expression data of 985 cancer cell lines and their responses to 265 different anti-cancer drugs. The studied algorithm is applied to the GDSC data with the goal of predicting which cancer cell lines are sensitive to each drug and which are not. The application of a differentially private mechanism to the gene expression data is justifiable because genomic data are identifying and carry highly sensitive information about e.g. an individual's phenotype, health, and risk of various diseases. The results presented in the thesis show the studied algorithm works as planned and is able to benefit from having more data: in the sense of prediction accuracy, it approaches the non-private version of the same algorithm as the size of the available data set increases. It also reaches considerably better accuracy than the three compared algorithms that are based on different differentially private mechanisms: private linear regression with no projection, output perturbed linear regression, and functional mechanism linear regression.

Files in this item

Total number of downloads: Loading...

Files Size Format View
gradudprlr.pdf 2.092Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record