Privacy and limited data

Privacy and limited data

AI Innovation of Sweden, Lindholmspiren 11
2019-11-11 18:00
2019-11-11 21:00
Sign up

Description

Welcome to another meetup with GAIA!

This meetup will have one talk by Olof Mogren on the topic of privacy and limited data. Olof Mogren is heading the deep learning research group in Gothenburg at RISE Research Institutes of Sweden. He received his PhD in computer science from Chalmers University of Technology in 2018, with a thesis on deep representation learning for natural language processing. See the abstract below for more details on his coming presentation.

Our hosts for the evening will be AIIoS. AI Innovation of Sweden is a national center for applied AI research and innovation, with the aim to strengthen the competitiveness of the Swedish industry and welfare. AI Innovation of Sweden is a national and neutral initiative, functioning as an engine in the Swedish AI ecosystem. The focus is on accelerating the implementation of AI through sharing knowledge and data and collaborative projects

This meetup will be recorded and posted on our YouTube channel later.

Privacy and limited data

Many areas of advanced data analytics have seen astonishing progress with deep learning. Deep neural networks now power systems that excel in image processing, playing ancient board-games, and interpreting natural language. These networks have a high learning capacity, but they require large amounts of training data to come to their full potential. What choices do we have when the required amounts of data can not be met? And how do we ensure privacy for individuals that may be part of the datasets that underlie our conclusions?

One strategy to deal with limited data is transfer learning: the process of training a model in two stages: first using a large generic dataset, and then on data from the target domain where the model will later be used and evaluated. For instance, you may pretrain a model for classification on a large and easily available dataset such as Imagenet, and then perform the fine-tuning on a different dataset, or even on a different task such as semantic segmentation. For convolutional neural networks in computer vision applications, this kind of initialization has been successfully employed for years, and similar approaches have now started to emerge for applications in natural language processing. The Transformer-based architectures such as BERT and GPT-2 can now be trained in similar ways for language applications.

When a model obtains a very good fit on the training data, yet fails to generalize to unseen test data, is an issue referred to as overfitting. Interestingly enough, this is closely linked to the issue of privacy. Special care needs to be taken about both, especially when using datasets of limited size. When a model fails to generalize due to overfitting, it also starts to memorize information that is specific to the training data. For sensitive applications, this may be information that we'd like the model not to expose. Limiting overfitting can lead to improving privacy but this neat side-effect may not be enough in practice. Ensuring privacy may also be approached using mechanisms such as particular ensemble setups or adversarial learning.

In this talk, we will go through the basics of transfer learning and some issues of data privacy with some possible remedies, illustrated with examples from the AI research at RISE Research Institutes of Sweden.