School of GeoSciences

School of GeoSciences

Gillian Raab

Synthpop

Temporary page for the synthpop package for R

Latest News!

A new version of synthpop (synthpop_1.5-0) has recently been added to CRAN. Note, you should be running a recent version of R (e.g.3.5.0) before installing it. Several new features and improvements are included. For example, a fuller range of options in the Utility functions, as described in the paper published in RSS A below. Also synthesis by log-linear models for categorical data is implemented. The method used is iterative proportional fitting, with models defined by the margins they constrain. The appearance of zero cells in the synthesised data is controlled by setting a prior for each cell. See the package NEWS file for more details.

Introduction

The synthpop package allows users to create synthetic versions of confidential data for use by researchers interested in making inferences about the population that the data represent. The synthesised data can be released with fewer restrictions about on how it must be held than for the original data. Synthetic data can be used to carry out statistical analyses and obtain results, though we would usually recommend that they are confirmed by an analysis of the original data. They are also useful for providing data sets for teaching.

The package authors are Beata Nowok (beata.nowok@ed.ac.uk, main author and maintainer) and Gillian Raab (gillian.raab@ed.ac.uk).

The package allows the synthesis process to be customised in many different ways according to the characteristics of the data being synthesised. There are default values for most of the parameters, but if you want your synthetic data to be useful you must set parameters appropriately

Downloading the package

synthpop can be downloaded from the CRAN web site. Make sure you have the latest version, see latest news above.

Material for synthpop course 25/9/2018 at PSD 2018 Valencia

What Links to material
Preparatory material - please complete before workshop Preparation.pdf for experienced R user should take well under 30 minutes
Notes with overview and instructions for practicals, Copies will be provided valencia_notes.pdf
Power point presentations session 1 Presentation_1.pdf
Sample code for practical 1 sample_code_1.R
Power point presentations session 2 Presentation_2.pdf
Sample code for practical 2 sample_code_2.R
Power point presentations session 3 Presentation_3.pdf

Links to publications

General papers and reports
Reference Description Link to source
synthpop: Bespoke Creation of Synthetic Data in R Beata Nowok, Gillian M. Raab, Chris Dibben, Journal of Statistical Software 2016, Volume 74, Issue 11. doi: 10.18637/jss.v074.i11 Explains many of the basic features of the package. It is focussed on the use of the package for exploratory analysis without making inferences to the population. It assumes that a fi nal analysis with appropriate standard errors will be carried out on the original data.
Since this paper was published several new features have been added to synthpop. In particular several new models are available and an option to synthesise within stata has been added. To review the latest version consult the reference manual on the CRAN web site
Journal
Beata Nowok synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control . Presented at UNUCE Work session on statistical data confidentiality 5 - 7 October 2015 Helsinki, Finland This is a shorter version of the paper above that might be an easier starting point for someone new to this area. The same caveats about its referring to an older version of the package apply to this. Presentation at a workshop
Practical data synthesis for large samples. Gillian M. Raab, Beata Nowok and Chris Dibben, Journal of Privacy and Confidentiality (2016-2017)7, Number 3, 67–97 This paper gives a brief description of the motivation for developing synthpop but it also includes theoretical work which allow inferences from fully synthetic data to be carried out with much less effort than the previous literature had suggested. In particular, the new methods do not require multiple synthetic data sets to be produced for making inferences to populations, thus reducing disclosure risk. Journal
Inference from fitted models in synthpop, Gillian M Raab & Beata Nowok Preprint currently available as a vignette on the CRAN web site Describes how the methods for inference from synthetic data, including those in the paper above, and those proposed by others, are implemented in synthpop. Preprint
Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R, Beata Nowok, Gillian M. Raab, Chris Dibben, Statistical Journal of the IAOS 33 (2017) 785–796 785, DOI 10.3233/SJI-150153 IOS Press /td> Describes how synthpop is used in the Scottish Longitudinal study, and presents an example of the analysis of survey data that is available as part of the synthpop package. Journal
Measuring the utility of synthetic data
Reference Description Link to source
General and specific utility measures for synthetic data Joshua Snoke, Gillian M Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic, Journal of the Royal Statistical Society: forthcoming. Derives a general utility measure that is available in synthpop ans the function utility.gen illustrates their use on examples. When the published version is online (soon we hope) there will be a link to sample code that can be used. Journal
Guidelines for Producing Useful Synthetic Data Gillian M. Raab, Beata Nowok and Chris Dibben, Preprint available from arxiv.org Gives practical advice on how to create synthetic data and also introduces a utility measure for comparing tables between synthesised data and the original. This is implemented in synthpop as utility.tab. Preprint
Assessing disclosure risk from synthetic data

We have made less progress on this aspect of synthetic data than others, although the package includes a module on Statistical disclosure control (sdc) that implements many standard techniques. We are currently working on this aspect. Meanwhile, there are two small relevant items below.

Reference Description Link to source
Elliot, M. (2014). Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team. Tested data produced by synthpop for disclosure risk. Report
Gillian M. Raab Internal report An internal report summrising the methods used in synthpop ad focussing, particularly, on issues of disclosure control. Report

Special topics and case studies
.

Reference Description Link to source
Beata Nowok Utility of synthetic microdata generated using tree-based methods. Presented at UNUCE Work session on statistical data confidentiality 5 - 7 October 2015 Helsinki, Finland Compared different tree based methods, including bagging and random forests as methods for modelling conditional distributions in synthpop Presentation at a workshop.

Material for synthpop course 20/6/2018 at ADRN Conference Belfast

What Links to material
Course notes (you have hard copy) Notes
Power point presentations session 1 Introduction , Using synthpop
Power point presentations session 2 Illustrations of practical , Further methods - tips and tricks
R syntax files for practicals Sample_code_prac1.R , Sample_code_prac2.R , Sample_code_prac2_mid.R
Alternative data set for practical 2 ICEM data in R format ---- codebook for ICEM data
Sample code used to synthesise this data set Code used to synthesise this data set

Course material for survival analysis course

Course notes: You have printed copies.

Other material

What Links to material
Power point presentation Lecture 1 , Lecture 2 , Lecture 3 , Examples, Lecture 4
Power point presentation Examples of survival analysis with SLS
SPSS syntax practical_1.sps practical_2.sps practical_3.sps
Stata syntax practical_1.do practical_2.do practical_3.do
R syntax practical_1.R practical_2.R practical_3.R