Gillian Raab
Synthpop
Temporary page for the synthpop package for R
Latest News!
A new version of synthpop (synthpop_1.50) has recently been added to CRAN. Note, you should be running a recent version of R (e.g.3.5.0) before installing it.
Several new features and improvements are included. For example, a fuller range of options in the Utility functions, as described in the paper published in RSS A below.
Also synthesis by loglinear models for categorical data is implemented. The method used is iterative proportional fitting, with models defined by the margins
they constrain. The appearance of zero cells in the synthesised data is controlled by setting a prior for each cell. See
the package NEWS file for more details.
Introduction
The synthpop package allows users to create synthetic versions of confidential data for use by researchers interested in making inferences about the population that the data represent.
The synthesised data can be released with fewer restrictions about on how it must be held than for the original data. Synthetic data can be used to carry out statistical analyses and obtain results,
though we would usually recommend that they are confirmed by an analysis of the original data. They are also useful for providing data sets for teaching.
The package authors are Beata Nowok (beata.nowok@ed.ac.uk, main author and maintainer) and Gillian Raab (gillian.raab@ed.ac.uk).
The package allows the synthesis process to be customised in many different ways according to the characteristics of the data being synthesised. There are default values for
most of the parameters, but if you want your synthetic data to be useful you must set parameters appropriately
Downloading the package
synthpop can be downloaded from the CRAN web site. Make sure you have the latest version, see latest news above.
Material for synthpop course 25/9/2018 at PSD 2018 Valencia
Links to publications
General papers and reports
synthpop: Bespoke Creation of Synthetic Data in R
Beata Nowok, Gillian M. Raab, Chris Dibben,
Journal of Statistical Software
2016, Volume 74, Issue 11. doi: 10.18637/jss.v074.i11

Explains many of the basic features of the package. It is focussed on the
use of the package for exploratory analysis without making inferences
to the population. It assumes that a fi nal analysis with appropriate standard
errors will be carried out on the original data.
Since this paper was published several new features have been added to synthpop.
In particular several new models are available and an option to
synthesise within stata has been added. To review the latest version
consult the reference manual on the CRAN web site 
Journal 
Beata Nowok synthpop: An R package for generating synthetic versions of sensitive microdata for statistical disclosure control . Presented at UNUCE Work session on statistical data confidentiality 5  7 October 2015 Helsinki, Finland 
This is a shorter version of the paper above that might be an easier starting point for someone
new to this area. The same caveats about its referring to an older version of the package apply to this. 
Presentation at a workshop 
Practical data synthesis for large samples.
Gillian M. Raab, Beata Nowok and Chris Dibben,
Journal of Privacy and Confidentiality
(20162017)7, Number 3, 67–97 
This paper gives a brief description of the motivation for developing synthpop
but it also includes theoretical work which allow inferences from fully
synthetic data to be carried out with much less effort than the previous literature
had suggested. In particular, the new methods do not require multiple synthetic
data sets to be produced for making inferences to populations, thus reducing disclosure risk. 
Journal 
Inference from fitted models in synthpop,
Gillian M Raab & Beata Nowok
Preprint currently available as a vignette on the CRAN web site 
Describes how the methods for inference from synthetic data, including those in the paper above, and those proposed by others, are implemented in synthpop. 
Preprint 
Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R,
Beata Nowok, Gillian M. Raab, Chris Dibben,
Statistical Journal of the IAOS 33 (2017) 785–796 785,
DOI 10.3233/SJI150153
IOS Press
/td>

Describes how synthpop is used in the Scottish Longitudinal study, and presents an example of the analysis of survey data that is available as part of the synthpop package. 
Journal 
Measuring the utility of synthetic data
General and specific utility measures for synthetic data
Joshua Snoke, Gillian M Raab, Beata Nowok,
Chris Dibben, and Aleksandra Slavkovic,
Journal of the Royal Statistical Society: forthcoming.

Derives a general utility measure that is available in synthpop ans the function utility.gen
illustrates their use on examples. When the published version is online (soon we hope) there will be a link to sample code
that can be used. 
Journal 
Guidelines for Producing Useful Synthetic Data
Gillian M. Raab, Beata Nowok and Chris Dibben,
Preprint available from arxiv.org 
Gives practical advice on how to create synthetic data and also introduces
a utility measure for comparing tables between synthesised data and the original. This is implemented in synthpop as utility.tab.

Preprint 
Assessing disclosure risk from synthetic data
We have made less progress on this aspect of synthetic data than
others, although the package includes a module on Statistical disclosure control (sdc) that implements many standard techniques. We are currently working on this aspect. Meanwhile, there are two small relevant items below.
Elliot, M. (2014). Final report on the disclosure risk associated with the synthetic data produced by the SYLLS team.

Tested data produced by synthpop for disclosure risk. 
Report 
Gillian M. Raab
Internal report 
An internal report summrising the methods used in synthpop ad focussing, particularly,
on issues of disclosure control.

Report 
Special topics and case studies
.
Beata Nowok Utility of synthetic microdata generated using treebased methods. Presented at UNUCE Work session on statistical data confidentiality 5  7 October 2015 Helsinki, Finland

Compared different tree based methods, including bagging and random forests as methods for modelling conditional distributions in synthpop 
Presentation at a workshop. 
Material for synthpop course 20/6/2018 at ADRN Conference Belfast
Course material for survival analysis course
Course notes: You have printed copies.
Other material