.

 

THIS SITE IS UNDER CONSTRUCTION

PLEASE REPORT ANY PROBLEMS OR PROVIDE SUGGESTIONS

beata.nowok@ed.ac.uk

Overview

The synthpop package for R allows users to create synthetic versions of confidential individual-level data for use by researchers interested in making inferences about the population that the data represent. The synthesised data can be released with fewer restrictions on how they must be held than for the original data. They can be used to carry out statistical analyses, though we would usually recommend to conduct an analysis of the original data to confirm the results. Synthetic data are also useful for providing data sets for teaching.

The package allows the synthesis process to be customised in many different ways according to the characteristics of the data being synthesised. There are default values for most of the parameters, but if you want your synthetic data to be useful you must set parameters appropriately.

To cite the synthpop package in publications use:

Nowok, B., G.M. Raab & C. Dibben (2016), synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74:1-26; DOI:10.18637/jss.v074.i11. Available at: https://www.jstatsoft.org/article/view/v074i11

Methodology

The key objective of producing synthetic versions of original data sets is to replace sensitive values with synthetic ones causing minimal distortion of the statistical information contained in the data set. In synthpop all values of synthesised variables are replaced. Variables are synthesised one-by-one using sequential regression modelling. It means that conditional distributions, from which synthetic values are drawn, are defined for each variable separately and they are conditioned on the original variables that are earlier in the synthesis sequence (optionally additional variables not to be synthesised can be used as predictors).

Consider as an example a default synthesis, i.e. synthesis with all values of all variables (Y1, Y2,…, Yp) to be replaced. The first variable to be synthesised Y1 cannot have any predictors and therefore its synthetic values are generated by random sampling with replacement from its observed values. Then the distribution of Y2 conditional on Y1 is estimated and the synthetic values of Y2 are generated using the fitted model and the synthesised values of Y1. Next the distribution of Y3 conditional on Y1 and Y2 is estimated and used along with synthetic values of Y1 and Y2 to generate synthetic values of Y3 and so on. The distribution of the last variable Yp will be conditional on all other variables. Similar conditional specification approaches are used in most implementations of synthetic data generation. They are preferred to joint modelling not only because of the ease of implementation but also because of their flexibility to apply methods that take into account structural features of the data such as logical constraints or missing data patterns.

With practicality and flexibility in mind, classification and regression trees (CART) are used as the default conditional models for synthesis but various parametric alternatives are also available.

synthpop story

The R package synthpop has been written as part of the UK Economic and Social Research Council funded SYLLS project (SYnthetic Data Estimation for UK LongitudinaL Studies) to allow support staff of the UK Longitudinal Studies (LSs) to produce synthetic data tailored to the needs of individual research projects. You can read more here.

People behind

The core team is is based at the University of Edinburgh and the Administrative Data Research Centre - Scotland:

Several people have contributed to synthpop in various ways and below is a probably incomplete list of them:

     Joshua Snoke (development of general utility method and function)

     Jörg Drechsler (many useful comments and bug reports)

     ...

 

Stay connected with us

Enter your email address to receive occasional updates

Submitting...

Something went wrong

Your email has been received