.

 

THIS SITE IS UNDER CONSTRUCTION

PLEASE REPORT ANY PROBLEMS OR PROVIDE SUGGESTIONS

beata.nowok@ed.ac.uk

General

The synthpop package makes a synthetic version of a microdata set held as an R data frame. If you are new to R you will need to install R and learn the basics of R. Then you should read your data into R as an R data frame, as explained below, and carry out some exploratory analyses to understand their structure.

As far as possible your R data frame should consist of the variables that might be used by an analyst who will be producing data summaries, such as tables or fits to statistical models from the data. Each variable in the data frame should have an appropriate data type (e.g. numeric, factor). The synthpop package provides some tools to help you check this (see below).

Installation

Get R (essential)

In order to run synthpop, you must have R installed on your computer. If you need to install R go here. If you have never used R before you should access some of the resources for getting started with R.

Get RStudio (recommended)

In addition to R, you might also want to install RStudio. It is an integrated development environment (IDE) for R that will make your experience with R much more enjoyable. If you want to install RStudio go here.

Install synthpop (online)

Open R or RStudio and install synthpop by typing the following code into the console

install.packages("synthpop")

You will only need to do this once. It will install synthpop and all the other packages it uses from the CRAN website.

Install synthpop (offline)

If you are in a secure setting, without internet access, you will need to download the appropriate zip files from CRAN, import them into your secure environment and install packages from those local zip files.

Start synthpop

To start using the package you will need to load it using the library() function and you will have to repeat this step every time you open R or RStudio and want to run synthpop.

library("synthpop")

You can get a list of all the synthpop functions using the command

help(package = synthpop)

To quickly access a help file for a specific function, e.g. the main synthpop function syn(), you can type its name preceded by ?

?syn

First synthesis

You will be working with your own data, but to help you we have provided an R script that uses the data SD2011 that is supplied as part of the synthpop package. Get the sample R script here.

Read the data you want to synthesise into R, if it is not there already. You can use the synthpop function read.obs() to read it in from other formats (check the help file).

We strongly advise you to start creating synthetic data from an example with only a modest number of  variables (say between 8 and 12 variables) so you can understand synthpop. If your data have more variables than this then make a selection. The synthpop package is intended for large data sets. We do not recommend using it for data sets with fewer than around 500 observations because a small data set will not provide enough information about relationships between many variables.

Now examine your data. Perhaps check the first or last few lines with head() or tail() and any other R functions you know. You can also use the synthpop function codebook.syn() to examine the features that will be relevant to synthesising.

codebook.syn(mydata)

Use the output to do the following things to make your data ready to be synthesised:

  • Remove any identifiers, e.g. study number.
  • Change any character (text) variables into factors and rerun codebook.syn() after this. The syn() function will do this conversion for you but it is better that you do it first.
  • Note which variables have missing values, especially those that are not coded as the R missing value NA.  For example the value -9 often signifies missing data for positive items like income. These can be identified to the syn() function via the cont.na parameter.
  • Note any variables that ought to be derivable from others, e.g. discharge date from length-of-stay and admission date. These could be omitted and recalculated after synthesis or calculated as part of the synthesis process by setting their method to passive (see ?syn.passive).
  • Also note any variables that should obey rules that depend on other variables. For example, the number of cigarettes smoked should be zero or missing for non-smokers. You can set the rules with the parameters rules  and rvalues of the syn() function. The syn() function will warn you if the rule is not obeyed in the observed data.

If your data have more than 12 variables or if you have any factors with a large number of levels (say more than 20) you should create a smaller and simpler data frame that will be easier to synthesise for your first attempt. Omit or recode factors with many levels and select fewer variables. It would be a good idea to select a set of variables you might be interested in analysing.

You are now ready to do your first synthesis, e.g.

mysyn <- syn(mydata, cont.na = list(income = -8))

You have created a synthetic data object mysyn of class synds which is a list with a number of components including the synthesised data and information on how they were created. See the value entry in the help file for syn() for details. To get an overview, use the summary() function for a synds object, e.g.

summary(mysyn)

You will see a list of variables with their synthesis methods in the order in which they were synthesised. As you used the default values for most syn() parameters you will see that the data have been synthesised in the order in the data frame and all except the first method used  "cart" (classification and regression trees).

To do an initial comparison of the original and synthetic data as tables and histograms use the compare() function, e.g.

compare(mysyn, mydata, stat = "counts")

We hope that these will indicate similar distributions for the original and synthetic data.

If you want to export your synthetic data to analyse in other programs you can use the synthpop write.syn() function, e.g.

write.syn(mysyn,file = "mysyn.sav", filetype = "SPSS")

Next steps

Now you have managed your first synthesis you could read our paper in the Journal of Statistical Software and explore other resources on our website with more explanation of different features of synthpop including:

  • statistical disclosure control functions,
  • customising your synthesis by defining methods, order and predictor matrix,
  • evaluating the utility of the synthetic data,
  • comparing model fits between observed and synthetic data.

Stay connected with us

Enter your email address to receive occasional updates

Submitting...

Something went wrong

Your email has been received