Tutorial 2.5 - Simulating data
27 Mar 2017
Individual variables (vectors)
Most statisticians strongly recommend that research questions be designed around sets of well defined statistical procedures. This ensures that the eventual data analyses remain possible and relatively straightforward. Furthermore, many would recommend the generation and mock analysis of dummy data sets that approximate the anticipated structure and variability of the anticipated data. This enables many of the common data analysis problems to be anticipated, thereby allowing solutions to be considered prior to data collection. Dummy data sets are usually created by filling the response variable(s) (and continuous predictor variables) with random data.
R uses the Mersenne-Twister Random Number Generator (RNG) with a random number sequence cycle of $2^{19937} - 1$. All random number generators have what is known as a 'seed'. This is a number that uniquely identifies a series of random number sequences. Strictly, computer generated random numbers are 'pseudo-random' as the sequences themselves are predefined. However, with such a large number of possible sequences ($2^{19937} - 1$), for all intents and purposes they are random. By default, the initial random seed is generated from the computer clock (milliseconds field) and is therefore unbiased. However, it is also possible to specify a random seed. This is often useful for error-checking functions. Additionally, it also facilitates learning how to perform randomizations, as the same outcomes can be repeated.
R has a family of functions (see the following Table) that extract random numbers from a range of mathematical distributions that represent the common sampling and statistical distributions encountered in biology.
Distribution | Example syntax |
---|---|
Normal (gaussian) |
Generate 5 random numbers from a normal distribution with a mean of 10 and a standard deviation of 1
rnorm(5,mean=10,sd=1) [1] 8.071531 8.544727 11.222091 10.318727 9.783264 |
Log-normal |
Generate 5 random numbers from a log-normal distribtution whose logarithm has a mean of 2 and a standard deviation of 1
rlnorm(5,mean=2,sd=1) [1] 7.496357 4.077878 7.872193 3.063754 3.865319 |
Exponential |
Generate 5 random numbers from a exponential distribtution with a lambda rate of 2
rexp(5,rate=2) [1] 0.01285233 0.15786292 0.44019651 0.10174078 0.09513809 |
Uniform |
Generate 5 random numbers from a uniform distribution with a minimum of 2 and a maximum of 10
runif(5, min=2, max=10) [1] 4.913912 4.360669 2.879489 2.544072 4.805009 |
Binomial |
Generate 5 random numbers from a binomial distribution based on 10 Bernoulli trials and a probability of success of 0.5 per trial
rbinom(5, size=10, prob=0.5) [1] 7 6 4 4 4 |
Negative binomial |
Generate 5 random numbers from a negative binomial distribution based on 10 Bernoulli trials and a $\mu$ of 4
rnbinom(5, size=10, mu=4) [1] 6 6 0 2 6 |
Poisson |
Generate 5 random numbers from a Poisson distribution with a lambda parameter of 4
rpois(5, lambda=4) [1] 3 2 5 6 5 |
Dummy data sets
This concept can be extended to generating sets of vectors that have certain individual and collective properties.
For example, imagine that you were interested in examining the effect of four different nitrogen treatments (N1, N2, N3, N4) on the growth rate of a particular species of plant. An ANOVA design appeared suitable for your intended experimental design, and you prudently decided to run a mock analysis prior to data collection.
Previous studies had indicated that the growth rate of the plant species was normally distributed with a mean of around 250 mm per year with a standard deviation of about 20 mm, and you had decided (for whatever reason) to have 3 replicates of each treatment. Furthermore, compared to N1 (sort of a control), N2, N3 and N4 where expected to reduce the growth rate by 10,20 and 50 mm per year. Using these criteria it is possible to generate a dummy data set.
#create the response variable with four sets of 3 random numbers from a normal distribution GROWTH.RATE <- c(rnorm(3,mean=250, sd=20), rnorm(3,mean=240,sd=20), rnorm(3,mean=230, sd=20), rnorm(3,mean=200, sd=20)) #create the nitrogen treatment factor with four levels each replicated 3 times TREATMENT <- gl(4,3,12,lab=paste('N',1:4,sep="")) #combine the vectors into a dataframe NITROGEN <- data.frame(GROWTH.RATE, TREATMENT) NITROGEN
GROWTH.RATE TREATMENT 1 217.0711 N1 2 222.3915 N1 3 249.8471 N1 4 226.9847 N2 5 263.5522 N2 6 249.9296 N2 7 228.0719 N3 8 220.1417 N3 9 206.3110 N3 10 195.5114 N4 11 198.5820 N4 12 216.1429 N4
A much more elegant (yet technically more advanced) approach is to generate the response by calculating the outer product of the expected coefficients and the design (model) matrix. This is the approach is effectively the reverse of the statistical modeling process and is also the technique used to generate data sets used in the tutorial series.
# define the expected effects coefficients BASE <- 250 SD <- 20 N.effects <- c(-10,-20,-50) ALL.effects <- c(BASE,N.effects) # create the nitrogen treatment factor with four levels each replicated 3 times TREATMENT <- gl(4,3,12,lab=paste('N',1:4,sep="")) # create the model matrix XMAT <- model.matrix(~TREATMENT) # calculate the outer product of the effects parameters and the model matrix GROWTH.RATE <- as.numeric(ALL.effects %*% t(XMAT)) + rnorm(12,0,SD) #combine the vectors into a dataframe NITROGEN <- data.frame(GROWTH.RATE, TREATMENT) NITROGEN
GROWTH.RATE TREATMENT 1 255.8536 N1 2 254.8708 N1 3 241.8815 N1 4 254.8649 N2 5 215.5755 N2 6 220.7445 N2 7 260.7054 N3 8 220.2532 N3 9 243.6722 N3 10 204.5869 N4 11 212.1512 N4 12 175.1708 N4
For multifactor designs, the expand.grid() function provides a convenient way to generate dataframes containing all combinations of one or more factors. Following from the previous example, imagine you now wanted to create mock data for a two factor (nitrogen treatment and season) ANOVA design. A dummy data set could be created as follows:
# define the expected effects coefficients #mean growth rate in winter with N1 treatment BASE <- 250 SD <- 20 #nitrogen effects N.effects <- c(-10,-20,-50) #season effects S.effects <- 40 #interaction effects I.effects <- c(10,5,-5) ALL.effects <- c(BASE,N.effects,S.effects,I.effects) # create the nitrogen treatment factor with four levels and the season treatment factor with two levels and with 3 replicates of each combination X <- expand.grid(rep=1:3,SEASON=c("WINTER","SUMMER"), TREATMENT=paste("N",1:4,sep="")) # create the model matrix XMAT <- model.matrix(~TREATMENT*SEASON, data=X) # calculate the outer product of the effects parameters and the model matrix GROWTH.RATE <- as.numeric(ALL.effects %*% t(XMAT)) + rnorm(24,0,SD) #combine the vectors into a dataframe NITROGEN <- data.frame(GROWTH.RATE, X$TREATMENT, X$SEASON) NITROGEN
GROWTH.RATE X.TREATMENT X.SEASON 1 270.8447 N1 WINTER 2 231.6353 N1 WINTER 3 270.6817 N1 WINTER 4 306.9407 N1 SUMMER 5 283.9610 N1 SUMMER 6 284.4013 N1 SUMMER 7 233.9044 N2 WINTER 8 271.2367 N2 WINTER 9 254.0034 N2 WINTER 10 289.8781 N2 SUMMER 11 313.9714 N2 SUMMER 12 307.8634 N2 SUMMER 13 253.2728 N3 WINTER 14 214.2920 N3 WINTER 15 206.1691 N3 WINTER 16 267.3112 N3 SUMMER 17 259.5184 N3 SUMMER 18 337.7054 N3 SUMMER 19 236.3569 N4 WINTER 20 213.0547 N4 WINTER 21 225.8168 N4 WINTER 22 222.7644 N4 SUMMER 23 260.2253 N4 SUMMER 24 250.7643 N4 SUMMER
These data can now be subject to the statistical and graphical procedures. Dummy data sets are also useful for examining the possible impacts of missing data and unbalanced designs.
As indicated above, there are many more examples of simulating data in the tutorial series. The complexity of the data similations increase in complexity as the underlying designs become more complex.