Tutorial 5.3 - Exploratory data analysis

08 May 2017

Data analysis begins with quality control and familiarizing oneself with the collected data. Enormous effort and resourses are invested in designing and collecting data. It is therefore vital that some time and effort should also be invested in checking that the data were transcribed correctly.

Most good research programs will have fully considered the eventual analysis options when designing the surveys or experiments. Nevertheless, real observed ecological data rarely complies exactly with the requirements of all analysis procedures. Ecological pursuits are often plagued by missing or peculiar observations, spurious measurements and unexpected trends. Hence it is also vital that the researcher be aware of any issues in their data so that they can best accommodate them in the statistical models.

The reliability of most statistical analyses is dependent on the degree to which the data (or residuals) conform to specific conditions. For example, traditional linear regression assumes that the residuals are normally distributed and homogeneous the relationship between the response and predictor is linear.

Whilst these conditions are most definitively assessed after the model has been fitted, major violations can be pre-identified from a series of simple numerical and graphical summaries. Moreover, many other analogical decisions (such as choice of analysis type, error structure and the makeup of the linear predictor - set of predictor variables) can be sensibly pre-informed once a detailed exploration of the data have been carried out.

In general, decisions made from graphical summaries are more subjective than those based on numerical summaries, however when combined, numerical and graphical summaries can provide a rich set of decision tools.

In order to illustrate exploratory data analysis techniques, I will generate artificial data. This has the advantage of allowing us to match features simple summaries to the 'real' features and parameters of the data.

Numerical summaries

Details of data generation

Random data incorporating the following properties

Factor A has two levels ("a1" and "a2") and 20 replicates of each for a total sample size of 40
The continuous covariate (X) is drawn from a uniform distribution with a minimum of 20 and maximum of 50
In the linear predictor model matrix, X is approximately centered (such that the intercept is vaguely informative).
The intercept is 30 (value of the response when the centered covariate is zero).
The slope of the relationship between Y and X at level a1 is 2
Level a2 is five units higher than a1 at centered X
The slope of the relationship between Y and X at a2 is 2 units less than it is at a1
The noise (errors) are normally distributed with a mean of 0 and a standard deviation of 5

\begin{align*} x&\sim U(20,50)\\ y&=\beta_0 + \alpha_2+\beta_1(x-35)+\alpha_2\beta_1(x-35)+\epsilon, \hspace{1cm} &\epsilon\sim N(0,\sigma^2)\\ \end{align*}

set.seed(1)
n.A <- 3
n.Rep <- 18
n <- n.A*n.Rep
A <- as.factor(c(rep("High",n.Rep), rep("Medium",n.Rep), rep("Low",n.Rep)))
n.B <- 2
B <- gl(n=n.B,k=n.Rep/n.B, length=n, lab=paste("b",1:2, sep=""))
X <- runif(n,20,50)
Xmat <- model.matrix(~A*B*I(X-35))

#Effects
base <- 30
X.slope <- 2
A.eff <- c(5,2)
AX.eff <- c(-2,1)
B.eff <- 1
BX.eff <- -1
AB.eff <- c(2,3)
ABX.eff <- c(0.5,1)
all.eff <- c(base, A.eff, B.eff, X.slope, AB.eff, AX.eff, BX.eff, ABX.eff)

sigma <- 5
eps <- rnorm(n,0,sigma)
Y <- as.numeric(Xmat %*% all.eff + eps)
data1 <- data.frame(Y,A,B,X)
rem <- round(runif(3,0,n),0)
data2 <- data1[-rem,]
data2[10,1]<-NA
data2[2,1] <-data2[2,1]*100

#save data
write.table(data1, file='../downloads/data/data1.csv', quote=FALSE, row.names=FALSE, sep=",")
write.table(data2, file='../downloads/data/data2.csv', quote=FALSE, row.names=FALSE, sep=",")

Download our test data set (data2.csv)

Data structure

A broad summary of scale and type of each variable is a very good starting point for exploratory data analysis. Not only can the researcher get a reasonable sense of the data, they can also confirm that R's internal representation of the data is consistent with that of the researcher. For example, does R correctly identify categorical variables as factors and is the order of factor levels compatible with the researchers expectations.

`summary()`

summary(data2)

       Y                A       B            X        
 Min.   : -17.44   High  :18   b1:25   Min.   :20.40  
 1st Qu.:  27.39   Low   :16   b2:26   1st Qu.:30.69  
 Median :  35.96   Medium:17           Median :36.59  
 Mean   :  74.89                       Mean   :36.12  
 3rd Qu.:  42.32                       3rd Qu.:43.21  
 Max.   :1993.67                       Max.   :49.76  
 NA's   :1

dim(data2)

[1] 51  4

Close inspection of the above summaries would have revealed:

The variable Y is a continuous, numeric variable with a single missing value
There is one very large Y observation that is two orders of magnitude larger than most of the other observations in that variable - possibly a typo or instrument miss-reading/malfunction.
Variable A is categorical with levels 'High','Low' and 'Medium' - this might not be a logical order for the factor levels.
There are an unequal number of replicates in Variable A.
Variable B is a categorical variable with levels 'b1' and 'b2' and unequal numbers of replicates.
Variable X is a continuous, numeric variable. The quantiles of this variable are very symmetrical suggesting that the data may well be normally distributed.

`glimpse()`

library(tidyverse)
glimpse(data2)

Observations: 51
Variables: 4
$ Y <dbl> 8.576758, 1993.668370, 36.460910, 61.285865, 11.586977, 55.841739, 56.411491, 32.7625...
$ A <fctr> High, High, High, High, High, High, High, High, High, High, High, High, High, High, ...
$ B <fctr> b1, b1, b1, b1, b1, b1, b1, b1, b1, b2, b2, b2, b2, b2, b2, b2, b2, b2, b1, b1, b1, ...
$ X <dbl> 27.96526, 31.16372, 37.18560, 47.24623, 26.05046, 46.95169, 48.34026, 39.82393, 38.87...

`head()`

head(data2)

            Y    A  B        X
1    8.576758 High b1 27.96526
2 1993.668370 High b1 31.16372
3   36.460910 High b1 37.18560
4   61.285865 High b1 47.24623
5   11.586977 High b1 26.05046
6   55.841739 High b1 46.95169

`str()`

str(data2)

'data.frame':	51 obs. of  4 variables:
 $ Y: num  8.58 1993.67 36.46 61.29 11.59 ...
 $ A: Factor w/ 3 levels "High","Low","Medium": 1 1 1 1 1 1 1 1 1 1 ...
 $ B: Factor w/ 2 levels "b1","b2": 1 1 1 1 1 1 1 1 1 2 ...
 $ X: num  28 31.2 37.2 47.2 26.1 ...

`replications()`

Design balance (equal amounts of replicates) is important for many analyses involving multiple categorical variables.

replications(Y~A*B, data1)

  A   B A:B 
 18  27   9

replications(Y~A*B, data2)

$A
A
  High    Low Medium 
    17     16     17 

$B
[1] 25

$`A:B`
        B
A        b1 b2
  High    9  8
  Low     8  8
  Medium  8  9

Close inspection of the above summaries would have revealed:

Data1 is balanced, there are 9 replicates of each AB combination. The returned object is a vector.
Data2 is not balanced as there are not the same number of replicates in each AB combination. The returned object is a list.
Design balance can be assessed by whether the replications() function returns a vector (balanced) or list (unbalanced).

Vector (variable) basic statistical summaries

Function	Description	Example
`mean()`	Arithmetic mean	mean(data1$Y, na.rm=TRUE) [1] 34.34877
`median()`	Middle value	median(data1$Y, na.rm=TRUE) [1] 35.40594
`var()`	Variance	var(data1$Y,na.rm=TRUE) [1] 311.2593
`sd()`	Standard deviation	sd(data1$Y,na.rm=TRUE) [1] 17.64254
`range()`	Minimum to maximum	range(data1$Y,na.rm=TRUE) [1] -17.44169 79.17418
`quantile()`	Quantiles	quantile(data1$Y,na.rm=TRUE) 0% 25% 50% 75% 100% -17.44169 26.15653 35.40594 40.46630 79.17418
`fivenum()`	Minimum, quantiles and maximum	fivenum(data1$Y,na.rm=TRUE) [1] -17.44169 25.97521 35.40594 40.47954 79.17418
`cor()`	Correlation between continuous variables	cor(data1$Y,data1$X) [1] 0.7644416
`acf()`	Autocorrelation	acf(data1$Y,plot=FALSE,lag.max=5) Autocorrelations of series 'data1$Y', by lag 0 1 2 3 4 5 1.000 -0.030 -0.104 0.211 -0.208 -0.177
`length()`	Number of items in the vector	length(data1$Y) [1] 54
`class()`	R representation of the object	class(data1$Y) [1] "numeric"

Pivot tables

So called pivot tables, summarize a single variable (typically a numeric vector) by performing a function (such as mean to different subsets of the data. For example, we could calculate the mean of Y for each level of A, or we could calculate the column means or row totals etc.

There are numerous functions available for generating pivot-like summary and aggregation tables:

Task	Function	Example
Mean Y of each combination of A and B	`tapply`	with(data2,tapply(Y,list(A,B),mean,na.rm=TRUE)) b1 b2 High 254.69628 33.83451 Low 35.67018 36.53979 Medium 27.62700 42.54609
	`ddply`	library(plyr) ddply(data2,~A*B,function(x) mean(x$Y,na.rm=TRUE)) A B V1 1 High b1 254.69628 2 High b2 33.83451 3 Low b1 35.67018 4 Low b2 36.53979 5 Medium b1 27.62700 6 Medium b2 42.54609
Mean of all numeric variables for each combination of A and B	`ddply` and `colwise`	library(plyr) ddply(data2,~A*B,function(x) {colwise(mean,is.numeric, na.rm=TRUE)(x)}) A B Y X 1 High b1 254.69628 38.17784 2 High b2 33.83451 34.97503 3 Low b1 35.67018 37.71426 4 Low b2 36.53979 35.97296 5 Medium b1 27.62700 33.58523 6 Medium b2 42.54609 36.16558
Class of variables in data frame	`sapply`	sapply(data2,class) Y A B X "numeric" "factor" "factor" "numeric"
	`colwise`	library(plyr) colwise(class)(data2) Y A B X 1 numeric factor factor numeric

Graphical summaries

Although numerical summaries can provide broad overviews of the data, they can be a gross oversimplification. Notably, point estimates of location and spread tend conceal most of the data and do not provide a full sense of the distribution of the data. Similarly, measures of correlation provide only a single, very narrow perspective of the trend between variables.

Graphical summaries by contrast provide a way to visualize data and trends and thus have the potential to be more useful in informing analytical decisions as well as in identifying and diagnosing possible analysis issues.

Exploring distributions

Histograms

Histograms are useful at representing the distribution of observations for large ($n>30$) sample sizes.

Normal data with a mean of 10 and standard deviation of 2	Skewed distribution with a mean of log(10) and standard deviation of log(2)
set.seed(1) VAR <- rnorm(100,10,2) hist(VAR) Via ggplot library(ggplot2) set.seed(1) gVAR <- data.frame(VAR=rnorm(100,10,2)) ggplot(gVAR, aes(x=VAR))+geom_histogram()	set.seed(1) VAR <- rlnorm(100,log(10),log(2)) hist(VAR) Via ggplot library(ggplot2) set.seed(1) gVAR <- data.frame(VAR=rlnorm(100,log(10),log(2))) ggplot(gVAR, aes(x=VAR))+geom_histogram()
Classic bell shape (Normal) distribution	Asymmetrical (Non-normal) distribution

Normal data with a mean of 10 and standard deviation of 2

Skewed distribution with a mean of log(10) and standard deviation of log(2)

set.seed(1)
VAR <- rnorm(100,10,2)
hist(VAR)

Via ggplot

library(ggplot2)
set.seed(1)
gVAR <- data.frame(VAR=rnorm(100,10,2))
ggplot(gVAR, aes(x=VAR))+geom_histogram()

set.seed(1)
VAR <- rlnorm(100,log(10),log(2))
hist(VAR)

Via ggplot

library(ggplot2)
set.seed(1)
gVAR <- data.frame(VAR=rlnorm(100,log(10),log(2)))
ggplot(gVAR, aes(x=VAR))+geom_histogram()

Classic bell shape (Normal) distribution

Asymmetrical (Non-normal) distribution

The number or size of the bins can be controlled by passing respectively a single number or vector of bin breakpoints with the breaks= argument. It is also possible to pass a function that computes the number of breaks or the name of a breaking algorithm. Specifying the probability=T argument will express the number counts in each bin as a density (probability) rather than as a frequency.

set.seed(1)
VAR <- rnorm(100,10,2)
hist(VAR, breaks=18, probability=T)
#or equivalently
hist(VAR, breaks=seq(5.5,15,by=.5), probability=T)

Via ggplot

library(ggplot2)
set.seed(1)
gVAR <- data.frame(VAR=rnorm(100,10,2))
ggplot(gVAR, aes(x=VAR))+geom_histogram(aes(y=..density..),binwidth=0.5)

#OR
ggplot(gVAR, aes(x=VAR))+geom_histogram(aes(y=..density..),breaks=seq(5.5,15,by=.5))

Density

Probability density functions are also useful additions or alternatives to histograms as they further assist in describing the patterns of the underlying distribution. Typical kernel density functions fit a series of kernels (symmetric probability functions) to successive subsets (windows) of the ordered dataset from which new estimates of the observations are calculated. The resolution and texture (smoothness) of the density function is controlled by a smoothing parameter which essentially defines the width of the kernel window.make

A density function can be plotted using the density() function as an argument to the high-level overloaded plot() function.

set.seed(1)
VAR <- rnorm(100,10,2)
plot(density(VAR))

Via ggplot

library(ggplot2)
set.seed(1)
gVAR <- data.frame(VAR=rnorm(100,10,2))
ggplot(gVAR, aes(x=VAR))+geom_density()

#OR
ggplot(gVAR, aes(x=VAR))+geom_density(aes(y=..count..))

#OR
ggplot(gVAR, aes(x=VAR))+geom_density(aes(y=..scaled..))

The type of smoothing kernel (normal or gaussian by default) can be defined by the `kernel=` argument and the degree of smoothing is controlled by the `bw=` (bandwidth) argument. The higher the smoothing bandwidth, the greater the degree of smoothing. set.seed(1) VAR <- rnorm(100,10,2) plot(density(VAR, bw=1))
Via ggplot library(ggplot2) set.seed(1) gVAR <- data.frame(VAR=rnorm(100,10,2)) ggplot(gVAR, aes(x=VAR))+geom_density(adjust=2)+xlim(0,20)

The density function can also be added to a histogram using the density() function as an argument to a the low-level lines() function.

set.seed(1)
VAR1 <- rlnorm(100,log(10),log(2))
hist(VAR1, prob=T)
lines(density(VAR1))

Via ggplot

library(ggplot2)
set.seed(1)
gVAR1 <- data.frame(VAR1=rlnorm(100,log(10),log(2)))
ggplot(gVAR1, aes(x=VAR1))+geom_histogram(aes(y=..density..))+geom_density()

#OR
ggplot(gVAR1, aes(x=VAR1))+geom_histogram(aes(y=..density..), fill='white', color='black')+geom_density()

#OR
ggplot(gVAR1, aes(x=VAR1))+geom_histogram(aes(y=..density..),fill='white', color='black')+
  geom_density(fill='blue',alpha=0.3)+theme_classic()

Q-Q plots

Q-Q normal plots can also be useful at diagnosing departures from normality by comparing the data quantiles to those of a standard normal distribution. Substantial deviations from linearity, indicate departures from normality. Quantiles are a regular spacing of points throughout an ordered data set.

Normal distribution	Log-normal distribution
qqnorm(VAR) qqline(VAR) Via ggplot library(ggplot2) ggplot(gVAR, aes(sample=VAR))+stat_qq()+ geom_abline(intercept=mean(gVAR$VAR), slope=sd(gVAR$VAR)) #OR ggplot(gVAR, aes(sample=VAR))+geom_point(stat='qq')	qqnorm(VAR1) qqline(VAR1) Via ggplot library(ggplot2) ggplot(gVAR1, aes(sample=VAR1))+stat_qq()+ geom_abline(intercept=mean(gVAR1$VAR), slope=sd(gVAR1$VAR))
Observed data confirms to a normal distribution	Observed data appears to deviate from a normal distribution.

Normal distribution

Log-normal distribution

qqnorm(VAR)
qqline(VAR)

Via ggplot

library(ggplot2)
ggplot(gVAR, aes(sample=VAR))+stat_qq()+
  geom_abline(intercept=mean(gVAR$VAR), slope=sd(gVAR$VAR))

#OR
ggplot(gVAR, aes(sample=VAR))+geom_point(stat='qq')

qqnorm(VAR1)
qqline(VAR1)

Via ggplot

library(ggplot2)
ggplot(gVAR1, aes(sample=VAR1))+stat_qq()+
  geom_abline(intercept=mean(gVAR1$VAR), slope=sd(gVAR1$VAR))

Observed data confirms to a normal distribution

Observed data appears to deviate from a normal distribution.

Boxplots

For smaller sample sizes, histograms and density functions can be difficult to interpret. Boxplots (or box-and-whisker plots) provide an alternative means of depicting the location (average), variability and shape of the distribution of data. The dimensions of a boxplot are defined by the five-number summaries (minimum value, lower quartile (Q1), median (Q2), upper quartile (Q3) and maximum value - each representing %25%) of the data.

Univariate boxplots are generated by passing a vector to the boxplot() function.

Normal distribution	Log-normal distribution
set.seed(11) VAR2 <- rnorm(15,10,2) boxplot(VAR2) Via ggplot library(ggplot2) ggplot(gVAR, aes(y=VAR,x=1))+geom_boxplot()	set.seed(11) VAR3 <- rlnorm(15,log(10), log(2)) boxplot(VAR3) Via ggplot library(ggplot2) set.seed(11) gVAR3 <- data.frame(VAR3=rlnorm(15,log(10), log(2))) ggplot(gVAR3, aes(y=VAR3,x=1))+geom_boxplot()
Observed data symmetrical (consistent with normal distribution)	Observed data appears to be skewed (non-normal)

Normal distribution

Log-normal distribution

set.seed(11)
VAR2 <- rnorm(15,10,2)
boxplot(VAR2)

Via ggplot

library(ggplot2)
ggplot(gVAR, aes(y=VAR,x=1))+geom_boxplot()

set.seed(11)
VAR3 <- rlnorm(15,log(10), log(2))
boxplot(VAR3)

Via ggplot

library(ggplot2)
set.seed(11)
gVAR3 <- data.frame(VAR3=rlnorm(15,log(10), log(2)))
ggplot(gVAR3, aes(y=VAR3,x=1))+geom_boxplot()

Observed data symmetrical (consistent with normal distribution)

Observed data appears to be skewed (non-normal)

The horizontal=T argument is used to produce horizontally aligned boxplots.

boxplot(VAR3, horizontal=TRUE)
Via ggplot library(ggplot2) set.seed(1) ggplot(gVAR3, aes(y=VAR3, x=1))+ geom_boxplot() + coord_flip()

Conditional boxplots

When a categorical predictor has more than three levels patterns of variance are more readily explored with boxplots. Rather than a single boxplot representing all of the observed responses, the data are split up (conditioned) according to the levels of a categorical variable.

The height (size) of a boxplot is approximately proportional to the variance of the underlying data and the position of the boxplot along the Y-axis is proportional to the mean. Hence we can explore whether there is likely to be a relationship between population means and variances (which is not desirable) via conditional boxplots.

set.seed(10)
A <- gl(n=5,10,50,lab=paste("a",1:5, sep=""))
Xmat <- model.matrix(~A)
Y <- as.numeric(Xmat %*% c(20,2,5,10,20))
Y <- Y+rlnorm(50,log(Y/10),log(Y/10))
boxplot(Y~A)

There is an indication that the means and variances are related.

Via ggplot

library(ggplot2)
dat <- data.frame(Y=Y, A=A)
ggplot(dat, aes(y=Y, x=A))+ geom_boxplot()

Rug charts

Another representation of the data that can be added to existing plots is a rug chart that displays the values as a series of ticks on the axis. Rug charts can be particularly useful at revealing artifacts (such as clumping) in the data that are "smoothed" over by histograms, boxplots and density functions.

boxplot(VAR3, horizontal=TRUE)
rug(VAR3,side=1)

Via ggplot

library(ggplot2)
ggplot(gVAR3, aes(y=VAR3, x=1))+ geom_boxplot()+
  coord_flip()+
  geom_rug(sides='b')

plot(density(VAR1))
rug(VAR1,side=1)

Via ggplot

library(ggplot2)
ggplot(gVAR1, aes(x=VAR1))+ geom_density()+
  geom_rug(sides='b')

Exploring relationships

When two or more continuous variables are collected, we often intend to explore the nature of the relationships between the variables. Such trends can be depicted graphically in scatterplots. Scatterplots display a cloud of points, the coordinates of which correspond to the values of the variables that define the horizontal and vertical axes.

Scatterplots

Although scatterplots do not formally distinguish between response (dependent) and predictor (independent) variables, when such distinctions occur, independent variables are conventionally plotted along the horizontal (x) axis.

Scatterplots are used prior to analyses to help assess the suitability of the data to particular analytical procedures. Of particular importance are the insights they provide into the linearity and patterns of variability of trends. They are also presented post analysis as summaries of the trends and analyses.

Normality, linearity, homogeneity of variance met	Normality, linearity, homogeneity of variance violated
set.seed(7) x <- runif(20, 0,30) y <- (2*x)+3+rnorm(20,0,5) plot(y~x) Via ggplot library(ggplot2) dat <- data.frame(y=y, x=x) ggplot(dat, aes(x=x, y=y))+ geom_point()	set.seed(7) x <- runif(20, 0,30) y <- 3x/(8+x)+rnorm(20,0,(0.15x/10)) plot(y~x) Via ggplot library(ggplot2) dat <- data.frame(y=y, x=x) ggplot(dat, aes(x=x, y=y))+ geom_point()
Observed data appear linear and evenly scattered	Data cloud appears non-linear Data appear to become more variable (spread) with increasing X

Normality, linearity, homogeneity of variance met

Normality, linearity, homogeneity of variance violated

set.seed(7)
x <- runif(20, 0,30)
y <- (2*x)+3+rnorm(20,0,5)
plot(y~x)

Via ggplot

library(ggplot2)
dat <- data.frame(y=y, x=x)
ggplot(dat, aes(x=x, y=y))+ geom_point()

set.seed(7)
x <- runif(20, 0,30)
y <- 3*x/(8+x)+rnorm(20,0,(0.15*x/10))
plot(y~x)

Via ggplot

library(ggplot2)
dat <- data.frame(y=y, x=x)
ggplot(dat, aes(x=x, y=y))+ geom_point()

Observed data appear linear and evenly scattered

Data cloud appears non-linear
Data appear to become more variable (spread) with increasing X

Trendline

Fitting a linear line of best fit through the data cloud can diagnose issues with non-linearity and homogeneity of variance. Ideally, the data should be spread evenly about the trendline. Increasing spread along a trendline is symptomatic of a relationship between mean and variance and thus evidence of non-homogeneity.

Normality, linearity, homogeneity of variance met	Normality, linearity, homogeneity of variance violated
set.seed(7) x <- runif(20, 0,30) y <- (2*x)+3+rnorm(20,0,5) plot(y~x) abline(lm(y~x), lty=2) Via ggplot library(ggplot2) dat <- data.frame(y=y, x=x) ggplot(dat, aes(x=x, y=y))+ geom_point()+ geom_smooth(method='lm')	set.seed(7) x <- runif(20, 0,30) y <- 3x/(8+x)+rnorm(20,0,(0.15x/10)) plot(y~x) abline(lm(y~x), lty=2) Via ggplot library(ggplot2) dat <- data.frame(y=y, x=x) ggplot(dat, aes(x=x, y=y))+ geom_point()+ geom_smooth(method='lm')
Observed data appear linear and evenly scattered Linear trend line appears to represent the data trend well	Data cloud appears non-linear, the linear trend line does not appear to fit the data cloud well. Data appear to become more variable (spread) with increasing X

Normality, linearity, homogeneity of variance met

Normality, linearity, homogeneity of variance violated

set.seed(7)
x <- runif(20, 0,30)
y <- (2*x)+3+rnorm(20,0,5)
plot(y~x)
abline(lm(y~x), lty=2)

Via ggplot

library(ggplot2)
dat <- data.frame(y=y, x=x)
ggplot(dat, aes(x=x, y=y))+ geom_point()+
  geom_smooth(method='lm')

set.seed(7)
x <- runif(20, 0,30)
y <- 3*x/(8+x)+rnorm(20,0,(0.15*x/10))
plot(y~x)
abline(lm(y~x), lty=2)

Via ggplot

library(ggplot2)
dat <- data.frame(y=y, x=x)
ggplot(dat, aes(x=x, y=y))+ geom_point()+
  geom_smooth(method='lm')

Observed data appear linear and evenly scattered
Linear trend line appears to represent the data trend well

Data cloud appears non-linear, the linear trend line does not appear to fit the data cloud well.
Data appear to become more variable (spread) with increasing X

Lowess smoother

A lowess (locally weighted polynomial regression) smoother generates a curve by stitching together a series of polynomials that each apply to different regions of the data cloud. When a trend is consistently linear throughout its length, each of the smaller polynomial trends (and thus the overall smoother) should be approximately linear. Thus major deviations from linearity can be identified via lowess smoothers that display multiple trajectories (directions).

Analytical decisions regarding the use of scale transformations, non-linear modeling or additive models can be made on the basis of these simple smoothers.

Normality, linearity, homogeneity of variance met	Normality, linearity, homogeneity of variance violated
set.seed(7) x <- runif(20, 0,30) y <- (2*x)+3+rnorm(20,0,5) plot(y~x) lines(lowess(y~x), lty=2) Via ggplot library(ggplot2) dat <- data.frame(y=y, x=x) ggplot(dat, aes(x=x, y=y))+ geom_point()+ geom_smooth(method='loess')	set.seed(7) x <- runif(20, 0,30) y <- 3x/(8+x)+rnorm(20,0,(0.15x/10)) plot(y~x) lines(lowess(y~x), lty=2) Via ggplot library(ggplot2) dat <- data.frame(y=y, x=x) ggplot(dat, aes(x=x, y=y))+ geom_point()+ geom_smooth(method='loess')
Lowess smoother appears approximately linear throughout its length	Lowess smoother demonstrates a distinct change in path indicative of a non-linear trend

Normality, linearity, homogeneity of variance met

Normality, linearity, homogeneity of variance violated

set.seed(7)
x <- runif(20, 0,30)
y <- (2*x)+3+rnorm(20,0,5)
plot(y~x)
lines(lowess(y~x), lty=2)

Via ggplot

library(ggplot2)
dat <- data.frame(y=y, x=x)
ggplot(dat, aes(x=x, y=y))+ geom_point()+
  geom_smooth(method='loess')

set.seed(7)
x <- runif(20, 0,30)
y <- 3*x/(8+x)+rnorm(20,0,(0.15*x/10))
plot(y~x)
lines(lowess(y~x), lty=2)

Via ggplot

library(ggplot2)
dat <- data.frame(y=y, x=x)
ggplot(dat, aes(x=x, y=y))+ geom_point()+
  geom_smooth(method='loess')

Lowess smoother appears approximately linear throughout its length

Lowess smoother demonstrates a distinct change in path indicative of a non-linear trend

Spline smoothers

Normality, linearity, homogeneity of variance met	Normality, linearity, homogeneity of variance violated
set.seed(7) x <- runif(20, 0,30) y <- (2*x)+3+rnorm(20,0,5) plot(y~x) library(splines) sm <- lm(y~ns(x,k=3)) xs <- seq(0,30,l=100) lines(xs,predict(sm, newdata = data.frame(x=xs)), lty=2) Via ggplot library(ggplot2) dat <- data.frame(y=y, x=x) ggplot(dat, aes(x=x, y=y))+ geom_point()+ geom_smooth(method='gam') #OR ggplot(dat, aes(x=x, y=y))+ geom_point()+ geom_smooth(method='gam', formula=y~s(x,k=3,fx=T))	set.seed(7) x <- runif(20, 0,30) y <- 3x/(8+x)+rnorm(20,0,(0.15x/10)) plot(y~x) library(splines) sm <- lm(y~ns(x,k=3)) xs <- seq(0,30,l=100) lines(xs,predict(sm, newdata=data.frame(x=xs)), lty=2) Via ggplot library(ggplot2) dat <- data.frame(y=y, x=x) ggplot(dat, aes(x=x, y=y))+ geom_point()+ geom_smooth(method='gam') #OR ggplot(dat, aes(x=x, y=y))+ geom_point()+ geom_smooth(method='gam', formula=y~s(x,df=3,fx=T))

Normality, linearity, homogeneity of variance met

Normality, linearity, homogeneity of variance violated

set.seed(7)
x <- runif(20, 0,30)
y <- (2*x)+3+rnorm(20,0,5)
plot(y~x)
library(splines)
sm <- lm(y~ns(x,k=3))
xs <- seq(0,30,l=100)
lines(xs,predict(sm, newdata = data.frame(x=xs)), lty=2)

Via ggplot

library(ggplot2)
dat <- data.frame(y=y, x=x)
ggplot(dat, aes(x=x, y=y))+ geom_point()+
  geom_smooth(method='gam')

#OR
ggplot(dat, aes(x=x, y=y))+ geom_point()+
  geom_smooth(method='gam', formula=y~s(x,k=3,fx=T))

set.seed(7)
x <- runif(20, 0,30)
y <- 3*x/(8+x)+rnorm(20,0,(0.15*x/10))
plot(y~x)
library(splines)
sm <- lm(y~ns(x,k=3))
xs <- seq(0,30,l=100)
lines(xs,predict(sm, newdata=data.frame(x=xs)), lty=2)

Via ggplot

library(ggplot2)
dat <- data.frame(y=y, x=x)
ggplot(dat, aes(x=x, y=y))+ geom_point()+
  geom_smooth(method='gam')

#OR
ggplot(dat, aes(x=x, y=y))+ geom_point()+
  geom_smooth(method='gam', formula=y~s(x,df=3,fx=T))

Scatterplot

To facilitate all of these diagnostic features as well as marginal boxplots, the high-level scatterplot() function (car package) is very useful. Note, the scatterplot() function fits a lowess rather than loess smoother.

Normality, linearity, homogeneity of variance met	Normality, linearity, homogeneity of variance violated
set.seed(7) x <- runif(20, 0,30) y <- (2*x)+3+rnorm(20,0,5) library(car) scatterplot(y~x)	set.seed(7) x <- runif(20, 0,30) y <- 3x/(8+x)+rnorm(20,0,(0.15x/10)) library(car) scatterplot(y~x)

Scatterplot matrix

When there are numerous potential predictors we not only need to be concerned about the nature of distributions and relationships with the response variable. Models involving multiple predictor variables are often compromised by the presence of predictors that are correlated to one another.

Details of data generation

Random data incorporating the following properties

Normally distributed response ($y$) with mean of 20 and variance of 5 ($N(\mu=20,\sigma^2=5$)
Normally distributed predictor1 ($x1$) with mean of 50 and variance of 10 ($N(\mu=50,\sigma^2=10$)
Normally distributed predictor2 ($x2$) with mean of 100 and variance of 20 ($N(\mu=100,\sigma^2=20$)
In order to control (approximately fix) the degree of correlation between the three variables, we need to sample from a multivariate normal distribution. The multivariate normal distribution is defined by the variable means and variance-covariance matrix. Since \begin{align*} Corr(x,y) &= Cov(x,y)/\sqrt{Var(x)*Var(y)}\\ Cov(x,y) &= Corr(x,y)*\sqrt{Var(x)*Var(y)} \end{align*} we can use the desired correlations to calculate the covariances between each variable
Correlation between $Corr(Y,X1)=0.9, Corr(Y,X2)=0.1, Corr(X1,X2)=0.5$
Use the rmvnorm() function in the mvtnorm package to generate 50 sets of observations from a multivariate normal distribution with the specified means and variance-covariances

set.seed(3)
y.mean <- 20
y.var <- 5
x1.mean <- 50
x1.var <- 10
x2.mean <- 100
x2.var <- 20

cor.yx1 <- 0.7
cov.yx1 <- cor.yx1*sqrt(y.var*x1.var)

cor.yx2 <- -0.6
cov.yx2 <- cor.yx2*sqrt(y.var*x2.var)
cor.x1x2 <- -0.9
cov.x1x2 <- cor.x1x2*sqrt(x1.var*x2.var)
sigma <- matrix(c(y.var,cov.yx1,cov.yx2,
                  cov.yx1,x1.var,cov.x1x2,
                                  cov.yx2,cov.x1x2,x2.var),3,3)
library(mvtnorm)
x <- rmvnorm(n=50, mean=c(y.mean,x1.mean,x2.mean), sigma=sigma)
data <- data.frame(Y=x[,1],X1=x[,2],X2=x[,3])

Scatterplot matrices display a panel of scatterplots between each pair of variables when there are three or more continuous variables. A given variable makes up the x-axis of each of the panels up the column and the y-axis of each of the panels along the row. The diagnal panels are often populated with univariate plots such as boxplots, histograms or density functions. The upper right panels are a mirror of the lower left panels. There are a few high-level plotting functions for producing scatterplot matrices:

the pairs() function is an extension of the regular plot()function. Different functions can be applied to the lower, upper and diagonal panels of the grid. A lowess smoother is supported by the panel.smooth function. It is also possible to define alternative functions. This example illustrates the application of horizontal boxplots into the diagonal panels. Since, the upper panels are a mirror of the lower panels, the upper panels can be removed with by setting the upper.panel= parameter to NULL.
# define a boxplot panel function panel.bxp <- function(x, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2],0,2)) boxplot(x, add=TRUE, horizontal=T) } pairs(~Y+X1+X2,data=data, lower.panel=panel.smooth, diag.panel=panel.bxp, upper.panel=NULL, gap=0)
- An increase in X1 appears to be associated with a positive increase in Y
- An increase in X2 appears to be associated with a decline in Y
- X1 and X2 appear to be correlated to one another. This is likely to destabilize any linear models
the ggpairs() function (GGally package) can also differentiate trends for different levels (groups) of a categorical variable. To illustrate, we will use the cut() function to convert the X1 vector into a categorical variable with two levels (small and large).
library(GGally) ggpairs(data,lower=list(continuous='smooth'), diag=list(continuous='density'), axisLabels = 'show')
- The trends between X2 and Y appear to be different for each level of the categorical variable - this would cause problems for the traditional ANCOVA model
- Each level of the categorical variable appears to be associated with a different span of X2 values - this would cause problems for the traditional ANCOVA model
library(GGally) data$A <- cut(data$X1, br=2,lab=c('small','large')) ggpairs(subset(data,select=-X2),color='A',lower=list(continuous='smooth'), diag=list(continuous='density'), axisLabels = 'show')

the scatterplotMatrix() function is an extension of the regular scatterplot() function

library(car)
scatterplotMatrix(~Y+X1+X2, data=data, diag='boxplot')

The scatterplotMatrix() function can also differentiate trends for different levels (groups) of a categorical variable. To illustrate, we will use the cut() function to convert the X1 vector into a categorical variable with two levels (small and large).

library(car)
scatterplotMatrix(~Y+X2, groups=cut(data$X1, br=2,lab=c('small','large')), by.groups=TRUE, data=data, diag='boxplot')

The trends between X2 and Y appear to be different for each level of the categorical variable - this would cause problems for the traditional ANCOVA model
Each level of the categorical variable appears to be associated with a different span of X2 values - this would cause problems for the traditional ANCOVA model

Exploring interactions - means plots

Interactions are outcomes in which the effects of one factor are dependent on the levels of other factor(s). That is, the effect of one factor is not consistent across all levels of the other factors. Interaction plots depict the mean response value of each combination of factor levels (groups) and are therefore useful for interpreting interactions.

Information on the presence of interactions can help inform whether multiplicative models (models with interaction terms) are necessary as well as provide context for interpreting interaction effects.

Details of data generation

Random data incorporating the following properties

Total sample size of 12
Factor A has 3 levels each with 4 replicates
Factor B has 2 levels each with 6 replicates
The mean of the first cell (a1b1) is 20 (this is the base)
Levels a2 and a3 are 10 and 15 units higher the base
Level b2 is 5 units lower than the base
Cell a2b2 is 0 units higher than a2b1
Cell a3b2 is 10 units lower than a3b1
The response variable (Y) is equal to the linear predictor (fully factorial, multiplicative model) plus normally distributed errors with mean of 0 and standard deviation of 5

set.seed(1)
n.A <- 3
n.B <- 2
n.Rep <- 5
n <- n.A*n.B*n.Rep
A <- gl(n=n.A,k=n.Rep*n.B,length=n,lab=c("a1","a2","a3"))
B <- gl(n=n.B,k=n.Rep,length=n,lab=c("b1","b2"))
#Effects
base <- 20
A.eff <- c(10,15)
B.eff <- -5
AB.eff <- c(0,-10)
all.eff <- c(base, A.eff, B.eff, AB.eff)

sigma <- 5
eps <- rnorm(n,0,sigma)
X <- model.matrix(~A*B)
Y <- as.numeric(X %*% all.eff + eps)
data <- data.frame(Y,A,B)

There are a few high-level plotting functions for producing interaction plots matrices:

the interaction.plot() function
interaction.plot(data$A,data$B,data$Y, type='b', pch=16)
- The effect of factor B (difference between b1 and b2) on Y is not consistent across all levels of factor A and vice verse.

the plotmeans() function (gplots() package)

library(gplots)
plotmeans(Y~interaction(B,A), data=data, connect=list(c(1,3,5),c(2,4,6)))

the ggplot function is also ideal for these sorts of data summaries.

library(ggplot2)
ggplot(data, aes(y=Y, x=A, group=B,linetype=B))+ geom_pointrange(stat='summary',fun.data = "mean_cl_boot")+
   geom_line(stat='summary', fun.y='mean')
#OR
ggplot(data, aes(y=Y, x=A, group=B,linetype=B))+
   geom_pointrange(stat='summary',fun.data = "mean_cl_boot", position=position_dodge(width=0.5),
      linetype=1)+
   geom_line(stat='summary', fun.y='mean', position=position_dodge(width=0.5))

In the case of models containing a mixture of continuous and categorical predictors (ANCOVA), interaction plots can help explore whether the slopes are likely to be homogeneous (ANCOVA models assume that the relationships between response and continuous predictor are very similar for each level of the categorical predictor).

Details of data generation

Random data incorporating the following properties

Factor A has two levels ("a1" and "a2") and 20 replicates of each for a total sample size of 40
The continuous covariate (X) is drawn from a uniform distribution with a minimum of 20 and maximum of 50
In the linear predictor model matrix, X is approximately centered (such that the intercept is vaguely informative).
The intercept is 30 (value of the response when the centered covariate is zero).
The slope of the relationship between Y and X at level a1 is 2
Level a2 is five units higher than a1 at centered X
The slope of the relationship between Y and X at a2 is 2 units less than it is at a1
The noise (errors) are normally distributed with a mean of 0 and a standard deviation of 5

\begin{align*} x&\sim U(20,50)\\ y&=\beta_0 + \alpha_2+\beta_1(x-35)+\alpha_2\beta_1(x-35)+\epsilon, \hspace{1cm} &\epsilon\sim N(0,\sigma^2)\\ \end{align*}

set.seed(1)
n.A <- 2
n.Rep <- 20
n <- n.A*n.Rep
A <- gl(n=n.A,k=n.Rep,length=n,lab=c("a1","a2"))
X <- runif(n,20,50)
Xmat <- model.matrix(~A*I(X-35))

#Effects
base <- 30
X.slope <- 2
A.eff <- c(5)
AX.eff <- c(-2)
all.eff <- c(base, A.eff, X.slope, AX.eff)

sigma <- 5
eps <- rnorm(n,0,sigma)
Y <- as.numeric(Xmat %*% all.eff + eps)
data1 <- data.frame(Y,A,X)

scatterplot(Y~X|A, data=data1)

ggplot(data1, aes(y=Y, x=X, group=A, color=A))+geom_point()+geom_smooth(method='lm')