Outlier removal using normal distributions#
This tutorial shows how to calculate mean and variance of samples belonging to the same taxon and remove outliers based on normal distributions.
If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.
Author: David Schellenberger Costa
Requirements#
To run the script, the following is needed:
the file stemDensitySample.txt, available here
some R libraries that may need to be installed
Code#
# load in libraries
library(data.table) # handle large datasets
# clear workspace
rm(list = ls())
# set working directory (adapt this!)
setwd(paste0(.brd, "stem density"))
Let us get the data. The sample file contains stem density measurements of about 60 species, with varying numbers of observations per species. We will extract the number of species, the number of the column containing the stem density values, and the maximum stem density for later use. We will also create a matrix to store mean and variance per species.
# read in data
stD <- fread("stemDensitySample.txt")
str(stD) # the dataset has four three columns: rownames, sample (==ID), species, and stem specific density (==SSD)
specs <- unique(stD$species) # unique species in the dataset
dCol <- grep("specific.density", colnames(stD)) # stem density column
stDMax <- max(stD[[dCol]]) # maximum stem density
meanAndVar <- matrix(NA, length(specs), 2) # create a matrix to store mean and variance of the density per species
rownames(meanAndVar) <- specs # rownames should be species names
colnames(meanAndVar) <- c("mean", "variance") # colnames should be mean and variance
There are a couple of species, and we may want to check them visually. To do this, we could create a pdf. If you would like to do so, uncomment the pdf(…) command and uncomment the dev.off() command at the end of the for loop. Otherwise, there will be a lot of plots in the graphics window.
# create a pdf to save figures in
# pdf("stemDensitySample.pdf", width=11.7, height=8.3) # width and height represent A4 paper inch sizes
# number of rows and columns per page for figures
par(mfrow = c(2, 3))
for (i in seq_along(specs)) {
# calculate mean and variance for each species
specMean <- mean(stD[stD$species == specs[i]][[dCol]])
specVar <- var(stD[stD$species == specs[i]][[dCol]])
# save mean and var for later use
meanAndVar[i, ] <- c(specMean, specVar)
# only draw normal distributions if > 4 samples available
if (sum(stD$species == specs[i]) > 4) {
# we first calculate the values of the density distribution for 500 values within the range [0,stDMax]
# and then plot it using type="l" to draw a line instead of single points
densityVals <- sapply(seq(0, stDMax, l = 500), function(x) dnorm(x, specMean, sqrt(specVar)))
plot(seq(0, stDMax, l = 500), densityVals,
main = specs[i],
xlab = "stem density [g*cm^-3]", ylab = "density", type = "l"
)
# we plot points to indicate the raw data used, the x coordinate being the sample
# density value, the y coordinate being set to half the maximum of the density function
# the y coordinate is repeated to match the number of samples
points(stD[stD$species == specs[i]][[dCol]], rep(max(densityVals) / 2, sum(stD$species == specs[i])))
# we plot a text to indicate the number of samples used at stDMax*1/4 on the x axis and
# 3/4 of the maximum of the density function on the y axis
text(x = stDMax * 1 / 4, y = max(densityVals) * 3 / 4, labels = paste("n =", sum(stD$species == specs[i])))
} else {
# we should also plot something when there are not enough samples to not forget
# we may want to resample later or use just the means of those species
# we set the range of the x axis to [0,stDMax] to match the other figures, while we
# (arbitrarly) set the range of the y axis to [0,1]
plot(stD[stD$species == specs[i]][[dCol]], rep(1 / 2, sum(stD$species == specs[i])),
main = specs[i], xlim = c(0, stDMax), ylim = c(0, 1), xlab = "stem density [g*cm^-3]", ylab = "density"
)
text(x = stDMax * 1 / 4, y = 3 / 4, labels = paste("n =", sum(stD$species == specs[i])))
}
}
# close pdf (necessary for viewing it!)
# dev.off()
Now we want to produce another mean and variance matrix with outliers removed beforehand. We will plot a curve with outliers and one with outliers removed in the same figure for each species again (or just for the first couple of species). The outliers are identified by using the means and variances of all observations per species and calculating the probability of each sample assuming the mean and variance per species are the true values. We then remove the values falling into the lowermost or uppermost 2.5% of the probability density distribution.
meanAndVarNO <- matrix(NA, length(specs), 2) # create a matrix to store mean and variance of the density per species
rownames(meanAndVarNO) <- specs
colnames(meanAndVarNO) <- c("mean", "variance")
# create a pdf to save figures in
# pdf("stemDensitySampleNO.pdf", width=11.7, height=8.3) # width and height represent A4 paper inch sizes
# number of rows and columns per page for figures
par(mfrow = c(2, 3))
for (i in seq_along(specs)) {
# save samples of a species in a variable for convenience
specVals <- stD[stD$species == specs[i]][[dCol]]
# use outlier removal only if > 4 samples available
if (length(specVals) > 4) {
# detect outliers by calculating a normal distribution from all other samples of the same
# species and check whether the sample in focus belongs into the outer 5% of the density
# distribution
probs <- c()
for (j in seq_along(specVals)) {
meanTemp <- mean(specVals[-j])
varTemp <- var(specVals[-j])
probs <- c(probs, pnorm(specVals[j], meanTemp, sqrt(varTemp)))
}
probs <- probs <= .975 & probs >= .025 # indicate which samples to use and which not based on the 5% criterion
} else {
# if less than 5 observations are available, keep all
probs <- rep(TRUE, length(specVals))
}
specMean <- meanAndVar[i, 1]
specVar <- meanAndVar[i, 2]
if (sum(stD$species == specs[i]) > 4) { # only draw normal distributions if > 4 samples available
# we first calculate the values of the density distribution for 500 values within the range [0,stDMax]
# and then plot it using type="l" to draw a line instead of single points
densityVals <- sapply(seq(0, stDMax, l = 500), function(x) dnorm(x, specMean, sqrt(specVar)))
if (sum(probs) > 4) { # only draw a outlier-corrected curve if > 4 samples remain
# recalculate mean and variance without outliers
specMean <- mean(specVals[probs])
specVar <- var(specVals[probs])
# recalculate the values of the density distribution and keep the values with outliers
densityValsOld <- densityVals
densityVals <- sapply(seq(0, stDMax, l = 500), function(x) dnorm(x, specMean, sqrt(specVar)))
plot(seq(0, stDMax, l = 500), densityValsOld,
main = specs[i],
ylim = c(0, max(c(densityVals, densityValsOld))), xlab = "stem density [g*cm^-3]", ylab = "density", type = "l"
)
# we add the newly calculated density values to the existing plot, but use red color for distinction
lines(seq(0, stDMax, l = 500), densityVals, col = "red")
# we define a color vector to add the raw data used, with black points indicating outliers
# and red points indicating no outliers
cols <- rep("black", length(specVals))
cols[probs == TRUE] <- "red"
points(specVals, rep(max(c(densityVals, densityValsOld)) / 2, length(specVals)), col = cols)
# we plot a text to indicate the number of samples used at stDMax*1/4 on the x axis and
# 3/4 of the maximum of the density function on the y axis
text(x = stDMax * 1 / 4, y = max(densityVals) * 3 / 4, labels = paste("n =", sum(probs)))
} else { # if not enough data remain, do not draw an outlier-corrected version
plot(seq(0, stDMax, l = 500), densityVals,
main = specs[i],
ylim = c(0, max(densityVals)), xlab = "stem density [g*cm^-3]", ylab = "density", type = "l"
)
# we plot points to indicate the raw data used, the x coordinate being the sample
# density value, the y coordinate being set to half the maximum of the density function
# the y coordinate is repeated to match the number of samples
points(stD[stD$species == specs[i], dCol], rep(max(densityVals) / 2, sum(stD$species == specs[i])))
# we plot a text to indicate the number of samples used at stDMax*1/4 on the x axis and
# 3/4 of the maximum of the density function on the y axis
text(x = stDMax * 1 / 4, y = max(densityVals) * 3 / 4, labels = paste("n =", sum(stD$species == specs[i])))
}
} else {
# we should also plot something when there are not enough samples to not forget
# we may want to resample later or use just the means of those species
# we set the range of the x axis to [0,stDMax] to match the other figures, while we
# (arbitrarly) set the range of the y axis to [0,1]
plot(stD[stD$species == specs[i]][[dCol]], rep(1 / 2, sum(stD$species == specs[i])),
main = specs[i], xlim = c(0, stDMax), ylim = c(0, 1), xlab = "stem density [g*cm^-3]", ylab = "density"
)
text(x = stDMax * 1 / 4, y = 3 / 4, labels = paste("n =", sum(stD$species == specs[i])))
}
# we save mean and var for later use
meanAndVarNO[i, ] <- c(specMean, specVar)
}
# close a pdf to finalize
# dev.off()
We now have two matrices, one with mean and var of the stem density of each species calculated using all values available, and one in which mean and variance for species with > 4 values have been calculated with the values that are not considered outliers. Let’s write the data into a file. Inside the fwrite() command, we convert the matrices used to store the values into data.table format.
fwrite(meanAndVar, file = "stemDensitySampleMeanAndVar.txt")
fwrite(meanAndVarNO, file = "stemDensitySampleMeanAndVarNO.txt")