## The opinionated estimator

You have been lied to. By me.

I taught once a programming class and introduced my students to the notion of an unbiased estimator of the variance of a population. The problem can be stated as follows: given a set of observations $(x_1, x_2, …, x_n)$, what can you say about the variance of the population from which this sample is drawn?

Classical textbooks, MathWorld, the Khan Academy, and Wikipedia all give you the formula for the so-called unbiased estimator of the population variance:

$$\hat{s}^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i – \bar{x})^2$$

where $\bar{x}$ is the sample mean. The expected error of this estimator is zero:

$$E[\hat{s}^2 – \sigma^2] = 0$$

where $\sigma^2$ is the “true” population variance. Put another way, the expected value of this estimator is exactly the population variance:

$$E[\hat{s}^2] = \sigma^2$$

So far so good. The expected error is zero, therefore it’s the best estimator, right? This is what orthodox statistics (and teachers like me who don’t know better) will have you believe.

But Jaynes (Probability Theory) points out that in practical problems one does not care about the expected error of the estimated variances (or of any estimator for that matter). What matters is how accurate this estimator is, i.e. how close it is to the true variance. And this calls for an estimator that will minimise the expected squared error $E[(\hat{s}^2 – \sigma^2)^2]$. But we can also write this expected squared error as:

$$E[(\hat{s}^2 – \sigma^2)^2] = (E[\hat{s}^2] – \sigma^2)^2 + \mathrm{Var}(\hat{s}^2)$$

The expected squared error of our estimator is thus the sum of two terms: the square of the expected error, and the variance of the estimator. When following the cookbooks of orthodox statistics, only the first term is minimised and there is no guarantee that the total error is minimised.

For samples drawn from a Gaussian distribution, Jaynes shows that an estimator that minimises the total (squared) error is

$$\hat{s}^2 = \frac{1}{n+1}\sum_{i=1}^n (x_i – \bar{x})^2$$

Notice that the $n-1$ denominator has been replaced with $n+1$. In a fit of fancifulness I’ll call this an opinionated estimator. Let’s test how well this estimator performs.

First we generate 1000 random sets of 10 samples with mean 0 and variance 25:

samples <- matrix(rnorm(10000, sd = 5), ncol = 10)


For each group of 10 samples, we estimate the population variance first with the canonical $n-1$ denominator. This is what R’s built-in var function will do, according to its documentation:

unbiased <- apply(samples, MARGIN = 1, var)


Next we estimate the population variance with the $n+1$ denominator. We take a little shortcut here by multiplying the unbiased estimator by $(n-1)/(n+1)$, but it makes no difference:

opinionated <- apply(samples, MARGIN = 1, function(x) var(x) * (length(x) - 1) / (length(x) + 1))


Finally we combine everything in one convenient data frame:

estimators <- rbind(data.frame(estimator = "Unbiased", estimate = unbiased),
data.frame(estimator = "Opinionated", estimate = opinionated))

histogram(~ estimate | estimator, estimators,
panel = function(...) {
panel.histogram(...)
panel.abline(v = 25, col = "red", lwd = 5)
})


It’s a bit hard to tell visually which one is “better”. But let’s compute the average squared error for each estimator:

aggregate(estimate ~ estimator, estimators, function(x) mean((x - 25)^2))

##     estimator estimate
## 1    Unbiased 145.1007
## 2 Opinionated 115.5074


This shows clearly that the $n+1$ denominator yields a smaller total (squared) error than the so-called unbiased $n-1$ estimator, at leat for a sample drawn from a Gaussian distribution.

So do your brain a favour and question everything I tell you. Including this post.

## Biblical kings and boxplots

When you read through the biblical books of Kings, you may have been struck by a phrase that repeats itself for every monarch:

In the Xth year of (king of kingdom B), (name of king) became king of (kingdom A). He reigned N years, and did (evil|good) in the sight of the Lord.

If you’ve read through these books several times, you will probably have noticed that the shorter reigns tend to belong to kings deemed to have done evil, with a record-breaking 3 months for Jehoahaz and Jehoiachin. Let’s see if there’s any relationship between reign duration and “goodness” of the king. First we prepare the data in a form suitable for analysis:

kings.data <- c("Name Deeds Reign
David Good 40
Solomon Good 40
Rehoboam Evil 17
Abijah Evil 3
Asa Good 41
Jehoshaphat Good 25
Jehoram Evil 8
Ahaziah Evil 1
Joash Good 40
Amaziah Good 29
Azariah Good 52
Jotham Good 16
Ahaz Evil 16
Hezekiah Good 29
Manasseh Evil 55
Amon Evil 2
Josiah Good 31
Jehoahaz Evil 0.25
Jehoiakim Evil 11
Jehoiachin Evil 0.25
Zedekiah Evil 11")


The kings data frame holds one row for each king. The row names are the names of the kings; the column Deeds is a two-level factor telling if their reign was deemed good or evil. (We assume here that Solomon was a good guy in spite of what happened towards the end of his life.) The Reign column records the length of the reign as given in the Bible in years, with fractional values for reigns shorter than one year.
> table(kings$Deeds) Evil Good 11 10  Here we compute the median reign duration depending on the rating of the deeds: > with(kings, tapply(Reign, Deeds, median)) Evil Good 8.0 35.5  There’s already a good indication that the length of the reign depends on the deeds. We can now plot the length of the reigns: library(car) Boxplot(Reign ~ Deeds, kings)  This plot confirms our impression: “evil” kings tend to have shorter reigns that “good” kings, with the obvious exception of Manasseh, the same one of whom it was said I’ve arranged for four kinds of punishment: death in battle, the corpses dropped off by killer dogs, the rest picked clean by vultures, the bones gnawed by hyenas. They’ll be a sight to see, a sight to shock the whole world—and all because of Manasseh son of Hezekiah and all he did in Jerusalem. (Jer 15:3-4) So what does this all prove? Probably nothing. Plotting data is its own reward. :-) ## Book review: Advanced R I would like to call this the best second book on R, except that I wouldn’t know what the first one would be. I learned R from classes and tutorials about 10 years ago, used it on my PhD and four articles, and use it today on a daily basis at work; yet only now, after reading this book, do I feel like I could possibly be called an R programmer rather than just a user. The book deals with a variety of topics that are seldom discussed in the R tutorials you are likely to find freely available. Some are perhaps unnecessary in a book like this (Ch. 5 Style Guide), some could easily deserve an entire book (Ch. 7 OO field guide), but the chapters on Functions, Environments, the three chapters in Part II (Functional Programming) and the chapter on Non-standard evaluation are easily reasons enough to buy this book. How many time indeed have you spent hours, frustrated, trying to write a function that would act as a wrapper around, say, lattice plotting functions that use non-standard evaluation? Or try to call subset() from another function, only to see cryptic error messages? Now, for the first time, the solution is not only clear to me; I feel like I could also explain to a colleague why things work the way they do. R is incredibly powerful and dynamic and will, most of the time, do just what you expect it to do. But if you want to really understand what is going on under the hood, or if you want to write your own packages (whether for self-, internal-, or widespread use), you owe it to yourself to read this book. ## Bayesian tanks The frequentist vs bayesian debate has plagued the scientific community for almost a century now, yet most of the arguments I’ve seen seem to involve philosophical considerations instead of hard data. Instead of letting the sun explode, I propose a simpler experiment to assess the performance of each approach. The problem reads as follows (taken from Jaynes’s Probability Theory): You are traveling on a night train; on awakening from sleep, you notice that the train is stopped at some unknown town, and all you can see is a taxicab with the number 27 on it. What is then your guess as to the number N of taxicabs in the town, which would in turn give a clue as to the size of the town? In different setting, this problem is also known as the German tank problem, where again the goal is to estimate the total size of a population from the serial number observed on a small sample of that population. The frequentist and bayesian approaches yield completely different estimates for the number N. To see which approach gives the most satisfactory estimates, let’s generate a large number of such problems and see the error distribution that arise from either approach. n.runs <- 10000 N.max <- 1000 N <- sample.int(n = N.max, size = n.runs, replace = TRUE) m <- sapply(N, sample, size = 1)  We run this experiment n.runs times. Each time we select a random population size N uniformly drawn between 1 and N.max. We draw a random number m between 1 and N, representing the serial number that is observed. The frequentist approach gives the following formula for estimating$N$:$\hat{N} = 2m-1$. Intuitively, the argument runs that the expected value for$m$will be$N/2$.$m$is therefore our best estimate for half of$N$, and hence, our best estimate for$N$will be twice$m$. And I’m not sure where the${}-1$thing comes from. The bayesian approach is more complex and requires one to provide an estimate for the maximum possible number of taxicabs. Let’s therefore assume that we know that the total number of cabs will not be larger than 1000. (The frequentist approach cannot use this information, but to make a fair comparison we will cap the frequentist estimate at 1000 if it is larger.) Then the bayesian estimate is given by$\hat{N} = (N_\textrm{max} +1 – m) / \log(N_\textrm{max} / (m – 1))$. Putting it all together, we create a data frame with the errors found for both approaches: frequentist <- pmin(m * 2 - 1, N.max.bayes) - N bayesian <- (N.max.bayes + 1 - m) / log(N.max.bayes / (m - 1)) - N errors <- rbind(data.frame(approach = "FREQ", errors = frequentist), data.frame(approach = "BAYES", errors = bayesian))  The mean square error for each approach is then given by: > by(errors$errors^2, errors$approach, mean) errors$approach: FREQ
errors$approach: BAYES [1] 44878.61  The bayesian approach yields close to half the square error of the frequentist approach. The errors can also be plotted: library(lattice) histogram(~ errors | approach, errors)  Both error distributions are skewed towards negative values, meaning that both approaches tend to underestimate$N\$. However, the bayesian errors have a tighter distribution around 0 than the frequentist ones.