Programming Wingmen
For the past few weeks we’ve been experimenting with a variant of the Pair Programming theme.
Conventional wisdom holds that pairs should be rotated frequently, i.e. several times per day. In our experience, this has been hard to sustain. Instead, we’ve experimented with having two team members bond together and take collective ownership of a feature (or bug, or research item). They stick together until that feature is done. That can last anywhere between a few hours and several days.
The Pair Programming community use the metaphor of two drivers alternating being in the driver’s seat. Implicit in this metaphor is the idea that you stay with your fellow driver only for part of the whole road; several different programming pairs will end up working on the same feature.
What we are proposing instead, is that the same pair sticks together until the feature is complete. If the Pair Programming community uses a landbased metaphor, I’m going to use an aerial one. I’ll liken this to a pair of wingmen flying a missions, sticking together until the mission is done.
I’ve been pairing for almost two weeks now with our test engineer. We write the R scripts that automatically analyze, and generate a report on, this winter’s test data. Speaking for myself, I’m now convinced that this practice carries the same benefits as traditional pair programming, but brings several extra benefits:
First, since we are both responsible for the quality of this work, I feel far more accountable to my fellow team member. If I screw up, it’s our collective work that suffers. Conversely, I have a much stronger incentive for watching out over what he’s doing while he’s flying. We are 100% responsible for the quality of the work; we cannot later claim that some other pair came and introduced errors.
Secondly, it shortens the daily standup since it’s now the wings that presents what was done, what will be done, etc. In fact, it’s been the first time in years that we kept our daily standups within the allocated time.
Thirdly, it strongly encourages us to show, not tell, what was done the day before during the standup. This one is a bit tricky to explain; in fact I’m not sure I understand it, but I suspect some psychological rush is experienced from showing off what I, as a wingman, have accomplished with my partner. I see many more charts, pictures, plots, printouts at the standups since we’ve introduced this practice.
Fourthly, it reduces the guilt associated with pairing with someone instead of working on one’s “own” story. I think one of the main reasons why people don’t pair more often is the fear of not completing what they perceive as their “own” tasks. Indeed, perhaps it is time to revise the daily standup mantra. Forcing people to chant “Yesterday I did this, today I‘ll do that, etc” emphasizes their individual accomplishments.
A big issue with traditional Scrum is that by the end of the daily standup, everyone is so energized for the day ahead that they simply can’t be bothered to pair with someone else. In fact, once everyone has spoken their piece, we often found it useful to rediscuss who will be doing what for the day and with whom. This is obviously a waste of time, that can be eliminated by this wingmen system.
How did we get started? By a simple change. So simple, in fact, that it borders on the trivial, and you can try it today. Assuming you use a whiteboard to track your work in progress, figure out a way to indicate that a story is now owned by two people instead of one. For us, we used to have parallel swimlanes across the whiteboard, one lane per team member. What we did was as simple as merge them together by pairs.
If you have any position of authority in your team, I dare you to try this experiment for just one week, and for just one pair of team members. You’ll thank me for it.
In: Uncategorized
C’est “Test Unité”, m***e!
Most frenchspeaking professional programmers I’ve worked with will translate “Unit Test” by “Test Unitaire”. This is indeed the term used by the french translation of the Wikipedia article on unit testing.
Forgive my obsessivecompulsive disorder, but I believe the proper french translation of “unit test” should be “test unité”, and not “test unitaire”.
The english “unit” and the french “unitaire” mean two completely different things. “Unit” refers to a small, indivisible part of a system. “Unitaire” is a word that I have never seen used outside of linear algebra. For example a “matrice unitaire” (“unitary matrix“) refers to a complex matrix $U$ whose inverse is its conjugate transpose: $U \times U^* = I$.
I don’t know what a “test unitaire” is, nor do I know that a unitary test is. I do know, however, that a unit test is a test that tests a unit. Therefore, I also know that a “test unité” is a test that tests an “unité”.
In: Uncategorized
When to disable software assertions
I’m currently taking the Software Testing class on Udacity. The instructor spent a couple of videos discussing the pros and cons of leaving software assertions turned on in your production systems. Here is my interpretation:
In: Uncategorized
The only statistical distribution you need to know
If you pick two people at random and measure their heights, what can you say about the average height of the population?
When dealing with small sample sizes (less than, say, 20) it’s generally not possible to assume that the sample variance is a good approximation of the population variance. There simply isn’t enough data, and you cannot use the normal distribution to infer meaningful confidence intervals.
Statistics is like electrical work: it’s very easy to do a job that looks correct but could kill or maim someone. For example we have all, at least once, taken a small sample and used the normal distribution to get confidence intervals. Here is what we would get for the example above, assuming that the measured heights were 184 cm and 178 cm:
> x < c(184, 178)
> xbar < mean(x)
> s < sd(x)
> qnorm(c(0.025, 0.975), xbar, s)
[1] 172.6846 189.3154
The confidence interval we just obtained ($172 \le \mu \le 189$) looks perfectly reasonable, but it’s wrong. What’s worse, there’s nothing obviously wrong with it. If $s$ was the true population standard deviation it might even be correct; but that’s extremely unlikely.
This is why Student’s tdistribution was invented (or discovered?), to deal with small sample sizes. If we draw $N$ random samples from a normally distributed population, compute the sample mean $\bar{x}$ and the sample standard deviation $s$, then the socalled tvalue $\frac{\bar{x}\mu}{s/\sqrt{N}}$ will be distributed according to the tdistribution with $N1$ degrees of freedom.
With our example above, $N=2$, $\bar{x} = \frac{184+178}{2} = 181$, and $s=\sqrt{(3^2 + 3^2)/(N1)}=4.24$. If we want a 95% confidence interval for the true mean $\mu$ we use the 2.5% and the 97.5% quantiles of the tdistribution with 1 degree of freedom:
> N < 2
> x < c(184, 178)
> xbar < mean(x)
> s < sd(x)
> qt(c(0.025, 0.975), N1) * s / sqrt(N) + xbar
[1] 142.8814 219.1186
That last result gives you the correct 95% confidence interval for the true mean: $142 \le \mu \le 219$. Compare this with the interval obtained earlier: it is much wider, and therefore the uncertainty on the true mean is much higher than what the normal distribution would have let you believe.
This is, however, pretty good news. As long as your sample size is larger than 1, you can always infer correct (though perhaps useless) confidence intervals for the true mean, provided the population is normally distributed.
Recently I got angry was discussing statistics in a phone conference. The question was how many experimental test sites were necessary to estimate the CO2 reduction potential of Neurobat’s technology. I pointed out that no matter how many test sites we had, the tdistribution would always let us infer confidence intervals about the true CO2 reduction potential. But the statistics experts on the other side of the line disagreed, arguing that anything less than 10 was not statistically significant.
I tried the analogy above and showed that $N=2$ was enough to derive a confidence interval for the average height of the population.
“No, we don’t think so”, was the answer.
That’s when I decided to shut up and write a blog post.
We’ll run a little computer experiment. Suppose again that the height of the swiss population is drawn from a normal distribution of mean 180 cm and standard deviation 20 cm (I am totally making this up). We’re going to draw one sample at a time from this distribution and see the evolution of the sample mean and the confidence interval.
The true mean $\mu$ lies with 95% confidence within $\left[t_{2.5\%, N1}\times(\frac{s}{\sqrt{N}}+\bar{x}), t_{97.5\%, N1}\times(\frac{s}{\sqrt{N}}+\bar{x})\right]$. Let’s form this confidence interval as we progressively sample elements from the parent population:
N < 1000
mean.height < 180
sd.height < 20
height.sample < rnorm(N, mean.height, sd.height)
mean.sample < numeric(N)
sd.sample < numeric(N)
for (i in 1:N) {
mean.sample[i] < mean(height.sample[1:i])
sd.sample[i] < sd(height.sample[1:i])
}
low.interval < sd.sample / sqrt(1:N) * qt(0.025, 1:N  1) + mean.sample
upper.interval < sd.sample / sqrt(1:N) * qt(0.975, 1:N  1) + mean.sample
And here is the result:
It’s obvious that for small sample sizes, the confidence interval quickly becomes smaller than the parent standard deviation. We can see this by plotting the log of the confidence interval over time:
The confidence interval becomes smaller than the population standard deviation (20) after only 15 samples. However, even for extremely small sample sizes (less than, say, 5) the confidence interval is roughly 40 cm. The true mean can therefore be estimated within $\pm 20$, or with about 11% error, after only 5 samples. And that is thanks to the tdistribution.
In: Uncategorized
Review: RESTful Web Services
RESTful Web Services by Leonard Richardson My rating: 5 of 5 stars
I began reading “Restful web services” while researching technical solutions for Neurobat Online, the web service version of our intelligent heating controller. Prior to this, most (all?) web service projects I had been involved in were based on SOAP.
REST is a heavily overloaded term in our industry, and can mean different things to different people. The author avoids that controversy by coining the term “ResourceOriented Architecture”, and shows different examples of web services that can be built using this approach: a social bookmarking service inspired by Delicious, and a mapping service. Both examples are RESTful but the author does an excellent job at showing why being RESTful is not enough. To fully leverage the existing web architecture (including the full HTTP protocol) you need, he argues, to do more than merely being RESTful, and he shows how.
The author never says so explicitly, but after reading this book I found that RESTful web services have at least two significant advantages over what the author coyly calls “Big Web Service” (aka SOAP):

Testability: do not underestimate the advantage of exposing a service that your testing team can test with cURL instead of having to setup a tool such as SOAPUI.

Discoverability: everything is a resource, and all resources will respond to a limited number of HTTP verbs. You don’t have to worry whether adding a bookmark is done through
addBookmark()
orappendBookmark()
; if your service is RESTful, you know that you need to send a POST to some URI.
The biggest takeaway from this book for me was to realise that it’s possible to design an application where “everything is a resource”, and that all resources respond to the same set of methods. Think of it for a moment. Will this not change the way you design nonweb services too? Are not all your objects resources? Imagine, for a minute, if all your classes were restricted to expose not more than 5 public methods, and if these methods had the same names. It may sound crazy, but it’s quite possible that you’d end up with a cleaner design built out of many small classes with small interfaces. Is this not an easy way to clean code?
What makes this book great and not merely good is that the author doesn’t simply explain what RESTful web services are about. He is also clearly opinionated about it; however he is never patronising, never condescending. He takes very occasional jabs at systems he calls “Big Web Services” but never belittles them. His message comes across as entirely believable and convincing. By example after example he shows how popular services can be designed as collections of resources, and it is up to the intelligence of the reader to judge whether which kind of design is better.
Pairprogramming girls did just as well as boys
For the past three years I’ve taught a freshmanlevel programming course at the Swiss Federal Institute of Technology in Lausanne. Students are asked to form groups of 2 and to work on a semester project, consisting in the development of a simple library of numeric routines (e.g. square root function, integrals, etc). I then submit their code to a suite of unit tests (including the Valgrind memory checker) and assign them a grade linearly proportional to the number of unit tests that pass. The same grade is assigned to both members of the pair.
Most students will pair with a fellow student of the same sex. In the spring 2014 session, 43 pairs out of 52 were of the same sex. This year’s class was large enough to consider carrying out statistically significant studies on the students’ grades. More specifically, I wanted to examine whether pairs of girls obtained significantly different results from pairs of boys.
Here I show the boxplots of the grades assigned to the 52 pairs, depending on whether it was two females, mixed sex, or two males. The median grade for females is 5.5 out of 6, while the median grade for males is 5 out of 6.
The Welch two sample ttest (used to determine whether two samples are drawn from populations with the same mean) yields a pvalue of 0.32. The 95% confidence interval for the difference in means between allfemales and allmales is between 0.27 and 0.80. In other words, there is no statistically significant difference between the grades obtained by twofemale pairs of students and twomale ones.
And what about the pairs of mixed sex? The boxplot suggests that their results are lower, and I can think of a hypothesis to explain that. But with a sample size of only 9 it is hard to draw any conclusion.
In: Uncategorized
Statistically significant energy savings: how many buildings are enough?
From Neurobat’s website it is now possible to download a brochure with the 20122013 test results. It summarises the findings we published at the CISBAT 2013 conference, describing the energy savings that we have achieved on four experimental test sites. Of these four, one is an administrative office. Another included the (domestic) hot water in its energy metering. Therefore, only two of them are singlefamily houses whose energy savings concern the space heating alone. The energy savings we measured on these two sites are 23% and 35%.
It is natural to ask oneself what the average energy savings on a typical house might be. It’s possible to give an estimate, provided that we make several assumptions. We’re going to assume that the energy savings that Neurobat can achieve on a singlefamily house in Switzerland is a random variable drawn from a normal distribution. Therefore, our best estimates for the mean and standard deviation of that parent distribution are:
$$\mu = \frac{23 + 35}{2} = 29$$
and
$$s = \sqrt{\frac{(2329)^2 + (3529)^2}{1}} = 8.49$$
The sample mean estimated from n samples of a normal parent distribution is distributed according to a tdistribution with $n1$ degrees of freedom. The 95% confidence interval for the true (parent) mean can therefore be found by looking up the 0.975 and 0.025 quantiles of the tdistribution with $n1$ degrees of freedom. In our case, $n = 2$ and the 95% confidence interval of the true mean is therefore:
$$\left[\mu  t_{n1, 0.975} \frac{s}{\sqrt{n}}, \mu + t_{n1, 0.975} \frac{s}{\sqrt{n}}\right] = \left[ 47, 105 \right],$$
where $\mu = 29$, $s$ is the sample standard deviation calculated above and $n = 2$. Not the most helpful estimate ever.
We are currently repeating the experiment for this 20132014 heating season. A natural question that has come up is “How many buildings do we need to have a usable confidence interval for the average energy savings?”
As always, reformulating the question in precise terms is half of the battle. We want a narrow confidence interval around the mean energy savings. We can make it as narrow as we want by increasing n, or by relaxing our confidence requirement. Suppose then that we want a 95% confidence interval not wider than 10%; i.e., we want to state that Neurobat achieves $X\pm5\%$ energy savings with 95% confidence.
By a bit of arithmetic, what we are looking for is the number $n$ such that
$$n \ge 4t^2_{n1, 0.975}\frac{s^2}{w^2},$$
where $s$ is our sample standard deviation and $w$ is the desired width of our confidence interval. There is no closedform solution for this equation (except for large $n$, where the tdistribution can be approximated with a normal distribution), so finding $n$ is an iterative process. In R, the righthand side of this formula can be computed with:
[code] 4 * qt(.975,n1)^2 * s^2 / w^2 [/code]
Evaluate this expression with increasing values of n until is becomes smaller than n.
Assuming that $s = 8.49$ as above, we obtain:
$$n = 14.$$
And that’s it. Again, assuming that the energy savings are drawn from a normal distribution whose parent standard deviation is about 8.49, we will need experimental results on 14 buildings to give a 95% confidence interval not larger than 10% on the expected energy savings. For $n = 10$, the width of the confidence interval increases to 12% and for $n = 5$, it is 21%.
The next time you hear someone claim suspiciously precise energy savings with their miracle device, you have my permission to ask them what their confidence interval is, how they calculated it, and what underlying assumptions they are making.
In: Uncategorized
Welcome back, Climate Charts & Graphs
I was happy to learn a few days ago that the Climate Charts & Graphs blog is being reactivated by its author. I used to subscribe to it back in the Google Reader days. In the current climate change conversation we need more blogs like CCG, where arguments can be conclusively settled with (preferably graphical) evidence.
So welcome back, and there’s no reason to apologise!
In: Uncategorized
Review: The Thoughtworks Anthology: Essays on Software Technology and Innovation
The Thoughtworks Anthology: Essays on Software Technology and Innovation by ThoughtWorks Inc.<br/>
My rating: 3 of 5 stars
I read this first volume after reading its successor. Compared with the latter, I found the first volume to be slightly disappointing.
Like its successor, it’s a series of essays from Thoughtworks employees, including Martin Fowler. Whereas the second volume had some detailed, practical advice, I found this one to be much more vague and generic. It sounds almost as if it was written during the early years of the agile movement (which it maybe was), giving advice and recommendations that seem common sense today.
Martin Fowler’s article on Domain Specific Languages, although interesting, is of limited value now that his book on the subject has been published. Rebecca Parson’s article on programming languages sounds like yetanotherlookathowmanylanguagesIknow kind of article. Neal Ford’s article on Polyglot Programming recommends we build solutions with more than one language; well, people have been calling Fortran routines from C, or testing Java code with Scala, for several years now.
The only exception I want to make is Jeff Bay’s Object Calisthenics article. He proposes 9 rules to deliberately follow during your next project, and claims that following those rules will yield a superior design. This is one article I definitely want to apply, and which has practical value. Some of the rules sound extreme, such as “Don’t use any classes with more than two instance variables.”. But it’s definitely worth a look.
<br/><br/>
View all my reviews
In: Uncategorized
Review: Linkers and Loaders
Linkers and Loaders by John R. Levine
My rating: 4 of 5 stars
You may have written hundreds, maybe thousands of programs, but if you are like most programmers then everything that happens after the compilation is kind of mysterious. Why does the compiler have to create object files? What are they? What is this socalled linker who combines those files into a library, or an executable? What’s its purpose? John Levine’s book answers those questions, and more.
Item 53 in 97 Things Every Programmer Should Know: Collective Wisdom from the Experts is “The Linker Is not a Magical Program”, and this book goes a long way towards taking that magic away. It carefully explains step by step what happens from the moment the code is compiled until it actually runs on the machine; and what’s more important, it makes it very clear why things are as they are today.
I was recommended this book in a reply to a Stackoverflow question, and I am not disappointed. The book goes occasionally perhaps a little bit too much into technical details, which I felt could be safely skipped. Perhaps a case study, i.e. going through every single step towards running a complete program, would have been useful, instead of exposing how different systems solve the different steps one by one.
Until I read this book I simply did not understand how a program actually ran on my computer. A few details are still a bit fuzzy, but now I feel much better equipped for dealing with obscure linker errors or custom linker scripts. Highly recommended for any programmer who wants to get to the bottom of things.
In: Book reviews