# Linus Torvalds thinks like a chess grandmaster

I’ve uncovered evidence that Linus Torvalds, creator of Linux, may entertain a secret hobby.

An interview of Linus Torvalds in a recent issue of IEEE Spectrum had the following passage:

I’d rather make a decision that turns out to be wrong later than waffle about possible alternatives for too long.

On the surface, this sounds like your usual admonition against analysis paralysis (Wikipedia). But what Linus said echoes something that Alexander Kotov (Wikipedia), former chess grandmaster, wrote in 1971 in his Thinking like a Grandmaster (Amazon):

Better to suffer the consequences of an oversight than suffer from foolish and panicky disorder in analysis.

If I didn’t know better I would conclude that the same person wrote these two passages.

In: Uncategorized

# Where all floating-point values are above average

When you just fix a programming bug quickly, you lose. You waste a previous opportunity to think and reflect on what led to this error, and to improve as a craftsman.

Some time ago, I discovered a bug. The firmware was crashing, seemingly at random. It was eventually resolved, the fix reviewed and tested, and temptation was high to just leave it at that and get on with what was next on the backlog.

This is probably how most programmers work. But it’s probably wrong. Here’s Douglas Crockford on the topic, interviewed by Scott Hanselman:

There’s a lot of Groundhog’s Day in the way that we work. One thing you can do is, every time you make a mistake, write it down. Keep a bug journal.

I wanted to give it a try. So what follows is my best recollection of how I solved the bug.

First, the observations. You cannot be a successful debugger if you are not a successful observer. My firmware wasn’t quite crashing at random. It would crash and reboot 18 times in very quick succession (less than a few minutes) following a firmware update. Once this tantrum was over it would behave normally again.

It was a new firmware version. The same firmware had been deployed on other devices, but without the same problem. So why should it happen on some devices but not all of them?

There are some useful heuristics to keep in mind when debugging. I’ve said it before, but if you don’t observe carefully and keep notes, you’re just not a good debugger. I’ve found the following heuristics useful when debugging:

1. What changed between this release and the previous one?
2. What is different between this environment and another where the failure doesn’t occur?
3. Carefully go through whatever logfiles you may have. Document anything you notice.
4. How often does the failure happen? Any discernible pattern?

In this case, the software changes introduced by this release were relatively minor and I judged it unlikely that those changes were the cause of the problem. If they were, I would expect to see the same problem on all devices.

Now when I say that something is “unlikely”, I mean of course that there must be something else that is more likely to be the real explanation. Nothing is ever unlikely by itself, and if you can remove feelings from your day-to-day work you’ll be a better engineer. But more on this in another post.

I next examined the logfiles, and noticed that the first recorded crash was not a crash. It was the normal system reboot when a new firmware was installed. The second crash was not a crash either. It was a factory reset of the system, performed by the person who updated the system to the new firmware. It’s an operation that can only be done manually, and the only crashing device was the one that had been factory-reset right after the firmware update.

So someone had logged into that device and factory-reset the system. Going through the /var/log/auth logfiles I could determine who had done it. When confronted, he confirmed that he had reset the system in order to try an improved version of our heating schedule detection algorithm.

Now there’s nothing wrong with that; but it’s well-known that bugs are more likely in the largest, most recently changed modules. The module doing heating schedule detection was relatively large, complex, and recently changed.

Now experience had shown that only two events could cause the firmware to crash and reboot:

• a watchdog reset;
• a failed assertion.

(A watchdog is a countdown timer that will reboot the system after a given timeout, typically of the order of the second. You’re supposed to manually reset the timer at regular intervals throughout your application. It’s meant to prevent the system from being stuck in infinite loops.)

At this point I went through the implementation of that algorithm very carefully, keeping an eye on anything that could be an infinite loop or a failed assertion. When I was done, I was fairly confident (i.e. could almost prove) that it would always terminate. But I also came across a section of code whose gist was the following:

float child[24]; // assume child[] is filled here with some floating-point values
float sum = 0;
float avg;
for (int i = 0; i < 24; i++)
sum += child[i];
avg = sum / 24; // compute the average of the elements of child[]

int n_above_avg = 0; // count how many elements are greater than the average
int n_below_avg = 0; // count how many elements are less than or equal to the average
for (int i = 0; i < 24; i++)
if (child[i] <= avg)
n_below_avg++;
else
n_above_avg++;
assert(n_below_avg > 0); // at least one element must be less than or equal to the average


That was the only place where an assertion was called. Could this assertion ever fail? This code calculates the average of a set of floating-point values, and counts how many elements are less than or equal to the average (n_below_avg), and how many are greater (n_above_avg). Elementary mathematics tells you that at least one element must be less than or equal to the average.

But we’re dealing with floating-point variables here, where common-sense mathematics doesn’t always hold. Could it be that all the values were greater than their average? I asked that question on Stackoverflow. Several answers came quickly back: it is indeed perfectly possible for a set of floating-point numbers to all be above their average.

In fact, it’s easy to find such a set of numbers if they are all the same. One respondent gave a list of floating-point values that, when averaged, turned out to all be greater than their average. For example:

#include <iostream>
#include <cassert>

using namespace std;

int main() {
int sz = 24;
int i;
double values[sz];
for (i = 0; i < sz; i++) values[i] = 0.108809;
double avg;
for (avg = 0, i = 0; i < sz; i++) avg += values[i];
avg /= sz;
assert (values[0] > avg);
return 0;
}


Once the root cause of the problem was identified, it was relatively easy to write a failing unit-test and implement a solution.

Well, that’s the news from the world of programming where all the floating-point values can be above average. Who said Lake Wobegon was pure fiction?

In: Uncategorized

# The one question not to ask at the standup meeting

What is the very first question one is supposed to answer during a standup meeting? If your answer is:

What did you do since the last standup?

then congratulations. You have given the canonical answer recommended by Mike Cohn himself. But I am now convinced that this is the wrong question to ask.

When you ask someone What did you do?, you are inviting an answer along the lines of:

I worked on X.

The problem with this answer is that depending on X, you really don’t know what the team member has achieved. Consider the following possibilities, all perfectly reasonable answers to the question:

I worked on the ABC-123 issue and it is going well.

I worked on some unit tests for this story.

I worked with [team member Y].

You simply cannot tell if any progress is being made. Sure, you can ask for clarifying questions, but this will prolong the standup. Instead, I wish to suggest a slightly different version of that first question:

What did you get done since the last standup?

Here the emphasis is on what work was completed, not on what has been “worked” on. The deliverable becomes the object of the conversation, not the activity. The answers above don’t answer the question anymore, and this is what you might instead hear:

I tested and rejected 3 hypotheses for the cause of the ABC-123 issue, but I can think of at least 2 more.

I wrote a custom function for testing object equality and converted some unit tests to use it.

I paired with [team member Y] and we […]

Ambiguity and vagueness during the standups have regularly been an issue for our own team, and I am sure we are not the only ones. If you have fallen into the habit of asking the first version of this question, consider trying the second version and let me know (in the comments below) how that works out for you.

In: Uncategorized

# The opinionated estimator

You have been lied to. By me.

I taught once a programming class and introduced my students to the notion of an unbiased estimator of the variance of a population. The problem can be stated as follows: given a set of observations $(x_1, x_2, …, x_n)$, what can you say about the variance of the population from which this sample is drawn?

Classical textbooks, MathWorld, the Khan Academy, and Wikipedia all give you the formula for the so-called unbiased estimator of the population variance:

$$\hat{s}^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i – \bar{x})^2$$

where $\bar{x}$ is the sample mean. The expected error of this estimator is zero:

$$E[\hat{s}^2 – \sigma^2] = 0$$

where $\sigma^2$ is the “true” population variance. Put another way, the expected value of this estimator is exactly the population variance:

$$E[\hat{s}^2] = \sigma^2$$

So far so good. The expected error is zero, therefore it’s the best estimator, right? This is what orthodox statistics (and teachers like me who don’t know better) will have you believe.

But Jaynes (Probability Theory) points out that in practical problems one does not care about the expected error of the estimated variances (or of any estimator for that matter). What matters is how accurate this estimator is, i.e. how close it is to the true variance. And this calls for an estimator that will minimise the expected squared error $E[(\hat{s}^2 – \sigma^2)^2]$. But we can also write this expected squared error as:

$$E[(\hat{s}^2 – \sigma^2)^2] = (E[\hat{s}^2] – \sigma^2)^2 + \mathrm{Var}(\hat{s}^2)$$

The expected squared error of our estimator is thus the sum of two terms: the square of the expected error, and the variance of the estimator. When following the cookbooks of orthodox statistics, only the first term is minimised and there is no guarantee that the total error is minimised.

For samples drawn from a Gaussian distribution, Jaynes shows that an estimator that minimises the total (squared) error is

$$\hat{s}^2 = \frac{1}{n+1}\sum_{i=1}^n (x_i – \bar{x})^2$$

Notice that the $n-1$ denominator has been replaced with $n+1$. In a fit of fancifulness I’ll call this an opinionated estimator. Let’s test how well this estimator performs.

First we generate 1000 random sets of 10 samples with mean 0 and variance 25:

samples <- matrix(rnorm(10000, sd = 5), ncol = 10)


For each group of 10 samples, we estimate the population variance first with the canonical $n-1$ denominator. This is what R’s built-in var function will do, according to its documentation:

unbiased <- apply(samples, MARGIN = 1, var)


Next we estimate the population variance with the $n+1$ denominator. We take a little shortcut here by multiplying the unbiased estimator by $(n-1)/(n+1)$, but it makes no difference:

opinionated <- apply(samples, MARGIN = 1, function(x) var(x) * (length(x) - 1) / (length(x) + 1))


Finally we combine everything in one convenient data frame:

estimators <- rbind(data.frame(estimator = "Unbiased", estimate = unbiased),
data.frame(estimator = "Opinionated", estimate = opinionated))

histogram(~ estimate | estimator, estimators,
panel = function(...) {
panel.histogram(...)
panel.abline(v = 25, col = "red", lwd = 5)
})


It’s a bit hard to tell visually which one is “better”. But let’s compute the average squared error for each estimator:

aggregate(estimate ~ estimator, estimators, function(x) mean((x - 25)^2))

##     estimator estimate
## 1    Unbiased 145.1007
## 2 Opinionated 115.5074


This shows clearly that the $n+1$ denominator yields a smaller total (squared) error than the so-called unbiased $n-1$ estimator, at leat for a sample drawn from a Gaussian distribution.

So do your brain a favour and question everything I tell you. Including this post.

In: R

# The SIM4Blocks project kick-off meeting

The SIM4Blocks European project held its kick-off meeting on 5 and 6 April 2016 in Stuttgart, Germany. A consortium of 17 European partners (including 3 Swiss organisations) have answered the H2020-EE-2015-2-RIA call for proposals for:

… real time optimisation of energy demand, storage and supply (including self-production when applicable) using intelligent energy management systems with the objective of reducing the difference between peak power demand and minimum night time demand, thus reducing costs and greenhouse gas emissions. …

The official, “long” title of the project is Simulation Supported Real Time Energy Management in Building Blocks. Led by the Hochschule für Technik Stuttgart, the 5.5 MEUR project will:

… develop innovative demand response (DR) services for smaller residential and commercial customers, implement and test these services in three pilot sites and transfer successful DR models to customers of project partners in further European countries. …

The kick-off meeting was held in the traditional manner: after an introduction by the coordinator, and a short (remote) intervention by our project officer, each work package leader presented their work packages, going through the tasks that were defined, clarifying questions and making sure we had a common understanding of what was to be done.

Neurobat’s part will consist in offering our online heating optimisation server in order to help the manager of a group of buildings optimise the timing of the operation of their heat pumps. The goal is to avoid excessive peak power, for example when all pumps run at the same time.

One the second day we went north to the Wüstenrot village, where a cluster of single family houses (and a couple of commercial buildings) draw their heating power from what some call an energy ring, a cold water circuit from which the heat pumps draw the heat for each building. This will be one of three pilot sites, the other two being in Spain and in Switzerland.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 695965. It will last four years and we are honored to have been invited to join it. We look forward to a successful collaboration with the other members of the consortium.

In: Uncategorized

# Biblical kings and boxplots

When you read through the biblical books of Kings, you may have been struck by a phrase that repeats itself for every monarch:

In the Xth year of (king of kingdom B), (name of king) became king of (kingdom A). He reigned N years, and did (evil|good) in the sight of the Lord.

If you’ve read through these books several times, you will probably have noticed that the shorter reigns tend to belong to kings deemed to have done evil, with a record-breaking 3 months for Jehoahaz and Jehoiachin. Let’s see if there’s any relationship between reign duration and “goodness” of the king. First we prepare the data in a form suitable for analysis:

kings.data <- c("Name Deeds Reign
David Good 40
Solomon Good 40
Rehoboam Evil 17
Abijah Evil 3
Asa Good 41
Jehoshaphat Good 25
Jehoram Evil 8
Ahaziah Evil 1
Joash Good 40
Amaziah Good 29
Azariah Good 52
Jotham Good 16
Ahaz Evil 16
Hezekiah Good 29
Manasseh Evil 55
Amon Evil 2
Josiah Good 31
Jehoahaz Evil 0.25
Jehoiakim Evil 11
Jehoiachin Evil 0.25
Zedekiah Evil 11")

row.names = 1)


The kings data frame holds one row for each king. The row names are the names of the kings; the column Deeds is a two-level factor telling if their reign was deemed good or evil. (We assume here that Solomon was a good guy in spite of what happened towards the end of his life.) The Reign column records the length of the reign as given in the Bible in years, with fractional values for reigns shorter than one year.

We have 11 evil kings and 10 good kings:

> table(kings\$Deeds)

Evil Good
11   10


Here we compute the median reign duration depending on the rating of the deeds:

> with(kings, tapply(Reign, Deeds, median))
Evil Good
8.0 35.5


There’s already a good indication that the length of the reign depends on the deeds. We can now plot the length of the reigns:

library(car)
Boxplot(Reign ~ Deeds,
kings)


This plot confirms our impression: “evil” kings tend to have shorter reigns that “good” kings, with the obvious exception of Manasseh, the same one of whom it was said

I’ve arranged for four kinds of punishment: death in battle, the
corpses dropped off by killer dogs, the rest picked clean by vultures,
the bones gnawed by hyenas. They’ll be a sight to see, a sight to
shock the whole world—and all because of Manasseh son of Hezekiah and
all he did in Jerusalem. (Jer 15:3-4)

So what does this all prove? Probably nothing. Plotting data is its own reward. :-)

In: R, Uncategorized

# C++: when delete doesn’t delete

We once spent almost a week chasing after a mysterious memory leak in our application, built on top of the highly regarded eCos real-time operating system. The leak appeared after we had rewritten some of our code in C++ after recognising that C’s object-oriented capabilities were no longer adequate for our needs.

After about half a day we could reproduce the memory leak on the target system with code that essentially looked like this:

Controller* controller = new Controller();
delete controller;


What baffled us most was that running this code in unit tests on the development machines exposed no such memory leak. We routinely run all our unit tests under Valgrind to identify memory usage errors, but in this case there was none. It was very unlikely that the leak was caused by defective code.

What’s more, the leak was almost-but-not-quite consistent. We leaked about 952 bytes on average, but that figure could be as low as 920 or as high as 968. It was always a multiple of 8 bytes. After about 8.5 hours, the system would reboot, presumably because it ran out of memory. We used the mallinfo() function to display the amount of available memory.

After almost a week we found the answer. According to the documentation, the default implementation of the delete operator in eCos is a no-op! I suppose the rationale is that most developers of embedded systems tend to shy away from dynamic memory allocation, and that it is better to reduce the size of the firmware by not providing a (rarely-needed) delete operator.

Except when we need one, of course.

To enable a proper delete operator you simply disable the CYGFUN_INFRA_EMPTY_DELETE_FUNCTIONS option in your eCos configuration file.

In: Uncategorized

# Going for one-week sprints: a good wrong idea

A few weeks ago, our team held a sprint retrospective (which I unfortunately couldn’t attend) during which it was decided to shorten our sprint length from two weeks to one. The team was right in their decision, but probably for the wrong reasons and here’s why I think so.

The main driver behind this decision was Neurobat’s involvement with the Aargau Heizt Schlau project: a canton-wide project to measure the efficiency of our system on 50-100 individual houses in the Aargau canton during the 2015-2016 winter. The goal is to have an independent assessment of the energy-savings potential of our product, a replication of our own peer-reviewed investigation. The project is mostly driven from a team in Brugg, with ample support from our R&D team in Meyrin.

Keeping the project on track and on schedule turned out to be extremely challenging. Very soon, urgent support requests began to come at unpredictable times, and we were having trouble keeping our sprint commitments.

The realisation that urgent, random support requests were going to be the norm for this heating season is the main reason why the team decided to experiment with one-week sprints. They have a 5-year long history of sprint retrospectives and I’m convinced they collectively understand the principles underlying the practice of timeboxed iterations. (A practice always proceeds from a principle, and can be modified only when the principle is fully understood.) A less mature team should not have made this decision and should stick with 2-week sprints; but I believe our team was mature enough to carry out this experiment.

Many teams, when they begin with Scrum, will object to the “overhead” introduced by daily standups, sprintly retrospectives and planning meetings. And since they are likely to miss their sprint commitments during the first few months, they are very likely to ask for longer sprints. Resist this temptation.

Mike Cohn tells the story of a team facing exactly this problem: good quality work but systematic overcommitments. He agreed to let the team change the sprint duration, but went against the team’s request for longer sprints. Instead, they went for shorter ones. His rationale against longer sprints is simple:

The team was already pulling too much work into a four-week sprint.
They were, in fact, probably pulling six weeks of work into each
four-week sprint. But, if they had gone to a six-week sprint, they
probably would have pulled eight or nine weeks of work into those!

So if shorter sprints are generally to be preferred over longer ones, why do I think the decision was a solution to the wrong problem? Because I believe that switching to shorter sprints will only perpetuate the root cause of the situation we are in. We decided to go for shorter sprints because urgent support requests were coming in more frequently than ever. How will switching to shorter sprints solve that problem?

I’m reminded of this quote, which I believe came from Mike Cohn’s Succeeding with Agile:

Few organizations are in industries that change so rapidly that they cannot set priorities at the start of a two-week sprint and then leave them alone. Many organizations may think they exist in that environment; they don’t.

If your organisation has trouble planning for more than a week ahead, then do your development team a favour and try, at all costs, to address the underlying problem. Your team members should not be the ones whose productivity should suffer for the lack of foresight elsewhere in the organisation.

In: Uncategorized

# Scrum stories that are juuuust right

On thing has been bugging me for quite some time now as I observe our team at Neurobat. Most stories on our sprint board are being worked on by one developer each, leading to daily scrums where everyone reports on work that is completely independent from that of the others.

Even though we encourage people to pair program, the fact remains that most stories are such that one person can implement them by himself, with the possible exception of testing. (We have a rule that a developer may not write the acceptance tests for his own story, much less execute them.)

That, in turn, leads to a very quiet office. We work in an open-space office where the six of us are in direct line of sight of each others. Yet for most of the time, there is very little chatter as each of us is busy with “his bit”.

Perhaps Mike Cohn summarises the issue best, in his User Stories Applied:

Most user stories should be written such that they need to be worked on by more than one person, such as a user interface designer, programmer, database engineer, and a tester. If most of your stories can be completed by a single person, you should reconsider how your stories are written. Normally, this means they need to be written at a higher level so that work from multiple individuals is included with each.

I’m not a big fan of hyperbole, but this passage was a little bit of a revelation to me. Here we had been faithfully trying hard to break up stories that were too large into tiny weeny stories that could be implemented in a couple of days or two by a motivated developer; and now I’m being told that there is such a thing as a story that is too small? Talk about being in a Goldilock-ish fix.

Very well Goldilocks er… I mean Mr Cohn, I’ll bring this up at our next retrospective and we’ll see whether our stories are really too small.

In: Uncategorized

# Running CARNOT models under OSX

CARNOT (Conventional And Renewable eNergy systems Optimization Toolbox) is a set of MATLAB & Simulink models for simulating buildings and building systems, e.g. boilers, heat distributors etc. It’s been developed by a collaboration involving several companies and universities and is generally well-regarded. It’s one of several MATLAB toolboxes dedicated to the problem of simulating building physics; other toolboxes with a similar goal include the International Building Physics Toolbox and SIMBAD.

CARNOT is in the process of being moved to a new hosting provider. In the meantime, I’ve recently obtained a copy of this toolbox and here is my experience in getting it to run under OSX.

CARNOT is distributed as a zip file, carnot_60_2013b_public_22oct2015.zip. I decompress it and find what looks like a Simulink top-level model called carnot.slx, and several sub-folders. Very encouragingly, I see there’s an installation guide:

I move the decompressed folder to the folder where I keep all my in-progress projects, and create a symbolic link to it named more simply carnot.

The installation guide is very well written, and the main steps consist in:

1. Decompressing the toolbox;
2. Running the init_carnot.m script, that will setup all the paths correctly;
3. Compiling all the MEX file with the provided script.

When I ran init_carnot.m for the first time on my Mac, it didn’t work and the error message made it very clear that most file paths have been written with the assumption that the toolbox was going to be used on Windows. At this point, I could have done either of two things:

1. Fix the issues myself as quickly as possibly and get on with the work;
2. Fix the issues myself carefully, making sure that my fixes could then be sent back to the maintainer of the package.

Being currently on a business trip, I felt like I had the leisure to go for option 2. Seeing that this toolbox was going to need some fixing, I made a git repo out of it and added all its files. (I didn’t know at this point which, if any, files were generated. I figured that this was something I could worry about later and remove those files from the repo.)

I fixed the paths problem, which mostly lay in path_carnot.m, called by init_carnot.m. I created a patch from it and sent it to one of the maintainers. The paths were now correctly setup.

Next I ran mex -setup and this ran fine, MATLAB picked up my XCode installation with the Clang compiler. So the next step in the installation guide was to run MakeMEX.m from the version_manager directory. That ran fine for several .c files until it tried to compile dir2_mex.c. When I opened that file in the editor I saw that it depended on the Windows API. Here I had two options:

1. Try to understand what this file was doing and try to rewrite it without using the Windows API;
2. Skip the compilation of this file and hope that the rest of the toolbox would run fine without it.

The problem with option 1 was that I didn’t know at this point whether there were going to be other files with the same problem. I certainly didn’t want to enter that kind of endless loop of fixing file after file that depended on the Windows API. Since the build process had stoppped on encountering the first error, I had no way of knowing if there was going to be many other problems.

So that’s why felt more appropriate to find where in the call mex was being called, and wrap that call in a try/catch block, yielding a warning if any file failed to be compiled. Re-running MakeMEX now compiled all the files correctly except the single dir2_mex.c, which I hoped would not be needed to run any of the simulations I was planning.

Once this was done, I could finally type carnot at the command line and the toolbox would open:

I was immediately drawn to the box that says double click to open examples and that yielded another set of errors, again related to file paths. After fixing those I could open an example model, the example_House_SFH45, click run, and saw the simulation running. I was all set and done.

A couple of days after writing the first draft of this article, I learned from one of the main developers that they plan to setup a proper SVN repository for the code, and that the whole toolbox was going to released under a BSD licence instead of the current LGPL. Until this is done, I’ve pushed my fixes to a pubic repository on GitHub, to which I have contributed some extra small fixes. But keep in mind that this is in no way the official repository for CARNOT; that will be announced shortly.

Posted on December 9, 2015 at 10:00 am by David Lindelöf · Permalink · 2 Comments
In: Uncategorized