The NYC Marathon

New York’s annual marathon took place yesterday. Watching a bit of it on television with my friends, I was struck by the much earlier starting time for women than men. Specifically, professional women started running yesterday at 9:10 AM, while professional men start running at 9:40 AM. (This information comes from the runner’s handbook.) I wanted to get a sense of how much this head start depended on real differences in their performance, because I found it very hard to imagine why professional women would run significantly slower than professional men.

Of course, I have seen discussions of the speed difference between men and women before, but I was still very surprised by it yesterday. To get a sense of the scope of the differences, I found some data this morning from the ING Marathon website and made a quick density estimate plot, which you can see below:

hours_gender.png

It’s clear that men and women had quite difference average speeds yesterday, and that their times had very different distributions. Of course, these plots are each based on 100 observations, so I’m hesitant to make any strong conclusions. Having confirmed for myself that there are real differences in the performance of men and women, I have to confess that I still find it surprising.

For those interested in following up on this, the code I used to produce this plot and the data set I used are both available on GitHub. I’m sure there are other interesting questions one can ask of this data beyond simple comparisons across genders.

The Answer Depends on the Question

To quote from the preface to the first edition in Jeffreys (1961): ‘It is sometimes considered a paradox that the answer depends not only on the observations but on the question; it should be a platitude.’1

  1. Generalized Linear Models : P. McCullagh and J. A. Nelder : Chapter 2

Promising R Packages

As a quick note, here are two R packages that were mentioned to me recently and that look promising: reldist and mixtools.

EM and Regression Mixture Modeling

[UPDATE: As Will points out in the comments, this isn't really the EM algorithm. There isn't a proper E step, because there's no distribution being estimated: there's only a maximization step that alternates between maximizing the class labels and the slopes. You can think of this algorithm as a degenerate version of EM in the way that naive k-means implements a degenerate form of EM for Gaussian mixtures.]

Last night, Drew Conway showed me a fascinating graph that he made from the R package data we’ve recently collected from CRAN. That graph will be posted and described in the near future, because it has some really interesting implications for the structure of the R package world.

But for the moment I want to talk about the use of mixture modeling when you have a complex regression problem. I think it’s easiest to see some example data to motivate my interest in this topic, so here we go:

unlabeled.png

If you’ve never seen data like this, let’s just make sure it’s clear how you could have ended up with a plot that looks this way. We could end up with data like this if we had two classes of data points that each separately obey a standard linear regression model, but the models have different slopes for points from each of the two classes of data. In other words, this is the sort of data set you might fit using a varying-slope regression model — if you knew about the classes coming in to the problem. To make this idea really clear, here’s the simulation code that generated the plot I’ve just shown you.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
N <- 100
 
true.classes <- sample(c(0, 1), N, replace = TRUE)
 
x <- rep(1:50, 2)
 
y <- rep(NA, N)
 
beta <- c(0.3, 1.0)
 
for (i in 1:N)
{
  y[i] <- beta[true.classes[i] + 1] * x[i] + rnorm(1)
}
 
png('unlabeled.png')
qplot(x, y, geom = 'point')
dev.off()

But what do you do when you don’t know anything about the classes because you’ve only discovered them after visualizing your data? It should be obvious that no amount of regression trickery is going to give us the class information we’re missing. And we also can’t fit a varying slope regression without some sort of class information. It would seem that we can’t get started at all given standard regression techniques, because we have a chicken-and-egg problem where we need either the class labels or the regression parameters to infer the other missing piece of the puzzle.

The solution to this problem may amaze readers who don’t already know the EM algorithm and degenerate forms of EM, because it’s so shockingly simple and seemingly cavalier in its approach: we make up for the missing data by just making new data up out of thin air.

Seriously. The approach I’ll describe reliably works and it works for two reasons that are obvious in retrospect once someone’s told them to you:

  1. If we have an algorithm that will eventually reach the best solution to a problem from any starting point, then we can make up for missing data by randomly selecting values for what we’re missing and moving on from there. We don’t have to be paralyzed by the seemingly insurmountable problem of doubly missing data, because using arbitrary data is enough for us to get started. Now if that’s not data hacking, I don’t know what is.
  2. The first claim isn’t just hypothetical when there’s a finite number of possible classes each point could belong to: our algorithm really will eventually reach the best solution, because each step of the algorithm will always give us a better solution than before, and there are only finitely many steps the algorithm can take, because there is only a finite number of possible class label assignments it could use.

With that said, let’s go through the details for this problem with example code.

First, we have to make up imaginary class labels.

1
inferred.classes <- sample(c(0, 1), N, replace = TRUE)

Then we’ll plot this assignment of classes to see how well it matches the structure we see visually:

1
2
3
png(paste('state_', 0, '.png', sep = ''))
qplot(x, y, geom = 'point', color = inferred.classes)
dev.off()
state_0.png

This assignment doesn’t look good at all. That’s not surprisingly given that we made it up without any reference to the rest of our data. But it’s actually quite easy to go from this made up set of labels to a better set. How? By fitting a varying-slope regression, calculating the errors at each data point for both possible class labels, and then re-assigning data points to the class that makes the errors smallest. We can do that with the following very simple code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
my.data <- data.frame(Y = y, X = x, Class = inferred.classes)
 
lm.fit <- lm(Y ~ X + X:Class - 1, data = my.data)
 
for (i in 1:N)
{
  error.zero <- (y[i] - predict(lm.fit, data.frame(Y = y[i], X = x[i], Class = 0)))^2
  error.one <- (y[i] - predict(lm.fit, data.frame(Y = y[i], X = x[i], Class = 1)))^2
 
  if (error.zero < error.one)
  {
    inferred.classes[i] <- 0
  }
  else
  {
    inferred.classes[i] <- 1
  }
}

Here we fit a linear regression with two slopes, depending on the class being 0 or 1, and we’ve thrown out any intercept for simplicity. Then we determine which of the two classes would make the data more likely given the slopes we inferred using our imaginary classes. This actually makes a huge improvement in just one step:

state_1.png

Luckily for us, there’s only data point that’s not been assigned properly, so we can just loop over the steps we took one more time to clean up our model to near perfection:

state_2.png

And that’s it.

[EDIT: Fixed a typo in the example code that actually made the algorithm work faster, but only because it coincided with the structure of the problem.]

Apologies for Polluting Twitter

I’d like to publicly apologize to anyone that follows me on Twitter and saw the argument I started with two people yesterday morning. While I still believe that the people on the other side of the argument had behaved inappropriately enough that someone needed to confront them, my actual reaction was completely counter-productive and represented exactly the sort of adversarial behavior that never improves a bad situation. I’ll do my best not to repeat that mistake in the future.

R Recommendation Contest Launches on Kaggle

The R Recommendation Engine contest is now live on Kaggle. Please head over there and start submitting your predictions for the test data set. Once you do, you can check the leaderboard to see how your algorithm compares with other people’s work. We know that there’s still plenty of progress that can be made, because we have other models that are much better than the benchmark code we released to the public.

In the future, I’ll be posting some hints on this blog about ways to improve your models for the contest using less well-known methods in machine learning. Because you can use CRAN itself as a data source, this contest offers a lot of opportunities to exploit the state-of-the-art in machine learning based on text and network analysis.

If you any questions or comments about the contest, please use the contest forum on Kaggle so that others will benefit from the discussion.

Build a Recommendation System for R Packages

On Dataists, a new collaborative blog for data hackers that I’m contributing to, we’ve just announced a data contest that’s custom made for R users. To win the contest, you need to build a recommendation system for R packages.

To find out more, check out the official announcement on Dataists. Then go to GitHub to get the data sets we’re providing, including the official training data set that you should use to build your model. We’re even providing you with a baseline model to get you started.

On Sunday, the contest will officially go live on Kaggle, where you’ll want to make submissions to see how your recommendation algorithm compares with other contestants’ submissions. In February 2011, the contest will end and the team with the best system will win three UseR! books of their choosing.

Happy hacking!

ProjectTemplate Version 0.1-3 Released

I’ve just released the newest version of ProjectTemplate. The primary change is a completely redesigned mechanism for automatically loading data. ProjectTemplate can now read compressed CSV files, access CSV data files over HTTP, read Stata, SPSS and RData binary files and even load MySQL database tables automatically. For my own projects, this is a big step forward. To access the more esoteric data sources like remote datasets and MySQL databases, the end user only needs to provide a YAML file that specifies a few details about the data source that you’ll be accessing. Hopefully the approach I’ve taken works for a large range of problems.

If you’re interested in data available over HTTP, a sample configuration file, called a.url is shown below:

1
2
url: "http://www.johnmyleswhite.com/ProjectTemplate/sample_data.csv"
separator: ","

And for those interested in accessing data from MySQL, a sample configuration file, called b.sql is shown below:

1
2
3
4
5
6
type: mysql
user: sample_user
password: sample_password
host: localhost
dbname: sample_database
table: sample_table

My inspiration for these changes came from two people that I’d like to thank: Diego Valle-Jones and David Edgar Liebke. A month ago, Diego submitted a patch for load_data.R that added RData and compressed CSV file type support. At that time, I started thinking about how to make a more extensible data loader, but wasn’t able to return to the topic until this week.

Last night, while I was reading David’s very helpful tutorial on Incanter, I realized that ProjectTemplate could automate many more types of data loading. I hope that I’ve made load_data.R capable of least some of the magic that Incanter’s get-dataset does.

The full list of file types that is now supported is shown below:

  • .csv: CSV files that use a comma separator.
  • .csv.bz2: CSV files that use a comma separator and are compressed using bzip2.
  • .csv.zip: CSV files that use a comma separator and are compressed using zip.
  • .csv.gz: CSV files that use a comma separator and are compressed using gzip.
  • .tsv: CSV files that use a tab separator.
  • .tsv.bz2: CSV files that use a tab separator and are compressed using bzip2.
  • .tsv.zip: CSV files that use a tab separator and are compressed using zip.
  • .tsv.gz: CSV files that use a tab separator and are compressed using gzip.
  • .wsv: CSV files that use an arbitrary whitespace separator.
  • .wsv.bz2: CSV files that use an arbitrary whitespace separator and are compressed using bzip2.
  • .wsv.zip: CSV files that use an arbitrary whitespace separator and are compressed using zip.
  • .wsv.gz: CSV files that use an arbitrary whitespace separator and are compressed using gzip.
  • .RData: .RData binary files produced by save().
  • .rda: .RData binary files produced by save().
  • .url: A YAML file that contains an HTTP URL and a separator specification for a remote dataset.
  • .sql: A YAML file that contains database connection information for a MySQL database.
  • .sav: Binary file format generated by SPSS.
  • .dta: Binary file format generated by Stata.

The other major change to ProjectTemplate in this release is that many fewer packages are now being loaded or even installed by default. I am not sure whether this is the ideal practice moving forward, but it was explicitly requested by a user. I’ve decided to see how the change is received by other users before making a final design decision. If you have strong views for or against this change, please speak up here or on the Google Groups mailing list.

Three-Quarter Truths: Correlation Is Not Causation

Other than our culture’s implicit association between lies, damned lies and statistics, I think no idea has stifled the growth of statistical literacy as much as the endless repetition of the words correlation is not causation. This phrase seems to be primarily used to suppress intellectual inquiry by encouraging the unspoken assumption that correlational knowledge is somehow an inferior form of knowledge.

I’d like to defend correlation for a bit. Here are four reasons why I think we should learn to love correlation and stop worrying so much about causation.

Claim 1: Most Knowledge is Correlational Knowledge

The majority of reliable human knowledge is already correlational. Spend a few days making a list of things that you know for certain about the world. I claim that you will find that a solid majority of them will be correlational statements rather than causal statements. For example, you might notice that you know that teenagers who own skateboards generally like punk music more than world music, though you are certainly aware that listening to the Sex Pistols isn’t the cause of their desire to learn how to ollie. And you almost certainly know that ‘s’ is followed by ‘t’ more often than ‘s’ is followed by ‘r’ in English, though you would never claim that an ‘s’ causes an ‘r’. 1

Hopefully those two examples are enough to make you suspect that you have an enormous quantity of correlational information stored inside your head. I’d like to further suggest that, despite its low status in our scientific culture, this sort of correlational knowledge has enormous practical value to you, because it allows you to make sense of a world in which you have incomplete information and are constantly required to fill in the blanks. For example, if you’re out at night in the deep South and suddenly see someone charging towards you dressed in white sheets, you’ll almost certainly run away, even though you don’t believe that white sheets cause lynchings. 2 Correlational knowledge can keep you alive when worrying about causality would get you killed.

Claim 2: The Value of Information is More Complex than the Opposition of Causation and Correlation Would Suggest

Taking this point a step further, it’s worth noting that assessing the value of information is a far more difficult problem than one might think. In practice, you always need to ask yourself what you’re trying to do with information. In many cases, you aren’t trying to control things, which I would claim is the only scenario in which causal knowledge could not be replaced with correlational knowledge in principle. In most real world problems, correlational knowledge is enough to make predictions with very high accuracy. For example, imagine that you run a bank and want to predict whether a person will default on their loan. You find that their zipcode predicts their rate of default quite well. You know full well that a zipcode cannot possibly cause a person to default on their loan, because it’s just a number based on a fairly arbitrary way of cutting up neighborhoods. But the absence of a causal relationship is completely irrelevant to you as a banker, since your interest lies in making money — and not in learning something about the hidden causes of human behavior.

If you want to predict something, rather than control it, the most important thing to ask is how well the information you can acquire will allow you to make predictions. After addressing this problem, you will also need to consider the relative costs of acquiring different sorts of information. For example, suppose that you want to predict a person’s height. Most of us accept that our genes are the ultimate cause of our height, barring serious illness or malnutrition as children. That’s why the heights of identical twins are so similar, while the heights of fraternal twins can be quite different. Focusing on causal pathway from genes to phenotype might suggest that you should try to measure someone’s genes to predict their height. People have done this and it doesn’t work very well. More importantly, it provides mediocre results at a fairly high cost. Acquiring a genotype is constantly going down in price, but it still costs a few hundred dollars.

Another approach comes from the inventor of the concept of correlation: Francis Galton. Galton’s method simply takes your parents’ heights and uses a correlational model to predict your height. This approach is correlational because no one believes that your parents’ heights cause your height: your parents’ genes caused their heights, then their genes caused your genes, and finally your genes caused your height. This is a perfect example of the way in which two things can be correlated because they share a common cause.

By making clever use of correlational information, Galton’s method only requires data that is available at almost zero cost, and yet it is more than ten times as accurate as the genetic screening method described above. Sometimes cheap correlational information provides high predictive accuracy, while costly causal information provides almost no predictive power. If you want to do something with information, you should always consider the possibility that a correlational pathway may be cheaper to observe than a causal one — at the same time that it provides comparable predictive power or even greater predictive power.

Claim 3: Causation is a Moving Target

Causation is not an entirely well-defined concept. It is an intuitive notion like justice or intelligence, and therefore may not have any definition that corresponds to all of the ways in which the word “cause” is used in normal language. Despite considerable work by philosophers and mathematicians, our accumulated understanding of what causation means is still very weak.

This vagueness works in causation’s favor. Because correlation is so much more precise as a concept than causation, it’s easier to come up with examples in which correlation doesn’t provide us with useful information than it is to come up with examples of the irrelevance of causal knowledge. This discrepancy in falsifiability is really a general property of mathematical models when compared with intuitive arguments: the precision of mathematical models makes them much more vulnerable to attack than vague ideas. But this brittleness is really a unrecognized virtue, because it is inseparable from the exactness that makes mathematical models directly comparable, precisely communicable and easily modified and extended. Despite their intuitive appeal, ideas whose true or falsehood is hard to assess are less amenable to the incremental improvements that has made scientific knowledge so valuable to humanity.

Claim 4: Correlation and Causation are Related

Last, but not least, I think correlation and causation are themselves correlated. By this I mean that if you were to list pairs of related things like height and weight, ethnicity and voting preferences, or zipcodes and mortgage default rates; and then classified each relationship as correlational and causal, you’d find that many instances of correlation were accompanied by causation. And you’d find that even more instances of causation were accompanied by correlation. Following Drew Conway’s lead, I’ll draw a Venn diagram of the relationship that I believe holds between correlation and causation:

correlation_vs_causation.jpg

This claim is incredibly hard to test: it is merely meant to remind us how wasteful it can be to focus exclusively on the differences between correlation and causation when they also have important similarities. It is true that correlation is not causation. But it is also true that human beings are not chimpanzees. And yet, in spite of that, we’ve been able to learn a lot about the human brain from studying the brains of chimpanzees, because there are many cases in which the similarities between humans and chimps are more important than the differences. Similarly, studying correlations can give us valuable information, including information about where to start looking for causal relationships. And even when it can’t do this, there is nothing wrong with correlational knowledge that is not also causal knowledge. Knowledge of causation is only necessary when we want to control the world. But there are many aspects of the world that we are largely unable to control, even in principle. In those cases, we simply need to have accurate predictions, because prediction without causation is enough for us to make the best of what is going to happen in the future. Assessing our ability to make predictions is vitally important, and it is the habit of making testable and precise predictions that an education in statistics can give to us. So let’s embrace a world with rich data sets that can provide us with formal, testable knowledge based on unambiguous, formal models — even if those models won’t ultimately provide us with causal mechanisms.

With all of that said, if you really want to understand the distinctions between correlation and causality, there is a rich academic literature that is far subtler and more interesting than the folk philosophy of science that I’ve been attacking. The current classic is Judea Pearl‘s masterwork, entitled simply “Causality”. It is very challenging material, but well worth the effort. And understanding it will require you to master so much of the machinery of prediction that you’ll walk away enlightened even if you decide in the end that causality doesn’t really interest you.

For most people, though, I have a different closing message. Please don’t allow the absence of causation to be used as a justification for remaining ignorant about the correlational structure of our world. Though there are cases in which knowing that A is related to B is much less useful than knowing that A causes B, knowing that A and B are related at all is still far better than knowing nothing at all — and we currently know nothing about many things. We should stop focusing on the ways in which correlation is not causation and instead follow Voltaire’s advice: do not allow the perfect to become the enemy of the good. 3

  1. You can quickly check this with the following shell script on OS X:

    1
    2
    
    grep 'sr' /usr/share/dict/words | wc -l
    grep 'st' /usr/share/dict/words | wc -l

    Running those commands should show you that there are 156 examples of ‘sr’ and 21,407 examples of ‘st’ in the standard UNIX dictionary.

  2. Especially not if you’ve seen Santa Semana celebrations in Spain.
  3. These ideas came up during a recent planning session for O’Reilly’s upcoming Strata Conference, during which Chris Wiggins said that he thought the distinction between correlation and causation was a red herring. My desire to expand on the reasons why I agreed with him inspired me to write my own ideas down.]

Freedman on Decision Theory

On the other hand, taken as a whole, decision theory seems to have about the same connection to real decisions as war games played on a table do to real wars.1

  1. David Freedman : Some issues in the foundation of statistics