A 3D Version of R’s curve() Function

I like exploring the behavior of functions of a single variable using the curve() function in R. One thing that seems to be missing from R’s base functions is a tool for exploring functions of two variables.

I asked for examples of such a function on Twitter today and didn’t get any answers, so I decided to build my own. As I see it, there are two ways to visualize a function of two variables:

  1. Use a 3D surface.
  2. Use a heatmap.

But 3D surfaces aren’t currently available in ggplot2, so I decided to work with heatmaps. The function below provides a very simple implementation of my 3D version of curve(), which I call curve3D():

1
2
3
4
5
6
7
8
9
10
curve3D <- function(f, from.x, to.x, from.y, to.y, n = 101)
{
	x.seq <- seq(from.x, to.x, (to.x - from.x) / n)
	y.seq <- seq(from.y, to.y, (to.y - from.y) / n)
	eval.points <- expand.grid(x.seq, y.seq)
	names(eval.points) <- c('x', 'y')
	eval.points <- transform(eval.points, z = apply(eval.points, 1, function (r) {f(r['x'], r['y'])}))
	p <- ggplot(eval.points, aes(x = x, y = y, fill = z)) + geom_tile()
	print(p)
}

Here’s an example of the use of curve3D to explore the behavior of Loewenstein and Prelec’s Generalized Hyperbolic discounting function:

g <- function(x, y) {(1 + y * 2) ^ (-x / y) * (1 + y * 1) ^ (x / y)}

curve3D(g, from.x = 0.01, to.x = 1, from.y = 0.01, to.y = 1)
example.png

I'd love suggestions for cleaning this function up. Two obvious improvements are:

  1. Allow the function to accept arbitrary expressions and not just functions as inputs.
  2. Allow the user to see 3D surfaces or heatmaps.

I suspect that the first problem would be a great way to learn about functional programming in R -- especially R's methods for quoting, parsing and deparsing expressions.

Canabalt Revisited: Gamma Distributions, Multinomial Distributions and More JAGS Goodness

Introduction

Neil Kodner recently got me interested again in analyzing Canabalt scores statistically by writing a great post in which he compared the average scores across iOS devices. Thankfully, Neil’s made his code and data freely available, so I’ve been revising my original analyses using his new data whenever I can find a free minute.

Returning to Canabalt has been a lot of fun, especially because my grasp of statistical theory is a lot stronger now than it was when I published my last post on Canabalt scores. For example, I actually know now what I was trying to say when I publicly described my search for a Poisson distribution in the posted scores. At the time, I had just read about Poisson regressions and was therefore eager to fit Poisson models to the scores data, even though the Poisson model gave very poor results. In retrospect, it’s clear that I was misled by superficial similarities in statistical terminology. What I was really looking for in the data was something closer to a Poisson process than to a Poisson distribution. Unfortunately, I didn’t really understand Poisson processes at the time I wrote my original posts, so I only succeeded in showing that Canabalt scores could not reasonably be claimed to be Poisson distributed.

Generating Process

But now that I have more data and more experience, it’s easy to see what I was struggling to articulate before: Canabalt scores seem to be generated by something like a truncated Poisson process. From my current perspective, the generating process for a Canabalt score is essentially the following:

  1. Initialize the player’s score to zero.
  2. Iterate the following steps repeatedly:
    1. Draw the number of meters before the next obstacle from a Poisson distribution. Add this value to the player’s current score.
    2. Draw the identity of the next obstacle from a multinomial distribution.
    3. For every type of obstacle, there is a constant probability p that a player will die when they come up against an instance of that type of obstacle. Given the value of p, draw a Bernoulli variable with probability p of coming up heads. If the result is heads, the player dies and their score is the current value of score, which is therefore simply the sum of several Poisson variables. If the result is tails, go back to the top of this loop.

This intuition, unfortunately, isn’t trivial to express as a model that I can quickly fit to the data I have. I haven’t tried very systematically to build a formal version of this model because there seem to be obvious problems of identification. For example, one can imagine that the values for the obstacle-specific probabilities described above can all be lowered by a constant proportion at the same time that the mean of the underlying Poisson distribution is decreased while the distribution of outcome scores is left constant. My intuition tells me that the only way around this would be to exploit the variance of the data and the restriction of the Poisson distribution to integer values, but I haven’t pushed very hard on this. If others are interested in pursuing it, I think there’s an interesting open problem here.

Thankfully, there’s another approach to modeling the data that’s simpler. As Owe Jessen noted in a comment on my earlier post, the distribution of scores looks like one parameterization of the gamma distribution. When Owe made this suggestion originally, I tried to use fitdistr from the MASS package to fit a gamma model to the scores data, but was never able to get the code to work properly. But now that I’m reasonably fluent in BUGS, it’s quite easy to fit a gamma distribution to the empirical scores data.

Gamma Distribution

Using JAGS, fitting a gamma distribution to the scores data is basically a trivial problem that requires only a few lines of code:

1
2
3
4
5
6
7
8
9
10
model
{
	for (i in 1:N)
	{
		score[i] ~ dgamma(shape, rate)
	}
 
	shape ~ dgamma(0.0001, 0.0001)
	rate ~ dgamma(0.0001, 0.0001)
}

Invoking this BUGS model from R using code that I’ve put on GitHub allows us to estimate the scale and rate parameters for a gamma distribution in a few minutes. The resulting parameterization of the gamma distribution looks very similar to the empirical density that we can estimate using a KDE:

density_comparison.png

Beyond visual comparisons, we can formally test the fit of the gamma model using a K/S test. The K/S test tells us that we should reject the gamma model, but it’s a fairly weak rejection at p = 0.005. Given that we have over a thousand data points, I think this weak rejection suggests that the gamma model is not such a bad approximation to the true score distribution.

Where the gamma model seems to noticeably fail is in the tail of the scores distribution:

tail_density_comparison.png

To exaggerate the differences here, I’ve used a square root transform on the y axis so that we can see the bumps in the estimated density plot that are missing from the theoretical gamma model.

Since generating these images, I’ve had a chance to read a bit about heavy tailed distributions, but haven’t yet tried fitting any of them to this data set. I’ll probably start with a Pareto distribution, though I’d really like to find a discrete heavy tailed distribution over the natural numbers rather than a continuous distribution over the non-negative reals.

Looking further past the gamma model’s misfit in the tails, there’s another reason that I like the gamma model: the gamma distribution has an origins story that has some points of connection to the generative model I outlined above. Specifically, adding a bunch of exponential variables together will give you a gamma distribution (also called an Erlang distribution in this context). While I’m skeptical that the distribution of meters between obstacles can be reasonably treated as if it were an exponential distribution, the summation origin for the gamma distribution is a nice point of connection to my intuitions about a data generating mechanism that behaves like a Poisson process.

Hierarchical Gamma Model

The gamma model has another point in its favor: it’s easy to write down a hierarchical model that fits a distinct gamma distribution to subsets of the original data. By fitting multiple gamma distributions, we can easily make comparisons between the estimated score distributions for the different devices that Neil analyzed. As Neil showed, there are enough rows in the current data to do this in a principled way across devices without resorting to gamma distributions, but a hierarchical model provides us with tools for thinking about comparisons across different types of deaths, where we don’t have enough data to use density estimates or other non-parametric methods.

Without a distributional model, I’d be skeptical about estimating differences between groups with such unequal sample sizes. (Even with a model, I don’t think we can make strong conclusions about differences between all of the groups in this data set.) That said, within a hierarchical model, I feel more comfortable estimating several conditional distributions from a small data set, because the hierarchical model provides enough shrinkage to prevent us from arriving at extreme conclusions simply because some groups were undersampled in our data set. (Of course, shrinking the model parameters for small groups towards the global mean may lead us to miss differences that are real.)

All that said, the analyses I describe below implement a hierarchical model that I’ve used to estimate the mean and standard deviation of the score distribution for each of the iOS devices and each of the death types in our data set.

First, let’s model the expected score for a player who died because of each of the various possible obstacles they might come across. I’ll refer to these different types of deaths as the death types. My JAGS code for estimating the shape and rate parameters for each death type is below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
model
{
	for (i in 1:N)
	{
		score[i] ~ dgamma(shape[death[i]], rate[death[i]])
	}
 
	for (j in 1:K)
	{
		shape[j] ~ dgamma(alpha.shape, beta.shape)
		rate[j] ~ dgamma(alpha.rate, beta.rate)
	}
 
	alpha.shape ~ dgamma(0.0001, 0.0001)
	beta.shape ~ dgamma(0.0001, 0.0001)
 
	alpha.rate ~ dgamma(0.0001, 0.0001)
	beta.rate ~ dgamma(0.0001, 0.0001)
}

As you can see, I estimate group hyperparameters that partially pool data across all of the death types: these hyperparameters are themselves given weakly informative gamma priors in the last four lines of the model. The results give us the following estimates for the mean score for each death type along with the estimated standard deviation:

death_hierarchical_gamma_1.png

I’ve chosen to use means and standard deviations rather than the shape and rate parameters that we’re actually fitting, because the shape and rate parameters are on such different scales that only one of them can be visualized effectively at a time. There are simple formulas for translating between these parameterizations of the gamma distribution: the translation scheme can be found on Wikipedia.

Having estimated these parameters across death types, I can also approach the question that Neil addressed in his post by comparing the score distributions across iOS devices. The JAGS code to do so is almost identical to the code for estimating the gamma parameters across death types:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
model
{
	for (i in 1:N)
	{
		score[i] ~ dgamma(shape[device[i]], rate[device[i]])
	}
 
	for (j in 1:J)
	{
		shape[j] ~ dgamma(alpha.shape, beta.shape)
		rate[j] ~ dgamma(alpha.rate, beta.rate)
	}
 
	alpha.shape ~ dgamma(0.0001, 0.0001)
	beta.shape ~ dgamma(0.0001, 0.0001)
 
	alpha.rate ~ dgamma(0.0001, 0.0001)
	beta.rate ~ dgamma(0.0001, 0.0001)
}

In this case, the results tell a much simpler story about the differences between the three devices in our data set:

device_gamma_1.png

Multinomial Model

As a closing point, there’s one more modeling project for which I’ve used JAGS on the new Canabalt data: I’ve tried to estimate the probability of suffering each type of death along with an indication of the uncertainty in our estimates. Of course, this is a simple problem to solve using maximum likelihood methods: you can just plug in the empirical frequencies. But, given my growing love for Bayesian methods and the use of credible intervals to summarize uncertainty, I decided that I would estimate the multinomial model for death types using a Dirichlet prior centered on a uniform multinomial distribution. Here’s my JAGS code for estimating the multinomial model:

1
2
3
4
5
6
7
8
9
10
model
{
	deaths[1:K] ~ dmulti(p[1:K], N)
	p[1:K] ~ ddirch(alpha[1:K])
 
	for (i in 1:K)
	{
		alpha[i] <- 1 / K
	}
}

And here are the final results:

death_type_probabilities.png

Conclusion

It’s been a lot of fun coming back to this topic. I still want to understand the outliers in the data better, but I’ll leave that for another day. If you’re interested in exploring this data set yourself, I encourage you to go to the GitHub repository I’ve set up and explore the JAGS code that I’ve used to fit these models. There you can also find higher effective resolution PDF’s of the graphs that you see here, which are admittedly a bit hard to read at this resolution.

Finally, I’d like to thank Neil Kodner for having put so much work into collecting more Canabalt data and analyzing it. Essentially all of the work I’ve done here is just a Bayesian reformulation of the questions that Neil addressed already in his own post.

Review of R Graphs Cookbook

The kind people at Packt Publishing recently asked me to review one of their newest R books: the R Graphs Cookbook. In general, I think pretty highly of the book: it provides a nice overview of the basic tools for visualizing data in R. If you’re just getting started with creating graphs in R, this book could be a very valuable resource. It’s clearly targeted at beginners, though it does seem to assume at least a little familiarity with R’s basic data structures and control flow. Perhaps more importantly for potential readers, the book actually works through some comprehensible, extended examples, which makes it substantially more readable than the default R documentation. I’d probably have benefited from having this book when I was first starting to learn to program in R.

Another plus for the book is that it covers some of the graphing functionality that’s provided by the base, lattice and ggplot2 packages. That last bit is particularly important to me: I hope this book represents one of the first members of a new generation of books on R that will treat ggplot2 as a default tool for building graphs in R — if not the default tool. That said, I’d actually have been happy if the book had proselytized more strongly for ggplot2, even though I can imagine that it’s probably a good thing to expose beginner users to the traditional tools in R for visualizing data in addition to the newest favorite contender.

For intermediate users, I suspect that the most useful part of the book will be the discussion in the second chapter of some of the lower level graphic parameters that par() gives you access to. I’d never learned to use those features very effectively, so the chapter covering those details was particularly helpful for me. Also, there’s a later chapter on creating maps that’s quite nice, as well as a final chapter that’s focused on producing publication-ready graphics. Both of those chapters could be very useful for more advanced readers: for example, I knew absolutely nothing about working with fonts in R before reading those sections.

In short, the R Graphs Cookbook seems like a useful book to have around for most R hackers and a potentially very valuable resource for new programmers. If you’re interested in finding a starting point for learning to build graphs in R, I’d suggest considering this book.

Modern Science and the Bayesian-Frequentist Controversy

The Bayesian-Frequentist debate reflects two different attitudes to the process of doing science, both quite legitimate. Bayesian statistics is well-suited to individual researchers, or a research group, trying to use all the information at its disposal to make the quickest possible progress. In pursuing progress, Bayesians tend to be aggressive and optimistic with their modeling assumptions. Frequentist statisticians are more cautious and defensive. One definition says that a frequentist is a Bayesian trying to do well, or at least not too badly, against any possible prior distribution. The frequentist aims for universally acceptable conclusions, ones that will stand up to adversarial scrutiny. The FDA for example doesn’t care about Pfizer’s prior opinion of how well it’s new drug will work, it wants objective proof. Pfizer, on the other hand may care very much about its own opinions in planning future drug development.1

To me, it’s amazing how similar the ambiguous regions of behavioral decision theory are to the major questions of theoretical statistics: people seem largely unable to systematically decide whether they want to be minimaxing (which seems very close to Efron’s vision of frequentist thought as stated here) or whether they want to be minimizing expected risk (which is closer to my own vision of Bayesian thinking). My own sense is that we learn as a global culture, over time, which error functions are least erroneous — and we do so largely by trial and error.

Most interesting to me is to consider individual differences in the error functions people effectively use: I suspect political preferences correlate with a propensity to focus on worst case thinking rather than average case thinking. Also, I’m fascinated by the way that a single person switches between worst case and average case thinking: I suspect there’s as much to be learned here as there was in understanding what drives risk seeking behavior and what drives risk average behavior.

HT: John D. Cook


  1. Bradley Efron : Modern Science and the Bayesian-Frequentist Controversy

Inconsistencies in Bayesian Models of Decision-Making

But modeling devices that make sense for an unbiased decisionmaker may not make sense for a biased one. For example, why would individuals have priors and posteriors if they are destined to apply Bayes’ law incorrectly?1

A question I often ask myself.

  1. Wolfgang Pesendorfer : Behavioral Economics Comes of Age: A Review Essay on Advances in Behavioral Economics

Useful One-Line for Mac OS X Users

Today I needed to quickly reverse the order of rows in a LaTeX table. This command nicely did the trick:

1
pbpaste | awk '{ line[NR] = $0 } END { for (i=NR;i>0;i--) print line[i] }'

I stole the awk piece from here.

But Muddle We Must

We are constantly exploited by the tools meant to foil our exploitation. For a progressive to acknowledge as much is tantamount to abandoning progressivism. So it’s no surprise that progressives would rather worry over trivialities such as campaign finance reform than dwell on the paradoxes of political power. But it really isn’t the Citizens United decision that’s about to make Peter Orszag a minor Midas. It’s the vast power of a handful of Washington players, with whom Mr Orszag has become relatively intimate, to make or destroy great fortunes more or less at whim. Well-connected wonks can get rich on Wall Street only because Washington power is now so unconstrained. Washington is so unconstrained in no small part because progressives and New Dealers and Keynesians and neo-cons and neo-liberals for various good and bad reasons wanted it that way. So, what is to be done? Summon a self-bottling genie-bottling genie?

The classically liberal answer is to make government less powerful. The monstrous offspring of entangled markets and states can be defeated only by the most thorough possible separation. But public self-protection through market-state divorce can work only if libertarians are right that unfettered markets are not by nature unstable, that they do not lead to op[p]ressive concentrations of power, that we would do better without a central bank, and so on. Most of us don’t believe that. Until more of us do, we’re not going far in that direction. And maybe that’s just as well. Maybe it’s true that markets hum along smoothly only with relatively active government intervention and it’s also true that relatively active government intervention is eventually inevitably co-opted, exacerbating rather than mitigating capitalism’s injustices. Perhaps the best we can hope ever to achieve is a fleeting state of grace when fundamentally unstable forces are temporarily held in balance by an evanescent combination of complementary cultural currents. This is increasingly my fear: that there is no principled alternative to muddling through; that every ideologue’s op-ed is wrong, except the ones serendipitously right. But muddle we must.

I generally avoid posting about politics these days, but this piece is too beautifully written and too close to an exact statement of my own political beliefs to let it go unmarked.

Academic Jargon: Field-Specific Insults

Every academic field seems to develop a set of generic insults based on their intellectual toolkit. Here are two examples I hear often:

  1. Probabilists and Statisticians: “I think that’s an interesting case, but it’s in a set with measure zero.”
  2. Economists: “X group’s behavior is clearly rent-seeking.”

Do any readers have good examples from other fields?

A Draft of ProjectTemplate v0.2-1

I’ve just uploaded a new binary of ProjectTemplate to GitHub. This is a draft version of the next release, v0.2-1, which includes some fairly substantial changes and is backwards incompatible in several ways with previous versions of ProjectTemplate.

Foremost of the changes is that most of the logic for load.project() is now built into the load.project() function directly, rather than spread out into autogenerated scripts that you can edit by hand. While this makes ProjectTemplate harder for non-experts to modify, the change will make it much easier to make revisions to ProjectTemplate in the future without having to worry about existing projects falling behind because of vestigial code that’s not being automatically updated when you install a new version of ProjectTemplate.

Because more system logic is now hardcoded into functions, each project’s configuration is handled through a YAML file in config/global.yaml. Incidentally, this introduces the new directory, config/, where configuration files will go from now on.

The data loading system is also more complex than it was before. First, there’s a new hierarchy of data sources: now the system will look for data in a cache/ directory before moving on to the data/ directory. This makes it possible for you to permanently store changes to your data set in cache/ that will allow you to skip loading the raw data set. This is helpful when the original data set is enormous and you only need a radically reduced form of it for your future analyses that you’ll store in cache/.

In addition, preprocessing is now handled through a series of ordered scripts in a munge/ directory rather than just a single preprocessing script in the lib/ directory. There’s also a log/ directory, used by the new integrated log4r support, which is off by default, but can be easily set up after installing log4r from CRAN.

Finally, there’s a src/ directory where we’re going to encourage users to place their primary analyses, so that the main directory always has the same files and directories across all projects.

In addition to all of these changes, many of which were inspired by conversations with Mike Dewar, I’ve incorporated some very helpful patches in this release. Specifically, Diego Valle-Jones fixed a bug in clean.variable.name() that lead to trouble when filenames in the data/ directory began with numbers and Patrick D. Schalk contributed code that adds support for SQLite to ProjectTemplate along with general improvements to the database access codebase.

Thanks for all of the support since the last release. Please let me know if there any changes that need to be made before I turn v0.2-1 loose on CRAN.

The NYC Marathon

New York’s annual marathon took place yesterday. Watching a bit of it on television with my friends, I was struck by the much earlier starting time for women than men. Specifically, professional women started running yesterday at 9:10 AM, while professional men start running at 9:40 AM. (This information comes from the runner’s handbook.) I wanted to get a sense of how much this head start depended on real differences in their performance, because I found it very hard to imagine why professional women would run significantly slower than professional men.

Of course, I have seen discussions of the speed difference between men and women before, but I was still very surprised by it yesterday. To get a sense of the scope of the differences, I found some data this morning from the ING Marathon website and made a quick density estimate plot, which you can see below:

hours_gender.png

It’s clear that men and women had quite difference average speeds yesterday, and that their times had very different distributions. Of course, these plots are each based on 100 observations, so I’m hesitant to make any strong conclusions. Having confirmed for myself that there are real differences in the performance of men and women, I have to confess that I still find it surprising.

For those interested in following up on this, the code I used to produce this plot and the data set I used are both available on GitHub. I’m sure there are other interesting questions one can ask of this data beyond simple comparisons across genders.