Overfitting

What do you think when you see a model like the one below?


Overfitting

Does this strike you as a good model? Or as a bad model?

There’s no right or wrong answer to this question, but I’d like to argue that models that are able to match white noise are typically bad things, especially when you don’t have a clear cross-validation paradigm that will allow you to demonstrate that your model’s ability to match complex data isn’t a form of overfitting.

There are many objective reasons to suspect complicated models, but I’d like to offer up a subjective one. A model that fits complex data as perfectly as the model above is likely to not be an interpretable model1 because it is essentially a noisy copy of the data. If the model looks so much like the data, why construct a model at all? Why not just use the raw data?

Unless the functional form of a model and its dependence on inputs is simple, I’m very suspicious of any statistical method that produces outputs like those shown above. If you want a model to do more than produce black-box predictions, it should probably provide predictions that are relatively smooth. At the least it should reveal comprehensible and memorable patterns that are non-smooth. While there are fields in which neither of these goals is possible (and others where it’s not desirable), I think the default reaction to a model fit like the one above should be: “why does the model make such complex predictions? Isn’t that a mistake? How many degrees of freedom does it have that it can so closely fit such noisy data?”

  1. Although it might be a great predictive model if you can confirm that the fit above is the quality of the fit to held-out data!

EDA Before CDA

One Paragraph Summary

Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you’ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of intraocular trauma.

Context

This morning, I woke up to find that Neil Kodner had discovered a very convenient CSV file that contains geospatial data about every valid US zip code. I’ve been interested in the relationship between places and zip codes recently, because I spent my summer living in the 98122 zip code after having spent my entire life living in places with zip codes below 20000. Because of the huge gulf between my Seattle zip code and my zip codes on the East Coast, I’ve on-and-off wondered if the zip codes were originally assigned in terms of the seniority of states. Specifically, the original thirteen colonies seem to have some of the lowest zip codes, while the newer states had some of the highest zip codes.

While I could presumably find this information through a few web searches or could gather the right data set to test my idea formally, I decided to blindly plot the zip code data instead. I think the results help to show why a few well-chosen visualizations can be so much more valuable than regression coefficients. Below I’ve posted the code I used to explore the zip code data in the exact order of the plots I produced. I’ll let the resulting pictures tell the rest of the story.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
zipcodes <- read.csv("zipcodes.csv")
 
ggplot(zipcodes, aes(x = zip, y = latitude)) +
  geom_point()
ggsave("latitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = zip, y = longitude)) +
  geom_point()
ggsave("longitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = latitude, y = longitude, color = zip)) +
  geom_point()
ggsave("latitude_vs_longitude_color.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = longitude, y = latitude, color = zip)) +
  geom_point()
ggsave("longitude_vs_latitude_color.png", height = 7, width = 10)
ggplot(subset(zipcodes, longitude < 0), aes(x = longitude, y = latitude, color = zip)) +
  geom_point()
ggsave("usa_color.png", height = 7, width = 10)

Picture

(Latitude, Zipcode) Scatterplot


Latitude vs zip

(Longitude, Zipcode) Scatterplot


Longitude vs zip

(Latitude, Longitude) Heatmap


Latitude vs longitude color

(Longitude, Latitude) Heatmap


Longitude vs latitude color

(Longitude, Latitude) Heatmap without Non-States


Usa color

Finder Bug in OS X

Four years after I first noticed it, Finder still has a bug in it that causes it to report a negative number of items waiting for deletion:

FinderBug

Playing with The Circular Law in Julia

Introduction

Statistically-trained readers of this blog will be very familiar with the Central Limit Theorem, which describes the asymptotic sampling distribution of the mean of a random vector composed of IID variables. Some of the most interesting recent work in mathematics has been focused on the development of increasingly powerful proofs of a similar law, called the Circular Law, which describes the asymptotic sampling distribution of the eigenvalues of a random matrix composed of IID variables.

Julia, which is funded by one of the world’s great experts on random matrix theory, is perfectly designed for generating random matrices to experiment with the Circular Law. The rest of this post shows how you might write the most naive sort of Monte Carlo study of random matrices in order to convince yourself that the Circular Law really is true.

Details

In order to show off the Circular Law, we need to be a bit more formal. We’ll define a random matrix \(M\) as an \(N\)x\(N\) matrix composed of \(N^{2}\) IID complex random variables, each with mean \(0\) and variance \(\frac{1}{N}\). Then the distribution of the \(N\) eigenvalues asymptotically converges to the uniform distribution over the unit disc. I think it’s easiest to see this by doing a Monte Carlo simulation of the eigenvalues of random matrices composed of Gaussian variables for five values of \(N\):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
f = open("eig.tsv", "w")
 
println(f, join(["N", "I", "J", "Re", "Im"], "\t"))
 
ns = [1, 2, 5, 25, 50]
sims = 100
 
for n in ns
  for i in 1:sims
    m = (1 / sqrt(n)) * randn(n, n)
    e, v = eig(m)
    for j in 1:n
      println(f, join([n, i, j, real(e[j]), imag(e[j])], "\t"))
    end
  end
end
 
close(f)

The results from this simulation are shown below:

Circular law

These images highlight two important patterns:

  1. For the entry-level Gaussian distribution, the distribution of eigenvalues converges on the unit circle surprisingly quickly.
  2. There is a very noticeable excess of values along the real line that goes away much more slowly than the movement towards the unit disk.

If you’re interested in exploring this topic, I’d encourage you to try two things:

  1. Obtain samples from a larger variety of values of \(N\).
  2. Construct samples from other entry-level distributions than the Gaussian distribution used here.

PS: Drew Conway wants me to note that Julia likes big matrices.

Will Data Scientists Be Replaced by Tools?

The Quick-and-Dirty Summary

I was recently asked to participate in a proposed SXSW panel that will debate the question, “Will Data Scientists Be Replaced by Tools?” This post describes my current thinking on that question as a way of (1) convincing you to go vote for the panel’s inclusion in this year’s SXSW and (2) instigating a larger debate about the future of companies whose business model depends upon Data Science in some way.

The Slow-and-Clean Version

In the last five years, Data Science has emerged as a new discipline, although there are many reasonable people who think that this new discipline is largely a rebranding of existing fields that suffer from a history of poor marketing and weak publicity.

All that said, within the startup world, I see at least three different sorts of Data Science work being done that really do constitute new types of activities for startups to be engaged in:

  1. Data aggregation and redistribution: A group of companies like DataSift have emerged recently whose business model centers on the acquisition of large quantities of data, which they resell in raw or processed form. These companies are essentially the Exxon’s of the data mining world.
  2. Data analytics toolkit development: Another group of companies like Prior Knowledge have emerged that develop automated tools for data analysis. Often this work involves building usable and scalable implementations of new methods coming out of academic machine learning groups.
  3. In-house data teams: Many current startups and once-upon-a-time startups now employ at least one person whose job title is Data Scientist. These people are charged with extracting value from the data accumulated by startups as a means of increasing the market value of these startups.

I find these three categories particularly helpful here, because it seems to me that the question, “Will Data Scientists Be Replaced by Tools?”, is most interesting when framed as a question about whether the third category of workers will be replaced by the products designed by the second category of workers. I see no sign that the first category of companies will go away anytime soon.

When posed this way, the most plausible answer to the question seems to be: “data scientists will have portions of their job automated, but their work will be much less automated than one might hope. Although we might hope to replace knowledge workers with algorithms, this will not happen as soon as some would like to claim.”

In general, I’m skeptical of sweeping automation of any specific branch of knowledge workers because I think the major work done by a data scientist isn’t technological, but sociological: their job is to communicate with the heads of companies and with the broader public about how data can be used to improve businesses. Essentially, data scientists are advocates for better data analysis and for more data-driven decision-making, both of which require constant vigilance to maintain. While the mathematical component of the work done by a data scientist is essential, it is nevertheless irrelevant in the absence of human efforts to sway decision-makers.

To put it another way, many of the problems in our society aren’t failures of science or technology, but failures of human nature. Consider, for example, Atul Gawande’s claim that many people still die each year because doctors don’t wash their hands often enough. Even though Seimelweiss showed the world that hygiene is a life-or-death matter in hospitals more than a century ago, we’re still not doing a good enough job maintaining proper hygiene.

Similarly, we can examine the many sloppy uses of basic statistics that can be found in the biological and the social sciences — for example, those common errors that have been recently described by Ioannidis and Simonsohn. Basic statistical methods are already totally automated, but this automation seems to have done little to make the analysis of data more reliable. While programs like SPSS have automated the computational components of statistics, they have done nothing to diminish the need for a person in the loop who understands what is actually being computed and what it actually means about the substantive questions being asked of data.

While we can — and will — develop better tools for data analysis in the coming years, we will not do nearly as much as we hope to obviate the need for sound judgment, domain expertise and hard work. As David Freedman put it, we’re still going to need shoe leather to get useful insights out of data and that will require human intervention for a very long time to come. The data scientist can no more be automated than the CEO.

DataGotham

As some of you may know already, I’m co-organizing an upcoming conference called DataGotham that’s taking place in September. To help spread the word about DataGotham, I’m cross-posting the most recent announcement below:

We’d like to let you know about DataGotham: a celebration of New York City’s data community!

http://datagotham.com

This is an event run by Drew Conway, Hilary Mason, Mike Dewar and John Myles White that will bring together professionals from finance to fashion and from startups to the Fortune 500. This day-and-a-half event will consist of intense discussion, networking, and sharing of wisdom, and will take place September 13th-14th at NYU Stern.

We also have four tutorials running on the afternoon of the 13th, followed by cocktails and The Great Data Extravaganza Show at the Tribeca Rooftop that evening. Tickets are on sale – we would love to see you there!

If you’d like to attend, please see the DataGotham website for more information. Since you’re a reader of this fine blog, we’ve set up a special “Friends and Family” discount that will give you 25% off the ticket price. To get the discount, you need to use the promo code “dataGothamist”.

The Social Dynamics of the R Core Team

Recently a few members of R Core have indicated that part of what slows down the development of R as a language is that it has become increasingly difficult over the years to achieve consensus among the core developers of the language. Inspired by these claims, I decided to look into this issue quantitatively by measuring the quantity of commits to R’s SVN repository that were made by each of the R Core developers. I wanted to know whether a small group of developers were overwhelmingly responsible for changes to R or whether all of the members of R Core had contributed equally. To follow along with what I did, you can grab the data and analysis scripts from GitHub.

First, I downloaded the R Core team’s SVN logs from http://developer.r-project.org/. I then used a simple regex to parse the SVN logs to count commits coming from each core committer.

After that, I tabulated the number of commits from each developer, pooling across the years 2003-2012 for which I had logs. You can see the results below, sorted by total commits in decreasing order:

Committer Total Number of Commits
ripley 22730
maechler 3605
hornik 3602
murdoch 1978
pd 1781
apache 658
jmc 599
luke 576
urbaneks 414
iacus 382
murrell 324
leisch 274
tlumley 153
rgentlem 141
root 87
duncan 81
bates 76
falcon 45
deepayan 40
plummer 28
ligges 24
martyn 20
ihaka 14

After that, I tried to visualize evolving trends over the years. First, I visualized the number of commits per developer per year:


Commits

And then I visualized the evenness of contributions from different developers by measuring the entropy of the distribution of commits on a yearly basis:


Entropy

There seems to be some weak evidence that the community is either finding consensus more difficult and tending towards a single leader who makes final decisions or that some developers are progressively dropping out because of the difficulty of achieving consensus. There is unambiguous evidence that a single developer makes the overwhelming majority of commits to R’s SVN repo.

I leave it to others to understand what all of this means for R and for programming language communities in general.

My New Book: Developing, Deploying and Debugging Multi-Armed Bandit Algorithms

I’m happy to announce that I’ve started writing a new book for O’Reilly, which will focus on teaching readers how to use Multi-Armed Bandit Algorithms to build better websites. My hope is that the book can help web developers build up an intuition for the core conundrum facing anyone who wants to build a successful business: you have to constantly make trade-offs between (A) making decisions that are likely to yield safe, short-term successes and (B) taking chances on new opportunities that could be good long-term strategies, but could also be spectacular failures.

In the academic literature, this trade-off is usually called the Explore-Exploit dilemma. It’s been studied for decades because there is no universally applicable, out-of-the-box solution. There are several simple methods that nearly everyone knows about (like A/B testing), but these methods can lead to spectacular failures. For that reason, academics have invested a huge amount of intellectual energy into developing better approaches than the sort of randomized experimentation embodied in A/B testing.

In large part, they’ve made progress by studying an idealized version of the Explore-Exploit tradeoff called the Multi-Armed Bandit Problem. A Multi-Armed Bandit Problem is a thought experiment in which you try to make the most money possible while gambling at a casino that features only an assortment of different slot machines, each of which is called a bandit because of its propensity to steal the gambler’s money.

In the book, I’ll cover a few of the standard algorithms developed based on this thought experiment, including the \(\epsilon\)-greedy algorithm, softmax decision rules and a family of algorithms collectively called UCB. All of these algorithms will be implemented in Python, but my primary interest in writing the book isn’t to provide code for these algorithms to readers, but to give them a framework that will allow them to reason about how to solve the Explore-Exploit dilemma as it impacts their businesses and the general art of developing better websites. There are now many, many different algorithms available for solving the Multi-Armed Bandit Problem, but they all share a few core intuitions and strategies. I’m hoping to communicate those higher level intuitions to readers.

At the same time, I want to provide readers with a toolkit for deciding between the various options that are available to them. Web developers are very familiar with the importance of debugging, so I’d like to teach them the equivalent of debugging for Multi-Armed Bandit Problems, which involves simulation. Instead of building unit tests in which you confirm that your code can solve toy problems with clear answers, you run simulations to test that your algorithm can learn the right answer to toy problems with unambiguous right answers.

In summary, I hope readers will leave this book with a toolbox of strategies for developing, deploying and debugging multi-armed bandit algorithms in the context of web development. At large companies, these methods are responsible for millions of dollars of increased profits, so I think it’s high time that the core ideas trickle down from elite institutions like Google, Microsoft and Facebook to other web companies.

I’m really excited about this project and look forward to sharing more details as the book progresses.

Automatic Hyperparameter Tuning Methods

At MSR this week, we had two very good talks on algorithmic methods for tuning the hyperparameters of machine learning models. Selecting appropriate settings for hyperparameters is a constant problem in machine learning, which is somewhat surprising given how much expertise the machine learning community has in optimization theory. I suspect there’s interesting psychological and sociological work to be done exploring why a problem that could be answered using known techniques wasn’t given an appropriate solution earlier.

Thankfully, the take away message of this blog post is that this problem is starting to be understood.

A Two-Part Optimization Problem

To set up the problem of hyperparameter tuning, it’s helpful to think of the canonical model-tuning and model-testing setup used in machine learning: one splits the original data set into three parts — a training set, a validation set and a test set. If, for example, we plan to use L2-regularized linear regression to solve our problem, we will use the training set and validation set to select a value for the \(\lambda\) hyperparameter that is used to determine the strength of the penalty for large coefficients relative to the penalty for errors in predictions.

With this context in mind, we can set up our problem using five types of variables:

  1. Features: \(x\)
  2. Labels: \(y\)
  3. Parameters: \(\theta\)
  4. Hyperparameters: \(\lambda\)
  5. Cost function: \(C\)

We then estimate our parameters and hyperparameters in the following multi-step way so as to minimize our cost function:

\[
\theta_{Train}(\lambda) = \arg \min_{\theta} C(x_{Train}, y_{Train}, \theta, \lambda)
\]

\[
\lambda_{Validation}^{*} = \arg \min_{\lambda} C(x_{Validation}, y_{Validation}, \theta_{Train}(\lambda), \lambda)
\]

The final model performance is assessed using:
\[
C(x_{Test}, y_{Test}, \theta_{Train + Validation}(\lambda_{Validation}^{*}), \lambda_{Validation}^{*})
\]

This two-part minimization problem is similar in many ways to stepwise regression. Like stepwise regression, it feels like an opportunity for clean abstraction is being passed over, but it’s not clear to me (or anyone I think) if there is any analytic way to solve this problem more abstractly.

Instead, the methods we saw presented in our seminars were ways to find better approximations to \(\lambda^{*}\) using less compute time. I’ll go through the traditional approach, then describe the newer and cleaner methods.

Grid Search

Typically, hyperparameters are set using the Grid Search algorithm, which works as follows:

  1. For each parameter \(p_{i}\) the researcher selects a list of values to test empirically.
  2. For each element of the Cartesian product of these values, the computer evaluates the cost function.
  3. The computer selects the hyperparameter settings from this grid with the lowest cost.

Grid Search is about the worst algorithm one could possibly use, but it’s in widespread use because (A) machine learning experts seem to have less familiarity with derivative-free optimization techniques than with gradient-based optimization methods and (B) machine learning culture does not traditionally think of hyperparameter tuning as a formal optimization problem. Almost certainly (B) is more important than (A).

Random Search

James Bergstra’s first proposed solution was so entertaining because, absent evidence that it works, it seems almost flippant to even propose: he suggested replacing Grid Search with Random Search. Instead of selecting a grid of values and walking through it exhaustively, you select a value for each hyperparameter independently using some probability distribution. You then evaluate the cost function given these random settings for the hyperparameters.

Since this approach seems like it might be worst than Grid Search, it’s worth pondering why it should work. James’ argument is this: most ML models have low-effective dimension, which means that a small number of parameters really affect the cost function and most have almost no effect. Random search lets you explore a greater variety of settings for each parameter, which allows you to find better values for the few parameters that really matter.

I am sure that Paul Meehl would have a field day with this research if he were alive to hear about it.

Arbitrary Regression Problem

An alternative approach is to view our problem as one of Bayesian Optimization: we have an arbitrary function that we want to minimize which is costly to evaluate and we would like to find a good approximate minimum in a small number of evaluations.

When viewed in this perspective, the natural strategy is to regress the cost function on the settings of the hyperparameters. Because the cost function may depend on the hyperparameters in strange ways, it is wise to use very general purpose regression methods. I’ve recently seen two clever strategies for this, one of which was presented to us at MSR:

  1. Jasper Snoek, Hugo Larochelle and Ryan Adams suggest that one use a Gaussian Process.
  2. Among other methods, Frank Hutter, Holger H. Hoos and Kevin Leyton-Brown suggest that one use Random Forests.

From my viewpoint, it seems that any derivative-free optimization method might be worth trying. While I have yet to see it published, I’d like to see more people try the Nelder-Mead method for tuning hyperparameters.

Criticism 5 of NHST: p-Values Measure Effort, Not Truth

Introduction

In the third installment of my series of criticisms of NHST, I focused on the notion that a p-value is nothing more than a one-dimensional representation of a two-dimensional space in which (1) the measured size of an effect and (2) the precision of this measurement have been combined in such a way that we can never pull those two dimensions apart again. Because the size of a p-value depends on two fully independent factors, measured p-values will become smaller when either (A) a researcher focuses on studying an effect whose true magnitude is greater or (B) a researcher works harder to increase the precision of the measurement of the effect they are already committed to studying.

I’d like to dwell on this second strategy for producing low p-values for a moment, because I believe it suggests that p-values — in the hands of a social scientist committed to studying effects that are never exactly zero nor likely to be large in magnitude — are simply a confusing way of reporting the precision of our measurements. Moreover, because the precision of our measurements can always be increased by gathering more data or acquiring better equipment, p-values in the social sciences are nothing more than measures of the effort and money we invest in acquiring more precise measurements, even though many of us would like to think of p-values as a measure of the truth of our hypotheses.

To hammer home the notion that p-values are, in practice, just measurements of sample size, I’m going to provide the reader with a simple algorithm that any researcher can use to insure that essentially any study in the social sciences will, in expectation, be validated as true by the NHST approach to assessing research. For the sake of argument, we’ll insist that our hypothetical study should pass the classical threshold of p < .05.

At a high level, the approach I'll take is similar in spirit to power analysis, but it is different in the details. In power analysis, you would tell me (1) the size of the effect you expect and (2) the sample size you can expect to gather; I would then hand you back the probability of successfully passing the p-value threshold you've set. In the approach described here, you hand me only the size of the expected effect; I then hand you back the minimum sample size you need to gather so that, in expectation, you'll pass the p < .05 threshold. In other words, for any hypothesis you could possibly invent that isn't absolutely false, you can take the information I'll give you and run an experiment with the expectation that you will be able to use NHST to support your hypothesis. This experiment may be extremely expensive to run and may contribute nothing to human welfare if completed successfully, but you can run the experiment in reasonable confidence that you'll be able to publish your results.

To do this, we only need to turn a simple calculation over to our computer:

  1. For any proposed effect size, epsilon, we generate many different random samples of N normally distributed data points centered at epsilon.
  2. For each sample, we run a one-sample t-test comparing the mean of that sample against 0. We store a copy of the p-value associated with this t-test.
  3. Across samples, we average the p-values derived from these t-tests to estimate the expected p-value for a fixed sample size.
  4. If the expected p-value is greater than .05, we increase N to N + 1 and try again.
  5. Eventually we will find a large enough value of N so that the expected p-value is below .05: we store this in a database and move on to the next effect size.

If we generate a large enough number for different data sets for each of our effect sizes, our Monte Carlo estimates of the expected p-values will hopefully be accurate. With a large number for these estimates for different effect sizes, we can extrapolate out a smooth curve that shows the minimum sample size required to have an expectation that the p-value for a study with that sample size and effect size will be less than .05.

To calculate these values, I've coded up this simple search algorithm in R and put the code up on GitHub. In this post, I'll simply provide a graph that shows how large a value of N is required to expect to pass a standard t-test as the size of the effect you're studying grows along a logarithmic scale starting at 0.1 and going all the way up to 10:

Sample size

As you can see, you need huge sample sizes for effects as small as 0.1, but you can use remarkably few data points when the effect is sufficiently large. No matter how small your effect may be, you can always do the hard work of gathering data in order to pass the threshold of p < .05. As long as the effect you're studying isn't non-existent, p-values just measure how much effort you've put into collecting data.

Conclusions

What should we take away from this? A variety of conclusions could be drawn from thinking about the graph above, but I'm really only concerned with one of them: if your field always studies effects that aren't exactly zero, it is always possible to design a study that can be expected to pass a significance test. As such, a significance test cannot, in principle, allow science to filter out any idea in the social sciences, except perhaps for those so outlandish that they do not have even a shred of truth in them. All that NHST does is measure how precise your measurement system is -- which in turn is nothing more than a metric of how hard you were willing to work to gather data in support of your hypothesis. But the width of a confidence interval is clearly a superior tool for measuring the precision of your measurements, because it directly tells you the precision of your measurements; a p-value only tells you the precision of a measurement after you've pealed off the residue left behind by the size of the effect you're studying.

Thankfully, I think that there are many viable approaches for presenting scientific research and evaluating the research of others that are far superior to p-values. In the next and final post in this series, I'll focus on some suggestions for developing better strategies for analyzing data in the social sciences. Unfortunately, these better strategies require more than a change in methods in the social sciences, they require a change in culture. We need to do less hypothesis testing and more estimation of formal models. But I'll leave the details of that until my next post.

[P.S. Many thanks to Sean Taylor for useful feedback on a draft of this post.]