The Shape of Floating Point Random Numbers

[Updated 10/18/2012: Fixed a typo in which mantissa was replaced with exponent.]

Over the weekend, Viral Shah updated Julia’s implementation of randn() to give a 20% speed boost. Because we all wanted to test that this speed-up had not come at the expense of the validity of Julia’s RNG system, I spent some time this weekend trying to get tests up and running. I didn’t get far, but thankfully others chimed in and got things done.

Testing an RNG is serious business. In total, we’ve considered using four different test suites:

All of these suites can be easily used to test uniform random numbers over unsigned integers. Some are also appropriate for testing uniform random numbers over floatint-point values.

But we wanted to test a Gaussian RNG. To do that, we followed Thomas et al.’s lead and mapped the Gaussian RNG’s output through a high-precision quantile function to produce uniform random floating point values. As our high-precision quantile function we ended up using the one described in Marsaglia’s 2004 JSS paper.

With that in place, I started to try modifying my previous RNG testing code. When we previously tried to test Julia’s rand() function, I got STS working on my machine and deciphered its manual well enough to run a suite of tests on a bit stream from Julia.

Unfortunately I made a fairly serious error in how I attempted to test Julia’s RNG. Because STS expects a stream of random 0’s and 1’s, I converted random numbers into 0’s and 1’s by testing whether the floating point numbers being generated were greater than 0.5 or less than 0.5. While this test is not completely wrong, it is very, very weak. Its substantive value comes from two points:

  1. It confirms that the median of the RNG is correctly positioned at 0.5.
  2. It confirms that the placement of successive entries relative to 0.5 is effectively random. In short, there is not trivial correlation between successive values.

Unfortunately that’s about all you learn from this method. We needed something more. So I started exploring how to convert a floating point into bits. Others had the good sense to avoid this and pushed us forward by using the TestU01 suite.

I instead got lost exploring the surprising complexity of trying to work with the individual bits of random floating point numbers. The topic is so subtle because the distribution of bits in a randomly generated floating point number is extremely far from a random source of individual bits.

For example, a uniform variable’s representation in floating point has all the following non-random properties:

  1. The sign bit is never random because uniform variables are never negative.
  2. The exponent is not random either because uniform variables are strictly contained in the interval [0, 1].
  3. Even the mantissa isn’t random. Because floating point numbers aren’t evenly spaced in the reals, the mantissa has to have complex patterns in it to simulate the equal-spacing of uniform numbers.

Inspired by all of this, I decided to get a sense for the bit pattern signature of different RNG’s. Below I’ve plotted the patterns for uniform, normal, gamma and Cauchy variables using lines that describe the mean value of the i-th bit in the bit string. At a minimum, a completely random bit stream would have a flat horizontal line through 0.5, which many of the lines touch for a moment, but never perfectly match.


Signatures

Some patterns:

  1. The first bit (shown on the far left) is the sign bit: you can clearly see which distributions are symmetric by looking for a mean value of 0.5 versus those that are strictly positive and have a mean value of 0.0.
  2. The next eleven bits are the exponent and you can clearly see which distributions are largely concentrated in the interval [-1, 1] and which have substantial density outside of that region. This bit would clue you into the variance of the distribution.
  3. You can see that there is a lot of non-randomness in the last few bits of the mantissa for uniform variables. There’s also non-randomness in the first few bits for all variables. I don’t yet have any real intuition for those patterns.

You can go beyond looking at the signatures of mean bit patterns by looking at covariance matrices as well. Below I show these covariances matrices in a white-blue coloring scheme in which white indicates negative values, light blue indicates zero and dark blue indicates positive values. Note that matrices, generated using R’s image() function are reflections of the more intuitive matrix ordering in which the [1,1] entry of the matrix occurs in the top-left instead of the bottom-left.

Uniform Variables

Uniform covariance

Normal Variables

Normal covariance

Gamma Variables

Gamma covariance

Cauchy Variables

Cauchy covariance

I find these pictures really helpful for reminding me how strangely floating point numbers behave. The complexity of these images is so far removed from the simplicity of the bit non-patterns in randomly generated unsigned integers, which can be generated using IID random bits and concatenating them together.

Overfitting

What do you think when you see a model like the one below?


Overfitting

Does this strike you as a good model? Or as a bad model?

There’s no right or wrong answer to this question, but I’d like to argue that models that are able to match white noise are typically bad things, especially when you don’t have a clear cross-validation paradigm that will allow you to demonstrate that your model’s ability to match complex data isn’t a form of overfitting.

There are many objective reasons to suspect complicated models, but I’d like to offer up a subjective one. A model that fits complex data as perfectly as the model above is likely to not be an interpretable model1 because it is essentially a noisy copy of the data. If the model looks so much like the data, why construct a model at all? Why not just use the raw data?

Unless the functional form of a model and its dependence on inputs is simple, I’m very suspicious of any statistical method that produces outputs like those shown above. If you want a model to do more than produce black-box predictions, it should probably provide predictions that are relatively smooth. At the least it should reveal comprehensible and memorable patterns that are non-smooth. While there are fields in which neither of these goals is possible (and others where it’s not desirable), I think the default reaction to a model fit like the one above should be: “why does the model make such complex predictions? Isn’t that a mistake? How many degrees of freedom does it have that it can so closely fit such noisy data?”

  1. Although it might be a great predictive model if you can confirm that the fit above is the quality of the fit to held-out data!

EDA Before CDA

One Paragraph Summary

Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you’ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of intraocular trauma.

Context

This morning, I woke up to find that Neil Kodner had discovered a very convenient CSV file that contains geospatial data about every valid US zip code. I’ve been interested in the relationship between places and zip codes recently, because I spent my summer living in the 98122 zip code after having spent my entire life living in places with zip codes below 20000. Because of the huge gulf between my Seattle zip code and my zip codes on the East Coast, I’ve on-and-off wondered if the zip codes were originally assigned in terms of the seniority of states. Specifically, the original thirteen colonies seem to have some of the lowest zip codes, while the newer states had some of the highest zip codes.

While I could presumably find this information through a few web searches or could gather the right data set to test my idea formally, I decided to blindly plot the zip code data instead. I think the results help to show why a few well-chosen visualizations can be so much more valuable than regression coefficients. Below I’ve posted the code I used to explore the zip code data in the exact order of the plots I produced. I’ll let the resulting pictures tell the rest of the story.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
zipcodes <- read.csv("zipcodes.csv")
 
ggplot(zipcodes, aes(x = zip, y = latitude)) +
  geom_point()
ggsave("latitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = zip, y = longitude)) +
  geom_point()
ggsave("longitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = latitude, y = longitude, color = zip)) +
  geom_point()
ggsave("latitude_vs_longitude_color.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = longitude, y = latitude, color = zip)) +
  geom_point()
ggsave("longitude_vs_latitude_color.png", height = 7, width = 10)
ggplot(subset(zipcodes, longitude < 0), aes(x = longitude, y = latitude, color = zip)) +
  geom_point()
ggsave("usa_color.png", height = 7, width = 10)

Picture

(Latitude, Zipcode) Scatterplot


Latitude vs zip

(Longitude, Zipcode) Scatterplot


Longitude vs zip

(Latitude, Longitude) Heatmap


Latitude vs longitude color

(Longitude, Latitude) Heatmap


Longitude vs latitude color

(Longitude, Latitude) Heatmap without Non-States


Usa color

Finder Bug in OS X

Four years after I first noticed it, Finder still has a bug in it that causes it to report a negative number of items waiting for deletion:

FinderBug

Playing with The Circular Law in Julia

Introduction

Statistically-trained readers of this blog will be very familiar with the Central Limit Theorem, which describes the asymptotic sampling distribution of the mean of a random vector composed of IID variables. Some of the most interesting recent work in mathematics has been focused on the development of increasingly powerful proofs of a similar law, called the Circular Law, which describes the asymptotic sampling distribution of the eigenvalues of a random matrix composed of IID variables.

Julia, which is funded by one of the world’s great experts on random matrix theory, is perfectly designed for generating random matrices to experiment with the Circular Law. The rest of this post shows how you might write the most naive sort of Monte Carlo study of random matrices in order to convince yourself that the Circular Law really is true.

Details

In order to show off the Circular Law, we need to be a bit more formal. We’ll define a random matrix \(M\) as an \(N\)x\(N\) matrix composed of \(N^{2}\) IID complex random variables, each with mean \(0\) and variance \(\frac{1}{N}\). Then the distribution of the \(N\) eigenvalues asymptotically converges to the uniform distribution over the unit disc. I think it’s easiest to see this by doing a Monte Carlo simulation of the eigenvalues of random matrices composed of Gaussian variables for five values of \(N\):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
f = open("eig.tsv", "w")
 
println(f, join(["N", "I", "J", "Re", "Im"], "\t"))
 
ns = [1, 2, 5, 25, 50]
sims = 100
 
for n in ns
  for i in 1:sims
    m = (1 / sqrt(n)) * randn(n, n)
    e, v = eig(m)
    for j in 1:n
      println(f, join([n, i, j, real(e[j]), imag(e[j])], "\t"))
    end
  end
end
 
close(f)

The results from this simulation are shown below:

Circular law

These images highlight two important patterns:

  1. For the entry-level Gaussian distribution, the distribution of eigenvalues converges on the unit circle surprisingly quickly.
  2. There is a very noticeable excess of values along the real line that goes away much more slowly than the movement towards the unit disk.

If you’re interested in exploring this topic, I’d encourage you to try two things:

  1. Obtain samples from a larger variety of values of \(N\).
  2. Construct samples from other entry-level distributions than the Gaussian distribution used here.

PS: Drew Conway wants me to note that Julia likes big matrices.

Will Data Scientists Be Replaced by Tools?

The Quick-and-Dirty Summary

I was recently asked to participate in a proposed SXSW panel that will debate the question, “Will Data Scientists Be Replaced by Tools?” This post describes my current thinking on that question as a way of (1) convincing you to go vote for the panel’s inclusion in this year’s SXSW and (2) instigating a larger debate about the future of companies whose business model depends upon Data Science in some way.

The Slow-and-Clean Version

In the last five years, Data Science has emerged as a new discipline, although there are many reasonable people who think that this new discipline is largely a rebranding of existing fields that suffer from a history of poor marketing and weak publicity.

All that said, within the startup world, I see at least three different sorts of Data Science work being done that really do constitute new types of activities for startups to be engaged in:

  1. Data aggregation and redistribution: A group of companies like DataSift have emerged recently whose business model centers on the acquisition of large quantities of data, which they resell in raw or processed form. These companies are essentially the Exxon’s of the data mining world.
  2. Data analytics toolkit development: Another group of companies like Prior Knowledge have emerged that develop automated tools for data analysis. Often this work involves building usable and scalable implementations of new methods coming out of academic machine learning groups.
  3. In-house data teams: Many current startups and once-upon-a-time startups now employ at least one person whose job title is Data Scientist. These people are charged with extracting value from the data accumulated by startups as a means of increasing the market value of these startups.

I find these three categories particularly helpful here, because it seems to me that the question, “Will Data Scientists Be Replaced by Tools?”, is most interesting when framed as a question about whether the third category of workers will be replaced by the products designed by the second category of workers. I see no sign that the first category of companies will go away anytime soon.

When posed this way, the most plausible answer to the question seems to be: “data scientists will have portions of their job automated, but their work will be much less automated than one might hope. Although we might hope to replace knowledge workers with algorithms, this will not happen as soon as some would like to claim.”

In general, I’m skeptical of sweeping automation of any specific branch of knowledge workers because I think the major work done by a data scientist isn’t technological, but sociological: their job is to communicate with the heads of companies and with the broader public about how data can be used to improve businesses. Essentially, data scientists are advocates for better data analysis and for more data-driven decision-making, both of which require constant vigilance to maintain. While the mathematical component of the work done by a data scientist is essential, it is nevertheless irrelevant in the absence of human efforts to sway decision-makers.

To put it another way, many of the problems in our society aren’t failures of science or technology, but failures of human nature. Consider, for example, Atul Gawande’s claim that many people still die each year because doctors don’t wash their hands often enough. Even though Seimelweiss showed the world that hygiene is a life-or-death matter in hospitals more than a century ago, we’re still not doing a good enough job maintaining proper hygiene.

Similarly, we can examine the many sloppy uses of basic statistics that can be found in the biological and the social sciences — for example, those common errors that have been recently described by Ioannidis and Simonsohn. Basic statistical methods are already totally automated, but this automation seems to have done little to make the analysis of data more reliable. While programs like SPSS have automated the computational components of statistics, they have done nothing to diminish the need for a person in the loop who understands what is actually being computed and what it actually means about the substantive questions being asked of data.

While we can — and will — develop better tools for data analysis in the coming years, we will not do nearly as much as we hope to obviate the need for sound judgment, domain expertise and hard work. As David Freedman put it, we’re still going to need shoe leather to get useful insights out of data and that will require human intervention for a very long time to come. The data scientist can no more be automated than the CEO.

DataGotham

As some of you may know already, I’m co-organizing an upcoming conference called DataGotham that’s taking place in September. To help spread the word about DataGotham, I’m cross-posting the most recent announcement below:

We’d like to let you know about DataGotham: a celebration of New York City’s data community!

http://datagotham.com

This is an event run by Drew Conway, Hilary Mason, Mike Dewar and John Myles White that will bring together professionals from finance to fashion and from startups to the Fortune 500. This day-and-a-half event will consist of intense discussion, networking, and sharing of wisdom, and will take place September 13th-14th at NYU Stern.

We also have four tutorials running on the afternoon of the 13th, followed by cocktails and The Great Data Extravaganza Show at the Tribeca Rooftop that evening. Tickets are on sale – we would love to see you there!

If you’d like to attend, please see the DataGotham website for more information. Since you’re a reader of this fine blog, we’ve set up a special “Friends and Family” discount that will give you 25% off the ticket price. To get the discount, you need to use the promo code “dataGothamist”.

The Social Dynamics of the R Core Team

Recently a few members of R Core have indicated that part of what slows down the development of R as a language is that it has become increasingly difficult over the years to achieve consensus among the core developers of the language. Inspired by these claims, I decided to look into this issue quantitatively by measuring the quantity of commits to R’s SVN repository that were made by each of the R Core developers. I wanted to know whether a small group of developers were overwhelmingly responsible for changes to R or whether all of the members of R Core had contributed equally. To follow along with what I did, you can grab the data and analysis scripts from GitHub.

First, I downloaded the R Core team’s SVN logs from http://developer.r-project.org/. I then used a simple regex to parse the SVN logs to count commits coming from each core committer.

After that, I tabulated the number of commits from each developer, pooling across the years 2003-2012 for which I had logs. You can see the results below, sorted by total commits in decreasing order:

Committer Total Number of Commits
ripley 22730
maechler 3605
hornik 3602
murdoch 1978
pd 1781
apache 658
jmc 599
luke 576
urbaneks 414
iacus 382
murrell 324
leisch 274
tlumley 153
rgentlem 141
root 87
duncan 81
bates 76
falcon 45
deepayan 40
plummer 28
ligges 24
martyn 20
ihaka 14

After that, I tried to visualize evolving trends over the years. First, I visualized the number of commits per developer per year:


Commits

And then I visualized the evenness of contributions from different developers by measuring the entropy of the distribution of commits on a yearly basis:


Entropy

There seems to be some weak evidence that the community is either finding consensus more difficult and tending towards a single leader who makes final decisions or that some developers are progressively dropping out because of the difficulty of achieving consensus. There is unambiguous evidence that a single developer makes the overwhelming majority of commits to R’s SVN repo.

I leave it to others to understand what all of this means for R and for programming language communities in general.

My New Book: Developing, Deploying and Debugging Multi-Armed Bandit Algorithms

I’m happy to announce that I’ve started writing a new book for O’Reilly, which will focus on teaching readers how to use Multi-Armed Bandit Algorithms to build better websites. My hope is that the book can help web developers build up an intuition for the core conundrum facing anyone who wants to build a successful business: you have to constantly make trade-offs between (A) making decisions that are likely to yield safe, short-term successes and (B) taking chances on new opportunities that could be good long-term strategies, but could also be spectacular failures.

In the academic literature, this trade-off is usually called the Explore-Exploit dilemma. It’s been studied for decades because there is no universally applicable, out-of-the-box solution. There are several simple methods that nearly everyone knows about (like A/B testing), but these methods can lead to spectacular failures. For that reason, academics have invested a huge amount of intellectual energy into developing better approaches than the sort of randomized experimentation embodied in A/B testing.

In large part, they’ve made progress by studying an idealized version of the Explore-Exploit tradeoff called the Multi-Armed Bandit Problem. A Multi-Armed Bandit Problem is a thought experiment in which you try to make the most money possible while gambling at a casino that features only an assortment of different slot machines, each of which is called a bandit because of its propensity to steal the gambler’s money.

In the book, I’ll cover a few of the standard algorithms developed based on this thought experiment, including the \(\epsilon\)-greedy algorithm, softmax decision rules and a family of algorithms collectively called UCB. All of these algorithms will be implemented in Python, but my primary interest in writing the book isn’t to provide code for these algorithms to readers, but to give them a framework that will allow them to reason about how to solve the Explore-Exploit dilemma as it impacts their businesses and the general art of developing better websites. There are now many, many different algorithms available for solving the Multi-Armed Bandit Problem, but they all share a few core intuitions and strategies. I’m hoping to communicate those higher level intuitions to readers.

At the same time, I want to provide readers with a toolkit for deciding between the various options that are available to them. Web developers are very familiar with the importance of debugging, so I’d like to teach them the equivalent of debugging for Multi-Armed Bandit Problems, which involves simulation. Instead of building unit tests in which you confirm that your code can solve toy problems with clear answers, you run simulations to test that your algorithm can learn the right answer to toy problems with unambiguous right answers.

In summary, I hope readers will leave this book with a toolbox of strategies for developing, deploying and debugging multi-armed bandit algorithms in the context of web development. At large companies, these methods are responsible for millions of dollars of increased profits, so I think it’s high time that the core ideas trickle down from elite institutions like Google, Microsoft and Facebook to other web companies.

I’m really excited about this project and look forward to sharing more details as the book progresses.

Automatic Hyperparameter Tuning Methods

At MSR this week, we had two very good talks on algorithmic methods for tuning the hyperparameters of machine learning models. Selecting appropriate settings for hyperparameters is a constant problem in machine learning, which is somewhat surprising given how much expertise the machine learning community has in optimization theory. I suspect there’s interesting psychological and sociological work to be done exploring why a problem that could be answered using known techniques wasn’t given an appropriate solution earlier.

Thankfully, the take away message of this blog post is that this problem is starting to be understood.

A Two-Part Optimization Problem

To set up the problem of hyperparameter tuning, it’s helpful to think of the canonical model-tuning and model-testing setup used in machine learning: one splits the original data set into three parts — a training set, a validation set and a test set. If, for example, we plan to use L2-regularized linear regression to solve our problem, we will use the training set and validation set to select a value for the \(\lambda\) hyperparameter that is used to determine the strength of the penalty for large coefficients relative to the penalty for errors in predictions.

With this context in mind, we can set up our problem using five types of variables:

  1. Features: \(x\)
  2. Labels: \(y\)
  3. Parameters: \(\theta\)
  4. Hyperparameters: \(\lambda\)
  5. Cost function: \(C\)

We then estimate our parameters and hyperparameters in the following multi-step way so as to minimize our cost function:

\[
\theta_{Train}(\lambda) = \arg \min_{\theta} C(x_{Train}, y_{Train}, \theta, \lambda)
\]

\[
\lambda_{Validation}^{*} = \arg \min_{\lambda} C(x_{Validation}, y_{Validation}, \theta_{Train}(\lambda), \lambda)
\]

The final model performance is assessed using:
\[
C(x_{Test}, y_{Test}, \theta_{Train + Validation}(\lambda_{Validation}^{*}), \lambda_{Validation}^{*})
\]

This two-part minimization problem is similar in many ways to stepwise regression. Like stepwise regression, it feels like an opportunity for clean abstraction is being passed over, but it’s not clear to me (or anyone I think) if there is any analytic way to solve this problem more abstractly.

Instead, the methods we saw presented in our seminars were ways to find better approximations to \(\lambda^{*}\) using less compute time. I’ll go through the traditional approach, then describe the newer and cleaner methods.

Grid Search

Typically, hyperparameters are set using the Grid Search algorithm, which works as follows:

  1. For each parameter \(p_{i}\) the researcher selects a list of values to test empirically.
  2. For each element of the Cartesian product of these values, the computer evaluates the cost function.
  3. The computer selects the hyperparameter settings from this grid with the lowest cost.

Grid Search is about the worst algorithm one could possibly use, but it’s in widespread use because (A) machine learning experts seem to have less familiarity with derivative-free optimization techniques than with gradient-based optimization methods and (B) machine learning culture does not traditionally think of hyperparameter tuning as a formal optimization problem. Almost certainly (B) is more important than (A).

Random Search

James Bergstra’s first proposed solution was so entertaining because, absent evidence that it works, it seems almost flippant to even propose: he suggested replacing Grid Search with Random Search. Instead of selecting a grid of values and walking through it exhaustively, you select a value for each hyperparameter independently using some probability distribution. You then evaluate the cost function given these random settings for the hyperparameters.

Since this approach seems like it might be worst than Grid Search, it’s worth pondering why it should work. James’ argument is this: most ML models have low-effective dimension, which means that a small number of parameters really affect the cost function and most have almost no effect. Random search lets you explore a greater variety of settings for each parameter, which allows you to find better values for the few parameters that really matter.

I am sure that Paul Meehl would have a field day with this research if he were alive to hear about it.

Arbitrary Regression Problem

An alternative approach is to view our problem as one of Bayesian Optimization: we have an arbitrary function that we want to minimize which is costly to evaluate and we would like to find a good approximate minimum in a small number of evaluations.

When viewed in this perspective, the natural strategy is to regress the cost function on the settings of the hyperparameters. Because the cost function may depend on the hyperparameters in strange ways, it is wise to use very general purpose regression methods. I’ve recently seen two clever strategies for this, one of which was presented to us at MSR:

  1. Jasper Snoek, Hugo Larochelle and Ryan Adams suggest that one use a Gaussian Process.
  2. Among other methods, Frank Hutter, Holger H. Hoos and Kevin Leyton-Brown suggest that one use Random Forests.

From my viewpoint, it seems that any derivative-free optimization method might be worth trying. While I have yet to see it published, I’d like to see more people try the Nelder-Mead method for tuning hyperparameters.