## What is Correctness for Statistical Software?

### Introduction

A few months ago, Drew Conway and I gave a webcast that tried to teach people about the basic principles behind linear and logistic regression. To illustrate logistic regression, we worked through a series of progressively more complex spam detection problems.

The simplest data set we used was the following:

This data set has one clear virtue: the correct classifier defines a decision boundary that implements a simple OR operation on the values of MentionsViagra and MentionsNigeria. Unfortunately, that very simplicity causes the logistic regression model to break down, because the MLE coefficients for MentionsViagra and MentionsNigeria should be infinite. In some ways, our elegantly simple example for logistic regression is actually the statistical equivalent of a SQL injection.

In our webcast, Drew and I decided to ignore that concern because R produces a useful model fit despite the theoretical MLE coefficients being infinite:

Although R produces finite coefficients here despite theory telling us to except something else, I should note that R does produce a somewhat cryptic warning during the model fitting step that alerts the very well-informed user that something has gain awry:

 1  glm.fit: fitted probabilities numerically 0 or 1 occurred

It seems clear to me that R’s warning would be better off if it were substantially more verbose:

 1 2 3 4 5 6  Warning from glm.fit():   Fitted probabilities could not be distinguished from 0's or 1's under finite precision floating point arithmetic. As a result, the optimization algorithm for GLM fitting may have failed to converge. You should check whether your data set is linearly separable.

Although I’ve started this piece with a very focused example of how R’s implementation of logistic regression differs from the purely mathematical definition of that model, I’m not really that interested in the details of how different pieces of software implement logistic regression. If you’re interested in learning more about that kind of thing, I’d suggest reading the excellent piece on R’s logistic regression function that can be found on the Win-Vector blog.

Instead, what interests me right now are a set of broader questions about how statistical software should work. What is the standard for correctness for statistical software? And what is the standard for usefulness? And how closely related are those two criteria?

Let’s think about each of them separately:

• Usefulness: If you want to simply make predictions based on your model, then you want R to produce a fitted model for this data set that makes reasonably good predictions on the training data. R achieves that goal: the fitted predictions for R’s logistic regression model are numerically almost indistinguishable from the 0/1 values that we would expect from a maximum likelihood algorithm. If you want useful algorithms, then R’s decision to produce some model fit is justified.
• Correctness: If you want software to either produce mathematically correct answers or to die trying, then R’s implementation of logistic is not for you. If you insist on theoretical purity, it seems clear that R should not merely emit a warning here, but should instead throw an inescapable error rather than return an imperfect model fit. You might even want R to go further and to teach the end-user about the virtues of SVM’s or the general usefulness of parameter regularization. Whatever you’d like to see, one thing is sure: you definitely do not want R to produce model fits that are mathematically incorrect.

It’s remarkable that such a simple example can bring the goals of predictive power and theoretical correctness into such direct opposition. In part, the conflict arises here because those purely theoretical concerns are linked by a third consideration: computer algorithms are not generally equivalent to their mathematical idealizations. Purely computational concerns involving floating-point imprecision and finite compute time mean that we cannot generally hope for computers to produce answers similar to those prescribed by theoretical mathematics.

What’s fascinating about this specific example is that there’s something strangely desirable about floating-point numbers having finite precision: no one with any practical interest in modeling is likely to be interested in fitting a model with infinite-valued parameters. R’s decision to blindly run an optimization algorithm here unwittingly achieves a form of regularization like that employed in early stopping algorithms for fitting neural networks. And that may be a good thing if you’re interested in using a fitted model to make predictions, even though it means that R produces quantities like standard errors that have no real coherent interpretation in terms of frequentist estimators.

Whatever your take is on the virtues or vices of R’s implementation of logistic regression, there’s a broad take away from this example that I’ve been dealing with constantly while working on Julia: any programmer designing statistical software has to make decisions that involve personal judgment. The requirement for striking a compromise between correctness and usefulness is so nearly omnipresent that one of the most popular pieces of statistical software on Earth implements logistic regression using an algorithm that a pure theorist could argue is basically broken. But it produces an answer that has practical value. And that might just be the more important thing for statistical software to do.

## What is Economics Studying?

Having spent all five of my years as a graduate student trying to get psychologists and economists to agree on basic ideas about decision-making, I think the following two pieces complement one another perfectly:

• Cosma Shalizi’s comments on rereading Blanchard and Fischer’s “Lectures on Macroeconomics”:

Blanchard and Fischer is about “modern” macro, models based on agents who know what the economy is like optimizing over time, possible under some limits. This is the DSGE style of macro. which has lately come into so much discredit — thoroughly deserved discredit. Chaikin and Lubensky is about modern condensed matter physics, especially soft condensed matter, based on principles of symmetry-breaking and phase transitions. Both books are about building stylized theoretical models and solving them to see what they imply; implicitly they are also about the considerations which go into building models in their respective domains.

What is very striking, looking at them side by side, is that while these are both books about mathematical modeling, Chaikin and Lubensky presents empirical data, compares theoretical predictions to experimental results, and goes into some detail into the considerations which lead to this sort of model for nematic liquid crystals, or that model for magnetism. There is absolutely nothing like this in Blanchard and Fischer — no data at all, no comparison of models to reality, no evidence of any kind supporting any of the models. There is not even an attempt, that I can find, to assess different macroeconomic models, by comparing their qualitative predictions to each other and to historical reality. I presume that Blanchard and Fischer, as individual scholars, are not quite so indifferent to reality, but their pedagogy is.

I will leave readers to draw their own morals.

• Itzhak Gilboa’s argument that economic theory is a rhetoric apparatus rather than a set of direct predictions about the world in which we live.

## A Cheap Criticism of p-Values

One of these days I am going to finish my series on problems with how NHST is issued in the social sciences. Until then, I came up with a cheap criticism of p-values today.

To make sense of my complaint, you’ll want to head over to Andy Gelman’s blog and read the comments on his recent blog post about p-values. Reading them makes one thing clear: not even a large group of stats wonks can agree on how to think about p-values. How could we ever hope for understanding from the kind of people who are only reporting p-values because they’re forced to do so by their fields?

## The State of Statistics in Julia

Updated 12.2.2012: Added sample output based on a suggestion from Stefan Karpinski.

### Introduction

Over the last few weeks, the Julia core team has rolled out a demo version of Julia’s package management system. While the Julia package system is still very much in beta, it nevertheless provides the first plausible way for non-expert users to see where Julia’s growing community of developers is heading.

To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis, I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis.

### Getting Julia Installed

If you’d like to try out Julia for yourself, you’ll first need to clone the current Julia repo from GitHub and then build Julia from source as described in the Julia README. Compiling Julia for the first time can take up to two hours, but updating Julia afterwards will be quite fast once you’ve gotten a working copy of the language and its dependencies installed on your system. After you have Julia built, you should add its main directory to your path and then open up the Julia REPL by typing julia at the command line.

### Installing Packages

Once Julia’s REPL is running, you can use the following commands to start installing packages:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67  julia> require("pkg")   julia> Pkg.init() Initialized empty Git repository in /Users/johnmyleswhite/.julia/.git/ Cloning into 'METADATA'... remote: Counting objects: 443, done. remote: Compressing objects: 100% (208/208), done. remote: Total 443 (delta 53), reused 423 (delta 33) Receiving objects: 100% (443/443), 38.98 KiB, done. Resolving deltas: 100% (53/53), done. [master (root-commit) dbd486e] empty package repo 2 files changed, 4 insertions(+) create mode 100644 .gitmodules create mode 160000 METADATA create mode 100644 REQUIRE   julia> Pkg.add("DataFrames", "Distributions", "MCMC", "Optim", "NHST", "Clustering") Installing DataFrames: v0.0.0 Cloning into 'DataFrames'... remote: Counting objects: 1340, done. remote: Compressing objects: 100% (562/562), done. remote: Total 1340 (delta 760), reused 1229 (delta 655) Receiving objects: 100% (1340/1340), 494.79 KiB, done. Resolving deltas: 100% (760/760), done. Installing Distributions: v0.0.0 Cloning into 'Distributions'... remote: Counting objects: 49, done. remote: Compressing objects: 100% (30/30), done. remote: Total 49 (delta 8), reused 49 (delta 8) Receiving objects: 100% (49/49), 17.29 KiB, done. Resolving deltas: 100% (8/8), done. Installing MCMC: v0.0.0 Cloning into 'MCMC'... warning: no common commits remote: Counting objects: 155, done. remote: Compressing objects: 100% (97/97), done. remote: Total 155 (delta 66), reused 140 (delta 51) Receiving objects: 100% (155/155), 256.68 KiB, done. Resolving deltas: 100% (66/66), done. Installing NHST: v0.0.0 Cloning into 'NHST'... remote: Counting objects: 20, done. remote: Compressing objects: 100% (18/18), done. remote: Total 20 (delta 2), reused 19 (delta 1) Receiving objects: 100% (20/20), 4.31 KiB, done. Resolving deltas: 100% (2/2), done. Installing Optim: v0.0.0 Cloning into 'Optim'... remote: Counting objects: 497, done. remote: Compressing objects: 100% (191/191), done. remote: Total 497 (delta 318), reused 476 (delta 297) Receiving objects: 100% (497/497), 79.68 KiB, done. Resolving deltas: 100% (318/318), done. Installing Options: v0.0.0 Cloning into 'Options'... remote: Counting objects: 10, done. remote: Compressing objects: 100% (8/8), done. remote: Total 10 (delta 1), reused 6 (delta 0) Receiving objects: 100% (10/10), done. Resolving deltas: 100% (1/1), done. Installing Clustering: v0.0.0 Cloning into 'Clustering'... remote: Counting objects: 38, done. remote: Compressing objects: 100% (28/28), done. remote: Total 38 (delta 7), reused 38 (delta 7) Receiving objects: 100% (38/38), 7.77 KiB, done. Resolving deltas: 100% (7/7), done.

That will get you started with some of the core tools for doing statistical programming in Julia. You’ll probably also want to install another package called “RDatasets”, which provides access to 570 of the classic data sets available in R. This package has a much larger file size than the others, which is why I recommend installing it after you’ve first installed the other packages:

 1 2 3 4 5 6 7 8 9 10  require("pkg")   julia> Pkg.add("RDatasets") Installing RDatasets: v0.0.0 Cloning into 'RDatasets'... remote: Counting objects: 609, done. remote: Compressing objects: 100% (588/588), done. remote: Total 609 (delta 21), reused 605 (delta 17) Receiving objects: 100% (609/609), 10.56 MiB | 1.15 MiB/s, done. Resolving deltas: 100% (21/21), done.

Assuming that you’ve gotten everything working, you can then type the following to load Fisher’s classic Iris data set:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85  julia> load("RDatasets") Warning: redefinition of constant NARule ignored. Warning: New definition ==(NAtype,Any) is ambiguous with ==(Any,AbstractArray{T,N}). Make sure ==(NAtype,AbstractArray{T,N}) is defined first. Warning: New definition ==(Any,NAtype) is ambiguous with ==(AbstractArray{T,N},Any). Make sure ==(AbstractArray{T,N},NAtype) is defined first. Warning: New definition replace!(PooledDataVec{S},NAtype,T) is ambiguous with replace!(PooledDataVec{S},T,NAtype). Make sure replace!(PooledDataVec{S},NAtype,NAtype) is defined first. Warning: New definition promote_rule(Type{AbstractDataVec{T}},Type{T}) is ambiguous with promote_rule(Type{AbstractDataVec{S}},Type{T}). Make sure promote_rule(Type{AbstractDataVec{T}},Type{T}) is defined first. Warning: New definition ^(NAtype,T<:Union(String,Number)) is ambiguous with ^(Any,Integer). Make sure ^(NAtype,_<:Integer) is defined first. Warning: New definition ^(DataVec{T},Number) is ambiguous with ^(Any,Integer). Make sure ^(DataVec{T},Integer) is defined first. Warning: New definition ^(DataFrame,Union(NAtype,Number)) is ambiguous with ^(Any,Integer). Make sure ^(DataFrame,Integer) is defined first.   julia> using DataFrames   julia> using RDatasets   julia> iris = data("datasets", "iris") DataFrame (150,6) Sepal.Length Sepal.Width Petal.Length Petal.Width Species [1,] 1 5.1 3.5 1.4 0.2 "setosa" [2,] 2 4.9 3.0 1.4 0.2 "setosa" [3,] 3 4.7 3.2 1.3 0.2 "setosa" [4,] 4 4.6 3.1 1.5 0.2 "setosa" [5,] 5 5.0 3.6 1.4 0.2 "setosa" [6,] 6 5.4 3.9 1.7 0.4 "setosa" [7,] 7 4.6 3.4 1.4 0.3 "setosa" [8,] 8 5.0 3.4 1.5 0.2 "setosa" [9,] 9 4.4 2.9 1.4 0.2 "setosa" [10,] 10 4.9 3.1 1.5 0.1 "setosa" [11,] 11 5.4 3.7 1.5 0.2 "setosa" [12,] 12 4.8 3.4 1.6 0.2 "setosa" [13,] 13 4.8 3.0 1.4 0.1 "setosa" [14,] 14 4.3 3.0 1.1 0.1 "setosa" [15,] 15 5.8 4.0 1.2 0.2 "setosa" [16,] 16 5.7 4.4 1.5 0.4 "setosa" [17,] 17 5.4 3.9 1.3 0.4 "setosa" [18,] 18 5.1 3.5 1.4 0.3 "setosa" [19,] 19 5.7 3.8 1.7 0.3 "setosa" [20,] 20 5.1 3.8 1.5 0.3 "setosa" : [131,] 131 7.4 2.8 6.1 1.9 "virginica" [132,] 132 7.9 3.8 6.4 2.0 "virginica" [133,] 133 6.4 2.8 5.6 2.2 "virginica" [134,] 134 6.3 2.8 5.1 1.5 "virginica" [135,] 135 6.1 2.6 5.6 1.4 "virginica" [136,] 136 7.7 3.0 6.1 2.3 "virginica" [137,] 137 6.3 3.4 5.6 2.4 "virginica" [138,] 138 6.4 3.1 5.5 1.8 "virginica" [139,] 139 6.0 3.0 4.8 1.8 "virginica" [140,] 140 6.9 3.1 5.4 2.1 "virginica" [141,] 141 6.7 3.1 5.6 2.4 "virginica" [142,] 142 6.9 3.1 5.1 2.3 "virginica" [143,] 143 5.8 2.7 5.1 1.9 "virginica" [144,] 144 6.8 3.2 5.9 2.3 "virginica" [145,] 145 6.7 3.3 5.7 2.5 "virginica" [146,] 146 6.7 3.0 5.2 2.3 "virginica" [147,] 147 6.3 2.5 5.0 1.9 "virginica" [148,] 148 6.5 3.0 5.2 2.0 "virginica" [149,] 149 6.2 3.4 5.4 2.3 "virginica" [150,] 150 5.9 3.0 5.1 1.8 "virginica"   julia> head(iris) DataFrame (6,6) Sepal.Length Sepal.Width Petal.Length Petal.Width Species [1,] 1 5.1 3.5 1.4 0.2 "setosa" [2,] 2 4.9 3.0 1.4 0.2 "setosa" [3,] 3 4.7 3.2 1.3 0.2 "setosa" [4,] 4 4.6 3.1 1.5 0.2 "setosa" [5,] 5 5.0 3.6 1.4 0.2 "setosa" [6,] 6 5.4 3.9 1.7 0.4 "setosa"   julia> tail(iris) DataFrame (6,6) Sepal.Length Sepal.Width Petal.Length Petal.Width Species [1,] 145 6.7 3.3 5.7 2.5 "virginica" [2,] 146 6.7 3.0 5.2 2.3 "virginica" [3,] 147 6.3 2.5 5.0 1.9 "virginica" [4,] 148 6.5 3.0 5.2 2.0 "virginica" [5,] 149 6.2 3.4 5.4 2.3 "virginica" [6,] 150 5.9 3.0 5.1 1.8 "virginica"

Now that you can see that Julia can handle complex data sets, let’s talk a little bit about the packages that make statistical analysis in Julia possible.

### The DataFrames Package

The DataFrames package provides data structures for working with tabular data in Julia. At a minimum, this means that DataFrames provides tools for dealing with individual columns of missing data, which are called DataVec‘s. A collection of DataVec‘s allows one to build up a DataFrame, which provides a tabular data structure like that used by R’s data.frame type.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30  julia> load("DataFrames")   julia> using DataFrames   julia> data = {"Value" => [1, 2, 3], "Label" => ["A", "B", "C"]} Warning: imported binding for data overwritten in module Main {"Label"=>["A", "B", "C"],"Value"=>[1, 2, 3]}   julia> df = DataFrame(data) DataFrame (3,2) Label Value [1,] "A" 1 [2,] "B" 2 [3,] "C" 3   julia> df["Value"] 3-element DataVec{Int64}   [1,2,3]   julia> df[1, "Value"] = NA NA     julia> head(df) DataFrame (3,2) Label Value [1,] "A" NA [2,] "B" 2 [3,] "C" 3

### Distributions

The Distributions package provides tools for working with probability distributions in Julia. It reifies distributions as types in Julia’s large type hierarchy, which means that quite generic names like rand can be used to sample from complex distributions:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  julia> load("Distributions") julia> using Distributions   julia> x = rand(Normal(11.0, 3.0), 10_000) 10000-element Float64 Array: 6.87693 13.3676 7.25008 8.82833 10.6911 7.1004 13.7449 5.96412 8.57957 15.2737 ⋮ 4.89007 15.1509 6.32376 7.83847 14.4476 14.2974 9.74783 9.67398 14.4992   julia> mean(x) 11.00366217730023   julia> var(x) Warning: Possible conflict in library symbol ddot_ 9.288938550823996

### Optim

The Optim package provides tools for numerical optimization of arbitrary functions in Julia. It provides a function, optimize, which works a bit like R’s optim function.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  julia> load("Optim") julia> using Optim   julia> f = v -> (10.9 - v[1])^2 + (7.3 - v[2])^2 #   julia> initial_guess = [0.0, 0.0] 2-element Float64 Array: 0.0 0.0   julia> results = optimize(f, initial_guess) Warning: Possible conflict in library symbol dcopy_ OptimizationResults("Nelder-Mead",[0.333333, 0.333333],[10.9, 7.29994],3.2848148720460163e-9,38,true)   julia> results.minimum 2-element Float64 Array: 10.9 7.29994

### MCMC

The MCMC package provides tools for sampling from arbitrary probability distributions using Markov Chain Monte Carlo. It provides functions like slice_sampler, which allows one to sample from a (potentially unnormalized) density function using Radford Neal’s slice sampling algorithm.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  julia> load("MCMC")   julia> using MCMC   julia> d = Normal(17.29, 1.0) Normal(17.29,1.0)   julia> f = x -> logpdf(d, x) #   julia> [slice_sampler(0.0, f) for i in 1:100] 100-element (Float64,Float64) Array: (2.7589100475626323,-106.49522613611775) (22.840595204318323,-16.323492094305458) (0.11800384424353683,-148.35766451986206) (25.507580447082677,-34.68325273534245) (25.794565860846134,-37.08275877393945) (25.898128716394307,-37.96887853221083) (9.309878825853284,-32.76010551023705) (30.824102772255355,-92.50490745818972) (9.108789186504177,-34.38504372063516) (25.547686903330494,-35.01363502992266) ⋮ (5.795001414731885,-66.98643477086263) (15.50115292212293,-2.518925467219337) (12.046429369881345,-14.666455009726143) (17.25455052645699,-0.919566865791911) (25.494698549206657,-34.57747767488159) (1.8340810959111111,-120.36165311809079) (2.7112428736526177,-107.18901820771696) (9.21203292192012,-33.54571459047587) (19.12274407701784,-2.5984139591266584)

### NHST

The NHST package provides tools for testing standard statistical hypotheses using null hypothesis significance testing tools like the t-test and the chi-squared test.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62  julia> load("Distributions")   julia> using Distributions   julia> load("NHST")   julia> using NHST   julia> d1 = Normal(17.29, 1.0) Normal(17.29,1.0)   julia> d2 = Normal(0.0, 1.0) Normal(0.0,1.0)   julia> x = rand(d1, 1_000) 1000-element Float64 Array: 15.7085 18.585 16.6036 18.962 17.8715 16.6814 17.9676 16.8924 16.6022 17.9813 ⋮ 17.1339 17.3964 18.6184 16.7238 18.5003 16.1618 17.9198 17.4928 18.715   julia> y = rand(d2, 1_000) 1000-element Float64 Array: 0.664885 0.147182 0.96265 0.24282 1.881 -0.632478 0.539297 0.996562 -0.483302 0.514629 ⋮ 2.06249 -0.549444 0.857575 -1.47464 -2.33243 0.510751 -0.381069 -1.49165 0.0521203   julia> t_test(x, y) HypothesisTest("t-Test",{"t"=>392.2838409538002},{"df"=>1989.732411290855},0.0,[17.1535, 17.3293],{"mean of x"=>17.24357323225425,"mean of y"=>0.0021786523177457794},0.0,"two-sided","Welch Two Sample t-test","x and y",1989.732411290855)

### Clustering

The Clustering package provides tools for doing simple k-means style clustering.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85  julia> load("Clustering")   julia> using Clustering   julia> srand(1)   julia> n = 100 100   julia> x = vcat(randn(n, 2), randn(n, 2) .+ 10) 200x2 Float64 Array: 0.0575636 -0.112322 -1.8329 -0.101326 0.370699 -0.956183 1.31816 -1.44351 0.787598 0.148386 0.712214 -1.293 -1.8578 -1.06208 -0.746303 -0.0439182 1.12082 -2.00616 0.364646 -1.09331 ⋮ 10.1974 10.5583 11.0832 8.92082 11.5414 11.6022 9.0453 11.5093 8.86714 10.4233 10.7336 10.7201 8.60415 9.13942 8.62482 8.51701 10.5044 10.3841   julia> true_assignments = vcat(zeros(n), ones(n)) 200-element Float64 Array: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0   julia> results = k_means(x, 2) Warning: Possible conflict in library symbol dgesdd_ Warning: Possible conflict in library symbol dsyrk_ Warning: Possible conflict in library symbol dgemm_ KMeansOutput([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ... 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],2x2 Float64 Array: -0.0166203 -0.248904 10.0418 10.0074 ,3,422.9820560670007,true)   julia> results.assignments 200-element Int64 Array: 1 1 1 1 1 1 1 1 1 1 ⋮ 2 2 2 2 2 2 2 2 2

While all of this software is still quite new and often still buggy, being able to work with these tools through a simple package systems had made me more excited than ever before about the future of Julia as a language for data analysis. There is, of course, one thing conspicuously lacking right now: a really powerful visualization toolkit for interactive graphics like that provided by R’s ggplot2 package. Hopefully something will come into being within the next few months.

## The Shape of Floating Point Random Numbers

[Updated 10/18/2012: Fixed a typo in which mantissa was replaced with exponent.]

Over the weekend, Viral Shah updated Julia’s implementation of randn() to give a 20% speed boost. Because we all wanted to test that this speed-up had not come at the expense of the validity of Julia’s RNG system, I spent some time this weekend trying to get tests up and running. I didn’t get far, but thankfully others chimed in and got things done.

Testing an RNG is serious business. In total, we’ve considered using four different test suites:

All of these suites can be easily used to test uniform random numbers over unsigned integers. Some are also appropriate for testing uniform random numbers over floatint-point values.

But we wanted to test a Gaussian RNG. To do that, we followed Thomas et al.’s lead and mapped the Gaussian RNG’s output through a high-precision quantile function to produce uniform random floating point values. As our high-precision quantile function we ended up using the one described in Marsaglia’s 2004 JSS paper.

With that in place, I started to try modifying my previous RNG testing code. When we previously tried to test Julia’s rand() function, I got STS working on my machine and deciphered its manual well enough to run a suite of tests on a bit stream from Julia.

Unfortunately I made a fairly serious error in how I attempted to test Julia’s RNG. Because STS expects a stream of random 0’s and 1’s, I converted random numbers into 0’s and 1’s by testing whether the floating point numbers being generated were greater than 0.5 or less than 0.5. While this test is not completely wrong, it is very, very weak. Its substantive value comes from two points:

1. It confirms that the median of the RNG is correctly positioned at 0.5.
2. It confirms that the placement of successive entries relative to 0.5 is effectively random. In short, there is not trivial correlation between successive values.

Unfortunately that’s about all you learn from this method. We needed something more. So I started exploring how to convert a floating point into bits. Others had the good sense to avoid this and pushed us forward by using the TestU01 suite.

I instead got lost exploring the surprising complexity of trying to work with the individual bits of random floating point numbers. The topic is so subtle because the distribution of bits in a randomly generated floating point number is extremely far from a random source of individual bits.

For example, a uniform variable’s representation in floating point has all the following non-random properties:

1. The sign bit is never random because uniform variables are never negative.
2. The exponent is not random either because uniform variables are strictly contained in the interval [0, 1].
3. Even the mantissa isn’t random. Because floating point numbers aren’t evenly spaced in the reals, the mantissa has to have complex patterns in it to simulate the equal-spacing of uniform numbers.

Inspired by all of this, I decided to get a sense for the bit pattern signature of different RNG’s. Below I’ve plotted the patterns for uniform, normal, gamma and Cauchy variables using lines that describe the mean value of the i-th bit in the bit string. At a minimum, a completely random bit stream would have a flat horizontal line through 0.5, which many of the lines touch for a moment, but never perfectly match.

Some patterns:

1. The first bit (shown on the far left) is the sign bit: you can clearly see which distributions are symmetric by looking for a mean value of 0.5 versus those that are strictly positive and have a mean value of 0.0.
2. The next eleven bits are the exponent and you can clearly see which distributions are largely concentrated in the interval [-1, 1] and which have substantial density outside of that region. This bit would clue you into the variance of the distribution.
3. You can see that there is a lot of non-randomness in the last few bits of the mantissa for uniform variables. There’s also non-randomness in the first few bits for all variables. I don’t yet have any real intuition for those patterns.

You can go beyond looking at the signatures of mean bit patterns by looking at covariance matrices as well. Below I show these covariances matrices in a white-blue coloring scheme in which white indicates negative values, light blue indicates zero and dark blue indicates positive values. Note that matrices, generated using R’s image() function are reflections of the more intuitive matrix ordering in which the [1,1] entry of the matrix occurs in the top-left instead of the bottom-left.

#### Cauchy Variables

I find these pictures really helpful for reminding me how strangely floating point numbers behave. The complexity of these images is so far removed from the simplicity of the bit non-patterns in randomly generated unsigned integers, which can be generated using IID random bits and concatenating them together.

## Overfitting

What do you think when you see a model like the one below?

Does this strike you as a good model? Or as a bad model?

There’s no right or wrong answer to this question, but I’d like to argue that models that are able to match white noise are typically bad things, especially when you don’t have a clear cross-validation paradigm that will allow you to demonstrate that your model’s ability to match complex data isn’t a form of overfitting.

There are many objective reasons to suspect complicated models, but I’d like to offer up a subjective one. A model that fits complex data as perfectly as the model above is likely to not be an interpretable model1 because it is essentially a noisy copy of the data. If the model looks so much like the data, why construct a model at all? Why not just use the raw data?

Unless the functional form of a model and its dependence on inputs is simple, I’m very suspicious of any statistical method that produces outputs like those shown above. If you want a model to do more than produce black-box predictions, it should probably provide predictions that are relatively smooth. At the least it should reveal comprehensible and memorable patterns that are non-smooth. While there are fields in which neither of these goals is possible (and others where it’s not desirable), I think the default reaction to a model fit like the one above should be: “why does the model make such complex predictions? Isn’t that a mistake? How many degrees of freedom does it have that it can so closely fit such noisy data?”

1. Although it might be a great predictive model if you can confirm that the fit above is the quality of the fit to held-out data!

## EDA Before CDA

### One Paragraph Summary

Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you’ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of intraocular trauma.

### Context

This morning, I woke up to find that Neil Kodner had discovered a very convenient CSV file that contains geospatial data about every valid US zip code. I’ve been interested in the relationship between places and zip codes recently, because I spent my summer living in the 98122 zip code after having spent my entire life living in places with zip codes below 20000. Because of the huge gulf between my Seattle zip code and my zip codes on the East Coast, I’ve on-and-off wondered if the zip codes were originally assigned in terms of the seniority of states. Specifically, the original thirteen colonies seem to have some of the lowest zip codes, while the newer states had some of the highest zip codes.

While I could presumably find this information through a few web searches or could gather the right data set to test my idea formally, I decided to blindly plot the zip code data instead. I think the results help to show why a few well-chosen visualizations can be so much more valuable than regression coefficients. Below I’ve posted the code I used to explore the zip code data in the exact order of the plots I produced. I’ll let the resulting pictures tell the rest of the story.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  zipcodes <- read.csv("zipcodes.csv")   ggplot(zipcodes, aes(x = zip, y = latitude)) + geom_point() ggsave("latitude_vs_zip.png", height = 7, width = 10) ggplot(zipcodes, aes(x = zip, y = longitude)) + geom_point() ggsave("longitude_vs_zip.png", height = 7, width = 10) ggplot(zipcodes, aes(x = latitude, y = longitude, color = zip)) + geom_point() ggsave("latitude_vs_longitude_color.png", height = 7, width = 10) ggplot(zipcodes, aes(x = longitude, y = latitude, color = zip)) + geom_point() ggsave("longitude_vs_latitude_color.png", height = 7, width = 10) ggplot(subset(zipcodes, longitude < 0), aes(x = longitude, y = latitude, color = zip)) + geom_point() ggsave("usa_color.png", height = 7, width = 10)

## Finder Bug in OS X

Four years after I first noticed it, Finder still has a bug in it that causes it to report a negative number of items waiting for deletion:

## Playing with The Circular Law in Julia

### Introduction

Statistically-trained readers of this blog will be very familiar with the Central Limit Theorem, which describes the asymptotic sampling distribution of the mean of a random vector composed of IID variables. Some of the most interesting recent work in mathematics has been focused on the development of increasingly powerful proofs of a similar law, called the Circular Law, which describes the asymptotic sampling distribution of the eigenvalues of a random matrix composed of IID variables.

Julia, which is funded by one of the world’s great experts on random matrix theory, is perfectly designed for generating random matrices to experiment with the Circular Law. The rest of this post shows how you might write the most naive sort of Monte Carlo study of random matrices in order to convince yourself that the Circular Law really is true.

### Details

In order to show off the Circular Law, we need to be a bit more formal. We’ll define a random matrix $$M$$ as an $$N$$x$$N$$ matrix composed of $$N^{2}$$ IID complex random variables, each with mean $$0$$ and variance $$\frac{1}{N}$$. Then the distribution of the $$N$$ eigenvalues asymptotically converges to the uniform distribution over the unit disc. I think it’s easiest to see this by doing a Monte Carlo simulation of the eigenvalues of random matrices composed of Gaussian variables for five values of $$N$$:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  f = open("eig.tsv", "w")   println(f, join(["N", "I", "J", "Re", "Im"], "\t"))   ns = [1, 2, 5, 25, 50] sims = 100   for n in ns for i in 1:sims m = (1 / sqrt(n)) * randn(n, n) e, v = eig(m) for j in 1:n println(f, join([n, i, j, real(e[j]), imag(e[j])], "\t")) end end end   close(f)

The results from this simulation are shown below:

These images highlight two important patterns:

1. For the entry-level Gaussian distribution, the distribution of eigenvalues converges on the unit circle surprisingly quickly.
2. There is a very noticeable excess of values along the real line that goes away much more slowly than the movement towards the unit disk.

If you’re interested in exploring this topic, I’d encourage you to try two things:

1. Obtain samples from a larger variety of values of $$N$$.
2. Construct samples from other entry-level distributions than the Gaussian distribution used here.

PS: Drew Conway wants me to note that Julia likes big matrices.

## Will Data Scientists Be Replaced by Tools?

### The Quick-and-Dirty Summary

I was recently asked to participate in a proposed SXSW panel that will debate the question, “Will Data Scientists Be Replaced by Tools?” This post describes my current thinking on that question as a way of (1) convincing you to go vote for the panel’s inclusion in this year’s SXSW and (2) instigating a larger debate about the future of companies whose business model depends upon Data Science in some way.

### The Slow-and-Clean Version

In the last five years, Data Science has emerged as a new discipline, although there are many reasonable people who think that this new discipline is largely a rebranding of existing fields that suffer from a history of poor marketing and weak publicity.

All that said, within the startup world, I see at least three different sorts of Data Science work being done that really do constitute new types of activities for startups to be engaged in:

1. Data aggregation and redistribution: A group of companies like DataSift have emerged recently whose business model centers on the acquisition of large quantities of data, which they resell in raw or processed form. These companies are essentially the Exxon’s of the data mining world.
2. Data analytics toolkit development: Another group of companies like Prior Knowledge have emerged that develop automated tools for data analysis. Often this work involves building usable and scalable implementations of new methods coming out of academic machine learning groups.
3. In-house data teams: Many current startups and once-upon-a-time startups now employ at least one person whose job title is Data Scientist. These people are charged with extracting value from the data accumulated by startups as a means of increasing the market value of these startups.

I find these three categories particularly helpful here, because it seems to me that the question, “Will Data Scientists Be Replaced by Tools?”, is most interesting when framed as a question about whether the third category of workers will be replaced by the products designed by the second category of workers. I see no sign that the first category of companies will go away anytime soon.

When posed this way, the most plausible answer to the question seems to be: “data scientists will have portions of their job automated, but their work will be much less automated than one might hope. Although we might hope to replace knowledge workers with algorithms, this will not happen as soon as some would like to claim.”

In general, I’m skeptical of sweeping automation of any specific branch of knowledge workers because I think the major work done by a data scientist isn’t technological, but sociological: their job is to communicate with the heads of companies and with the broader public about how data can be used to improve businesses. Essentially, data scientists are advocates for better data analysis and for more data-driven decision-making, both of which require constant vigilance to maintain. While the mathematical component of the work done by a data scientist is essential, it is nevertheless irrelevant in the absence of human efforts to sway decision-makers.

To put it another way, many of the problems in our society aren’t failures of science or technology, but failures of human nature. Consider, for example, Atul Gawande’s claim that many people still die each year because doctors don’t wash their hands often enough. Even though Seimelweiss showed the world that hygiene is a life-or-death matter in hospitals more than a century ago, we’re still not doing a good enough job maintaining proper hygiene.

Similarly, we can examine the many sloppy uses of basic statistics that can be found in the biological and the social sciences — for example, those common errors that have been recently described by Ioannidis and Simonsohn. Basic statistical methods are already totally automated, but this automation seems to have done little to make the analysis of data more reliable. While programs like SPSS have automated the computational components of statistics, they have done nothing to diminish the need for a person in the loop who understands what is actually being computed and what it actually means about the substantive questions being asked of data.

While we can — and will — develop better tools for data analysis in the coming years, we will not do nearly as much as we hope to obviate the need for sound judgment, domain expertise and hard work. As David Freedman put it, we’re still going to need shoe leather to get useful insights out of data and that will require human intervention for a very long time to come. The data scientist can no more be automated than the CEO.