A Cheap Criticism of p-Values

One of these days I am going to finish my series on problems with how NHST is issued in the social sciences. Until then, I came up with a cheap criticism of p-values today.

To make sense of my complaint, you’ll want to head over to Andy Gelman’s blog and read the comments on his recent blog post about p-values. Reading them makes one thing clear: not even a large group of stats wonks can agree on how to think about p-values. How could we ever hope for understanding from the kind of people who are only reporting p-values because they’re forced to do so by their fields?

The State of Statistics in Julia

Updated 12.2.2012: Added sample output based on a suggestion from Stefan Karpinski.

Introduction

Over the last few weeks, the Julia core team has rolled out a demo version of Julia’s package management system. While the Julia package system is still very much in beta, it nevertheless provides the first plausible way for non-expert users to see where Julia’s growing community of developers is heading.

To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis, I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis.

Getting Julia Installed

If you’d like to try out Julia for yourself, you’ll first need to clone the current Julia repo from GitHub and then build Julia from source as described in the Julia README. Compiling Julia for the first time can take up to two hours, but updating Julia afterwards will be quite fast once you’ve gotten a working copy of the language and its dependencies installed on your system. After you have Julia built, you should add its main directory to your path and then open up the Julia REPL by typing julia at the command line.

Installing Packages

Once Julia’s REPL is running, you can use the following commands to start installing packages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
julia> require("pkg")
 
julia> Pkg.init()
Initialized empty Git repository in /Users/johnmyleswhite/.julia/.git/
Cloning into 'METADATA'...
remote: Counting objects: 443, done.
remote: Compressing objects: 100% (208/208), done.
remote: Total 443 (delta 53), reused 423 (delta 33)
Receiving objects: 100% (443/443), 38.98 KiB, done.
Resolving deltas: 100% (53/53), done.
[master (root-commit) dbd486e] empty package repo
 2 files changed, 4 insertions(+)
 create mode 100644 .gitmodules
 create mode 160000 METADATA
 create mode 100644 REQUIRE
 
julia> Pkg.add("DataFrames", "Distributions", "MCMC", "Optim", "NHST", "Clustering")
Installing DataFrames: v0.0.0
Cloning into 'DataFrames'...
remote: Counting objects: 1340, done.
remote: Compressing objects: 100% (562/562), done.
remote: Total 1340 (delta 760), reused 1229 (delta 655)
Receiving objects: 100% (1340/1340), 494.79 KiB, done.
Resolving deltas: 100% (760/760), done.
Installing Distributions: v0.0.0
Cloning into 'Distributions'...
remote: Counting objects: 49, done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 49 (delta 8), reused 49 (delta 8)
Receiving objects: 100% (49/49), 17.29 KiB, done.
Resolving deltas: 100% (8/8), done.
Installing MCMC: v0.0.0
Cloning into 'MCMC'...
warning: no common commits
remote: Counting objects: 155, done.
remote: Compressing objects: 100% (97/97), done.
remote: Total 155 (delta 66), reused 140 (delta 51)
Receiving objects: 100% (155/155), 256.68 KiB, done.
Resolving deltas: 100% (66/66), done.
Installing NHST: v0.0.0
Cloning into 'NHST'...
remote: Counting objects: 20, done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 20 (delta 2), reused 19 (delta 1)
Receiving objects: 100% (20/20), 4.31 KiB, done.
Resolving deltas: 100% (2/2), done.
Installing Optim: v0.0.0
Cloning into 'Optim'...
remote: Counting objects: 497, done.
remote: Compressing objects: 100% (191/191), done.
remote: Total 497 (delta 318), reused 476 (delta 297)
Receiving objects: 100% (497/497), 79.68 KiB, done.
Resolving deltas: 100% (318/318), done.
Installing Options: v0.0.0
Cloning into 'Options'...
remote: Counting objects: 10, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 10 (delta 1), reused 6 (delta 0)
Receiving objects: 100% (10/10), done.
Resolving deltas: 100% (1/1), done.
Installing Clustering: v0.0.0
Cloning into 'Clustering'...
remote: Counting objects: 38, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 38 (delta 7), reused 38 (delta 7)
Receiving objects: 100% (38/38), 7.77 KiB, done.
Resolving deltas: 100% (7/7), done.

That will get you started with some of the core tools for doing statistical programming in Julia. You’ll probably also want to install another package called “RDatasets”, which provides access to 570 of the classic data sets available in R. This package has a much larger file size than the others, which is why I recommend installing it after you’ve first installed the other packages:

1
2
3
4
5
6
7
8
9
10
require("pkg")
 
julia> Pkg.add("RDatasets")
Installing RDatasets: v0.0.0
Cloning into 'RDatasets'...
remote: Counting objects: 609, done.
remote: Compressing objects: 100% (588/588), done.
remote: Total 609 (delta 21), reused 605 (delta 17)
Receiving objects: 100% (609/609), 10.56 MiB | 1.15 MiB/s, done.
Resolving deltas: 100% (21/21), done.

Assuming that you’ve gotten everything working, you can then type the following to load Fisher’s classic Iris data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
julia> load("RDatasets")
Warning: redefinition of constant NARule ignored.
Warning: New definition ==(NAtype,Any) is ambiguous with ==(Any,AbstractArray{T,N}).
         Make sure ==(NAtype,AbstractArray{T,N}) is defined first.
Warning: New definition ==(Any,NAtype) is ambiguous with ==(AbstractArray{T,N},Any).
         Make sure ==(AbstractArray{T,N},NAtype) is defined first.
Warning: New definition replace!(PooledDataVec{S},NAtype,T) is ambiguous with replace!(PooledDataVec{S},T,NAtype).
         Make sure replace!(PooledDataVec{S},NAtype,NAtype) is defined first.
Warning: New definition promote_rule(Type{AbstractDataVec{T}},Type{T}) is ambiguous with promote_rule(Type{AbstractDataVec{S}},Type{T}).
         Make sure promote_rule(Type{AbstractDataVec{T}},Type{T}) is defined first.
Warning: New definition ^(NAtype,T<:Union(String,Number)) is ambiguous with ^(Any,Integer).
         Make sure ^(NAtype,_<:Integer) is defined first.
Warning: New definition ^(DataVec{T},Number) is ambiguous with ^(Any,Integer).
         Make sure ^(DataVec{T},Integer) is defined first.
Warning: New definition ^(DataFrame,Union(NAtype,Number)) is ambiguous with ^(Any,Integer).
         Make sure ^(DataFrame,Integer) is defined first.
 
julia> using DataFrames
 
julia> using RDatasets
 
julia> iris = data("datasets", "iris")
DataFrame  (150,6)
              Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]        1          5.1         3.5          1.4         0.2    "setosa"
[2,]        2          4.9         3.0          1.4         0.2    "setosa"
[3,]        3          4.7         3.2          1.3         0.2    "setosa"
[4,]        4          4.6         3.1          1.5         0.2    "setosa"
[5,]        5          5.0         3.6          1.4         0.2    "setosa"
[6,]        6          5.4         3.9          1.7         0.4    "setosa"
[7,]        7          4.6         3.4          1.4         0.3    "setosa"
[8,]        8          5.0         3.4          1.5         0.2    "setosa"
[9,]        9          4.4         2.9          1.4         0.2    "setosa"
[10,]      10          4.9         3.1          1.5         0.1    "setosa"
[11,]      11          5.4         3.7          1.5         0.2    "setosa"
[12,]      12          4.8         3.4          1.6         0.2    "setosa"
[13,]      13          4.8         3.0          1.4         0.1    "setosa"
[14,]      14          4.3         3.0          1.1         0.1    "setosa"
[15,]      15          5.8         4.0          1.2         0.2    "setosa"
[16,]      16          5.7         4.4          1.5         0.4    "setosa"
[17,]      17          5.4         3.9          1.3         0.4    "setosa"
[18,]      18          5.1         3.5          1.4         0.3    "setosa"
[19,]      19          5.7         3.8          1.7         0.3    "setosa"
[20,]      20          5.1         3.8          1.5         0.3    "setosa"
  :
[131,]    131          7.4         2.8          6.1         1.9 "virginica"
[132,]    132          7.9         3.8          6.4         2.0 "virginica"
[133,]    133          6.4         2.8          5.6         2.2 "virginica"
[134,]    134          6.3         2.8          5.1         1.5 "virginica"
[135,]    135          6.1         2.6          5.6         1.4 "virginica"
[136,]    136          7.7         3.0          6.1         2.3 "virginica"
[137,]    137          6.3         3.4          5.6         2.4 "virginica"
[138,]    138          6.4         3.1          5.5         1.8 "virginica"
[139,]    139          6.0         3.0          4.8         1.8 "virginica"
[140,]    140          6.9         3.1          5.4         2.1 "virginica"
[141,]    141          6.7         3.1          5.6         2.4 "virginica"
[142,]    142          6.9         3.1          5.1         2.3 "virginica"
[143,]    143          5.8         2.7          5.1         1.9 "virginica"
[144,]    144          6.8         3.2          5.9         2.3 "virginica"
[145,]    145          6.7         3.3          5.7         2.5 "virginica"
[146,]    146          6.7         3.0          5.2         2.3 "virginica"
[147,]    147          6.3         2.5          5.0         1.9 "virginica"
[148,]    148          6.5         3.0          5.2         2.0 "virginica"
[149,]    149          6.2         3.4          5.4         2.3 "virginica"
[150,]    150          5.9         3.0          5.1         1.8 "virginica"
 
julia> head(iris)
DataFrame  (6,6)
          Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
[1,]    1          5.1         3.5          1.4         0.2 "setosa"
[2,]    2          4.9         3.0          1.4         0.2 "setosa"
[3,]    3          4.7         3.2          1.3         0.2 "setosa"
[4,]    4          4.6         3.1          1.5         0.2 "setosa"
[5,]    5          5.0         3.6          1.4         0.2 "setosa"
[6,]    6          5.4         3.9          1.7         0.4 "setosa"
 
julia> tail(iris)
DataFrame  (6,6)
            Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]    145          6.7         3.3          5.7         2.5 "virginica"
[2,]    146          6.7         3.0          5.2         2.3 "virginica"
[3,]    147          6.3         2.5          5.0         1.9 "virginica"
[4,]    148          6.5         3.0          5.2         2.0 "virginica"
[5,]    149          6.2         3.4          5.4         2.3 "virginica"
[6,]    150          5.9         3.0          5.1         1.8 "virginica"

Now that you can see that Julia can handle complex data sets, let’s talk a little bit about the packages that make statistical analysis in Julia possible.

The DataFrames Package

The DataFrames package provides data structures for working with tabular data in Julia. At a minimum, this means that DataFrames provides tools for dealing with individual columns of missing data, which are called DataVec‘s. A collection of DataVec‘s allows one to build up a DataFrame, which provides a tabular data structure like that used by R’s data.frame type.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
julia> load("DataFrames")
 
julia> using DataFrames
 
julia> data = {"Value" => [1, 2, 3], "Label" => ["A", "B", "C"]}
Warning: imported binding for data overwritten in module Main
{"Label"=>["A", "B", "C"],"Value"=>[1, 2, 3]}
 
julia> df = DataFrame(data)
DataFrame  (3,2)
        Label Value
[1,]      "A"     1
[2,]      "B"     2
[3,]      "C"     3
 
julia> df["Value"]
3-element DataVec{Int64}
 
[1,2,3]
 
julia> df[1, "Value"] = NA
NA
 
 
julia> head(df)
DataFrame  (3,2)
        Label Value
[1,]      "A"    NA
[2,]      "B"     2
[3,]      "C"     3

Distributions

The Distributions package provides tools for working with probability distributions in Julia. It reifies distributions as types in Julia’s large type hierarchy, which means that quite generic names like rand can be used to sample from complex distributions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
julia> load("Distributions")
julia> using Distributions
 
julia> x = rand(Normal(11.0, 3.0), 10_000)
10000-element Float64 Array:
  6.87693
 13.3676 
  7.25008
  8.82833
 10.6911 
  7.1004 
 13.7449 
  5.96412
  8.57957
 15.27374.89007
 15.1509 
  6.32376
  7.83847
 14.4476 
 14.2974 
  9.74783
  9.67398
 14.4992 
 
julia> mean(x)
11.00366217730023
 
julia> var(x)
Warning: Possible conflict in library symbol ddot_
9.288938550823996

Optim

The Optim package provides tools for numerical optimization of arbitrary functions in Julia. It provides a function, optimize, which works a bit like R’s optim function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
julia> load("Optim")
julia> using Optim
 
julia> f = v -> (10.9 - v[1])^2 + (7.3 - v[2])^2
#<function>
 
julia> initial_guess = [0.0, 0.0]
2-element Float64 Array:
 0.0
 0.0
 
julia> results = optimize(f, initial_guess)
Warning: Possible conflict in library symbol dcopy_
OptimizationResults("Nelder-Mead",[0.333333, 0.333333],[10.9, 7.29994],3.2848148720460163e-9,38,true)
 
julia> results.minimum
2-element Float64 Array:
 10.9    
  7.29994

MCMC

The MCMC package provides tools for sampling from arbitrary probability distributions using Markov Chain Monte Carlo. It provides functions like slice_sampler, which allows one to sample from a (potentially unnormalized) density function using Radford Neal’s slice sampling algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
julia> load("MCMC")
 
julia> using MCMC
 
julia> d = Normal(17.29, 1.0)
Normal(17.29,1.0)
 
julia> f = x -> logpdf(d, x)
#<function>
 
julia> [slice_sampler(0.0, f) for i in 1:100]
100-element (Float64,Float64) Array:
 (2.7589100475626323,-106.49522613611775) 
 (22.840595204318323,-16.323492094305458) 
 (0.11800384424353683,-148.35766451986206)
 (25.507580447082677,-34.68325273534245)  
 (25.794565860846134,-37.08275877393945)  
 (25.898128716394307,-37.96887853221083)  
 (9.309878825853284,-32.76010551023705)   
 (30.824102772255355,-92.50490745818972)  
 (9.108789186504177,-34.38504372063516)   
 (25.547686903330494,-35.01363502992266)(5.795001414731885,-66.98643477086263)   
 (15.50115292212293,-2.518925467219337)   
 (12.046429369881345,-14.666455009726143) 
 (17.25455052645699,-0.919566865791911)   
 (25.494698549206657,-34.57747767488159)  
 (1.8340810959111111,-120.36165311809079) 
 (2.7112428736526177,-107.18901820771696) 
 (9.21203292192012,-33.54571459047587)    
 (19.12274407701784,-2.5984139591266584)

NHST

The NHST package provides tools for testing standard statistical hypotheses using null hypothesis significance testing tools like the t-test and the chi-squared test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
julia> load("Distributions")
 
julia> using Distributions
 
julia> load("NHST")
 
julia> using NHST
 
julia> d1 = Normal(17.29, 1.0)
Normal(17.29,1.0)
 
julia> d2 = Normal(0.0, 1.0)
Normal(0.0,1.0)
 
julia> x = rand(d1, 1_000)
1000-element Float64 Array:
 15.7085
 18.585 
 16.6036
 18.962 
 17.8715
 16.6814
 17.9676
 16.8924
 16.6022
 17.981317.1339
 17.3964
 18.6184
 16.7238
 18.5003
 16.1618
 17.9198
 17.4928
 18.715 
 
julia> y = rand(d2, 1_000)
1000-element Float64 Array:
  0.664885 
  0.147182 
  0.96265  
  0.24282  
  1.881    
 -0.632478 
  0.539297 
  0.996562 
 -0.483302 
  0.5146292.06249  
 -0.549444 
  0.857575 
 -1.47464  
 -2.33243  
  0.510751 
 -0.381069 
 -1.49165  
  0.0521203
 
julia> t_test(x, y)
HypothesisTest("t-Test",{"t"=>392.2838409538002},{"df"=>1989.732411290855},0.0,[17.1535, 17.3293],{"mean of x"=>17.24357323225425,"mean of y"=>0.0021786523177457794},0.0,"two-sided","Welch Two Sample t-test","x and y",1989.732411290855)

Clustering

The Clustering package provides tools for doing simple k-means style clustering.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
julia> load("Clustering")
 
julia> using Clustering
 
julia> srand(1)
 
julia> n = 100
100
 
julia> x = vcat(randn(n, 2), randn(n, 2) .+ 10)
200x2 Float64 Array:
  0.0575636  -0.112322 
 -1.8329     -0.101326 
  0.370699   -0.956183 
  1.31816    -1.44351  
  0.787598    0.148386 
  0.712214   -1.293    
 -1.8578     -1.06208  
 -0.746303   -0.0439182
  1.12082    -2.00616  
  0.364646   -1.0933110.1974     10.5583   
 11.0832      8.92082  
 11.5414     11.6022   
  9.0453     11.5093   
  8.86714    10.4233   
 10.7336     10.7201   
  8.60415     9.13942  
  8.62482     8.51701  
 10.5044     10.3841   
 
julia> true_assignments = vcat(zeros(n), ones(n))
200-element Float64 Array:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.01.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 
julia> results = k_means(x, 2)
Warning: Possible conflict in library symbol dgesdd_
Warning: Possible conflict in library symbol dsyrk_
Warning: Possible conflict in library symbol dgemm_
KMeansOutput([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1  ...  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],2x2 Float64 Array:
 -0.0166203  -0.248904
 10.0418     10.0074  ,3,422.9820560670007,true)
 
julia> results.assignments
200-element Int64 Array:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 12
 2
 2
 2
 2
 2
 2
 2
 2

While all of this software is still quite new and often still buggy, being able to work with these tools through a simple package systems had made me more excited than ever before about the future of Julia as a language for data analysis. There is, of course, one thing conspicuously lacking right now: a really powerful visualization toolkit for interactive graphics like that provided by R’s ggplot2 package. Hopefully something will come into being within the next few months.

The Shape of Floating Point Random Numbers

[Updated 10/18/2012: Fixed a typo in which mantissa was replaced with exponent.]

Over the weekend, Viral Shah updated Julia’s implementation of randn() to give a 20% speed boost. Because we all wanted to test that this speed-up had not come at the expense of the validity of Julia’s RNG system, I spent some time this weekend trying to get tests up and running. I didn’t get far, but thankfully others chimed in and got things done.

Testing an RNG is serious business. In total, we’ve considered using four different test suites:

All of these suites can be easily used to test uniform random numbers over unsigned integers. Some are also appropriate for testing uniform random numbers over floatint-point values.

But we wanted to test a Gaussian RNG. To do that, we followed Thomas et al.’s lead and mapped the Gaussian RNG’s output through a high-precision quantile function to produce uniform random floating point values. As our high-precision quantile function we ended up using the one described in Marsaglia’s 2004 JSS paper.

With that in place, I started to try modifying my previous RNG testing code. When we previously tried to test Julia’s rand() function, I got STS working on my machine and deciphered its manual well enough to run a suite of tests on a bit stream from Julia.

Unfortunately I made a fairly serious error in how I attempted to test Julia’s RNG. Because STS expects a stream of random 0′s and 1′s, I converted random numbers into 0′s and 1′s by testing whether the floating point numbers being generated were greater than 0.5 or less than 0.5. While this test is not completely wrong, it is very, very weak. Its substantive value comes from two points:

  1. It confirms that the median of the RNG is correctly positioned at 0.5.
  2. It confirms that the placement of successive entries relative to 0.5 is effectively random. In short, there is not trivial correlation between successive values.

Unfortunately that’s about all you learn from this method. We needed something more. So I started exploring how to convert a floating point into bits. Others had the good sense to avoid this and pushed us forward by using the TestU01 suite.

I instead got lost exploring the surprising complexity of trying to work with the individual bits of random floating point numbers. The topic is so subtle because the distribution of bits in a randomly generated floating point number is extremely far from a random source of individual bits.

For example, a uniform variable’s representation in floating point has all the following non-random properties:

  1. The sign bit is never random because uniform variables are never negative.
  2. The exponent is not random either because uniform variables are strictly contained in the interval [0, 1].
  3. Even the mantissa isn’t random. Because floating point numbers aren’t evenly spaced in the reals, the mantissa has to have complex patterns in it to simulate the equal-spacing of uniform numbers.

Inspired by all of this, I decided to get a sense for the bit pattern signature of different RNG’s. Below I’ve plotted the patterns for uniform, normal, gamma and Cauchy variables using lines that describe the mean value of the i-th bit in the bit string. At a minimum, a completely random bit stream would have a flat horizontal line through 0.5, which many of the lines touch for a moment, but never perfectly match.


Signatures

Some patterns:

  1. The first bit (shown on the far left) is the sign bit: you can clearly see which distributions are symmetric by looking for a mean value of 0.5 versus those that are strictly positive and have a mean value of 0.0.
  2. The next eleven bits are the exponent and you can clearly see which distributions are largely concentrated in the interval [-1, 1] and which have substantial density outside of that region. This bit would clue you into the variance of the distribution.
  3. You can see that there is a lot of non-randomness in the last few bits of the mantissa for uniform variables. There’s also non-randomness in the first few bits for all variables. I don’t yet have any real intuition for those patterns.

You can go beyond looking at the signatures of mean bit patterns by looking at covariance matrices as well. Below I show these covariances matrices in a white-blue coloring scheme in which white indicates negative values, light blue indicates zero and dark blue indicates positive values. Note that matrices, generated using R’s image() function are reflections of the more intuitive matrix ordering in which the [1,1] entry of the matrix occurs in the top-left instead of the bottom-left.

Uniform Variables

Uniform covariance

Normal Variables

Normal covariance

Gamma Variables

Gamma covariance

Cauchy Variables

Cauchy covariance

I find these pictures really helpful for reminding me how strangely floating point numbers behave. The complexity of these images is so far removed from the simplicity of the bit non-patterns in randomly generated unsigned integers, which can be generated using IID random bits and concatenating them together.

Overfitting

What do you think when you see a model like the one below?


Overfitting

Does this strike you as a good model? Or as a bad model?

There’s no right or wrong answer to this question, but I’d like to argue that models that are able to match white noise are typically bad things, especially when you don’t have a clear cross-validation paradigm that will allow you to demonstrate that your model’s ability to match complex data isn’t a form of overfitting.

There are many objective reasons to suspect complicated models, but I’d like to offer up a subjective one. A model that fits complex data as perfectly as the model above is likely to not be an interpretable model1 because it is essentially a noisy copy of the data. If the model looks so much like the data, why construct a model at all? Why not just use the raw data?

Unless the functional form of a model and its dependence on inputs is simple, I’m very suspicious of any statistical method that produces outputs like those shown above. If you want a model to do more than produce black-box predictions, it should probably provide predictions that are relatively smooth. At the least it should reveal comprehensible and memorable patterns that are non-smooth. While there are fields in which neither of these goals is possible (and others where it’s not desirable), I think the default reaction to a model fit like the one above should be: “why does the model make such complex predictions? Isn’t that a mistake? How many degrees of freedom does it have that it can so closely fit such noisy data?”

  1. Although it might be a great predictive model if you can confirm that the fit above is the quality of the fit to held-out data!

EDA Before CDA

One Paragraph Summary

Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you’ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of intraocular trauma.

Context

This morning, I woke up to find that Neil Kodner had discovered a very convenient CSV file that contains geospatial data about every valid US zip code. I’ve been interested in the relationship between places and zip codes recently, because I spent my summer living in the 98122 zip code after having spent my entire life living in places with zip codes below 20000. Because of the huge gulf between my Seattle zip code and my zip codes on the East Coast, I’ve on-and-off wondered if the zip codes were originally assigned in terms of the seniority of states. Specifically, the original thirteen colonies seem to have some of the lowest zip codes, while the newer states had some of the highest zip codes.

While I could presumably find this information through a few web searches or could gather the right data set to test my idea formally, I decided to blindly plot the zip code data instead. I think the results help to show why a few well-chosen visualizations can be so much more valuable than regression coefficients. Below I’ve posted the code I used to explore the zip code data in the exact order of the plots I produced. I’ll let the resulting pictures tell the rest of the story.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
zipcodes <- read.csv("zipcodes.csv")
 
ggplot(zipcodes, aes(x = zip, y = latitude)) +
  geom_point()
ggsave("latitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = zip, y = longitude)) +
  geom_point()
ggsave("longitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = latitude, y = longitude, color = zip)) +
  geom_point()
ggsave("latitude_vs_longitude_color.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = longitude, y = latitude, color = zip)) +
  geom_point()
ggsave("longitude_vs_latitude_color.png", height = 7, width = 10)
ggplot(subset(zipcodes, longitude < 0), aes(x = longitude, y = latitude, color = zip)) +
  geom_point()
ggsave("usa_color.png", height = 7, width = 10)

Picture

(Latitude, Zipcode) Scatterplot


Latitude vs zip

(Longitude, Zipcode) Scatterplot


Longitude vs zip

(Latitude, Longitude) Heatmap


Latitude vs longitude color

(Longitude, Latitude) Heatmap


Longitude vs latitude color

(Longitude, Latitude) Heatmap without Non-States


Usa color

Finder Bug in OS X

Four years after I first noticed it, Finder still has a bug in it that causes it to report a negative number of items waiting for deletion:

FinderBug

Playing with The Circular Law in Julia

Introduction

Statistically-trained readers of this blog will be very familiar with the Central Limit Theorem, which describes the asymptotic sampling distribution of the mean of a random vector composed of IID variables. Some of the most interesting recent work in mathematics has been focused on the development of increasingly powerful proofs of a similar law, called the Circular Law, which describes the asymptotic sampling distribution of the eigenvalues of a random matrix composed of IID variables.

Julia, which is funded by one of the world’s great experts on random matrix theory, is perfectly designed for generating random matrices to experiment with the Circular Law. The rest of this post shows how you might write the most naive sort of Monte Carlo study of random matrices in order to convince yourself that the Circular Law really is true.

Details

In order to show off the Circular Law, we need to be a bit more formal. We’ll define a random matrix \(M\) as an \(N\)x\(N\) matrix composed of \(N^{2}\) IID complex random variables, each with mean \(0\) and variance \(\frac{1}{N}\). Then the distribution of the \(N\) eigenvalues asymptotically converges to the uniform distribution over the unit disc. I think it’s easiest to see this by doing a Monte Carlo simulation of the eigenvalues of random matrices composed of Gaussian variables for five values of \(N\):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
f = open("eig.tsv", "w")
 
println(f, join(["N", "I", "J", "Re", "Im"], "\t"))
 
ns = [1, 2, 5, 25, 50]
sims = 100
 
for n in ns
  for i in 1:sims
    m = (1 / sqrt(n)) * randn(n, n)
    e, v = eig(m)
    for j in 1:n
      println(f, join([n, i, j, real(e[j]), imag(e[j])], "\t"))
    end
  end
end
 
close(f)

The results from this simulation are shown below:

Circular law

These images highlight two important patterns:

  1. For the entry-level Gaussian distribution, the distribution of eigenvalues converges on the unit circle surprisingly quickly.
  2. There is a very noticeable excess of values along the real line that goes away much more slowly than the movement towards the unit disk.

If you’re interested in exploring this topic, I’d encourage you to try two things:

  1. Obtain samples from a larger variety of values of \(N\).
  2. Construct samples from other entry-level distributions than the Gaussian distribution used here.

PS: Drew Conway wants me to note that Julia likes big matrices.

Will Data Scientists Be Replaced by Tools?

The Quick-and-Dirty Summary

I was recently asked to participate in a proposed SXSW panel that will debate the question, “Will Data Scientists Be Replaced by Tools?” This post describes my current thinking on that question as a way of (1) convincing you to go vote for the panel’s inclusion in this year’s SXSW and (2) instigating a larger debate about the future of companies whose business model depends upon Data Science in some way.

The Slow-and-Clean Version

In the last five years, Data Science has emerged as a new discipline, although there are many reasonable people who think that this new discipline is largely a rebranding of existing fields that suffer from a history of poor marketing and weak publicity.

All that said, within the startup world, I see at least three different sorts of Data Science work being done that really do constitute new types of activities for startups to be engaged in:

  1. Data aggregation and redistribution: A group of companies like DataSift have emerged recently whose business model centers on the acquisition of large quantities of data, which they resell in raw or processed form. These companies are essentially the Exxon’s of the data mining world.
  2. Data analytics toolkit development: Another group of companies like Prior Knowledge have emerged that develop automated tools for data analysis. Often this work involves building usable and scalable implementations of new methods coming out of academic machine learning groups.
  3. In-house data teams: Many current startups and once-upon-a-time startups now employ at least one person whose job title is Data Scientist. These people are charged with extracting value from the data accumulated by startups as a means of increasing the market value of these startups.

I find these three categories particularly helpful here, because it seems to me that the question, “Will Data Scientists Be Replaced by Tools?”, is most interesting when framed as a question about whether the third category of workers will be replaced by the products designed by the second category of workers. I see no sign that the first category of companies will go away anytime soon.

When posed this way, the most plausible answer to the question seems to be: “data scientists will have portions of their job automated, but their work will be much less automated than one might hope. Although we might hope to replace knowledge workers with algorithms, this will not happen as soon as some would like to claim.”

In general, I’m skeptical of sweeping automation of any specific branch of knowledge workers because I think the major work done by a data scientist isn’t technological, but sociological: their job is to communicate with the heads of companies and with the broader public about how data can be used to improve businesses. Essentially, data scientists are advocates for better data analysis and for more data-driven decision-making, both of which require constant vigilance to maintain. While the mathematical component of the work done by a data scientist is essential, it is nevertheless irrelevant in the absence of human efforts to sway decision-makers.

To put it another way, many of the problems in our society aren’t failures of science or technology, but failures of human nature. Consider, for example, Atul Gawande’s claim that many people still die each year because doctors don’t wash their hands often enough. Even though Seimelweiss showed the world that hygiene is a life-or-death matter in hospitals more than a century ago, we’re still not doing a good enough job maintaining proper hygiene.

Similarly, we can examine the many sloppy uses of basic statistics that can be found in the biological and the social sciences — for example, those common errors that have been recently described by Ioannidis and Simonsohn. Basic statistical methods are already totally automated, but this automation seems to have done little to make the analysis of data more reliable. While programs like SPSS have automated the computational components of statistics, they have done nothing to diminish the need for a person in the loop who understands what is actually being computed and what it actually means about the substantive questions being asked of data.

While we can — and will — develop better tools for data analysis in the coming years, we will not do nearly as much as we hope to obviate the need for sound judgment, domain expertise and hard work. As David Freedman put it, we’re still going to need shoe leather to get useful insights out of data and that will require human intervention for a very long time to come. The data scientist can no more be automated than the CEO.

DataGotham

As some of you may know already, I’m co-organizing an upcoming conference called DataGotham that’s taking place in September. To help spread the word about DataGotham, I’m cross-posting the most recent announcement below:

We’d like to let you know about DataGotham: a celebration of New York City’s data community!

http://datagotham.com

This is an event run by Drew Conway, Hilary Mason, Mike Dewar and John Myles White that will bring together professionals from finance to fashion and from startups to the Fortune 500. This day-and-a-half event will consist of intense discussion, networking, and sharing of wisdom, and will take place September 13th-14th at NYU Stern.

We also have four tutorials running on the afternoon of the 13th, followed by cocktails and The Great Data Extravaganza Show at the Tribeca Rooftop that evening. Tickets are on sale – we would love to see you there!

If you’d like to attend, please see the DataGotham website for more information. Since you’re a reader of this fine blog, we’ve set up a special “Friends and Family” discount that will give you 25% off the ticket price. To get the discount, you need to use the promo code “dataGothamist”.

The Social Dynamics of the R Core Team

Recently a few members of R Core have indicated that part of what slows down the development of R as a language is that it has become increasingly difficult over the years to achieve consensus among the core developers of the language. Inspired by these claims, I decided to look into this issue quantitatively by measuring the quantity of commits to R’s SVN repository that were made by each of the R Core developers. I wanted to know whether a small group of developers were overwhelmingly responsible for changes to R or whether all of the members of R Core had contributed equally. To follow along with what I did, you can grab the data and analysis scripts from GitHub.

First, I downloaded the R Core team’s SVN logs from http://developer.r-project.org/. I then used a simple regex to parse the SVN logs to count commits coming from each core committer.

After that, I tabulated the number of commits from each developer, pooling across the years 2003-2012 for which I had logs. You can see the results below, sorted by total commits in decreasing order:

Committer Total Number of Commits
ripley 22730
maechler 3605
hornik 3602
murdoch 1978
pd 1781
apache 658
jmc 599
luke 576
urbaneks 414
iacus 382
murrell 324
leisch 274
tlumley 153
rgentlem 141
root 87
duncan 81
bates 76
falcon 45
deepayan 40
plummer 28
ligges 24
martyn 20
ihaka 14

After that, I tried to visualize evolving trends over the years. First, I visualized the number of commits per developer per year:


Commits

And then I visualized the evenness of contributions from different developers by measuring the entropy of the distribution of commits on a yearly basis:


Entropy

There seems to be some weak evidence that the community is either finding consensus more difficult and tending towards a single leader who makes final decisions or that some developers are progressively dropping out because of the difficulty of achieving consensus. There is unambiguous evidence that a single developer makes the overwhelming majority of commits to R’s SVN repo.

I leave it to others to understand what all of this means for R and for programming language communities in general.