## Will Data Scientists Be Replaced by Tools?

### The Quick-and-Dirty Summary

I was recently asked to participate in a proposed SXSW panel that will debate the question, “Will Data Scientists Be Replaced by Tools?” This post describes my current thinking on that question as a way of (1) convincing you to go vote for the panel’s inclusion in this year’s SXSW and (2) instigating a larger debate about the future of companies whose business model depends upon Data Science in some way.

### The Slow-and-Clean Version

In the last five years, Data Science has emerged as a new discipline, although there are many reasonable people who think that this new discipline is largely a rebranding of existing fields that suffer from a history of poor marketing and weak publicity.

All that said, within the startup world, I see at least three different sorts of Data Science work being done that really do constitute new types of activities for startups to be engaged in:

1. Data aggregation and redistribution: A group of companies like DataSift have emerged recently whose business model centers on the acquisition of large quantities of data, which they resell in raw or processed form. These companies are essentially the Exxon’s of the data mining world.
2. Data analytics toolkit development: Another group of companies like Prior Knowledge have emerged that develop automated tools for data analysis. Often this work involves building usable and scalable implementations of new methods coming out of academic machine learning groups.
3. In-house data teams: Many current startups and once-upon-a-time startups now employ at least one person whose job title is Data Scientist. These people are charged with extracting value from the data accumulated by startups as a means of increasing the market value of these startups.

I find these three categories particularly helpful here, because it seems to me that the question, “Will Data Scientists Be Replaced by Tools?”, is most interesting when framed as a question about whether the third category of workers will be replaced by the products designed by the second category of workers. I see no sign that the first category of companies will go away anytime soon.

When posed this way, the most plausible answer to the question seems to be: “data scientists will have portions of their job automated, but their work will be much less automated than one might hope. Although we might hope to replace knowledge workers with algorithms, this will not happen as soon as some would like to claim.”

In general, I’m skeptical of sweeping automation of any specific branch of knowledge workers because I think the major work done by a data scientist isn’t technological, but sociological: their job is to communicate with the heads of companies and with the broader public about how data can be used to improve businesses. Essentially, data scientists are advocates for better data analysis and for more data-driven decision-making, both of which require constant vigilance to maintain. While the mathematical component of the work done by a data scientist is essential, it is nevertheless irrelevant in the absence of human efforts to sway decision-makers.

To put it another way, many of the problems in our society aren’t failures of science or technology, but failures of human nature. Consider, for example, Atul Gawande’s claim that many people still die each year because doctors don’t wash their hands often enough. Even though Seimelweiss showed the world that hygiene is a life-or-death matter in hospitals more than a century ago, we’re still not doing a good enough job maintaining proper hygiene.

Similarly, we can examine the many sloppy uses of basic statistics that can be found in the biological and the social sciences — for example, those common errors that have been recently described by Ioannidis and Simonsohn. Basic statistical methods are already totally automated, but this automation seems to have done little to make the analysis of data more reliable. While programs like SPSS have automated the computational components of statistics, they have done nothing to diminish the need for a person in the loop who understands what is actually being computed and what it actually means about the substantive questions being asked of data.

While we can — and will — develop better tools for data analysis in the coming years, we will not do nearly as much as we hope to obviate the need for sound judgment, domain expertise and hard work. As David Freedman put it, we’re still going to need shoe leather to get useful insights out of data and that will require human intervention for a very long time to come. The data scientist can no more be automated than the CEO.

## DataGotham

As some of you may know already, I’m co-organizing an upcoming conference called DataGotham that’s taking place in September. To help spread the word about DataGotham, I’m cross-posting the most recent announcement below:

We’d like to let you know about DataGotham: a celebration of New York City’s data community!

http://datagotham.com

This is an event run by Drew Conway, Hilary Mason, Mike Dewar and John Myles White that will bring together professionals from finance to fashion and from startups to the Fortune 500. This day-and-a-half event will consist of intense discussion, networking, and sharing of wisdom, and will take place September 13th-14th at NYU Stern.

We also have four tutorials running on the afternoon of the 13th, followed by cocktails and The Great Data Extravaganza Show at the Tribeca Rooftop that evening. Tickets are on sale – we would love to see you there!

If you’d like to attend, please see the DataGotham website for more information. Since you’re a reader of this fine blog, we’ve set up a special “Friends and Family” discount that will give you 25% off the ticket price. To get the discount, you need to use the promo code “dataGothamist”.

## The Social Dynamics of the R Core Team

Recently a few members of R Core have indicated that part of what slows down the development of R as a language is that it has become increasingly difficult over the years to achieve consensus among the core developers of the language. Inspired by these claims, I decided to look into this issue quantitatively by measuring the quantity of commits to R’s SVN repository that were made by each of the R Core developers. I wanted to know whether a small group of developers were overwhelmingly responsible for changes to R or whether all of the members of R Core had contributed equally. To follow along with what I did, you can grab the data and analysis scripts from GitHub.

First, I downloaded the R Core team’s SVN logs from http://developer.r-project.org/. I then used a simple regex to parse the SVN logs to count commits coming from each core committer.

After that, I tabulated the number of commits from each developer, pooling across the years 2003-2012 for which I had logs. You can see the results below, sorted by total commits in decreasing order:

Committer Total Number of Commits
ripley 22730
maechler 3605
hornik 3602
murdoch 1978
pd 1781
apache 658
jmc 599
luke 576
urbaneks 414
iacus 382
murrell 324
leisch 274
tlumley 153
rgentlem 141
root 87
duncan 81
bates 76
falcon 45
deepayan 40
plummer 28
ligges 24
martyn 20
ihaka 14

After that, I tried to visualize evolving trends over the years. First, I visualized the number of commits per developer per year:

And then I visualized the evenness of contributions from different developers by measuring the entropy of the distribution of commits on a yearly basis:

There seems to be some weak evidence that the community is either finding consensus more difficult and tending towards a single leader who makes final decisions or that some developers are progressively dropping out because of the difficulty of achieving consensus. There is unambiguous evidence that a single developer makes the overwhelming majority of commits to R’s SVN repo.

I leave it to others to understand what all of this means for R and for programming language communities in general.

## My New Book: Developing, Deploying and Debugging Multi-Armed Bandit Algorithms

I’m happy to announce that I’ve started writing a new book for O’Reilly, which will focus on teaching readers how to use Multi-Armed Bandit Algorithms to build better websites. My hope is that the book can help web developers build up an intuition for the core conundrum facing anyone who wants to build a successful business: you have to constantly make trade-offs between (A) making decisions that are likely to yield safe, short-term successes and (B) taking chances on new opportunities that could be good long-term strategies, but could also be spectacular failures.

In the academic literature, this trade-off is usually called the Explore-Exploit dilemma. It’s been studied for decades because there is no universally applicable, out-of-the-box solution. There are several simple methods that nearly everyone knows about (like A/B testing), but these methods can lead to spectacular failures. For that reason, academics have invested a huge amount of intellectual energy into developing better approaches than the sort of randomized experimentation embodied in A/B testing.

In large part, they’ve made progress by studying an idealized version of the Explore-Exploit tradeoff called the Multi-Armed Bandit Problem. A Multi-Armed Bandit Problem is a thought experiment in which you try to make the most money possible while gambling at a casino that features only an assortment of different slot machines, each of which is called a bandit because of its propensity to steal the gambler’s money.

In the book, I’ll cover a few of the standard algorithms developed based on this thought experiment, including the $$\epsilon$$-greedy algorithm, softmax decision rules and a family of algorithms collectively called UCB. All of these algorithms will be implemented in Python, but my primary interest in writing the book isn’t to provide code for these algorithms to readers, but to give them a framework that will allow them to reason about how to solve the Explore-Exploit dilemma as it impacts their businesses and the general art of developing better websites. There are now many, many different algorithms available for solving the Multi-Armed Bandit Problem, but they all share a few core intuitions and strategies. I’m hoping to communicate those higher level intuitions to readers.

At the same time, I want to provide readers with a toolkit for deciding between the various options that are available to them. Web developers are very familiar with the importance of debugging, so I’d like to teach them the equivalent of debugging for Multi-Armed Bandit Problems, which involves simulation. Instead of building unit tests in which you confirm that your code can solve toy problems with clear answers, you run simulations to test that your algorithm can learn the right answer to toy problems with unambiguous right answers.

In summary, I hope readers will leave this book with a toolbox of strategies for developing, deploying and debugging multi-armed bandit algorithms in the context of web development. At large companies, these methods are responsible for millions of dollars of increased profits, so I think it’s high time that the core ideas trickle down from elite institutions like Google, Microsoft and Facebook to other web companies.

I’m really excited about this project and look forward to sharing more details as the book progresses.

## Automatic Hyperparameter Tuning Methods

At MSR this week, we had two very good talks on algorithmic methods for tuning the hyperparameters of machine learning models. Selecting appropriate settings for hyperparameters is a constant problem in machine learning, which is somewhat surprising given how much expertise the machine learning community has in optimization theory. I suspect there’s interesting psychological and sociological work to be done exploring why a problem that could be answered using known techniques wasn’t given an appropriate solution earlier.

Thankfully, the take away message of this blog post is that this problem is starting to be understood.

### A Two-Part Optimization Problem

To set up the problem of hyperparameter tuning, it’s helpful to think of the canonical model-tuning and model-testing setup used in machine learning: one splits the original data set into three parts — a training set, a validation set and a test set. If, for example, we plan to use L2-regularized linear regression to solve our problem, we will use the training set and validation set to select a value for the $$\lambda$$ hyperparameter that is used to determine the strength of the penalty for large coefficients relative to the penalty for errors in predictions.

With this context in mind, we can set up our problem using five types of variables:

1. Features: $$x$$
2. Labels: $$y$$
3. Parameters: $$\theta$$
4. Hyperparameters: $$\lambda$$
5. Cost function: $$C$$

We then estimate our parameters and hyperparameters in the following multi-step way so as to minimize our cost function:

$\theta_{Train}(\lambda) = \arg \min_{\theta} C(x_{Train}, y_{Train}, \theta, \lambda)$

$\lambda_{Validation}^{*} = \arg \min_{\lambda} C(x_{Validation}, y_{Validation}, \theta_{Train}(\lambda), \lambda)$

The final model performance is assessed using:
$C(x_{Test}, y_{Test}, \theta_{Train + Validation}(\lambda_{Validation}^{*}), \lambda_{Validation}^{*})$

This two-part minimization problem is similar in many ways to stepwise regression. Like stepwise regression, it feels like an opportunity for clean abstraction is being passed over, but it’s not clear to me (or anyone I think) if there is any analytic way to solve this problem more abstractly.

Instead, the methods we saw presented in our seminars were ways to find better approximations to $$\lambda^{*}$$ using less compute time. I’ll go through the traditional approach, then describe the newer and cleaner methods.

### Grid Search

Typically, hyperparameters are set using the Grid Search algorithm, which works as follows:

1. For each parameter $$p_{i}$$ the researcher selects a list of values to test empirically.
2. For each element of the Cartesian product of these values, the computer evaluates the cost function.
3. The computer selects the hyperparameter settings from this grid with the lowest cost.

Grid Search is about the worst algorithm one could possibly use, but it’s in widespread use because (A) machine learning experts seem to have less familiarity with derivative-free optimization techniques than with gradient-based optimization methods and (B) machine learning culture does not traditionally think of hyperparameter tuning as a formal optimization problem. Almost certainly (B) is more important than (A).

### Random Search

James Bergstra’s first proposed solution was so entertaining because, absent evidence that it works, it seems almost flippant to even propose: he suggested replacing Grid Search with Random Search. Instead of selecting a grid of values and walking through it exhaustively, you select a value for each hyperparameter independently using some probability distribution. You then evaluate the cost function given these random settings for the hyperparameters.

Since this approach seems like it might be worst than Grid Search, it’s worth pondering why it should work. James’ argument is this: most ML models have low-effective dimension, which means that a small number of parameters really affect the cost function and most have almost no effect. Random search lets you explore a greater variety of settings for each parameter, which allows you to find better values for the few parameters that really matter.

I am sure that Paul Meehl would have a field day with this research if he were alive to hear about it.

### Arbitrary Regression Problem

An alternative approach is to view our problem as one of Bayesian Optimization: we have an arbitrary function that we want to minimize which is costly to evaluate and we would like to find a good approximate minimum in a small number of evaluations.

When viewed in this perspective, the natural strategy is to regress the cost function on the settings of the hyperparameters. Because the cost function may depend on the hyperparameters in strange ways, it is wise to use very general purpose regression methods. I’ve recently seen two clever strategies for this, one of which was presented to us at MSR:

1. Jasper Snoek, Hugo Larochelle and Ryan Adams suggest that one use a Gaussian Process.
2. Among other methods, Frank Hutter, Holger H. Hoos and Kevin Leyton-Brown suggest that one use Random Forests.

From my viewpoint, it seems that any derivative-free optimization method might be worth trying. While I have yet to see it published, I’d like to see more people try the Nelder-Mead method for tuning hyperparameters.

## Criticism 5 of NHST: p-Values Measure Effort, Not Truth

### Introduction

In the third installment of my series of criticisms of NHST, I focused on the notion that a p-value is nothing more than a one-dimensional representation of a two-dimensional space in which (1) the measured size of an effect and (2) the precision of this measurement have been combined in such a way that we can never pull those two dimensions apart again. Because the size of a p-value depends on two fully independent factors, measured p-values will become smaller when either (A) a researcher focuses on studying an effect whose true magnitude is greater or (B) a researcher works harder to increase the precision of the measurement of the effect they are already committed to studying.

I’d like to dwell on this second strategy for producing low p-values for a moment, because I believe it suggests that p-values — in the hands of a social scientist committed to studying effects that are never exactly zero nor likely to be large in magnitude — are simply a confusing way of reporting the precision of our measurements. Moreover, because the precision of our measurements can always be increased by gathering more data or acquiring better equipment, p-values in the social sciences are nothing more than measures of the effort and money we invest in acquiring more precise measurements, even though many of us would like to think of p-values as a measure of the truth of our hypotheses.

To hammer home the notion that p-values are, in practice, just measurements of sample size, I’m going to provide the reader with a simple algorithm that any researcher can use to insure that essentially any study in the social sciences will, in expectation, be validated as true by the NHST approach to assessing research. For the sake of argument, we’ll insist that our hypothetical study should pass the classical threshold of p < .05.

At a high level, the approach I'll take is similar in spirit to power analysis, but it is different in the details. In power analysis, you would tell me (1) the size of the effect you expect and (2) the sample size you can expect to gather; I would then hand you back the probability of successfully passing the p-value threshold you've set. In the approach described here, you hand me only the size of the expected effect; I then hand you back the minimum sample size you need to gather so that, in expectation, you'll pass the p < .05 threshold. In other words, for any hypothesis you could possibly invent that isn't absolutely false, you can take the information I'll give you and run an experiment with the expectation that you will be able to use NHST to support your hypothesis. This experiment may be extremely expensive to run and may contribute nothing to human welfare if completed successfully, but you can run the experiment in reasonable confidence that you'll be able to publish your results.

To do this, we only need to turn a simple calculation over to our computer:

1. For any proposed effect size, epsilon, we generate many different random samples of N normally distributed data points centered at epsilon.
2. For each sample, we run a one-sample t-test comparing the mean of that sample against 0. We store a copy of the p-value associated with this t-test.
3. Across samples, we average the p-values derived from these t-tests to estimate the expected p-value for a fixed sample size.
4. If the expected p-value is greater than .05, we increase N to N + 1 and try again.
5. Eventually we will find a large enough value of N so that the expected p-value is below .05: we store this in a database and move on to the next effect size.

If we generate a large enough number for different data sets for each of our effect sizes, our Monte Carlo estimates of the expected p-values will hopefully be accurate. With a large number for these estimates for different effect sizes, we can extrapolate out a smooth curve that shows the minimum sample size required to have an expectation that the p-value for a study with that sample size and effect size will be less than .05.

To calculate these values, I've coded up this simple search algorithm in R and put the code up on GitHub. In this post, I'll simply provide a graph that shows how large a value of N is required to expect to pass a standard t-test as the size of the effect you're studying grows along a logarithmic scale starting at 0.1 and going all the way up to 10:

As you can see, you need huge sample sizes for effects as small as 0.1, but you can use remarkably few data points when the effect is sufficiently large. No matter how small your effect may be, you can always do the hard work of gathering data in order to pass the threshold of p < .05. As long as the effect you're studying isn't non-existent, p-values just measure how much effort you've put into collecting data.

### Conclusions

What should we take away from this? A variety of conclusions could be drawn from thinking about the graph above, but I'm really only concerned with one of them: if your field always studies effects that aren't exactly zero, it is always possible to design a study that can be expected to pass a significance test. As such, a significance test cannot, in principle, allow science to filter out any idea in the social sciences, except perhaps for those so outlandish that they do not have even a shred of truth in them. All that NHST does is measure how precise your measurement system is -- which in turn is nothing more than a metric of how hard you were willing to work to gather data in support of your hypothesis. But the width of a confidence interval is clearly a superior tool for measuring the precision of your measurements, because it directly tells you the precision of your measurements; a p-value only tells you the precision of a measurement after you've pealed off the residue left behind by the size of the effect you're studying.

Thankfully, I think that there are many viable approaches for presenting scientific research and evaluating the research of others that are far superior to p-values. In the next and final post in this series, I'll focus on some suggestions for developing better strategies for analyzing data in the social sciences. Unfortunately, these better strategies require more than a change in methods in the social sciences, they require a change in culture. We need to do less hypothesis testing and more estimation of formal models. But I'll leave the details of that until my next post.

[P.S. Many thanks to Sean Taylor for useful feedback on a draft of this post.]

## Optimization Functions in Julia

Update 10/30/2013: Since this post was written, Julia has acquired a large body of optimization tools, which have been grouped under the heading of JuliaOpt.

Over the last few weeks, I’ve made a concerted effort to develop a basic suite of optimization algorithms for Julia so that Matlab programmers used to using fminunc() and R programmers used to using optim() can start to transition code over to Julia that requires access to simple optimization algorithms like L-BFGS and the Nelder-Mead method.

As of today, I believe that enough progress has been made to justify publicly announcing the first draft of an optimization suite for Julia tentatively called optim.jl. I intend to keep working on optim.jl for the next few months and expect that it will eventually become one of the first Julia packages.

For an end-user, interactions with optim.jl should take place almost entirely through the optimize() function, which I demonstrate how to use below in the context of minimizing the Rosenbrock function:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45  load("src/init.jl")   function rosenbrock(x::Vector) (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 end   function rosenbrock_gradient(x::Vector) [-2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1], 200.0 * (x[2] - x[1]^2)] end   function rosenbrock_hessian(x::Vector) h = zeros(2, 2) h[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 h[1, 2] = -400.0 * x[1] h[2, 1] = -400.0 * x[1] h[2, 2] = 200.0 h end   problem = Dict() problem[:f] = rosenbrock problem[:g] = rosenbrock_gradient problem[:h] = rosenbrock_hessian problem[:initial_x] = [0.0, 0.0] problem[:solution] = [1.0, 1.0]   algorithms = ["naive_gradient_descent", "gradient_descent", "newton", "bfgs", "l-bfgs", "nelder-mead", "sa"]   for algorithm = algorithms results = optimize(problem[:f], problem[:g], problem[:h], problem[:initial_x], algorithm, 10e-8, true) print(results) end

I think the code in my current draft of optim.jl is basically usable, but I’m still a little unsure of its correctness as I’ve never developed optimization algorithms before. I’ve set up a basic suite of benchmark functions to test these functions, but I could really use some additional use cases and benchmarks. If you’re interested, please submit pull requests through GitHub that provide new functions that you want optim.jl to be able to minimize automatically. If you submit a function, please provide the function itself, its gradient, its Hessian, a starting point and the global minimum of the function.

I’ve already set up five test functions as benchmarks, which are:

1. A simple exponential function.
2. A simple parabolic function.
3. A simple 4th-degree polynomial function.
4. The Rosenbrock function.
5. The Powell function.

The code for these five functions can be seen on GitHub along with drafts of a few other functions.

Below you can see plots that depict the performance of the seven algorithms that are currently available on each of these five problems. In some of these plots the bar for Naive Gradient Descent drops out: that’s because the constant step-size gradient descent algorithm behaves pathologically and produces NaN’s at some point. I’d stay away from that function unless you’re sure it’s what you want.

Stepping away from the perils of hard-coding a step-size for gradient descent, I think the take away message from these plots is that you should use BFGS, L-BFGS or Newton’s Method if you can compute or safely approximate the Hessian of a function. Only fall back on the other four methods if you find that you really need to avoid using even approximate Hessians or can’t compute the gradient of your to-be-minimized function.

The next few steps ahead are to add a few more optimization methods including Brent’s method and conjugate gradients; then start working on constrained optimization; and finally get to work on incorporating automatic differentiation and finite differencing. With those things in place, I think we’ll have a basic optimization function for Julia that will let us keep up with R’s optim. I think we’ve got a long way to go to catch up with fminunc, but I’m sure we’ll get there by the time Julia hits a 1.0 release.

## Bayesian Nonparametrics in R

On July 25th, I’ll be presenting at the Seattle R Meetup about implementing Bayesian nonparametrics in R. If you’re not sure what Bayesian nonparametric methods are, they’re a family of methods that allow you to fit traditional statistical models, such as mixture models or latent factor models, without having to fully specify the number of clusters or latent factors in advance.

Instead of predetermining the number of clusters or latent factors to prevent a statistical algorithm from using as many clusters as there are data points in a data set, Bayesian nonparametric methods prevent overfitting by using a family of flexible priors, including the Dirichlet Process, the Chinese Restaurant Process or the Indian Buffet Process, that allow for a potentially infinite number of clusters, but nevertheless favor using a small numbers of clusters unless the data demands using more.

This flexibility has several virtues:

1. You don’t have to decide how many clusters are present in your data before seeing the results of your analysis. You just specify the strength of a penalty for creating more clusters.
2. The results of a fitted Bayesian nonparametric model can be used as informed priors when you start processing new data. This is particularly important if you are writing an online algorithm: the use of a prior like the Chinese Restaurant Process that, in principle, allows the addition of new clusters at any time means that you can take advantage of all the information you have about existing clusters without arbitrarily forcing new data into those clusters: if a new data point is unique enough, it will create its own cluster.

To give you a sense of how Bayesian nonparametric methods work, I recently coded up the newly published dp-Means algorithm, which provides a deterministic implementation of a k-means like algorithm that allows for a potentially infinite number of clusters, but, in practice, only uses a small number of clusters by penalizing the creation of additional clusters. This penalty comes from a parameter, lambda, that the user specifies before fitting the model.

The series of images below shows how the dp-Means algorithm behaves when you turn it loose on a data set with four distinct Gaussian clusters. Each image represents a higher penalty for using more clusters.

As you can see, you can shift the model from a preference for using many clusters to using only one by manipulating the lambda parameter. If you’re interested in learning more and will be in the Seattle area in late July, come out.

## The Great Julia RNG Refactor

Many readers of this blog will know that I’m a big fan of Bayesian methods, in large part because automated inference tools like JAGS allow modelers to focus on the types of structure they want to extract from data rather than worry about the algorithmic details of how they will fit their models to data. For me, the ease with which we can construct automated inference tools for Bayesian models is the greatest fruit of the MCMC Revolution. And it’s embodied in lots of great software packages like WinBUGS, JAGS, and PyMC — among severals others I know less well.

When I develop a new model for analyzing small data sets, I typically can do all of my statistical programming in JAGS. But as soon as the data set I’m working with contains more than 100,000 entries, I find that I need to write sampling code by hand. And, of course, I always write my simulation code by hand.

Traditionally, I would use R for both of these tasks because of the wide variety of RNG’s provided by R by default. The convenience of having access to functions like rbinom() and rbeta() in out-of-the-box R makes it easy to forgive R’s faults.

But there’s always been a small hole in the RNG’s provided by R: there’s no rdirichlet() function in the “stats” package. In Bayesian statistics, the Dirichlet distribution comes up over and over again, because it is the conjugate prior for models with multinomial data, which include almost all Bayesian models for text analysis. Now, it is true that you can find an R function for Dirichlet random number generation in the “MCMCpack” package, but this is inconvenient and I find that I frequently forget which package contains rdirichlet() if I haven’t used it recently.

To avoid finding myself in the same situation with Julia, I recently wrote up samplers for the multinomial and Dirichlet distributions. And, as of this week, those samplers are now in the standard branch of Julia.

These new samplers are particularly nice to work with because they can take advantage of a substantial refactor of the Julia RNG design that was worked out by several Julia developers over the past few weeks. I’m going to describe this new interface in general, so that I can show how one can use this interface to draw samples from the multinomial and Dirichlet distributions in Julia.

Inspired by SciPy, the main idea of the new approach to RNG’s in Julia is to exploit Julia’s multiple dispatch system to make it possible for functions for working with distributions to be named as generically as possible. Instead of using naming conventions as is done in R (think of rbinom, dbinom, pbinom and qbinom), Julia uses the exact same function names for all probability distributions. To generate Gaussian, Gamma or Dirichlet draws, you’ll always use the rand() function. For the probability mass or density functions, you’ll always use pmf() or pdf(). For the cumulative distribution function, you’ll always use cdf(). This list goes on and on: there are even customized mean() and var() functions for many distributions that will return the expectation and variance of a theoretical random variable, rather than the empirical mean or variance of a sample from that distribution.

Julia can make this constant naming convention work smoothly by giving each probability distribution its own explicit position in the Julia type system. For example, there is now a Binomial type defined as:

 1 2 3 4  type Binomial size::Int prob::Float64 end

To generate a random binomial sample with 10 draws with probability 0.8 of heads for each draw, you therefore run the following code:

 1 2 3  load("base/distributions.jl")   rand(Binomial(10, 0.8))

I quite like this approach. At times, it makes Julia feel a bit like Mathematica, because you can interact with abstract concepts rather than thinking about data structures, as seen in this calculation of the theoretical variance of a binomial variable:

 1 2 3  load("base/distributions.jl")   var(Binomial(10, 0.8))

That’s enough of an introduction to Julia’s new RNG design for me to show off the multinomial and Dirichlet functionality. Here we go:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40  load("base/distributions.jl")   ## # Multinomial ##   # The multinomial distribution for one draw of one object with p = [0.5, 0.4, 0.1] of drawing each of three objects. d = Multinomial(1, [0.5, 0.4, 0.1])   # Draw one object from three types, all equally probable. d = Multinomial(1, 3)   # Compute the mean and variance of the distribution under its parameterization. mean(d) var(d)   # Compute the probability mass function. pmf(d, [1, 0, 0]) pmf(d, [1, 1, 0]) pmf(d, [0, 1, 0])   # Draw a random sample. rand(d)   ## # Dirichlet ##   # The Dirichlet distribution with concentration parameter = [1.0, 2.0, 1.0]. d = Dirichlet([1.0, 2.0, 1.0])   # The mean and variance of this distribution. mean(d) var(d)   # Compute the probability density function. pdf(d, [0.1, 0.8, 0.1])   # Draw a random sample. rand(d)

I’m excited to have these tools in Julia by default now. They’ll make it much easier to write clean code for Gibbs samplers for models like LDA that make heavy use of Dirichlet priors.

I also want to note that while I was writing up these new samplers, Chris DuBois has been building some very general purpose MCMC functions that will let people write a sampler for an arbitrary Bayesian model fairly easily. You can see his examples on GitHub.

## Criticism 4 of NHST: No Mechanism for Producing Substantive Cumulative Knowledge

[Note to the Reader: This is a much rougher piece than the previous pieces because the argument is more complex. I ask that you please point out places where things are unclear and where claims are not rigorous.]

In this fourth part of my series of criticisms of NHST, I’m going to focus on broad questions of epistemology: I want to ask what types of knowledge we can hope to obtain using NHST and what types of knowledge we would need to obtain as social scientists before our work will offer the sort of real-world value that older sciences like physics and chemistry provide. Above all else, my argument is based upon my conviction that the older sciences are currently superior to the social sciences because they can make precise numerical predictions about important things in the world — after all, it is this numerical precision that allows humanity to construct powerful technologies like satellites and computers. That we have failed so far to construct comparably precise models of human behavior is no great cause for concern: we have chosen to study topics in which there are no low-hanging fruit like the equations defining Newtonian gravity’s effect on small objects — equations which could be discovered from experimental data quite easily using modern (and not so modern) statistical techniques. We social scientists are going slower and reaching less powerful conclusions than the physical scientists have because we are studying something that is intrinsically much harder to learn about. If our subject matter were easier, then we could turn our attention to developing the sort of complex mathematical machinery and rich formal theory that makes modern physics so interesting, but also so difficult for many people.

In large part, this piece can be considered an elaboration of the claim by Paul Meehl that:

The almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is… basically unsound.

To do this, we will build up an idealized world in which we can clearly see that NHST provides no mechanism for the accumulation of substantive knowledge. In this idealized world, it will be easy to describe (1) the types of knowledge that we might aspire to possess and (2) the types of knowledge that NHST can actually provide. By doing so, we will show that NHST, even when practiced perfectly using infinite data, can achieve only a very weak approximation of the types of knowledge that one could attain by using the most basic quantitative modeling strategies. In addition, we will be able to show that the two different forms of NHST in popular use, which are either (1) a test against a point hypothesis that some constant m != 0 or (2) a test against a directional hypothesis that some constant m > 0, are equally unhelpful in our pursuit of cumulative knowledge because the conjunctive knowledge of having falsified two different null hypotheses about two different constants is actually much weaker than the knowledge from a single null hypothesis. In short, we will try to demonstrate by example that, even in a much simpler world than our own, NHST cannot form a major part of a successful science’s long-term research program.

To construct our idealized world in which it is easy to measure the extent of our knowledge, let us suppose that the exact structure of the universe is entirely linear and deterministic: there are n basic components of reality (which are measured as x_i‘s) and any one of these basics variables is related to every other variable in a single equation of the form:

beta_n * x_n = beta_1 * x_1 + beta_2 * x_2 + ... + beta_n-1 * x_n-1


In Nature’s caprice, a few of these variables may have zero coefficients (i.e. beta_i = 0 for some i), but we will not assume that most have zero coefficients. In an idealized world similar to the one facing the social sciences, it will generally be the case that there are almost no zero coefficients, because, as David Lykken observed, there is a crud factor in psychology linking all observables about human beings.1

Now that we have constructed our idealized world in which randomness plays no part (except when it is the result of not measuring one of the x_i‘s) and in which non-linearity never arises, let us set up an idealized social science research program. In this idealized science, we will learn about the structure of the world (which, when conducted correctly, is identical to learning about the equation above) by slowing discovering the x_i‘s that could be measured through some unspecified process and then performing randomized experiments to learn the coefficients for each. But, because we test only null hypotheses, our knowledge of each of the coefficients ultimately reduces either to (1) knowledge that beta_i != 0 or (2) knowledge that beta_i > 0 or that beta_i < 0. We will call these two types of NHST Form 1 and Form 2.

Because our knowledge about the beta_i reduces to knowledge at most of the sign of the beta_i's, if we one day wished to learn this model in full quantitative form, the final fruits after an indefinitely long progression of our idealized science's development using NHST Form 1 as our primary mechanism for verifying empirical results would be a list of the coefficients that would need to measured to build up the quantitative theory, i.e. a list of non-zero coefficients. The results of using NHST Form 2 would be a list of the coefficients that need to be measured along with the additional piece of prior information that each of those coefficients has a known sign. In short, the perfect state of knowledge from NHST would be little more than cost-saving measure for the construction of a quantitative model. If we believe in Lykken's crud factor, NHST Form 1 will not even provide us with this much information: there will be no non-zero coefficients and the list of non-zero coefficients will be empty, saving us no work and giving us no information about how to construct a quantitative theory.

But let us ignore the building of a quantitative theory for a moment and consider instead what this perfect state of knowledge would actually mean to us. To do this, we can ask how close we would come to full knowledge of the world: indeed, we could ask how many bits of information we have about the coefficients in this model. Sadly, the answer is either 0 or 1 bit per coefficient, so that we have at most n bits of knowledge after measuring n coefficients with infinitely precise application of NHST. This is in sharp contrast to a person who has measured the actual values of the coefficients: that person will know m * n bits of information about the world, where m is the precision in bits of the average coefficient's measurement. In short, there is an upper bound for any user of NHST on the amount of information they can obtain: it is n bits of information. In contrast, the quantitative theorist has an upper bound of infinite bits of information.

But this argument about bits is unlikely to convince many people who are not already very quantitative by nature: measuring things in bits only appeals to people who like to measure things in units. So let us ask a different sort of question: we can ask what sort of questions about the world we could answer using this perfect state of NHST knowledge. Again, the response is disheartening: under NHST Form 2 (which is strictly stronger than NHST Form 1), we will only be able to answer questions that can be broken apart into pieces, each of which reduces to the form: "is beta_i > 0, beta_i = 0, or beta_i < 0?" But the answer to such question does not merit the name of quantitative knowledge: it is knowledge of direction alone. This is not surprising: our NHST methodology focused on direction alone from the very start. But we will now show that perfect knowledge of direction is of shockingly little value for making predictions about the world. For instance, we could not answer any of the following questions:

(1) If I set all of the x_i to 1 except for x_n, what will be the the value of x_n? This is the idealized form of all prediction tasks: given a set of inputs, make a prediction about an output. Being unable to solve this most basic prediction task means that we have not built up a base of cumulative knowledge with any capacity to predict unseen events. But the value of science is defined in large part by its capacity to predict unseen events.

(2) Is x_1 more important than x_2 in the sense that changing x_1's value from a to b would have more effect on x_n than changing x_2's value from a to b? This is the basic form of any question asking which of our variables is most important. Being unable to answer such questions demonstrates that we have no sense of what objects in our theory really have the capacity to affect human lives.

(3) If we set all of the x_i to 1 except for x_n, what is the sign of x_n? We cannot answer even this question (which is superficially the aggregation of the basic directional question posed earlier), because, if beta_1 > 0, beta_2 < 0, and beta_i = 0 for all i > 3, then it matters whether beta_1 > beta_2 if we want to predict the sign of x_n. This is particularly damning, because it means that the accumulation of knowledge of the signs of many coefficients is not even sufficient to produce knowledge of the sign of one single output that depends on only two variables. This is the basic form of asking whether our knowledge is cumulative, rather than shattered and lacking conceptual coherence.

What can we conclude from our failure to answer such simple questions about our field of study in an idealized world in which reality has a simple mathematical structure and in which we assume that we have performed infinitely many experiments with infinite precision? I believe that we should conclude that NHST is basically unusable as a method for doing science: it amounts to what Noam Chomsky would call butterfly-collecting, except that our butterflies are variables that we have proven to have non-zero coefficients. We have no knowledge of the predictive importance of these variables nor of the scale of their effects on human experiences.

And this is true even in our idealized world in which methods like ANOVA's and linear regression would be able to tell us the exact structure of the universe. Given that those methods exist and are universally taught to social scientists, why do we persist in using NHST, a method that, if used in isolation, is strictly incapable of producing any substantive knowledge of the structure of the world?

One argument that can be given back is that we do not only have access to null hypothesis testing against zero: we can test against hypotheses like beta_1 > 1 and beta_i < 3. By using infinitely many of these enriched tests (let us call these directional test against arbitrary values NHST Form 3), we could ultimately perform a binary search of the space of all values for beta_i and slowly build up an idea of the quantitive value of each of the coefficients.

But admitting that we are interested in learning the actual values of the beta_i should not encourage use to use more elaborate NHST: it should be the first step towards deciding never to use NHST again. Why? Because the binary search paradigm described right above is noticeably inferior to strategies in which we simply use methods like linear regression to estimate the values of the beta_i's directly while placing confidence intervals around those estimated values to keep a record of the precision of our knowledge.

And this realization that simple linear regression would be a giant leap forward leads us to notice what is perhaps the most perverse part of our continued use of NHST in the social sciences: the machinery of NHST actually requires us to at least estimate all of the coefficients in our idealized equation (albeit in a possibly biased way), because t-tests, ANOVA's and linear regression actually require those estimates before one can compute a p-value. We are already producing, as a byproduct of our obsession with NHST, a strictly superior form of knowledge about the beta_i's, but we throw all of that information away! The entire substance of what I would call powerful scientific knowledge is consider a worthless intermediate step in falsifying the null and supporting one's favorite qualitative hypothesis: the estimates of the beta_i's are typically published, but simple verbal questioning of most social scientists will confirm that people do not walk away from papers they have read with this quantitative knowledge in memory.

I find it very odd that, in an age so otherwise concerned with the conservation of resources, we should be so careless with quantitative knowledge that we are willing to discard it as a valueless intermediate byproduct of the process of falsifying the null. I think this careless indifference to the more meaningful quantitative values that are estimated as a means to the end of producing a p-value reflects powerful epistemological misunderstandings on the part of the social sciences: we have much too little interest in quantitative thinking, foremost of which should be an interest in exact numerical prediction. And it is this absence of interest in exact numerical prediction that sets us apart from astronomy, a field that, unlike our own, does not even have limited access to randomized experiments. Astronomers have learned to make due using only observational data by building formal and numerically precise models that make sharp predictions whose accuracy can be rigorously tested. As I have shown, even if practiced perfectly in a simpler world, NHST would not produce a body of knowledge with value even close to the knowledge that any undergraduate in astronomy has.2 As a field, we should not accept this condition. In the long-term, scientific knowledge must be quantitative. We are already capable of producing such knowledge: indeed, we produce it as a waste product of our quest to falsify the null. If we abandon our interest in NHST and focus instead on the prediction of quantitative measurements, we will put our field on a healthier path towards catching up with the physical sciences. While we may always trail behind because of the intrinsic difficulty of our work, I also believe that the social sciences, in principle, have much more to offer to humanity than the natural sciences. We should start to live up to that promise more often. But NHST is a serious impediment to that, because it pushes towards the acquisition of a radically inferior type of knowledge that is not cumulative: there is no viable method for combining the successful falsification of the null about beta_i and the falsification of the null about beta_j into a coherent whole that will predict unseen data.

### References

Paul E. Meehl (1978) "Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology", Journal of Consulting and Clinical Psychology

David A. Freedman (1991), "Statistical Models and Shoe Leather", Sociological Methodology

Paul E. Meehl (1990), "Why Summaries of Research on Psychological Theories are Often Uninterpretable", Psychological Reports

1. I have found that this crud factor surprises some non-psychologists. To convince non-psychologists, I would note that Meehl once collected 15 variables about people and regressed every pair of them. 96% had a highly significant (p < 10e-6) correlation.
2. I note that a side product of this observation should be the conclusion that we psychologists have an excessive faith in the value of randomized experiments. It is the sharp prediction of unseen data in a way that is readily falsifiable that makes a field of inquiry as valuable to humanity as the physical sciences already are: while we social scientists do, in fact, have much such knowledge, it is attributable almost entirely to the intellectual talent and personal integrity of social scientists, all of which function despite our crippling usage of NHST.