Statistics

Overfitting

Overfitting

What do you think when you see a model like the one below? Does this strike you as a good model? Or as a bad model? There’s no right or wrong answer to this question, but I’d like to argue that models that are able to match white noise are typically bad things, especially when […]

EDA Before CDA

EDA Before CDA

One Paragraph Summary Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you’ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of […]

Playing with The Circular Law in Julia

Playing with The Circular Law in Julia

Introduction Statistically-trained readers of this blog will be very familiar with the Central Limit Theorem, which describes the asymptotic sampling distribution of the mean of a random vector composed of IID variables. Some of the most interesting recent work in mathematics has been focused on the development of increasingly powerful proofs of a similar law, […]

Will Data Scientists Be Replaced by Tools?

The Quick-and-Dirty Summary I was recently asked to participate in a proposed SXSW panel that will debate the question, “Will Data Scientists Be Replaced by Tools?” This post describes my current thinking on that question as a way of (1) convincing you to go vote for the panel’s inclusion in this year’s SXSW and (2) […]

DataGotham

As some of you may know already, I’m co-organizing an upcoming conference called DataGotham that’s taking place in September. To help spread the word about DataGotham, I’m cross-posting the most recent announcement below: We’d like to let you know about DataGotham: a celebration of New York City’s data community! http://datagotham.com This is an event run […]

The Social Dynamics of the R Core Team

The Social Dynamics of the R Core Team

Recently a few members of R Core have indicated that part of what slows down the development of R as a language is that it has become increasingly difficult over the years to achieve consensus among the core developers of the language. Inspired by these claims, I decided to look into this issue quantitatively by […]

My New Book: Developing, Deploying and Debugging Multi-Armed Bandit Algorithms

I’m happy to announce that I’ve started writing a new book for O’Reilly, which will focus on teaching readers how to use Multi-Armed Bandit Algorithms to build better websites. My hope is that the book can help web developers build up an intuition for the core conundrum facing anyone who wants to build a successful […]

Automatic Hyperparameter Tuning Methods

At MSR this week, we had two very good talks on algorithmic methods for tuning the hyperparameters of machine learning models. Selecting appropriate settings for hyperparameters is a constant problem in machine learning, which is somewhat surprising given how much expertise the machine learning community has in optimization theory. I suspect there’s interesting psychological and […]

Criticism 5 of NHST: p-Values Measure Effort, Not Truth

Criticism 5 of NHST: p-Values Measure Effort, Not Truth

Introduction In the third installment of my series of criticisms of NHST, I focused on the notion that a p-value is nothing more than a one-dimensional representation of a two-dimensional space in which (1) the measured size of an effect and (2) the precision of this measurement have been combined in such a way that […]

Optimization Functions in Julia

Optimization Functions in Julia

Update 10/30/2013: Since this post was written, Julia has acquired a large body of optimization tools, which have been grouped under the heading of JuliaOpt. Over the last few weeks, I’ve made a concerted effort to develop a basic suite of optimization algorithms for Julia so that Matlab programmers used to using fminunc() and R […]