### Introduction

A few months ago, Drew Conway and I gave a webcast that tried to teach people about the basic principles behind linear and logistic regression. To illustrate logistic regression, we worked through a series of progressively more complex spam detection problems.

The simplest data set we used was the following:

This data set has one clear virtue: the correct classifier defines a decision boundary that implements a simple `OR`

operation on the values of `MentionsViagra`

and `MentionsNigeria`

. Unfortunately, that very simplicity causes the logistic regression model to break down, because the MLE coefficients for `MentionsViagra`

and `MentionsNigeria`

should be infinite. In some ways, our elegantly simple example for logistic regression is actually the statistical equivalent of a SQL injection.

In our webcast, Drew and I decided to ignore that concern because R produces a useful model fit despite the theoretical MLE coefficients being infinite:

Although R produces finite coefficients here despite theory telling us to except something else, I should note that R does produce a somewhat cryptic warning during the model fitting step that alerts the very well-informed user that something has gain awry:

1 | glm.fit: fitted probabilities numerically 0 or 1 occurred |

It seems clear to me that R’s warning would be better off if it were substantially more verbose:

1 2 3 4 5 6 | Warning from glm.fit(): Fitted probabilities could not be distinguished from 0's or 1's under finite precision floating point arithmetic. As a result, the optimization algorithm for GLM fitting may have failed to converge. You should check whether your data set is linearly separable. |

### Broader Questions

Although I’ve started this piece with a very focused example of how R’s implementation of logistic regression differs from the purely mathematical definition of that model, I’m not really that interested in the details of how different pieces of software implement logistic regression. If you’re interested in learning more about that kind of thing, I’d suggest reading the excellent piece on R’s logistic regression function that can be found on the Win-Vector blog.

Instead, what interests me right now are a set of broader questions about how statistical software should work. What is the standard for correctness for statistical software? And what is the standard for usefulness? And how closely related are those two criteria?

Let’s think about each of them separately:

*Usefulness*: If you want to simply make predictions based on your model, then you want R to produce a fitted model for this data set that makes reasonably good predictions on the training data. R achieves that goal: the fitted predictions for R’s logistic regression model are numerically almost indistinguishable from the 0/1 values that we would expect from a maximum likelihood algorithm. If you want useful algorithms, then R’s decision to produce some model fit is justified.*Correctness*: If you want software to either produce mathematically correct answers or to die trying, then R’s implementation of logistic is not for you. If you insist on theoretical purity, it seems clear that R should not merely emit a warning here, but should instead throw an inescapable error rather than return an imperfect model fit. You might even want R to go further and to teach the end-user about the virtues of SVM’s or the general usefulness of parameter regularization. Whatever you’d like to see, one thing is sure: you definitely do not want R to produce model fits that are mathematically incorrect.

It’s remarkable that such a simple example can bring the goals of predictive power and theoretical correctness into such direct opposition. In part, the conflict arises here because those purely theoretical concerns are linked by a third consideration: computer algorithms are not generally equivalent to their mathematical idealizations. Purely computational concerns involving floating-point imprecision and finite compute time mean that we cannot generally hope for computers to produce answers similar to those prescribed by theoretical mathematics.

What’s fascinating about this specific example is that there’s something strangely desirable about floating-point numbers having finite precision: no one with any practical interest in modeling is likely to be interested in fitting a model with infinite-valued parameters. R’s decision to blindly run an optimization algorithm here unwittingly achieves a form of regularization like that employed in early stopping algorithms for fitting neural networks. And that may be a good thing if you’re interested in using a fitted model to make predictions, even though it means that R produces quantities like standard errors that have no real coherent interpretation in terms of frequentist estimators.

Whatever your take is on the virtues or vices of R’s implementation of logistic regression, there’s a broad take away from this example that I’ve been dealing with constantly while working on Julia: *any programmer designing statistical software has to make decisions that involve personal judgment*. The requirement for striking a compromise between correctness and usefulness is so nearly omnipresent that one of the most popular pieces of statistical software on Earth implements logistic regression using an algorithm that a pure theorist could argue is basically broken. But it produces an answer that has practical value. And that might just be the more important thing for statistical software to do.

One thing Python got right was having a reasonably consistent philosophy.

If you hit the correctness vs usefulness thing often with Julia, you probably want to make mostly the same decision, except in the egregious cases.

The SQL injection analogy is brilliant …

I like the SQL analogy too.

Two relevant quotes:

“A little bit of ridge never hurt.” – attributed to Trevor Hastie. Here ridge is an L2 penalty, also interpretable as a normal prior.

“Keeping things unridiculous is what regularizationâ€™s all about.” – Andrew Gelman, here:

http://andrewgelman.com/2011/12/keeping-things-unridiculous-berger-ohagan-and-me-on-weakly-informative-priors/

Glad everyone liked the SQL analogy. I totally agree that a little bit of ridge goes a long way.

In the case of logistic regression I’ve been thinking a lot about the way in which this kind of example touches on a general conflation of concerns in many statistical models: the coefficients of the model simultaneously define (a) the type of predictions being made and (b) the confidence with which those predictions are made. For many applications, I’m really only concerned about the direction of the unit vector that defines the separating hyperplane. In this kind of case, what I really want is to have the direction of the hyperplane reported separately from the norm of its defining normal vector.

Thought you might be interested in the warnings that SAS produces for the logistic regression data:

WARNING: There is a complete separation of data points. The maximum likelihood estimate does

not exist.

WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are

based on the last maximum likelihood iteration. Validity of the model fit is

questionable.

Something that was missed is that both parameter coefficients are not significant. However the Likelihood Ratio Test for both parameters is highly significant, and would be for each parameter separately. This is the Hauck-Donner effect and results from the shape of the likelihood becoming such that the asymptotics are very broken. The LRT is probably still OK but it is worrying.