What do you think when you see a model like the one below?

Does this strike you as a good model? Or as a bad model?

There’s no right or wrong answer to this question, but I’d like to argue that models that are able to match white noise are typically bad things, especially when you don’t have a clear cross-validation paradigm that will allow you to demonstrate that your model’s ability to match complex data isn’t a form of overfitting.

There are many objective reasons to suspect complicated models, but I’d like to offer up a subjective one. A model that fits complex data as perfectly as the model above is likely to not be an interpretable model^{1} because it is essentially a noisy copy of the data. If the model looks so much like the data, why construct a model at all? Why not just use the raw data?

Unless the functional form of a model and its dependence on inputs is simple, I’m very suspicious of any statistical method that produces outputs like those shown above. If you want a model to do more than produce black-box predictions, it should probably provide predictions that are relatively smooth. At the least it should reveal comprehensible and memorable patterns that are non-smooth. While there are fields in which neither of these goals is possible (and others where it’s not desirable), I think the default reaction to a model fit like the one above should be: “why does the model make such complex predictions? Isn’t that a mistake? How many degrees of freedom does it have that it can so closely fit such noisy data?”

- Although it might be a great predictive model if you can confirm that the fit above is the quality of the fit to held-out data!↩

Another important fact about that model is it probably has no predictive power, so it would (probably) fail horribly at leave-one-out validation. In that formulation, the purpose of a model is not to provide intuitions about the data but to make predictions about future data points.

The thing that really surprised me when I took a ML course was that the usual proxy for complexity, number of degrees of freedom, is actually pretty much worthless. For example, if your problem is to make a binary classification of points lying along a single dimension, sine waves (2 degrees of freedom) can capture literally any pattern of classification (the technical term here is “shatter”) of an infinite set of points (which is to say, its VC-Dimension is infinite). Of course, such a model also fails your subjective test as well, as it would be uninterpretable.

When I saw the data above, I thought it looks quite similar to some daily time series data that we use at my job. The thing is, we build rather simple models that fit such data quite nicely because there is a lot of explanatory power in weekly and monthly periodicity. The data is made to look noisy because of this when in fact it can be fairly well explained. I guess my point is, I would not assume that all squiggly lines represent pure noise.

@John, I’m glad you brought up VC-dimension. I wish that psychologists with an interest in modeling were taught in our courses about VC-dimension and Rademacher complexity since they’re such powerful concepts. You’re completely right that the sine wave example is counter-intuitive. (I actually had to look up the proof again since I remembered the example, but couldn’t recall the details of the argument.) And I agree about interpretability. My own preference is pretty strong for starting with a search for monotonic relationships in data and only then looking for linearity, convexity or concavity. After that I start to be suspicious.

@Ian, You’re right: there are plenty of examples of time series data in which this sort of white noise appearance is perfectly normal. That’s why I hedged my point originally. If you have the right type of periodic inputs, your outputs can easily have this sort of appearance.

The reason the sine example is so counterintuitive is because infinite precision real numbers defy intuition so systematically. One of my colleagues has argued that number of parameters is a good proxy for capacity if you measure parameters in bits. If quantize all my model’s parameters (perhaps even using IEEE floating point numbers) then there is a hard limit to the size of my hypothesis space for a given number of parameters.

You’re totally right, George: the infinite precision of the reals is what breaks intuition. Being able to exploit one parameter to get an infinitely refined partition of a line is something you could never do with finite precision numbers. I suspect that a 64-bit floating point number will still give you a model with excessively high VC-dimension.

“… And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to the mile!”

“Have you used it much?” I enquired.

“It has never been spread out, yet,” said Mein Herr: “the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well.

From Sylvie and Bruno Concluded by Lewis Carroll – published in 1893.

That quote is great, Anders! Thanks for posting it.