### Introduction

In my last post, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, \((x_1, x_2, \ldots, x_n)\), using a single number, \(s\). In particular, we measured the quality of different potential summaries in three different ways, which led us to modes, medians and means respectively. Each of these quantities emerged from measuring the typical discrepancy between an element of the list, \(x_i\), and the summary, \(s\), using a formula of the form,

$$

\sum_i |x_i – s|^p,

$$

where \(p\) was either \(0\), \(1\) or \(2\).

### The \(L_p\) Norms

In this post, I’d like to extend this approach to linear regression. The notion of discrepancies we used in the last post is very closely tied to the idea of measuring the size of a vector in \(\mathbb{R}^n\). Specifically, we were minimizing a measure of discrepancies that was almost identical to the \(L_p\) family of norms that can be used to measure the size of vectors. Understanding \(L_p\) norms makes it much easier to describe several modern generalizations of classical linear regression.

To extend our previous approach to the more standard notion of an \(L_p\) norm, we simply take the sum we used before and rescale things by taking a \(p^{th}\) root. This gives the formula for the \(L_p\) norm of any vector, \(v = (v_1, v_2, \ldots, v_n)\), as,

$$

|v|_p = (\sum_i |v_i|^p)^\frac{1}{p}.

$$

When \(p = 2\), this formula reduces to the familiar formula for the length of a vector:

$$

|v|_2 = \sqrt{\sum_i v_i^2}.

$$

In the last post, the vector we cared about was the vector of elementwise discrepancies, \(v = (x_1 – s, x_2 – s, \ldots, x_n – s)\). We wanted to minimize the overall size of this vector in order to make \(s\) a good summary of \(x_1, \ldots, x_n\). Because we were interested only in the minimum size of this vector, it didn’t matter that we skipped taking the \(p^{th}\) root at the end because one vector, \(v_1\), has a smaller norm than another vector, \(v_2\), only when the \(p^{th}\) power of that norm smaller than the \(p^{th}\) power of the other. What was essential wasn’t the scale of the norm, but rather the value of \(p\) that we chose. Here we’ll follow that approach again. Specifically, we’ll again be working consistently with the \(p^{th}\) power of an \(L_p\) norm:

$$

|v|_p^p = (\sum_i |v_i|^p).

$$

### The Regression Problem

Using \(L_p\) norms to measure the overall size of a vector of discrepancies extends naturally to other problems in statistics. In the previous post, we were trying to summarize a list of numbers by producing a simple summary statistic. In this post, we’re instead going to summarize the relationship between two lists of numbers in a form that generalizes traditional regression models.

Instead of a single list, we’ll now work with two vectors: \((x_1, x_2, \ldots, x_n)\) and \((y_1, y_2, \ldots, y_n)\). Because we like simple models, we’ll make the very strong (and very convenient) assumption that the second vector is, approximately, a linear function of the first vector, which gives us the formula:

$$

y_i \approx \beta_0 + \beta_1 x_i.

$$

In practice, this linear relationship is never perfect, but only an approximation. As such, for any specific values we choose for \(\beta_0\) and \(\beta_1\), we have to compute a vector of discrepancies: \(v = (y_1 – (\beta_0 + \beta_1 x_1), \ldots, y_n – (\beta_0 + \beta_1 x_n))\). The question then becomes: how do we measure the size of this vector of discrepancies? By choosing different norms to measure its size, we arrive at several different forms of linear regression models. In particular, we’ll work with three norms: the \(L_0\), \(L_1\) and \(L_2\) norms.

As we did with the single vector case, here we’ll define discrepancies as,

$$

d_i = |y_i – (\beta_0 + \beta_1 x_i)|^p,

$$

and the total error as,

$$

E_p = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^p,

$$

which is the just the \(p^{th}\) power of the \(L_p\) norm.

### Several Forms of Regression

In general, we want estimate a set of regression coefficients that minimize this total error. Different forms of linear regression appear when we alter the values of \(p\). As before, let’s consider three settings:

$$

E_0 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^0

$$

$$

E_1 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^1

$$

$$

E_2 = \sum_i |y_i – (\beta_0 + \beta_1 x_i)|^2

$$

What happens in these settings? In the first case, we select regression coefficients so that the line passes through as many points as possible. Clearly we can always select a line that passes through any pair of points. And we can show that there are data sets in which we cannot do better. So the \(L_0\) norm doesn’t seem to provide a very useful form of linear regression, but I’d be interested to see examples of its use.

In contrast, minimizing \(E_1\) and \(E_2\) define quite interesting and familiar forms of linear regression. We’ll start with \(E_2\) because it’s the most familiar: it defines Ordinary Least Squares (OLS) regression, which is the one we all know and love. In the \(L_2\) case, we select \(\beta_0\) and \(\beta_1\) to minimize,

$$

E_2 = \sum_i (y_i – (\beta_0 + \beta_1 x_i))^2,

$$

which is the summed squared error over all of the \((x_i, y_i)\) pairs. In other words, Ordinary Least Squares regression is just an attempt to find an approximating linear relationship between two vectors that minimizes the \(L_2\) norm of the vector of discrepancies.

Although OLS regression is clearly king, the coefficients we get from minimizing \(E_1\) are also quite widely used: using the \(L_1\) norm defines Least Absolute Deviations (LAD) regression, which is also sometimes called Robust Regression. This approach to regression is robust because large outliers that would produce errors greater than \(1\) are not unnecessarily augmented by the squaring operation that’s used in defining OLS regression, but instead only have their absolute values taken. This means that the resulting model will try to match the overall linear pattern in the data even when there are some very large outliers.

We can also relate these two approaches to the strategy employed in the previous post. When we use OLS regression (which would be better called \(L_2\) regression), we predict the mean of \(y_i\) given the value of \(x_i\). And when we use LAD regression (which would be better called \(L_1\) regression), we predict the median of \(y_i\) given the value of \(x_i\). Just as I said in the previous post, the core theoretical tool that we need to understand is the \(L_p\) norm. For single number summaries, it naturally leads to modes, medians and means. For simple regression problems, it naturally leads to LAD regression and OLS regression. But there’s more: it also leads naturally to the two most popular forms of regularized regression.

### Regularization

If you’re not familiar with regularization, the central idea is that we don’t exclusively try to find the values of \(\beta_0\) and \(\beta_1\) that minimize the discrepancy between \(\beta_0 + \beta_1 x_i\) and \(y_i\), but also simultaneously try to satisfy a competing requirement that \(\beta_1\) not get too large. Note that we don’t try to control the size of \(\beta_0\) because it describes the overall scale of the data rather than the relationship between \(x\) and \(y\).

Because these objectives compete, we have to combine them into a single objective. We do that by working with a linear sum of the two objectives. And because both the discrepancy objective and the size of the coefficients can be described in terms of norms, we’ll assume that we want to minimize the \(L_p\) norm of the discrepancies and the \(L_q\) norm of the \(\beta\)’s. This means that we end up trying to minimize an expression of the form,

$$

(\sum_i |y_i – (\beta_0 + \beta_1 x_i)|^{p}) + \lambda (|\beta_1|^q).

$$

In most regularized regression models that I’ve seen in the wild, people tend to use \(p = 2\) and \(q = 1\) or \(q = 2\). When \(q = 1\), this model is called the LASSO. When \(q = 2\), this model is called ridge regression. In another approach, I’ll try to describe why the LASSO and ridge regression produce such different patterns of coefficients.

Great and accessible post, thanks John. It would be great to also explain L2 regularization as Gaussian prior on the parameters, and L1 regularization as a double-Exponential prior. As you noted in our exchange on twitter, that does seem to require bringing up a bunch of probability theory, which complicates the accessibility. One way to keep this as an optimization-focused note is to comment, as an aside, that the regularized objective functions are the log-likelihood functions of a probabilistic model, where the additive regularization is really just the log of the multiplicative prior on the likelihood function, before it is log’ed. Given a concise explanation of this connection, I think it’s helpful to then simple “note” that the L2 regularization “happens” to correspond to Gaussian prior and L1 “happens” to correspond to double-Exponential. In principle, this then gives a framework for dreaming up other regularizations that might correspond to meaningful priors on the parameters.

Agreed. I think I’ll just add another post on the regularization / prior parallelism since I had a conversation about precisely that topic the other day and realized that the equivalence between the two ideas is less widely known than it should be.

Just a minor note: your formula for $\nu$ under “The Regression Problem” and the final displayed equation are missing some parens: you write $y_i – \beta_0 + \beta_1 x_i$ rather than $y_i – (\beta_0 + \beta_1 x_i)$.

Thanks for catching both of those!

Great post! Makes everything intuitive and clear. L_0 can be applied to regularization. According to Hastie, Tibshirani and Friedman, on page 72 of ESL (2nd ed.), “… q = 0 corresponds to variable subset selection, as the penalty simply counts the number of nonzero parameters.”

You’re totally right, but that involves applying an L0 penalty to the coefficients, rather than the residuals. In a future post I’ll try to describe the ways in which the L1 penalty is the best convex approximation to the L0 penalty — when applying them to the coefficients.

I’m surprised you didn’t mention the L_\infty norm (Chebyshev norm), which leads to minimax/Chebyshev regression. Minimax regression to a linear fit (or other function that depends linearly on the parameters) leads to a nice LP problem.

There is also robust least-squares, which minimizes the worst-case least-squares fit assuming some uncertainty in the data, and turns the problem into an SOCP.

The L0 “norm” (zero norm) is a little bit of a misnomer, because it is not a norm in the usual sense (e.g. it doesn’t satisfy ||ax|| = |a| ||x||), and writing it down in the way you did above seems a bit odd because it requires that you define 0^0 = 0.