### Introduction / Warning

Any traditional introductory statistics course will teach students the definitions of modes, medians and means. But, because introductory courses can’t assume that students have much mathematical maturity, the close relationship between these three summary statistics can’t be made clear. This post tries to remedy that situation by making it clear that all three concepts arise as specific parameterizations of a more general problem.

To do so, I’ll need to introduce one non-standard definition that may trouble some readers. In order to simplify my exposition, let’s all agree to assume that \(0^0 = 0\). In particular, we’ll want to assume that \(|0|^0 = 0\), even though \(|\epsilon|^0 = 1\) for all \(\epsilon > 0\). This definition is non-standard, but it greatly simplifies what follows and emphasizes the conceptual unity of modes, medians and means.

### Constructing a Summary Statistic

To see how modes, medians and means arise, let’s assume that we have a list of numbers, \((x_1, x_2, \ldots, x_n)\), that we want to summarize. We want our summary to be a single number, which we’ll call \(s\). How should we select \(s\) so that it summarizes the numbers, \((x_1, x_2, \ldots, x_n)\), effectively?

To answer that, we’ll assume that \(s\) is an effective summary of the entire list if the typical discrepancy between \(s\) and each of the \(x_i\) is small. With that assumption in place, we only need to do two things: (1) define the notion of discrepancy between two numbers, \(x_i\) and \(s\); and (2) define the notion of a typical discrepancy. Because each number \(x_i\) produces its own discrepancy, we’ll need to introduce a method for aggregating the individual discrepancies to order to say something about the typical discrepancy.

### Defining a Discrepancy

We could define the discrepancy between a number \(x_i\) and another number \(s\) in many ways. For now, we’ll consider only three possibilities. All of these three options satisfies a basic intuition we have about the notion of discrepancy: we expect that the discrepancy between \(x_i\) and \(s\) should be \(0\) if \(|x_i – s| = 0\) and that the discrepancy should be greater than \(0\) if \(|x_i – s| > 0\). That leaves us with one obvious question: how much greater should the discrepancy be when \(|x_i – s| > 0\)?

To answer that question, let’s consider three definitions of the discrepancy, \(d_i\):

- \(d_i = |x_i – s|^0\)
- \(d_i = |x_i – s|^1\)
- \(d_i = |x_i – s|^2\)

How should we think about these three possible definitions?

The first definition, \(d_i = |x_i – s|^0\), says that the discrepancy is \(1\) if \(x_i \neq s\) and is \(0\) only when \(x_i = s\). This notion of discrepancy is typically called **zero-one loss** in machine learning. Note that this definition implies that anything other than exact equality produces a constant measure of discrepancy. Summarizing \(x_i = 2\) with \(s = 0\) is no better nor worse than using \(s = 1\). In other words, the discrepancy does not increase at all as \(s\) gets further and further from \(x_i\). You can see this reflected in the far-left column of the image below:

The second definition, \(d_i = |x_i – s|^1\), says that the discrepancy is equal to the distance between \(x_i\) and \(s\). This is often called an **absolute deviation** in machine learning. Note that this definition implies that the discrepancy should increase linearly as \(s\) gets further and further from \(x_i\). This is reflected in the center column of the image above.

The third definition, \(d_i = |x_i – s|^2\), says that the discrepancy is the squared distance between \(x_i\) and \(s\). This is often called a **squared error** in machine learning. Note that this definition implies that the discrepancy should increase super-linearly as \(s\) gets further and further from \(x_i\). For example, if \(x_i = 1\) and \(s = 0\), then the discrepancy is \(1\). But if \(x_i = 2\) and \(s = 0\), then the discrepancy is \(4\). This is reflected in the far right column of the image above.

When we consider a list with a single element, \((x_1)\), these definitions all suggest that we should choose the same number: namely, \(s = x_1\).

### Aggregating Discrepancies

Although these definitions do not differ for a list with a single element, they suggest using very different summaries of a list with more than one number in it. To see why, let’s first assume that we’ll aggregate the discrepancy between \(x_i\) and \(s\) for each of the \(x_i\) into a single summary of the quality of a proposed value of \(s\). To perform this aggregation, we’ll sum up the discrepancies over each of the \(x_i\) and call the result \(E\).

In that case, our three definitions give three interestingly different possible definitions of the typical discrepancy, which we’ll call \(E\) for error:

$$

E_0 = \sum_{i} |x_i – s|^0.

$$

$$

E_1 = \sum_{i} |x_i – s|^1.

$$

$$

E_2 = \sum_{i} |x_i – s|^2.

$$

When we write down these expressions in isolation, they don’t look very different. But if we select \(s\) to minimize each of these three types of errors, we get very different numbers. And, surprisingly, each of these three numbers will be very familiar to us.

### Minimizing Aggregate Discrepancies

For example, suppose that we try to find \(s_0\) that minimizes the zero-one loss definition of the error of a single number summary. In that case, we require that,

$$

s_0 = \arg \min_{s} \sum_{i} |x_i – s|^0.

$$

What value should \(s_0\) take on? If you give this some extended thought, you’ll discover two things: (1) there is not necessarily a single best value of \(s_0\), but potentially many different values; and (2) each of these best values is one of the **modes** of the \(x_i\).

In other words, the best single number summary of a set of numbers, when you use exact equality as your metric of error, is one of the modes of that set of numbers.

What happens if we consider some of the other definitions? Let’s start by considering \(s_1\):

$$

s_1 = \arg \min_{s} \sum_{i} |x_i – s|^1.

$$

Unlike \(s_0\), \(s_1\) is a unique number: it is the **median** of the \(x_i\). That is, the best summary of a set of numbers, when you use absolute differences as your metric of error, is the median of that set of numbers.

Since we’ve just found that the mode and the median appear naturally, we might wonder if other familiar basic statistics will appear. Luckily, they will. If we look for,

$$

s_2 = \arg \min_{s} \sum_{i} |x_i – s|^2,

$$

we’ll find that, like \(s_1\), \(s_2\) is again a unique number. Moreover, \(s_2\) is the **mean** of the \(x_i\). That is, the best summary of a set of numbers, when you use squared differences as your metric of error, is the mean of that set of numbers.

To sum up, we’ve just seen that the three most famous single number summaries of a data set are very closely related: they all minimize the average discrepancy between \(s\) and the numbers being summarized. They only differ in the type of discrepancy being considered:

- The mode minimizes the number of times that one of the numbers in our summarized list is not equal to the summary that we use.
- The median minimizes the average distance between each number and our summary.
- The mean minimizes the average squared distance between each number and our summary.

In equations,

- \(\text{The mode of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^0\)
- \(\text{The median of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^1\)
- \(\text{The mean of } x_i = \arg \min_{s} \sum_{i} |x_i – s|^2\)

### Summary

We’ve just seen that the mode, median and mean all arise from a simple parametric process in which we try to minimize the average discrepancy between a single number \(s\) and a list of numbers, \(x_1, x_2, \ldots, x_n\) that we try to summarize using \(s\). In a future blog post, I’ll describe how the ideas we’ve just introduced relate to the concept of \(L_p\) norms. Thinking about minimizing \(L_p\) norms is a generalization of taking modes, medians and means that leads to almost every important linear method in statistics — ranging from linear regression to the SVD.

### Thanks

Thanks to Sean Taylor for reading a draft of this post and commenting on it.

I’d say “mind blown” but actually the world makes sense now. Great post. Not sure how I’ve never seen this explanation before. Thanks for sharing and making sense of the relationship between these three concepts.

How do geometric and harmonic means fit into this picture?

The harmonic mean is the reciprocal of the mean of the reciprocals.

The geometric mean is the antilogarithm of the mean of the logarithms.

Years ago I tried to visualize and animate the functions that are minimized for median, arithmetic and geometric mean. Motivated with this post I added the harmonic mean. Animation is available at http://bit.ly/XA8qdX .

Charles, I’m not really sure how to situate geometric or harmonic means relative to this context. It’s tricky to get even an intuitive feel for, because those constructs aren’t so simply described in terms of norms.

Very cool. One would hope that other powers would give additional variants, e.g. would a power of 1.5 give us a new hybrid between the median and the mean? Unfortunately, it looks like other positive powers just give the mean again.

Nice article. I’m surprised at the discrepancy of the names you give the norms. In mathematics and ML when I was at Chicago (which is very mathematically oriented), we called these norms, Hamming distance, l1 and l2. I suppose a good follow up would be on why the l2 is easier to compute with…

Rani, I don’t have much feel for how intermediate powers work. In practice, people usually get intermediate results by mixing things to give Huber loss (for example).

Andy, the loss functions I’m using are technically not pseudo-norms, but monotonic functions of the pseudo-norms because I’m not taking p-th roots. In my next post on linear regression, I’ll introduce the L_p “norms” formally. Judging from my experiences in Princeton, I think we’re much more likely to speak of the L0 “norm” than Hamming distance.

I think I’ll also do a post on the nice analytic properties of L2 and may also do a post on different algorithms for calculating medians and means: classical methods, formal optimization approaches, SGD approaches, etc.

Great post! Curious how this might generalize to continuous, rather than discrete, distributions?

The median and mean definitions generalize very easily to continuous distributions: the median, s1, is the number that minimizes E[|x – s|], whereas the mean, s2, is the number that minimizes E[|x – s|^2]. Something like this should work for modes, but the required argument is less clear to me.

Note that the mode and the median are not uniquely defined numbers. Consider the case of two observations, x1 < x2. Then both s=x1 and s=x2 minimizes E0, and so there are in this case two modes. Any s where x1 < s < x2 minimizes E1, and so there is an infinity of medians when the number of observations is even. However s = (x1+x2)/2 alone minimizes E2, and so there is only one mean value. This makes the mean the definition to be remembered, and the mode and median the definitions to be forgotten and lost in the stream of time.

Bo, you’re right that the median is not uniquely defined in the case of an even number of elements. That said, the constraints on the L1 minimizer become increasingly tight as the number of data points goes towards infinity. In addition, we’ve all adopted a standard definition of the median that makes the non-uniqueness a computational inconvenience.

Also, the enormous recent success of approaches to problems based on L1 minimization makes me think that the median is the future of data analysis rather than the past. See, for example, Tukey’s work on robust statistics.

Excellent post. I will use this explanation next time I am trying to explain todo Rome why median or mean does not make sense in a particular instance .. The there metrics of central tendency have their unique place in every data set .. It is surprising how often people use the wrong metric. Thanks for the explanation which can help explain done of those nuances .,

Uday.

I have seen this for the norm functions but nit these statistics. Thanks for putting this together!

I have seen this for the norm functions but not these statistics. Thanks for putting this together!

Nice post. I’m interested if the generalization from medians to quantiles can be applied to the mode and mean. To get quantiles other than the median, one can simply make the l1 discrepancy function asymmetric by scaling up losses on one side of s compared to the other side. Is there a nice interpretation of the result when applying this asymmetry to the l0 and l2 functions?

Brian, I’m not aware of any such analysis, but I agree that it would be really interesting.

Great post and discussion! The error is basically an unnormalized measure of dispersion, yet even though modes and medians are sometimes used (probably not as often as they should be), it seems that the error for the mean is used pretty much exclusively as basis for any measure of variability, even for the variability of medians, etc.. Any thoughts on that?

Christoph, you’re completely right that the variance we use is the squared error, which is built around the mean. There are some people who use absolute deviations: R has a function called mad() that will calculate a measure of variability around the median. I’ve used it several times for measuring variability in probabilities, where I find the squaring process hard to reason about. I doubt it will grow in popularity anytime soon, though.

Cool post! It also seems to me (just going thru this in my head) that the MidRange = (x_max + x_min)/2 minimizes the infinity (supremum) norm.

Thanks!

Joe

Totally right. I hadn’t thought about that, but will add to my complete writeup when I pool this series of posts together. That makes the infinity norm a lot easier to present.

Really nice post! I taught my students about square loss and absolute loss (Charles Manski’s term) when talking about mean and median, but had no idea that the mode can be expressed via a loss function too.

Joe Y, thanks for pointing out that midrange minimizes the infinity (supremum) norm – a quick simulation seems to confirm this.

Andrej B. – really nice animation. Do you have some more information what kind of loss functions harmonic, and geometric mean minimizes?

John, looking forward to posts that talk about “the nice analytic properties of L2 and … different algorithms for calculating medians and means: classical methods, formal optimization approaches, SGD approaches, etc.”

I guess that geometric and harmonic mean are basically variants of L2 norm on transformed data (log(x) for geometric and 1/x for harmonic mean).

I got interested in the question about geometric and harmonic means posted in the comments. Some searching led me to an interesting page about the concept of “generalized mean” or “power mean,” the most interesting part of which is the list of examples:

http://en.wikipedia.org/wiki/Power_mean#Special_cases

You’ll see that these means are themselves very close to being distances as defined by L_p-norms, where each scalar number in the dataset is taken to be a scalar entry in the vector x. The only differences are (1) the missing absolute-value sign (which gives me a little bit of pause about the power-mean concept), and (2) the factor 1/n (which is not a big deal because it’s just a normalization).

I found this generalization to be quite beautiful, since it includes the max, the min, and all the means one would usually mention (including geometric and harmonic). It’s also very closely related to norms. However, it is *not* the result of a maximization within a norm. Or at least, if such a maximization exists, we don’t know what it is yet. Maybe there’s another dimension to the problem that will let us see both generalized means and medians and modes as answers to a broader generalized problem?

All this got me wondering a bit about whether John is correct to believe it’s useful to think of a central summary statistic as being derived from a minimization of a particular loss function. After all, as just became clearer to me from my consideration of generalized means, the loss functions we use are themselves a bit arbitrary. I agree that John’s pictures are quite clear and help us understand the problem much better, but why do we prefer to minimize the mean absolute distance or the mean squared distance as opposed to, for example, the mean of e^(x_i – s)?

I have checked the relevant calculus and algebra, and I believe Andrej is correct that the harmonic mean can be expressed as a minimization of the sum of the squared distance between (1/s) and (1/x_i). And the geometric mean can similarly be expressed as a minimization of the sum of the squared distance between ln(s) and ln(x_i).

But this becomes decreasingly helpful as a means of understanding the problem, in my view, partly because I don’t see how to draw the function we’re minimizing with x on the horizontal axis (as opposed to having the log or the reciprocal of x on the horizontal axis). I found the “generalized means” page I linked to above to be more illuminating. Maybe there’s another way to look at the minimization problem that will make things more illuminating again?

To add another dimension of generality, we can consider weighted means. For example, if we are measuring the average income in each state, a particular policymaker might care ten times as much about being far from Pennsylvania’s truth than about being far from the truth of any other state. Then we’d want to use a weighted loss function where the Pennsylvania observation gets multiplied by 10 in the loss function, while every other state remains as is. This results in a weighted mean where Pennsylvania’s value of x gets multiplied by 10 and added to the sum of the x_i for every other state, then the whole sum gets divided by 59 instead of merely 50.

This example is a little silly, but it becomes more useful when we consider a scenario in which we know Pennsylvania has more accurate income statistics than other states. For overall accuracy, we end up wanting to give more weight to the observations that we believe to be most accurate. (This can probably itself be expressed as the outcome of a simpler maximization problem, but I’m not seeing exactly what it is at the moment.)

I mention weighted means because I believe their asymmetry is related to that of quantiles, another question that was asked above. If, instead of the median (i.e., the 50th percentile), you want the 10th percentile of the data, you can find this by minimizing an asymmetric loss function. Instead of minimizing the symmetric V-shaped absolute value function graphed by John in his figure above, we would be minimizing an asymmetric V-shaped function whose left side is much more steeply sloped than the left. For example, if the slope on the left is 10 and the slope on the right is 90, then we get the 10th percentile as our solution. If the slope on the left and the slope on the right are both 50, then we get the 50th percentile or median. (Mathematical statisticians usually normalize the slopes so that they sum to 1, but I thought it made more pedagogical sense to sum to 100 in this explanation.)

Wikipedia has a decent explanation of this, but it’s kind of hard to find as well as being mathematically dense, so I tried to unpack it a little more clearly with a concrete example above. Here’s the explanation in denser math-speak:

http://en.wikipedia.org/wiki/Quantile_regression#Quantiles

Thinking about quantiles in this way fits very nicely into John’s framework, now that I have straightened everything out in my head. The reason is that it’s very easy to graph the loss function we’re interested in. If we think that our loss function is linear in distance, but underestimating is nine times as bad as overestimating, then we want to compute the 10th percentile.

I’m much happier with including quantiles in John’s general framework, in fact, than I am with including weighted means, geometric means, or harmonic means. At least so far. Maybe we just don’t see the right generalization yet.

Gee, now you’ve got me thinking about trimmed means. For example, I am interested in the average value, but I have some outliers that are ridiculously large (100x as large as anything else in the dataset), and I fear that somebody made a typo that generated extra digits for those few observations. So I throw out the top 1% and the bottom 1% of the data, and then compute the mean. I guess this is equivalent to doing a minimization of the sum of squared distances, but putting a weight of zero on the top 1% and bottom 1% of the data. (But is there an even more elegant minimization problem that would show us why we want to pick 1%, as opposed to some other amount of outliers?)

Sorry for adding so many comments; perhaps I should have written my own blog entry instead. :-) I hope someone else finds these comments useful.

Great post. I just graduated with a math major and statistics minor. However, I have never

seen explanations tying loss minimization in my four years of study. Now I wonder if there

is a mean,median,mode analog for multivariable data.

This is one of these eye-openers that you only meet every now and then, thx a lot for your contribution John. Nathan’s words ‘Actually the world makes sense now’ is indeed the right description of what this does. The geometric mean is the max likelihood estimator of the median for lognormal distributions. Just checked with some simulations of lognormal samples and indeed the minimum of your first order discrepancy function for the mean gives the geometric mean, and your second order discrepancy function the arithmetic mean.