Turning Distances into Distributions

Deriving Distributions from Distances

Several of the continuous univariate distributions that frequently come up in statistical theory can be derived by transforming distances into probabilities. Essentially, these distributions only differ in terms of how frequently values are drawn that lie at a distance \(d\) from the mode. To see how these transformations work (and unify many distinct distributions), let’s make some simplifying assumptions.

First, we’ll assume that the distribution \(D\) we want to construct is unimodal. In other words, there is a typical value, which we’ll call \(m\), that occurs more frequently than any other value drawn from that distribution. For the sake of simplicity, we’re going to assume that \(m = 0\) from now on.

Second, we’ll assume that the probability that a value \(x\) is drawn from the distribution \(D\) decreases as \(x\) gets further from \(0\). To simplify things, we’ll assume that the density function is inversely proportional to a function \(f\) of \(x\)’s distance from \(0\).

Given these assumptions, we know that the density function will look like, $$ p(x) = \frac{c}{f(\lvert x \lvert)} \propto \frac{1}{f(\lvert x \lvert)}, $$ where \(c\) is chosen to ensure that the density function integrates to \(1\) over \(x \in (-\infty, +\infty)\). From now on, we’ll ignore this constant as it distracts one from seeing the bigger picture.

The really big question given this setup is this: what densities do we get when we consider different transformations for \(f\)?

How Must the Function \(f\) Behave to Define a Valid Density?

What constraints must we place on \(f\) to ensure that the proposed density \(p(x)\) is actually a valid probability density function?

First, we’ll assume that \(f(\lvert x \rvert)\) is bounded at zero to ensure that \(\frac{1}{f(\lvert x \lvert)}\) doesn’t head off towards \(\infty\) or take on negative values. To simplify things we’re going to make an even stronger assumption: we’ll assume that \(f(\lvert 0 \lvert) = 1\) and that \(f(\lvert x \rvert) > f(0)\) for all \(x \neq 0\).

Second, we’ll assume that \(f(\lvert x \lvert)\) increases fast enough to ensure that $$ \int_{-\infty}^{+\infty} \frac{1}{f(\lvert x \lvert)} , dx < \infty. $$ This latter requirement turns out to be the more interesting requirement. How fast must \(f(\lvert x \lvert )\) increase to guarantee that the probability integral is finite? To see that it’s possible for \(f(\lvert x \lvert )\) not to increase fast enough, imagine that \(f(\lvert x \lvert ) = 1 + \lvert x \lvert \). In this case, we’d be integrating something that looks like, $$ \int_{-\infty}^{+\infty} \frac{1}{1 + \lvert x \lvert } , dx, $$ which is not finite. As such, we know that \(f(\lvert x \lvert )\) must grow faster than the absolute value function. It turns out that the next obvious choice for growth after linear growth is sufficient to give us a valid probability density function: if we look at \(f(\lvert x \lvert ) = 1 + \lvert x \lvert ^k\) for \(k \geq 2\), we already get a valid density function. Indeed, when \(k = 2\) we’ve derived the functional form of the Cauchy distribution: a distribution famous for having tails so heavy that it has no finite moments. More generally, when \(k > 2\), we get something that looks a bit like the t-statistic’s distribution, although it’s definitely not the same. Our densities look like,

$$ p(x) \propto \frac{1}{1 + \lvert x \lvert ^k}, $$

whereas the t-statistic’s distribution looks more like,

$$ p(x) \propto \frac{1}{(1 + \frac{\lvert x \lvert ^2}{2k - 1})^k}. $$

That said, when \(f(\lvert x \lvert )\) is a polynomial, we generally get a heavy-tailed distribution.

From Polynomial Growth to Exponential Growth

What happens if we consider exponential growth for \(f(\lvert x \lvert )\) instead of polynomial growth?

(1) If \(f(\lvert x \lvert ) = \exp(\lvert x \lvert )\), we get the Laplace distribution. This is the canonical example of a sub-exponential distributions, whose tails drop off exponentially fast, but not as fast as a Gaussian’s tails.

(2) If \(f(\lvert x \lvert ) = \exp(\lvert x \lvert ^2)\), we get the Gaussian distribution. This is the quintessentially well-behaved distribution that places the vast majority of its probability mass at points within a small distance from zero.

(3) If \(f(\lvert x \lvert ) = \exp(\lvert x \lvert ^k)\) and \(k > 2\), we get a subset of the sub-Gaussian distributions, whose tails are even thinner than the Gaussian’s tails.

Essentially, the distance-to-distribution way of thinking about distributions gives us a way of thinking about how fast the tail probabilities for a distribution head to zero. By talking in terms of \(f(\lvert x \lvert )\) without reference to anything else, we can derive lots of interesting results that don’t depend upon the many constants we might otherwise get distracted by.

Conclusion: Ranking Distributions by Their Tails

We can recover a lot of classic probability distributions by assuming that they put mass at values inversely proportional to the value’s distance from zero. In particular, we can transform distributions into distributions as follows:

  • Select a function \(f(\lvert x \lvert )\) that increases faster than \(f(\lvert x \lvert ) = 1 + \lvert x \lvert \).
  • Define the unnormalized density as \(p'(x) = \frac{1}{f(\lvert x \lvert )}\).
  • Figure out the integral of \(p'(x)\) from \(-\infty\) to \(+\infty\) and call it \(c\).
  • Define the normalized density as \(p(x) = \frac{c}{f(\lvert x \lvert )}\).

This approach gives us the following prominent distributions:

(1) If \(f(\lvert x \lvert ) = 1 + \lvert x \lvert ^k\) and \(k = 2\), we exactly recover the Cauchy distribution. For larger \(k\), we recover heavy-tailed distributions that seem related to the t-statistic’s distribution, but aren’t identical to it.

(2) If \(f(\lvert x \lvert ) = \exp(\lvert x \lvert )\), we exactly recover the Laplace distribution.

(3) If \(f(\lvert x \lvert ) = \exp(\lvert x \lvert ^k)\) and \(1 < k < 2\), we recover a class of sub-exponential distributions between the Laplace distribution and the Gaussian distribution. (4) If \(f(\lvert x \lvert ) = \exp(\lvert x \lvert ^2)\), we exactly recover the Gaussian distribution. (5) If \(f(\lvert x \lvert ) = \exp(\lvert x \lvert ^k)\) and \(k > 2\), we recover a class of sub-Gaussian distributions.

If you want to see why this ranking of distributions in terms of distance is so helpful, imagine that the limiting distribution for a mean was Laplace rather than Gaussian. You’d find that inference would be vastly more difficult. The extreme speed with which the tails of the normal distribution head to zero does much of the heavy-lifting in statistical inference.

References to the Literature?

I imagine this classification of distributions has a name. It seems linked to the exponential family, although the details of that link aren’t very clear to me yet.

Much of the material for this note is canonical, although I’m less familiar with the focus on relating metrics to distributions. One thing that got me interested in the topic long ago was this blog post by Ben Moran.