People sometimes use \(R^2\) as their preferred measure of model fit. Unlike quantities such as MSE or MAD, \(R^2\) is not a function only of model’s errors, its definition contains an implicit model comparison between the model being analyzed and the constant model that uses only the observed mean to make predictions. As such, \(R^2\) answers the question: *“does my model perform better than a constant model?”* But we often would like to answer a very different question: *“does my model perform worse than the true model?”*

In theoretical examples, it is easy to see that the answers to these two questions are not interchangeable. We can construct examples in which our model performs no better than a constant model, *even though it also performs no worse than the true model*. But we can also construct examples in which our model performs much better than a constant model, *even when it also performs much worse than the true model*.

As with all model comparisons, \(R^2\) is a function not only of the models being compared, but also a function of the data set being used to perform the comparison. For almost all models, there exist data sets that are wholly incapable of distinguishing between the constant model and the true model. In particular, when using a dataset with insufficient model-discriminating power, \(R^2\) can be pushed arbitrarily close to zero — even when we are measuring \(R^2\) for the true model. As such, we must always keep in mind that \(R^2\) does not tell us whether our model is a good approximation to the true model: \(R^2\) only tells us whether our model performs noticeably better than the constant model on our dataset.

To see how comparing a proposed model with a constant model could lead to opposite conclusions than comparing the same proposed model with the true model, let’s consider a simple example: we want to model the function \(f(x)\), which is observed noisily over an equally spaced grid of \(n\) points between \(x_{min}\) and \(x_{max}\).

To start, let’s assume that:

- \(f(x) = \log(x)\).
- \(x_{min} = 0.99\).
- \(x_{max} = 1.01\).
- At 1,000 evenly spaced values of \(x\) between \(x_{min}\) and \(x_{max}\), we observe \(y_i = f(x_i) + \epsilon_i\), where \(\epsilon_i \sim \mathbf{N}(0, \sigma^2)\).

Based on this data, we’ll attempt to learn a model of \(f(x)\) using univariate OLS regression. We’ll fit both a linear model and a quadratic model. An example realization of this modeling process looks like the following:

In this graph, we can see that \(f(x)\) is well approximated by a line and so both our linear and quadratic regression models come close to recovering the true model. This is because \(x_{min}\) and \(x_{max}\) are very close and our target function is well approximated by a line in this region, especially relative to the amount of noise in our observations.

We can see the quality of these simple regression models if we look at two binary comparisons: our model versus the constant model and our model versus the true logarithmic model. To simplify these calculations, we’ll work with an alternative \(R^2\) calculation that ignores corrections for the number of regressors in a model. Thus, we’ll compute \(R^2\) for a model \(m\) (versus the constant model \(c\)) as follows:

$$

R^2 = \frac{\text{MSE}_c – \text{MSE}_m}{\text{MSE}_c} = 1 – \frac{\text{MSE}_m}{\text{MSE}_c}

$$

As with the official definition \(R^2\), this quantity tells us how much of the residual errors left over by the constant model are accounted for by our model. To see how this comparison leads to different conclusions than a comparison against the true model, we’ll also consider a variant of \(R^2\) that we’ll call \(E^2\) to emphasize that it measures how much worse the errors are than we’d see using the true model \(t\):

$$

E^2 = \frac{\text{MSE}_m – \text{MSE}_t}{\text{MSE}_t} = \frac{\text{MSE}_m}{\text{MSE}_t} – 1

$$

Note that \(E^2\) has the opposite sense as \(R^2\): better fitting models have lower values of \(E^2\).

Computing these numbers on our example, we find that \(R^2 = 0.007\) for the linear model and \(R^2 = 0.006\) for the true model. In contrast, \(E^2 = -0.0008\) for the linear model, indicating that it is essentially indistinguishable from the true model on this data set. Even though \(R^2\) suggests our model is not very good, \(E^2\) tells us that our model is close to perfect over the range of \(x\).

Now what would happen if the gap between \(x_{max}\) and \(x_{min}\) grew larger? For a monotonic function like \(f(x) = \log(x)\), we’ll see that \(E^2\) will constantly increase, but that \(R^2\) will do something very strange: it will increase for a while and then start to decrease.

Before considering this idea in general, let’s look at one more specific example in which \(x_{min} = 1\) and \(x_{max} = 1000\):

In this case, visual inspection makes it clear that the linear model and quadratic models are both systematically inaccurate, but their values of \(R^2\) have gone up substantially: \(R^2 = 0.760\) for the linear model and \(R^2 = 0.997\) for the true model. In contrast, \(E^2 = 85.582\) for the linear model, indicating that this data set provides substantial evidence that the linear model is worse than the true model.

These examples show how the linear model’s \(R^2\) can increase substantially, even though most people would agree that the linear model is becoming an increasingly unacceptably bad approximation of the true model. Indeed, although \(R^2\) seems to improve when transitioning from one example to the other, \(E^2\) gets strictly worse as the gap between \(x_{min}\) and \(x_{max}\) increases. This suggests that \(R^2\) might be misleading if we interpret it as a proxy for the generally unmeasurable \(E^2\). But, in truth, our two extreme examples don’t tell the full story: \(R^2\) actually changes non-monotonically as we transition between the two cases we’ve examined.

Consider performing a similar analysis to the ones we’ve done above, but over many possible grids of points starting with \((x_{min}, x_{max}) = (1, 1.1)\) and ending with \((x_{min}, x_{max}) = (1, 1000)\). If we calculate \(R^2\) and \(E^2\) along the way, we get graphs like the following (after scaling the x-axis logarithmically to make the non-monotonicity more visually apparent).

First, \(R^2\):

Second, \(E^2\):

Notice how strange the graph for \(R^2\) looks compared with the graph for \(E^2\). \(E^2\) always grows: as we consider more discriminative data sets, we find increasingly strong evidence that our linear approximation is not the true model. In contrast, \(R^2\) starts off very low (precisely when \(E^2\) is very low because our linear model is little worse than the true model) and then changes non-monotonically: it peaks when there is enough variation in the data to rule out the constant model, but not yet enough variation in the data to rule out the linear model. After that point it decreases. A similar non-monotonicity is seen in the \(R^2\) value for the quadratic model. Only the true model shows a monotonic increase in \(R^2\).

I wrote down all of this not to discourage people from ever using \(R^2\). I use it sometimes and will continue to do so. But I think it’s important to understand both that (a) the value of \(R^2\) is heavily determined by the data set being used and that (b) the value of \(R^2\) can decrease even when your model is becoming an increasingly good approximation to the true model. When deciding whether a model is useful, a high \(R^2\) can be undesirable and a low \(R^2\) can be desirable.

This is an inescapable problem: whether a wrong model is useful always depends upon the domain in which the model will be applied and the way in which we evaluate all possible errors over that domain. Because \(R^2\) contains an implicit model comparison, it suffers from this general dependence on the data set. \(E^2\) also has such a dependence, but it at least seems not to exhibit a non-monotonic relationship with the amount of variation in the values of the regressor variable, \(x\).

The code for this post is on GitHub.

In retrospect, I would like to have made more clear that measures of fit like MSE and MAD also depend upon the domain of application. My preference for their use is largely that the lack of an implicit normalization means that they are prima facie arbitrary numbers, which I hope makes it more obvious that they are highly sensitive to the domain of application — whereas the normalization in \(R^2\) makes the number seem less arbitrary and might allow one to forget how data-dependent they are.

In addition, I really ought to have defined \(E^2\) using a different normalization. For a model \(m\) compared with both the constant model \(c\) and the true model \(t\), it would be easier to work with the bounded quantity:

$$

E^2 = \frac{\text{MSE}_m – \text{MSE}_t}{\text{MSE}_c – \text{MSE}_t}

$$

This quantity asks where on a spectrum from the worse defensible model’s performance (which is \(\text{MSE}_c – \text{MSE}_t\)) to the best possible model’s performance (which is \(\text{MSE}_t – \text{MSE}_t\) = 0), one lies. Note that in the homoscedastic regression setup I’ve considered, \(\text{MSE}_t = \sigma^2\), which can be estimated from empirical data using the identifying assumption of homoscedasticity and repeated observations at any fixed value of the regressor, \(x\). (I’m fairly certain that this normalized quantity has a real name in the literature, but I don’t know it off the top of my head.)

I should also have noted that some additional hacks are required to guarantee this number lies in \([0, 1]\) because the lack of model nesting means that violations are possible. (Similar violations are not possible in the linear regression setup because the constant model is almost always nested in the more elaborate models being tested.)

After waking up to see this on Hacker News, I should probably have included a graphical representation of the core idea to help readers who don’t have any background in theoretical statistics. Here’s that graphical representation, which shows that, given three models to be compared, we can always place our model on a spectrum from the performance of the worst model (i.e. the constant model) to the performance of the best model (i.e. the true model).

When measuring a model’s position on this spectrum, \(R^2\) and \(E^2\) are complementary quantities.

An alternative presentation of these ideas would have focused on a bias/variance decompositions of the quantities being considered. From that perspective, we’d see that:

$$

R^2 = 1 – \frac{\text{MSE}_m}{\text{MSE}_c} = 1 – \frac{\text{bias}_m^2 + \sigma^2}{\text{bias}_c^2 + \sigma^2}

$$

Using the redefinition of \(E^2\) I mentioned in the first round of “Retrospective Edits”, the alternative to \(R^2\) would be:

$$

E^2 = \frac{\text{MSE}_m – \text{MSE}_t}{\text{MSE}_c – \text{MSE}_t} = \frac{(\text{bias}_m^2 + \sigma^2) – (\text{bias}_t^2 + \sigma^2)} {(\text{bias}_c^2 + \sigma^2) – (\text{bias}_t^2 + \sigma^2)} = \frac{\text{bias}_m^2 – \text{bias}_t^2}{\text{bias}_c^2 – \text{bias}_t^2} = \frac{\text{bias}_m^2 – 0}{\text{bias}_c^2 – 0} = \frac{\text{bias}_m^2}{\text{bias}_c^2}

$$

In large part, I find \(R^2\) hard to reason about because of the presence of the \(\sigma^2\) term in the ratios shown above. \(E^2\) essentially removes that term, which I find useful for reasoning about the quality of a model’s fit to data.

]]>Given their intended audience, I think Westfall and Yarkoni’s approach is close to optimal: they motivate their concerns using examples that depend only upon the reader having a basic understanding of linear regression and measurement error. That said, the paper made me wish that more psychologists were familiar with Judea Pearl’s work on graphical techniques for describing causal relationships.

To see how a graphical modeling approach could be used to articulate Westfall and Yarkoni’s point, let’s assume that the true causal structure of the world looks like the graph shown in Figure 1 below:

For those who’ve never seen such a graph before, the edges are directed in a way that indicates causal influence. This graph says that the objective temperature on any given day is the sole cause of (a) the subjective temperature experienced by any person, (b) the amount of ice cream consumed on that day, and (c) the number of deaths in swimming pools on the day.

If this graph reflects the true causal structure of the world, then the only safe way to analyze data about the relationship between ice cream consumption and swimming pool deaths is to condition on objective temperature. If highlighting represents conditioning on a variable, the conclusion is that a correct analysis must look like the following colored-in graph:

Westfall and Yarkoni’s point is that this correct analysis is frequently not performed because the objective temperature variable is not observed and therefore cannot be conditioned on. As such, one conditions on subjective temperature instead of objective temperature, leading to this colored-in graph:

The problem with conditioning on subjective temperature instead of objective temperature is that you have failed to block the relevant path in the graph, which is the one that links ice cream consumption and swimming pool deaths by passing through objective temperature. Because the relevant path of dependence is not properly blocked, an erroneous relationship between ice cream consumption and swimming pool deaths leaks through. It is this leaking through that causes the high false positive rates that are the focus of Westfall and Yarkoni’s paper.

One reason I like this graphical approach is that it makes it clear that the opposite problem can occur as well: the world might have also have a causal structure such that one must condition on subjective temperature, but conditioning on objective temperature is always insufficient.

To see how that could occur, consider this alternative hypothetical causal structure for the world:

If this alternative structure were correct, then a correct analysis would need to color in subjective temperature:

An incorrect analysis would be to color in objective temperature rather than subjective temperature:

But, somewhat surprising, coloring in both would be fine:

It’s taken me some time to master this formalism, but I now find it quite easy to reason about these kinds of issues thanks to the brevity of graphical models as a notational technique. I’d love to see this approach become more popular in psychology, given that it has already become quite widespread in other fields. Of course, Westfall and Yarkoni are already advocating for something very similar by advocating for the use of SEM’s, but the graphical approach is strictly more general than SEM’s and, in my personal opinion, strictly simpler to reason about.

]]>One of the things that set statistics apart from the rest of applied mathematics is an interest in the problems introduced by sampling: how can we learn about a model if we’re given only a finite and potentially noisy sample of data?

Although frequently important, the issues introduced by sampling can be a distraction when the core difficulties you face would persist even with access to an infinite supply of noiseless data. For example, if you’re fitting a misspecified model \(m_1\) to data generated by a model \(m_2\), this misspecification will persist even as the supply of data becomes infinite. In this setting, the issues introduced by sampling can be irrelevant: it’s often more important to know whether or not the misspecified model, \(m_1\), could ever act as an acceptable approximation to the true model, \(m_2\).

Until recently, I knew very little about these sorts of issues. While reading Cosma Shalizi’s draft book, “Advanced Data Analysis from an Elementary Point of View”, I learned about the concept of **pseudo-truth**, which refers to the version of \(m_1\) that’s closest to \(m_2\). Under model misspecification, estimation procedures often converge to the pseudo-truth as \(n \to \infty\).

Thinking about the issues raised by using a pseudo-true model rather than the true model has gotten me interested in learning more about quantifying approximation error. Although I haven’t yet learned much about the classical theory of approximations, it’s clear that a framework based on approximation errors allows one to formalize issues that would otherwise require some hand-waving.

Suppose that I want to approximate a function \(f(x)\) with another function \(g(x)\) that comes from a restricted set of functions, \(G\). What does the optimal approximation look like when \(g(x)\) comes from a parametric class of functions \(G = \{g(x, \theta) | \theta \in \mathbb{R}^{p}\}\)?

To answer this question with a single number, I think that one natural approach looks like:

- Pick a point-wise loss function, \(L(f(x), g(x))\), that evaluates the gap between \(f(x)\) and \(g(x)\) at any point \(x\).
- Pick a set, \(S\), of values of \(x\) over which you want to evaluate this point-wise loss function.
- Pick an aggregation function, \(A\), that summarizes all of the point-wise losses incurred by any particular approximation \(g(x, \theta)\) over the set \(S\).

Given \(L\), \(S\) and \(A\), you can define an optimal approximation from \(G\) to be a function, \(g \in G\), that minimizes the aggregated point-wise loss function over all values in \(S\).

This framework is so general that I find it hard to reason about. Let’s make some simplifying assumptions so that we can work through an analytically tractable special case:

- Let’s assume that \(f(x) = x^2\). Now we’re trying to find the optimal approximation of a quadratic function.
- Let’s assume that \(g(x, \theta) = g(x, a, b) = ax + b\), where \(\theta = \langle a, b \rangle \in \mathbb{R}^{2}\). Now we’re trying to find the optimal
**linear**approximation to the quadratic function, \(f(x) = x^2\). - Let’s assume that the point-wise loss function is \(L(f(x), g(x, \theta)) = [f(x) – g(x, \theta)]^2\). Now we’re quantifying the point-wise error using the squared error loss that’s commonly used in linear regression problems.
- Let’s assume that \(S\) is a closed interval on the real line, that is \(S = [l, u]\).
- Let’s assume that the aggregation function, \(A\), is the integral of the loss function, \(L\), over the interval, \([l, u]\).

Under these assumptions, the optimal approximation to \(f(x)\) from the parametric family \(g(x, \theta)\) is defined as a solution, \(\theta^{*}\), to the following minimization problem:

$$

\theta^{*} = \arg \min_{\theta} \int_{l}^{u} L(f(x), g(x, \theta)) dx = \arg \min_{\theta} \int_{l}^{u} [f(x) – g(x, \theta)]^2 dx = \arg \min_{\theta} \int_{l}^{u} [x^2 – (ax + b)]^2 dx.

$$

I believe that this optimization problem describes the pseudo-truth towards which OLS univariate regression converges when the values of the covariate, \(x\), are drawn from a uniform distribution over the interval \([l, u]\). Simulations agree with this belief, although I may be missing some important caveats.

To get a feeling for how the pseudo-truth behaves in this example problem, I found it useful to solve for the optimal linear approximation analytically by computing the gradient of the cost function with respect to \(\theta\) and then solving for the roots of the resulting equations:

$$

\frac{\partial}{\partial a} \int_{l}^{u} [x^2 – (ax + b)]^2 dx = 0 \\

\frac{\partial}{\partial b} \int_{l}^{u} [x^2 – (ax + b)]^2 dx = 0 \\

$$

After some simplification, this reduces to a matrix equation:

$$

\begin{bmatrix}

\frac{2}{3} (u^3 – l^3) & u^2 – l^2 \\

u^2 – l^2 & 2 (u – l) \\

\end{bmatrix} \begin{bmatrix}

a \\

b \\

\end{bmatrix} = \begin{bmatrix}

\frac{1}{2} (u^4 – l^4) \\

\frac{2}{3} (u^3 – l^3) \\

\end{bmatrix}.

$$

In practice, I suspect it’s much easier to solve for the optimal approximation on a computer by using quadrature to approximate the aggregated point-wise loss function and then minimizing that aggregation function using a quasi-Newton method. I experimented a bit with this computational approach and found that it reproduces the analytic solution for this problem. The computational approach generalizes to other problems readily and requires much less work to get something running.

You can see examples of the pseudo-truth versus the truth below for a few intervals \([l, u]\):

I don’t think there’s anything very surprising in these plots, but I nevertheless find the plots useful for reminding myself how sensitive the optimal approximation is to the set \(S\) over which it will be applied.

One reason that I find it useful to think formally about the quality of optimal approximations is that it makes it easier to rigorously define some of the problems that arise in machine learning.

Consider, for example, the issues raised in this blog post by Paul Mineiro:

In statistics the bias-variance tradeoff is a core concept. Roughly speaking, bias is how well the best hypothesis in your hypothesis class would perform in reality, whereas variance is how much performance degradation is introduced from having finite training data.

I would say that this definition of bias is quite far removed from the definition used in statistics and econometrics, although it can be formalized in terms of the expected point-wise loss.

The core issue, as I see it, is that, for a statistician or econometrician, bias is not an asymptotic property of a model class, but rather a finite-sample property of an estimator. Variance is also a finite-sample property of an estimator. Some estimators, such as the ridge estimator for a linear regression model, decrease variance by increasing bias, which induces a trade-off between bias and variance in finite samples.

But Paul Mineiro is not, I think, interested in these finite-sample properties of estimators. I believe he’s concerned about the intrinsic error introduced by approximating one function with another. And that’s a very important topic that I haven’t seen discussed as often as I’d like.

Put another way, I think Mineiro’s concern is much more closely linked to formalizing an analogue for regression problems of the concepts of VC dimension and Rademacher complexity that come up in classification problems. Because I wasn’t familiar with any such tools, I found it helpful to work through some problems on optimal approximations. Given this framework, I think it’s easy to replace Mineiro’s concept of bias with an approximation error formalism that might be called asymptotic bias or definitional bias, as Efron does in this paper. These tools let one quantify the costs incurred by the use of a pseudo-true model rather than the true model.

I’d love to learn more about this issue. As it stands, I know too much about sampling and too little about quantifying the quality of approximations.

]]>In this post, I’m going to write up my current understanding of the topic. To motivate my perspective, I’m going to describe a specific inferential problem. Trying to solve that problem will require me to explore a small portion of the design space of algorithms for dealing with outliers. It is not at all clear to me how well the conclusions I reach will generalize to other settings, but my hope is that a thorough examination of one specific problem will touch on some of the core issues that would arise in other settings. I’ve also put the code for this simulation study on GitHub so that others can easily build variants of it.

To simplify things, I’m going to define an outlier as any value greater than a certain percentile of all of the observed data points. For example, I might say that all observations above the 90th percentile are outliers. In this case, I will say that the “outlier window” is 10%, because I will consider all observations above the (100 – 10)th percentile to be outliers. This outlier window parameter is the first dimension of the design space that I want to understand. Most applications employ a very gentle outlier window, which labels no more than 5% of observations as outliers. But sometimes one sees more radical windows used. In this post, I will consider windows ranging from 1% up to 40%.

Given that we’ve flagged some observations as outliers, what do we do with them? I’ll consider three options:

- Do nothing and leave the data unadjusted.
- Discard all of the outliers. The removal of extreme values is usually called trimming or truncation.
- Replace all of the outliers with the largest value that is not considered an outlier. The replacement of extreme values is usually called winsorization.

As a further simplification, I’m only going to consider a single inferential problem: estimating the difference in means between two groups of observations. In this problem, I will generate data as follows:

- Draw \(n\) IID observations from a distribution \(D\). Label this set of observations as Group A.
- Draw \(n\) additional IID observations from the same distribution \(D\) and add a constant, \(\delta\), to every one of them. Label this set of observations as Group B.

In this post, the distribution, \(D\), that I will use is a log-normal distribution with mean 1.649 and standard deviation 2.161. These strange values correspond to a mean of 0 and a standard deviation of 1 on the pre-exponentiation scale used by my RNG.

Given that distribution, \(D\), I’ll consider three values of \(\delta\): 0.00, 0.05 and 0.10. Each value of \(\delta\) defines a new distribution, \(D^{\prime}\).

To contrast these distributions, I want to consider the quantity: \(\delta = \mathbb{E}[D^{\prime}] – \mathbb{E}[D]\). The outlier approaches I’ll consider will lead to several different estimators. I’ll refer to each of these estimators as \(\hat{\delta}\). From each estimator, I’ll construct point estimates and interval estimates; I’ll also conduct hypothesis tests based on the estimators using t-tests.

Given that we have two data sets, \(A\) and \(B\), applying outlier detection requires us to make another decision about our position in design space: which data sets do we use to compute the percentile that defines the outlier threshold?

I’ll consider three options:

- Compute a single percentile from a new data set formed by combining A and B. In the results, I’ll call this the “shared threshold” rule.
- Compute a single percentile using data from A only. In the results, I’ll call this the “threshold from A” rule.
- Compute two percentiles: one using data from A only and another using data from B only. We will then apply the threshold from A to A’s data and apply the threshold from B to B’s data. In the results, I’ll call this the “separate thresholds” rule.

Combining this new design decision with the others already mentioned, the design space I’m interested in has three dimensions:

- The outlier window. The method for addressing outliers, which is either (a) no adjustment, (b) trimming, or (c) winsorization.
- The data set used to compute the outlier threshold, which is a shared threshold, a separate threshold or a threshold derived from A’s observations only.

I consider the quality of inferences in terms of several metrics:

- Point Estimation:
- The bias of \(\hat{\delta}\) as an estimator of \(\delta\).
- The RMSE of \(\hat{\delta}\) as an estimator of \(\delta\).

- Interval Estimation:
- The coverage probability of nominal 95% CI’s for \(\delta\) built around \(\hat{\delta}\).

- Hypothesis Testing:
- The false positive rate of a t-test when \(\delta = 0.00\).
- The false negative rates of t-tests when \(\delta = 0.05\) or \(\delta = 0.10\).

The results for the point estimation problem are shown graphically below. I begin by plotting KDE’s of the sampling distributions of the seven estimators I’m considering, which give an intuition for all of the core results shown later. The columns in these plots reflect the different values of \(\delta\), which are either 0.00, 0.05 or 0.10.

Next I plot the bias of these estimators as a function of the outlier window. I split these plots according to the true value of \(\delta\):

Finally, I plot the RMSE of these estimators as a function of the outlier window, again splitting the plots according to the true value of \(\delta\):

To assess the quality of interval estimates derived from these estimators of \(\hat{\delta}\), I plot the empirical coverage probability of nominal 95% CI’s:

In this last set of analyses, I plot the false positive and false negative rates as a function of the outlier window:

And, for one final analysis, I include a plot of false positive rate when \(\delta = 0.00\) vs false negative rate when \(\delta = 0.05\). This makes it easier to see which methods are strictly dominated by others:

- Both winsorization and trimming can improve the quality of point estimates of \(\delta\). But they can make inferences worse because they can introduce bias into these point estimates. The only way to prevent bias is to compute outlier thresholds in Groups A and B separately.
- There is sometimes an optimal outlier window that minimizes the RMSE of point estimates of \(\delta\). It is not clear how to calculate this without knowing the ground truth parameter that you are trying to infer. Perhaps cross-validation would work. Importantly, choosing a bad value for the outlier window makes both winsorization and trimming inferior to doing no adjustment.
- Both winsorization and trimming can improve the FP/FN tradeoffs made in a system, although one must use outlier thresholds that are shared across the A and B groups to reap these benefits
- CI’s only provide the desired coverage probabilities when the outlier window is very small.

Putting all this together, my understanding is that there is no universally best method without a more complex approach to outlier control. Whether winsorization or trimming improves the quality of inference depends on the structure of the data being analyzed and the inferential approach being used. Some methods work well for point estimation and interval estimation, others work well for hypothesis testing. No method universally beats out the option of doing no adjustment for outliers.

If you consider a different kind of difference between the A and B sets of observations than the constant additive difference rule that I’ve considered, I am fairly sure that the conclusions I’ve reached will not apply.

]]>Although I’ve recently decided to take a break from working on OSS for a little while, I’m still as excited as ever about Julia as a language.

That said, I’m still unhappy with the performance of Julia’s core data analysis infrastructure. The performance of code that deals with missing values has been substantially improved thanks to the beta release of the NullableArrays package, which David Gold developed during this past Julia Summer of Code. But the DataFrames package is still a source of performance problems.

The goal of this post is to explain why Julia’s DataFrames are still unacceptably slow in many important use cases — and will remain slow even after the current dependency on the DataArrays package is replaced with a dependency on NullableArrays.

The core problem with the DataFrames library is that a DataFrame is, at its core, a black-box container that could, in theory, contain objects of arbitrary types. In practice, a DataFrame contains highly constrained objects, but those constraints are (a) hard to express to the compiler and (b) still too weak to allow the compiler to produce the most efficient machine code.

The use of any black-box container creates the potential for performance problems in Julia because of the way that Julia’s compiler works. In particular, Julia’s compiler is able to execute code quickly because it can generate custom machine code for every function call — and this custom machine code is specialized for the specific run-time types of the function’s arguments.

This run-time generation of custom machine code is called specialization. When working with black-box containers, Julia’s approach to specialization is not used to full effect because machine code specialization based on run-time types only occurs at function call sites. If you access objects from a black-box container and then perform extended computations on the results, those computations will not be fully specialized because there is no function call between (a) the moment at which type uncertainty about the contents of the black-box container is removed and (b) the moment at which code that could benefit from type information is executed.

To see this concern in practice, consider the following minimal example of a hot loop being executed on values that are extracted from a black-box container:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | function g1(black_box_container) x, y = black_box_container[1], black_box_container[2] n = length(x) s = 0.0 for i in 1:n s += x[i] * y[i] end s end function hot_loop(x, y) n = length(x) s = 0.0 for i in 1:n s += x[i] * y[i] end s end function g2(black_box_container) x, y = black_box_container[1], black_box_container[2] hot_loop(x, y) end container = Any[randn(10_000_000), randn(10_000_000)]; @time g1(container) # 2.258571 seconds (70.00 M allocations: 1.192 GB, 5.03% gc time) @time g2(container) # 0.015286 seconds (5 allocations: 176 bytes) |

`g1`

is approximately 150x slower than `g2`

on my machine. But `g2`

is, at a certain level of abstraction, exactly equivalent to `g1`

— the only difference is that the hot loop in `g1`

has been put inside of a function call. To convince yourself that the function call boundary is the only important difference between these two functions, consider the following variation of `g2`

and `hot_loop`

:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | @inline function hot_loop_alternative(x, y) n = length(x) s = 0.0 for i in 1:n s += x[i] * y[i] end s end function g3(black_box_container) x, y = black_box_container[1], black_box_container[2] hot_loop_alternative(x, y) end @time g1(container) # 2.290116 seconds (70.00 M allocations: 1.192 GB, 4.90% gc time) @time g2(container) # 0.017835 seconds (5 allocations: 176 bytes) @time g3(container) # 2.250301 seconds (70.00 M allocations: 1.192 GB, 5.08% gc time) |

On my system, forcing the hot loop code to be inlined removes **all** of the performance difference between `g1`

and `g2`

. Somewhat ironically, by inlining the hot loop, we’ve prevented the compiler from generating machine code that’s specialized on the types of the `x`

and `y`

values we pull out of our `black_box_container`

. Inlining removes a function call site — and function call sites are the only times when machine code can be fully specialized based on run-time type information.

This problem is the core issue that needs to be resolved to make Julia’s DataFrames as efficient as they should be. Below I outline three potential solutions to this problem. I do not claim that these three are the only solutions; I offer them only to illustrate important issues that need to be addressed.

One possible solution to the problem of under-specialization is to change Julia’s compiler. I think that work on that front could be very effective, but the introduction of specialization strategies beyond Julia’s current "specialize at function call sites" would make Julia’s compiler much more complex — and could, in theory, make some code slower if the compiler were to spend more time performing compilation and less time performing the actual computations that a user wants to perform.

A second possible solution is to generate custom DataFrame types for every distinct DataFrame object. This could convert DataFrames from black-box containers that contain objects of arbitrary type into fully typed containers that can only contain objects of types that are fully known to the compiler.

The danger with this strategy is that you could generate an excessively large number of different specializations — which would again run the risk of spending more time inside the compiler than inside of the code you actually want to execute. It could also create excessive memory pressure as an increasing number of specialized code paths are stored in memory. Despite these concerns, a more aggressively typed DataFrame might be a powerful tool for doing data analysis.

The last possible solution I know of is the introduction of a high-level API that ensures that operations on DataFrames always reduce down to operations on objects whose types are known when hot loops execute. This is essentially the computational model used in traditional databases: take in a SQL specification of a computation, make use of knowledge about the data actually stored in existing tables to formulate an optimized plan for performing that computation, and then perform that optimized computation.

I think this third option is the best because it will also solve another problem Julia’s data infrastructure will hit eventually: the creation of code that is insufficiently generic and not portable to other backends. If people learn to write code that only works efficiently for a specific implementation of DataFrames, then their code will likely not work when they try to apply it to data stored in alternative backends (e.g. traditional databases). This would trap users into data structures that may not suit their needs. The introduction of a layer of appropriate abstractions (as in dplyr and Ibis) would resolve both issues at once.

- Making Julia’s DataFrames better is still a work-in-progress.
- The core issue is still the usage of data structures that are not amenable to Julia’s type inference machinery. One of the two main issues is now resolved; another must be addressed before things function smoothly.
- Several solutions to this remaining are possible; we will probably see one or more of these solutions gain traction in the near-term future.

I said that the crud factor principle is the concrete empirical form, realized in the sciences, of the logician’s formal point that the third figure of the implicative (mixed hypothetical) syllogism is invalid, the error in purported deductive reasoning termed affirming the consequent. Speaking methodologically, in the language of working scientists, what it comes to is that there are quite a few alternative theories \(T’\), \(T”\), \(T”’\), \(\ldots\) (in addition to the theory of interest \(T\)) that are each capable of deriving as a consequence the statistical counternull hypothesis \(H^{*}: \delta = (\mu_1 – \mu_2) > 0\), or, if we are correlating quantitative variables, that \(\rho > 0\). We might imagine (Meehl, 1990e) a big pot of variables and another (not so big but still sizable) pot of substantive causal theories in a specified research domain (e.g., schizophrenia, social perception, maze learning in the rat). We fantasize an experimenter choosing elements from these two pots randomly in picking something to study to get a publication. (We might impose a restriction that the variables have some conceivable relation to the domain being investigated, but such a constraint should be interpreted very broadly. We cannot, e.g., take it for granted that eye color will be unrelated to liking introspective psychological novels, because there is evidence that Swedes tend to be more introverted than Irish or Italians.) Our experimenter picks a pair of variables randomly out of the first pot, and a substantive causal theory randomly out of the second pot, and then randomly assigns an algebraic sign to the variables’ relation, saying, “\(H^{*}: \rho > 0\), if theory \(T\) is true.” In this crazy example there is no semantic-logical-mathematical relation deriving \(H^{*}\) from \(T\), but we pretend there is. Because \(H_0\) is quasi-always false, the counternull hypothesis \(~H_0\) is quasi-always true. Assume perfect statistical power, so that when \(H_0\) is false we shall be sure of refuting it. Given the arbitrary assignment of direction, the directional counternull \(H^{*}\) will be proved half the time; that is, our experiment will “come out right” (i.e., as pseudo-predicted from theory \(T\)) half the time. This means we will be getting what purports to be a “confirmation” of \(T\) 10 times as often as the significance level \(\alpha = .05\) would suggest. This does not mean there is anything wrong with the significance test mathematics; it merely means that the odds of getting a confirmatory result (absent our theory) cannot be equated with the odds given by the \(t\) table, because those odds are based on the assumption of a true zero difference. There is nothing mathematically complicated about this, and it is a mistake to focus one’s attention on the mathematics of \(t\), \(F\), chi-square, or whatever statistic is being employed. The population from which we are drawing is specified by variables chosen from the first pot, and one can think of that population as an element of a superpopulation of variable pairs that is gigantic in size but finite, just as the population, however large, of theories defined as those that human beings will be able to construct before the sun burns out is finite. The methodological point is that \(T\) has not passed a severe test (speaking Popperian), the “successful” experimental outcome does not constitute what philosopher Wesley Salmon called a “strange coincidence” (Meehl, 1990a, 1990b; Nye, 1972; Salmon, 1984), because with high power \(T\) has almost an even chance of doing that, absent any logical connection whatever between the variables and the theory.

Put another way: what’s wrong with psychology is that the theories being debated make such vague predictions as to be quasi-unfalsifiable. NHST is bad for the field because it encourages researchers to make purely directional predictions — that is, to make predictions that are so vague that their success is, to use an old-fashioned word, vain-glorious.

]]>Several months ago, I promised to write an updated version of my old post, “The State of Statistics in Julia”, that would describe how Julia’s support for statistical computing has evolved since December 2012.

I’ve kept putting off writing that post for several reasons, but the most important reason is that all of my attention for the last few months has been focused on what’s wrong with how Julia handles statistical computing. As such, the post I’ve decided to write isn’t a review of what’s already been done in Julia, but a summary of what’s being done right now to improve Julia’s support for statistical computing.

In particular, this post focuses on several big changes to the core data structures that are used in Julia to represent statistical data. These changes should all ship when Julia 0.4 is released.

The primary problem with statistical computing in Julia is that the current tools were all designed to emulate R. Unfortunately, R’s approach to statistical computing isn’t amenable to the kinds of static analysis techniques that Julia uses to produce efficient machine code.

In particular, the following differences between R and Julia have repeatedly created problems for developers:

- In Julia, computations involving scalars are at least as important as computations involving vectors. In particular, iterative computations are first-class citizens in Julia. This implies that statistical libraries must allow developers to write efficient code that iterates over the elements of a vector in pure Julia. Because Julia’s compiler can only produce efficient machine code for computations that are type-stable, the representations of missing values, categorical values and ordinal values in Julia programs must all be type-stable. Whether a value is missing or not, its type must remain the same.
- In Julia, almost all end-users will end up creating their own types. As such, any tools for statistical computing must be generic enough that they can be extended to arbitrary types with little to no effort. In contrast to R, which can heavily optimize its algorithms for a very small number of primitive types, Julia developers must ensure that their libraries are both highly performant and highly abstract.
- Julia, like most mainstream languages, eagerly evaluates the arguments passed to functions. This implies that idioms from R which depend upon non-standard evaluation are not appropriate for Julia, although it is possible to emulate some forms of non-standard evaluation using macros. In addition, Julia doesn’t allow programmers to reify scope. This implies that idioms from R that require access to the caller’s scope are not appropriate for Julia.

The most important way in which these issues came up in the first generation of statistical libraries was in the representation of a single scalar missing value. In Julia 0.3, this concept is represented by the value `NA`

, but that representation will be replaced when 0.4 is released. Most of this post will focus on the problems created by `NA`

.

In addition to problems involving `NA`

, there were also problems with how expressions were being passed to some functions. These problems have been resolved by removing the function signatures for statistical functions that involved passing expressions as arguments to those functions. A prototype package called DataFramesMeta, which uses macros to emulate some kinds of non-standard evaluation, is being developed by Tom Short.

In Julia 0.3, missing values are represented by a singleton object, `NA`

, of type `NAtype`

. Thus, a variable `x`

, which might be either a `Float64`

value or a missing value encoded as `NA`

, will end up with type `Union(Float64, NAtype)`

. This `Union`

type is a source of performance problems because it defeats Julia’s compiler’s attempts to assign a unique concrete type to every variable.

We could remove this type-instability by ensuring that every type has a specific value, such as `NaN`

, that signals missingness. This is the approach that both R and pandas take. It offers acceptable performance, but does so at the expense of generic handling of non-primitive types. Given Julia’s rampant usage of custom types, the sentinel values approach is not viable.

As such, we’re going to represent missing values in Julia 0.4 by borrowing some ideas from functional languages. In particular, we’ll be replacing the singleton object `NA`

with a new parametric type `Nullable{T}`

. Unlike `NA`

, a `Nullable`

object isn’t a direct scalar value. Rather, a `Nullable`

object is a specialized container type that either contains one value or zero values. An empty `Nullable`

container is taken to represent a missing value.

The `Nullable`

approach to representing a missing scalar value offers two distinct improvements:

`Nullable{T}`

provides radically better performance than`Union(T, NA)`

. In some benchmarks, I find that iterative constructs can be as much as 100x faster when using`Nullable{Float64}`

instead of`Union(Float64, NA)`

. Alternatively, I’ve found that`Nullable{Float64}`

is about 60% slower than using`NaN`

to represent missing values, but involves a generic approach that trivially extends to arbitrary new types, including integers, dates, complex numbers, quaternions, etc…`Nullable{T}`

provides more type safety by requiring that all attempts to interact with potentially missing values explicitly indicate how missing values should be treated.

In a future blog post, I’ll describe how `Nullable`

works in greater detail.

In addition to revising the representation of missing values, I’ve also been working on revising our representation of categorical values. Working with categorical data in Julia has always been a little strange, because the main tool for representing categorical data, the `PooledDataArray`

, has always occupied an awkward intermediate position between two incompatible objectives:

- A container that keeps track of the unique values present in the container and uses this information to efficiently represent values as pointers to a pool of unique values.
- A container that contains values of a categorical variable drawn from a well-defined universe of possible values. The universe can include values that are not currently present in the container.

These two goals come into severe tension when considering subsets of a `PooledDataArray`

. The uniqueness constraint suggests that the pool should shrink, whereas the categorical variable definition suggests that the pool should be maintained without change. In Julia 0.4, we’re going to commit completely to the latter behavior and leave the problem of efficiently representing highly compressible data for another data structure.

We’ll also begin representing scalar values of categorical variables using custom types. The new `CategoricalVariable`

and `OrdinalVariable`

types that will ship with Julia 0.4 will further the efforts to put scalar computations on an equal footing with vector computations. This will be particularly notable for dealing with ordinal variables, which are not supported at all in Julia 0.3.

Many R functions employ non-standard evaluation as a mechanism for augmenting the current scope with the column names of a `data.frame`

. In Julia, it’s often possible to emulate this behavior using macros. The in-progress DataFramesMeta package explores this alternative to non-standard evaluation. We will also be exploring other alternatives to non-standard evaluation in the future.

In the long-term future, I’m hoping to improve several other parts of Julia’s core statistical infrastructure. In particular, I’d like to replace DataFrames with a new type that no longer occupies a strange intermediate position between matrices and relational tables. I’ll write another post about those issues later.

]]>**-2nd Normal Form**: The database contains at least one table that is a corrupt, out-of-date copy of another table, except with additional columns. It is impossible to determine if these additional columns can be trusted.

**-3rd Normal Form**: The database contains at least one table whose name contains the string `_v1`

and another table whose name contains the string `_v2`

. At least one column in the `_v1`

table must be an undeclared foreign key that refers to rows that no longer exist in any other table.

**-4th Normal Form**: The database contains at least one table that has two or more columns whose contents are exactly the same, except that one of the columns has a name that is a misspelling of the other column’s name.

**-5th Normal Form**: The database contains (A) at least one table whose name contains the string `do_not_use`

and (B) at least one other table whose name contains the string `do_not_ever_ever_use`

. In each of these tables separately, at least two columns must only contain NULL’s repeated for every single row. In addition, every string in these tables must have random amounts of whitespace padded on the left- and right-hand sides. Finally, at least one row must contain the text, “lasciate ogni speranza, voi ch’entrate.”

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | julia> a = [1, 2, 3] 3-element Array{Int64,1}: 1 2 3 julia> function foo!(a) a[1] = 10 return end foo! (generic function with 1 method) julia> foo!(a) julia> a 3-element Array{Int64,1}: 10 2 3 julia> function bar!(a) a = [1, 2] return end bar! (generic function with 1 method) julia> bar!(a) julia> a 3-element Array{Int64,1}: 10 2 3 |

Why does the first function successfuly alter the global variable `a`

, but the second function does not?

To answer that question, we need to explain the distinction between values and bindings. We’ll start with a particularly simple example of a value and a binding.

In Julia, the number `1`

is a value:

1 2 | julia> 1 1 |

In contrast to operating on a value, the Julia assignment operation shown below creates a binding:

1 2 | julia> a = 1 1 |

This newly created binding is an association between the symbolic name `a`

and the value `1`

. In general, a binding operation always associates a specific value with a specific name. In Julia, the valid names that can be used to create bindings are symbols, because it is important that the names be parseable without ambiguity. For example, the string `"a = 1"`

is not an acceptable name for a binding, because it would be ambiguous with the code that binds the value `1`

to the name `a`

.

This first example of values vs. bindings might lead one to believe that values and bindings are very easy to both recognize and distinguish. Unfortunately, the values of many common objects are not obvious to many newcomers.

What, for example, is the value of the following array?

1 2 3 4 5 | julia> [1, 2, 3] 3-element Array{Int64,1}: 1 2 3 |

To answer this question, note that the value of this array is *not* defined by the contents of the array. You can confirm this by checking whether Julia considers two objects to be *exactly identical* using the `===`

operator:

1 2 3 4 5 | julia> 1 === 1 true julia> [1, 2, 3] === [1, 2, 3] false |

The general rule is simple, but potentially non-intuitive: two arrays with identical contents are *not the same array*. To motivate this, think of arrays as if they were cardboard boxes. If I have two cardboard boxes, each of which contains a single ream of paper, I would not claim that the two boxes are the exact same box just because they have the same contents. Our intuitive notion of object identity is rich enough to distinguish between two containers with the same contents, but it takes some time for newcomers to programming languages to extend this notion to their understanding of arrays.

Because every container is distinct regardless of what it contains, every array is distinct because every array is its own independent container. *An array’s identity is not defined by what it contains*. As such, its value is not equivalent to its contents. Instead, an array’s value is a unique identifier that allows one to reliably distinguish each array from every other array. Think of arrays like numbered cardboard boxes. The value of an array is its identifier: thus the value of `[1, 2, 3]`

is something like the identifier “Box 1”. Right now, “Box 1” happens to contain the values `1`

, `2`

and `3`

, but it will continue to be “Box 1” even after its contents have changed.

Hopefully that clarifies what the value of an array is. Starting from that understanding, we need to re-examine bindings because bindings themselves behave like containers.

A binding can be thought of as a named box that can contain either 0 or 1 values. Thus, when a new Julia session is launched, the name `a`

has no value associated with it: it is an empty container. But after executing the line, `a = 1`

, the name has a value: the container now has one element in it. Being a container, the name is distinct from its contents. As such, the name can be rebound by a later operation: the line `a = 2`

will change the contents of the box called `a`

to refer to the value `2`

.

The fact that bindings behave like containers becomes a source of confusion when the value of a binding is itself a container:

1 | a = [1, 2, 3] |

In this case, the value associated with the name `a`

is the identifier of an array that happens to have the values `1`

, `2`

, and `3`

in it. But if the contents of that array are changed, the name `a`

will still refer to the same array — because the value associated with `a`

is not the contents of the array, but the identifier of the array.

As such, there is a very large difference between the following two operations:

1 2 | a[1] = 10 a = [1, 2] |

- In the first case, we are changing the contents of the array that
`a`

refers to. - In the second case, we are changing which array
`a`

refers to.

In this second case, we are actually creating a brand new container as an intermediate step to changing the binding of `a`

. This new container has, as its initial contents, the values `1`

and `2`

. After creating this new container, the name `a`

is changed to refer to the value that is the identifier of this new container.

This is why the two functions at the start of this post behave so differently: one mutates the contents of an array, while the other mutates which array a name refers to. Because variable names in functions are local, changing bindings inside of a function does not change the bindings outside of that function. Thus, the function `bar!`

does not behave as some would hope. To change the contents of an array wholesale, you must not change bindings: you must change the contents of the array. To do that, `bar!`

should be written as:

1 2 3 4 | function bar!(a) a[:] = [1, 2] return end |

The notation `a[:]`

allows one to talk about the contents of an array, rather than its identifier. In general, you should not expect that you can change the contents of any container without employing some indexing syntax that allows you to talk about the contents of the container, rather than the container itself.

Please do not use arithmetic on `data.frame`

objects when programming in R. It’s a hack that only works if you know **everything** about your datasets. If anything happens to change the order of the rows in your data set, previously safe `data.frame`

arithmetic operations will produce incorrect answers. If you learn to always explicitly merge two tables together before performing arithmetic on their shared columns, you’ll produce code that is both more reliable and more powerful.

You may not be aware of it, but R allows you to do arithmetic on `data.frame`

objects. For example, the following code works in R as of version 3.0.2:

1 2 3 4 5 6 7 | > df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0)) > df2 <- data.frame(ID = c(1, 2), Obs = c(2.0, 3.0)) > df3 <- (df1 + df2) / 2 > df3 ID Obs 1 1 1.5 2 2 2.5 |

If you discover that you can do this, you might think that it’s a really cool trick. You might even start using `data.frame`

arithmetic without realizing that your specific example had a bunch of special structure that was directly responsible for you getting the right answer.

Unfortunately, other examples that you didn’t see would have produced rather less pleasant outputs and led you to realize that arithmetic operations on `data.frame`

objects don’t really make sense:

1 2 3 4 5 6 7 | > df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0)) > df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0)) > df3 <- (df1 + df2) / 2 > df3 ID Obs 1 1.5 2 2 1.5 2 |

What happened here is obvious in retrospect: R added all of the columns together and then divided the result by two. The problem is that you didn’t actually want to add all of the columns together and then divide the result by two, because you had forgotten that the matching rows in `df1`

and `df2`

were not in the same index positions in the two tables.

Thankfully, it turns out that doing the right thing just requires a few more characters. What you should have done was to call `merge`

before doing any arithmetic:

1 2 3 4 5 6 7 8 | > df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0)) > df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0)) > df3 <- merge(df1, df2, by = "ID") > df3 <- transform(df3, AvgObs = (Obs.x + Obs.y) / 2) > df3 ID Obs.x Obs.y AvgObs 1 1 1 2 1.5 2 2 2 3 2.5 |

What makes `merge`

so unequivocally superior to `data.frame`

arithmetic is that it still works when the two inputs have different numbers of rows:

1 2 3 4 5 6 7 8 | > df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0)) > df2 <- data.frame(ID = c(1, 2, 3), Obs = c(5.0, 6.0, 7.0)) > df3 <- merge(df1, df2, by = "ID") > df3 <- transform(df3, AvgObs = (Obs.x + Obs.y) / 2) > df3 ID Obs.x Obs.y AvgObs 1 1 1 5 3 2 2 2 6 4 |

Now that you know why performing arithmetic operations on `data.frame`

objects is generally unsafe, I implore you to stop doing it. Learn to love `merge`

.

I just got home from JuliaCon, the first conference dedicated entirely to Julia. It was a great pleasure to spend two full days listening to talks about a language that I started advocating for just a little more than two years ago.

What follows is a very brief review of the talks that excited me the most. It’s not in any way exhaustive: there were a bunch of other good talks that I saw as well as a few talks I missed so that I could visit the Data Science for Social Good fellows.

The optimization community seems to be the academic field that’s been most ready to adopt Julia. Two talks about using Julia for optimization stood out: Iain Dunning and Joey Huchette’s talk about JuMP.jl, and Madeleine Udell’s talk about CVX.jl.

JuMP implements a DSL that allows users to describe an optimization problem in purely mathematical terms. This problem encoding can be then passed to one of many backend solvers to determine a solution. By abstracting across solvers, JuMP makes it easier for people like me to get access to well-established tools like GLPK.

CVX is quite similar to JuMP, but it implements a symbolic computation system that’s especially focused on allowing users to encode convex optimization problems. One of the things that’s most appealing about CVX is that it automatically confirms whether the problem you’re encoding is convex or not. Until I saw Madeleine’s talk, I hadn’t realized how much progress had been made on CVX.jl. Now that I’ve seen CVX.jl in action, I’m hoping to start using it for some of my work. I’ll probably also write a blog post about it in the future.

I really enjoyed the statistics talks given by Doug Bates, Simon Byrne and Dan Wlasiuk. I was especially glad to hear Doug Bates remind the audience that, years ago, he’d attended a small meeting about R that was similar in size to this first iteration of JuliaCon. Over the course of the intervening decades, he noted that the R community has grown from dozens to millions of users.

Given that Julia is still something of a language nerd’s language, it’s no surprise that some of the best talks focused on language-level issues.

Arch Robison gave a really interesting talk about the tools used in Julia 0.3 to automatically vectorize code so that it can take advantage of SIMD instructions. For those coming from languages like R or Python, you should be aware that vectorization means almost the exact opposite thing to compiler writers that it means to high-level language users: vectorization involves the transformation of certain kinds of iterative code into the thread-free parallelized instructions that modern CPU’s provide for performing a single operation on multiple data chunks simultaneously. I’ve come to love this kind of compiler design discussion and the invariance properties the compiler needs to prove before it can perform program transformations safely. For example, Arch noted that SIMD instructions can be safely used when working on many integers, but cannot be used on floating point numbers because of failures of associativity.

After Arch spoke, Jeff Bezanson gave a nice description of the process by which Julia code is transformed from raw text users enter into the REPL into the final compiled form that gets executed by CPU’s. For those interested in understanding how Julia works under the hood, this talk is likely to be the best place to start.

In addition, Leah Hanson and Keno Fischer both gave good talks about improved tools for debugging Julia code. Leah spoke about TypeCheck.jl, a system for automatically warning about potential code problems. Keno demoed a very rough draft of a Julia debugger built on top of LLDB. As an added plus, Keno also demoed a new C++ FFI for Julia that I’m really looking forward to. I’m hopeful that the new FFI will make it much easier to wrap C++ libraries for use from Julia.

Both Avik Sengupta and Michael Bean described their experiences using Julia in production systems. Knowing that Julia was being used in production anywhere was inspiring.

Daniel C. Jones and Spencer Russell both gave great talks about the developments taking place in graphics and audio support. Daniel C. Jones’s demo of a theremin built using Shashi Gowda’s React.jl and Spencer Russell’s AudioIO.jl was especially impressive.

The Julia community really is a community now. It was big enough to sell out a small conference and to field a large variety of discussion topics. I’m really excited to see how the next JuliaCon will turn out.

]]>Person A: And, just as I predicted, I found in my early studies that the correlation between X and Y is 0.4.

Person B: What do you make of the fact that subsequent studies have found that the correlation is closer to 0.001?

Person A: Oh, I was right all along: those studies continue to support my theoretical assertion that the empirical effect goes in the direction that my theory predicted. Exact numbers are meaningless in the social sciences, since we only conduct proof-of-concept studies and there are so many intervening variables we can’t measure.

]]>Person A: And, just as I predicted, I found in my early studies that the correlation between X and Y is 0.4.

Person B: What do you make of the fact that a conceptual replication, which employed words rather than pictures, found that the correlation between X and Y was -0.05?

Person A: Oh, I was right all along: X does have an effect on Y, even though the effect can switch directions under some circumstances. What matters is that X affects Y at all, which is deeply counter-intuitive.

A recent thread on Theoretical CS StackExchange comparing the Johnson-Lindenstrauss Lemma with the Singular Value Decomposition piqued my interest enough that I decided to spend some time last night reading the standard JL papers. Until this week, I only had a vague understanding of what the JL Lemma implied. I previously mistook the JL Lemma for a purely theoretical result that established the existence of distance-preserving projections from high-dimensional spaces into low-dimensional spaces.

This vague understanding of the JL Lemma turns out to be almost correct, but it also led me to neglect the most interesting elements of the literature on the JL Lemma: the papers on the JL Lemma do not simply establish the existence of such projections, but also provide (1) an explicit bound on the dimensionality required for a projection to ensure that it will approximately preserve distances and they even provide (2) an explicit construction of a random matrix, \(A\), that produces the desired projection.

Once I knew that the JL Lemma was a constructive proof, I decided to implement code in Julia to construct examples of this family of random projections. The rest of this post walks through that code as a way of explaining the JL Lemma’s practical applications.

The JL Lemma, as stated in “An elementary proof of the Johnson-Lindenstrauss Lemma” by Dasgputa and Gupta, is the following result about dimensionality reduction:

For any \(0 < \epsilon < 1\) and any integer \(n\), let \(k\) be a positive integer such that \(k \geq 4(\epsilon^2/2 - \epsilon^3/3)^{-1}\log(n)\). Then for any set \(V\) of \(n\) points in \(\mathbb{R}^d\), there is a map \(f : \mathbb{R}^d \to \mathbb{R}^k\) such that for all \(u, v \in V\), $$ (1 - \epsilon) ||u - v||^2 \leq ||f(u) - f(v)||^2 \leq (1 + \epsilon) ||u - v||^2. $$ Further this map can be found in randomized polynomial time.

To fully appreciate this result, we can unpack the abstract statement of the lemma into two components.

*Part 1*: Given a number of data points, \(n\), that we wish to project and a relative error, \(\epsilon\), that we are willing to tolerate, we can compute a minimum dimensionality, \(k\), that a projection must map a space into before it can guarantee that distances will be preserved up to a factor of \(\epsilon\).

In particular, \(k = \left \lceil{4(\epsilon^2/2 – \epsilon^3/3)^{-1}\log(n)} \right \rceil\).

Note that this implies that the dimensionality required to preserve distances depends only on the number of points and not on the dimensionality of the original space.

*Part 2*: Given an input matrix, \(X\), of \(n\) points in \(d\)-dimensional space, we can explicitly construct a map, \(f\), such that the distance between any pair of columns of \(X\) will not distorted by more than a factor of \(\epsilon\).

Surprisingly, this map \(f\) can be a simple matrix, \(A\), constructed by sampling \(k * d\) IID draws from a Gaussian with mean \(0\) and variance \(\frac{1}{k}\).

We can translate the first part of the JL Lemma into a single line of code that computes the dimensionality, \(k\), of our low-dimensional space given the number of data points, \(n\), and the error, \(\epsilon\), that we are willing to tolerate:

1 | mindim(n::Integer, ε::Real) = iceil((4 * log(n)) / (ε^2 / 2 - ε^3 / 3)) |

Having defined this function, we can try it out on a simple problem:

1 2 | mindim(3, 0.1) # => 942 |

This result was somewhat surprising to me: to represent \(3\) points with no more than \(10\)% error, we require nearly \(1,000\) dimensions. This reflects an important fact about the JL Lemma: it produces result that can be extremely conservative for small dimensional inputs. It’s obvious that, for data sets that contain \(3\) points in \(100\)-dimensional space, we could use a projection into \(100\) dimensions that would preserve distances perfectly.

But this observation neglects one of the essential aspects of the JL Lemma: the dimensions required by the lemma will be sufficient whether our data set contains points in \(100\)-dimensional space or points in \(10^{100}\)-dimensional space. No matter what dimensionality the raw data lies in, the JL Lemma says that \(942\) dimensions suffices to preserve the distances between \(3\) points.

I found this statement unintuitive at the start. To see that it’s true, let’s construct a random projection matrix, \(A\), that will let us confirm experimentally that the JL Lemma really works:

1 2 3 4 5 6 7 8 9 10 11 | using Distributions function projection( X::Matrix, ε::Real, k::Integer = mindim(size(X, 2), ε) ) d, n = size(X) A = rand(Normal(0, 1 / sqrt(k)), k, d) return A, k, A * X end |

This projection function is sufficient to construct a matrix, \(A\), that will satisfy the assumptions of the JL Lemma. It will also return the dimensionality, \(k\), of \(A\) and the result of projecting the input, \(X\), into the new space defined by \(A\). To get a feel for how this works, we can try this out on a very simple data set:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | X = eye(3, 3) ε = 0.1 A, k, AX = projection(X, ε) # => # ( # 942x3 Array{Float64,2}: # -0.035269 -0.0299966 -0.0292959 # -0.00501367 0.0316806 0.0460191 # 0.0633815 -0.0136478 -0.0198676 # 0.0262627 0.00187459 -0.0122604 # 0.0417169 -0.0230222 -0.00842476 # 0.0236389 0.0585979 -0.0642437 # 0.00685299 -0.0513301 0.0501431 # 0.027723 -0.0151694 0.00274466 # 0.0338992 0.0216184 -0.0494157 # 0.0612926 0.0276185 0.0271352 # ⋮ # -0.00167347 -0.018576 0.0290964 # 0.0158393 0.0124403 -0.0208216 # -0.00833401 0.0323784 0.0245698 # 0.019355 0.0057538 0.0150561 # 0.00352774 0.031572 -0.0262811 # -0.0523636 -0.0388993 -0.00794319 # -0.0363795 0.0633939 -0.0292289 # 0.0106868 0.0341909 0.0116523 # 0.0072586 -0.0337501 0.0405171 , # # 942, # 942x3 Array{Float64,2}: # -0.035269 -0.0299966 -0.0292959 # -0.00501367 0.0316806 0.0460191 # 0.0633815 -0.0136478 -0.0198676 # 0.0262627 0.00187459 -0.0122604 # 0.0417169 -0.0230222 -0.00842476 # 0.0236389 0.0585979 -0.0642437 # 0.00685299 -0.0513301 0.0501431 # 0.027723 -0.0151694 0.00274466 # 0.0338992 0.0216184 -0.0494157 # 0.0612926 0.0276185 0.0271352 # ⋮ # -0.00167347 -0.018576 0.0290964 # 0.0158393 0.0124403 -0.0208216 # -0.00833401 0.0323784 0.0245698 # 0.019355 0.0057538 0.0150561 # 0.00352774 0.031572 -0.0262811 # -0.0523636 -0.0388993 -0.00794319 # -0.0363795 0.0633939 -0.0292289 # 0.0106868 0.0341909 0.0116523 # 0.0072586 -0.0337501 0.0405171 ) |

According to the JL Lemma, the new matrix, \(AX\), should approximately preserve the distances between columns of \(X\). We can write a quick function that verifies this claim:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | function ispreserved(X::Matrix, A::Matrix, ε::Real) d, n = size(X) k = size(A, 1) for i in 1:n for j in (i + 1):n u, v = X[:, i], X[:, j] d_old = norm(u - v)^2 d_new = norm(A * u - A * v)^2 @printf("Considering the pair X[:, %d], X[:, %d]...\n", i, j) @printf("\tOld distance: %f\n", d_old) @printf("\tNew distance: %f\n", d_new) @printf( "\tWithin bounds %f <= %f <= %f\n", (1 - ε) * d_old, d_new, (1 + ε) * d_old ) if !((1 - ε) * d_old <= d_old <= (1 + ε) * d_old) return false end end end return true end |

And then we can test out the results:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ispreserved(X, A, ε) # => # Considering the pair X[:, 1], X[:, 2]... # Old distance: 2.000000 # New distance: 2.104506 # Within bounds 1.800000 <= 2.104506 <= 2.200000 # Considering the pair X[:, 1], X[:, 3]... # Old distance: 2.000000 # New distance: 2.006130 # Within bounds 1.800000 <= 2.006130 <= 2.200000 # Considering the pair X[:, 2], X[:, 3]... # Old distance: 2.000000 # New distance: 1.955495 # Within bounds 1.800000 <= 1.955495 <= 2.200000 |

As claimed, the distances are indeed preserved up to a factor of \(\epsilon\). But, as we noted earlier, the JL lemma has a somewhat perverse consequence for our \(3×3\) matrix: we’ve expanded our input into a \(942×3\) matrix rather than reduced its dimensionality.

To get meaningful dimensionality reduction, we need to project a data set from a space that has more than \(942\) dimensions. So let’s try out a \(50,000\)-dimensional example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | X = eye(50_000, 3) A, k, AX = projection(X, ε) ispreserved(X, A, ε) # => # Considering the pair X[:, 1], X[:, 2]... # Old distance: 2.000000 # New distance: 2.021298 # Within bounds 1.800000 <= 2.021298 <= 2.200000 # Considering the pair X[:, 1], X[:, 3]... # Old distance: 2.000000 # New distance: 1.955502 # Within bounds 1.800000 <= 1.955502 <= 2.200000 # Considering the pair X[:, 2], X[:, 3]... # Old distance: 2.000000 # New distance: 1.988945 # Within bounds 1.800000 <= 1.988945 <= 2.200000 |

In this case, the JL Lemma again works as claimed: the pairwise distances between columns of \(X\) are preserved. And we’ve done this while reducing the dimensionality of our data from \(50,000\) to \(942\). Moreover, this same approach would still work if the input space had \(10\) million dimensions.

Contrary to my naive conception of the JL Lemma, the literature on the lemma not only tells us that, abstractly, distances can be preserved by dimensionality reduction techniques. It tells how to perform this reduction — and the mechanism is both simple and general.

]]>It may be old news to some, but I just recently discovered that the automatic type inference system that R uses when parsing CSV files assumes that data sets will never contain 64-bit integer values.

Specially, if an integer value read from a CSV file is too large to fit in a 32-bit integer field without overflow, the column of data that contains that value will be automatically converted to floating point. This conversion will take place without any warnings, even though it may lead to data corruption.

The reason that the automatic conversion of 64-bit integer-valued data to floating point is problematic is that floating point numbers lack sufficient precision to exactly represent the full range of 64-bit integer values. As a consequence of the lower precision of floating point numbers, two **unequal** integer values in the input file may be converted to two **equal** floating point values in the `data.frame`

R uses to represent that data. Subsequent analysis in R will therefore treat unequal values as if they were equal, corrupting any downstream analysis that assumes that the equality predicate can be trusted.

Below, I demonstrate this general problem using two specific data sets. The specific failure case that I outline occurred for me while using R 3.0.2 on my x86_64-apple-darwin10.8.0 platform laptop, which is a “MacBook Pro Retina, 13-inch, Late 2013” model.

Consider the following two tables, one containing 32-bit integer values and the other containing 64-bit integer values:

ID |
---|

1000 |

1001 |

ID |
---|

100000000000000000 |

100000000000000001 |

What happens when they are read into R using the read.csv function?

32-bit compatible integer values are parsed, correctly, using R’s integer type, which does not lead to data corruption:

1 2 3 4 5 6 7 8 9 | data <- "MySQLID\n1000\n1001" ids <- read.csv(text = data) ids[1, 1] == ids[2, 1] # [1] FALSE class(ids$MySQLID) # [1] "integer" |

64-bit compatible integer values are parsed, incorrectly, using R’s numeric type, which does lead to data corruption:

1 2 3 4 5 6 7 8 9 | data <- "MySQLID\n100000000000000000\n100000000000000001" ids <- read.csv(text = data) ids[1, 1] == ids[2, 1] # [1] TRUE class(ids$MySQLID) # [1] "numeric" |

What should one make of this example? At the minimum, it suggests that R’s default behaviors are not well-suited to a world in which more and more people interact with data derived from commercial web sites, where 64-bit integers are commonplace. I hope that R will change the behavior of read.csv in a future release and deprecate any attempts to treat integer literals as anything other than 64-bit integers.

But, I would argue that this example also teaches a much more general point: it suggests that the assertion that scientists can safely ignore the distinction between integer and floating point data types is false. In the example I’ve provided, the very real distinction that modern CPU’s make between integer and floating point data leads to very real data corruption occurring. How that data corruption affects downstream analyses is situation-dependent, but it is conceivable that the effects are severe in some settings. I would hope that we will stop asserting that scientists can use computers to analyze data without understanding the inherent limitations of the tools they are working with.

]]>**tl;dr: Every website I use seems to have a slightly different password policy. Here I review some very basic algebraic facts about randomly generated passwords. Based on those facts, I argue that every able-bodied website should adopt a few simple standards for user passwords, including the following rules:**

- Never prevent users from copying-and-pasting passwords.
- Never restrict the length of passwords to anything lower than 40 characters. If you can, give users 255 characters to work with.
- Never truncate the password that a user submits, because truncation will invalidate the user’s password. Do not lock users out of their accounts as a punishment for using strong passwords.
- Never impose requirements on the types of characters used in passwords. Avoid encouraging l33t-speak passwords.

After a long drive last week, I arrived home to find my phone filled with a long sequence of two-factor authentication tokens, which I took as evidence that someone had been trying to break into one of my accounts. After looking into the matter a bit, it became clear that the account to which access had been requested was one of a few accounts that shared a password with my old (and recently compromised) Adobe account. Thankfully, all of my important user accounts had two-factor authentication enabled and, as such, I have not found evidence of any successful intrusions. But the incident was nevertheless sufficient to inspire me to create unique passwords for every single website account that I have.

Unfortunately, the process of resetting my passwords en masse served mostly to remind me how poorly password security is managed by most websites. Every website seems to have its own set of ad hoc standards: some require passwords with a lot of different types of characters, whereas others require long passwords. Many websites will allow passwords to be 40 characters or longer, but a large number of websites impose puzzling restrictions on the maximum length of passwords.

In particular, one site, which will rename nameless to protect the guilty, required that passwords contain (a) 1 lowercase letter, (b) 1 uppercase letter, (c) 1 digit and (d) 1 special character — and simultaneously required that passwords not be longer than 10 characters.

The widespread requirement that users employ l33t-speak passwords puzzles me. A little bit of arithmetic makes it clear that a randomly generated password is enormously more likely to be secure if it consists of 40 randomly chosen lowercase characters than if it consists of 5 randomly chosen characters from the extended character set that contains lowercase characters, uppercase characters, digits and special characters. To see this, consider the number of distinct passwords generated by either (a) varying the diversity of the character set or (b) varying the length of the character set. Some sample calculations are shown below in which I consider four types of characters sets and randomly generated passwords of length 5, 10 and 20:

Character Set | Password Length | Number of Distinct Passwords |

Lowercase | 5 | 11881376 |

Lowercase | 10 | 141167095653376 |

Lowercase | 20 | 19928148895209409152340197376 |

Lowercase + Uppercase | 5 | 380204032 |

Lowercase + Uppercase | 10 | 144555105949057024 |

Lowercase + Uppercase | 20 | 20896178655943101411324274803736576 |

Lowercase + Uppercase + Digits | 5 | 916132832 |

Lowercase + Uppercase + Digits | 10 | 839299365868340224 |

Lowercase + Uppercase + Digits | 20 | 704423425546998022968330264616370176 |

Lowercase + Uppercase + Digits + Special Characters | 5 | 1934917632 |

Lowercase + Uppercase + Digits + Special Characters | 10 | 3743906242624487424 |

Lowercase + Uppercase + Digits + Special Characters | 20 | 14016833953562607293918185758734155776 |

It’s hard not to be struck by the evidence the table above provides for the enormous superiority of long passwords over diverse passwords. When you go from the least diverse password to the most diverse password for a 5-character long password, you only go from 11881376 different passwords to 1934917632 passwords. If random guessing is feasible for one, it’s not that much harder for the other.

If, instead, you stick with only lowercase passwords and go from 5-characters to 20-characters, you go from 11881376 different passwords to 19928148895209409152340197376 passwords. Even if you could try out all of the smaller passwords in a second, we’d all be dead before you tried out all of the longer passwords.

To see why this happens, consider the number of distinct passwords you get when you either double the number of characters used or you double the length of the passwords generated. If you start with \(a\) different letters and use \(n\) of them, you’ll end up with \(a^n\) different passwords. And when you double the number of different letters, you increase this number from \(a^n\) to \((2a)^n = (2^n)(a^n)\).

If, instead, you double the length of the password, you increase \(a^n\) to \(a^{2n} = (a^n)^2 = (a^n)(a^n)\). This means that the superiority of using longer passwords over diverse passwords grows like \((\frac{a}{2})^n\), which is a huge number for even simple character sets that contain \(a\) different letters.

For example, if you’re looking at just lowercase characters for English, using a longer password instead of more diverse passwords is going to be \(13^n\) times better. In other words, going from a 10-character password to a 20-character password is more a billion times more secure than allowing both upper and lowercase letters in passwords that are always 10 characters long.

In response to my argument, one might ask: *“won’t using longer passwords impose a serious storage cost on websites?”*

Thankfully, the answer is, *“not really”*. Suppose you’re a wildly popular site and have a billion users. Then going from 10 character passwords to 255 characters will impose an additional storage cost of about 245 GB on you. Right now, getting a 512 GB hard drive costs about 50 dollars. So, the increased storage cost for most websites should be on the order of magnitude of 100 dollars. That seems quite affordable to me.

Why did we wind up with a requirement for l33t-speak passwords and no ability to use long passwords? Probably because websites have found that users keep using simple dictionary words like “password” as their password, which you can prevent as long you force users to set up passwords that contain digits and other special characters. It’s hard to get users to employ truly random passwords, so it’s easier to impose some trivial amount of randomness by making them use “random” characters like `@`

and `$`

.

Once we acknowledge the reasons why we wound up with the current set of heuristics for making passwords secure, we can move on and think how websites could do a better job in the future. The foremost step towards security, in my mind, is facilitating the use of high-quality password managers, like 1Password. You can make things easier for advanced users by adopting the following steps:

- Allow users to copy-and-paste their passwords from another source. When you force users to set up passwords they can easily type, you incentivize them to use short, memorable passwords. But short, memorable passwords are terrible passwords.
- Drop the unhelpful requirement that passwords contain a “diverse” set of characters. Adding a few special characters to a password does almost nothing to make it more secure. It’s not hard for a hacker to set up a dictionary of passwords that adds l33t-speak equivalents for every real English word.
- Impose a requirement that passwords be at least 12 characters long. Under no circumstances impose on users the requirement that passwords be shorter than 40 characters. Similarly, you should never truncate the password that a user submits to a length shorter than the string they submitted. Truncating passwords makes a password less secure and also locks the user out of their account.
- Offer to automatically generate a fully random password for every new user with an explanation of how this random password is better for security. Most users won’t take you up on the offer, but those that do will benefit substantially from the polite nudge towards using per-website unique passwords. ]]>