cumplyr: Extending the plyr Package to Handle Cross-Dependencies

Introduction

For me, Hadley Wickham‘s reshape and plyr packages are invaluable because they encapsulate omnipresent design patterns in statistical computing: reshape handles switching between the different possible representations of the same underlying data, while plyr automates what Hadley calls the Split-Apply-Combine strategy, in which you split up your data into several subsets, perform some computation on each of these subsets and then combine the results into a new data set. Many of the computations implicit in traditional statistical theory are easily described in this fashion: for example, comparing the means of two groups is computationally equivalent to splitting a data set of individual observations up into subsets based on the group assignments, applying mean to those subsets and then pooling the results back together again.

The Split-Apply-Combine Strategy is Broader than plyr

The only weakness of plyr, which automates so many of the computations that instantiate the Split-Apply-Combine strategy, is that plyr implements one very specific version of the Split-Apply-Combine strategy: plyr always splits your data into disjoint subsets. By disjoint, I mean that any row of the original data set can occur in only one of the subsets created by the splitting function. For computations that involve cross-dependencies between observations, this makes plyr inapplicable: cumulative quantities like running means and broadly local quantities like kernelized means cannot be computed using plyr. To highlight that concern, let’s consider three very simple data analysis problems.

Computing Forward-Running Means

Suppose that you have the following data set:

Time Value
1 1
2 3
3 5

To compute a forward-running mean, you need to split this data into three subsets:

Time Value
1 1
Time Value
1 1
2 3
Time Value
1 1
2 3
3 5

In each of these clearly non-disjoint subsets, you would then compute the mean of Value and combine the results to give:

Time Value
1 1
2 2
3 3

This sort of computation occurs often enough in a simpler form that R provides tools like cumsum and cumprod to deal with cumulative quantities. But the splitting problem in our example is not addressed by those tools, nor by plyr, because the cumulative quantities have to computed on subsets that are not disjoint.

Computing Backward-Running Means

Consider performing the same sort of calculation as described above, but moving in the opposite direction. In that case, the three non-disjoint subsets are:

Time Value
3 5
Time Value
2 3
3 5
Time Value
1 1
2 3
3 5

And the final result is:

Time Value
1 3
2 4
3 5

Computing Local Means (AKA Kernelized Means)

Imagine that, instead of looking forward or backward, we only want to know something about data that is close to the current observation being examined. For example, we might want to know the mean value of each row when pooled with its immediately proceeding and succeeding neighbors. This computation must create the following subsets of data:

Time Value
1 1
2 3
Time Value
1 1
2 3
3 5
Time Value
2 3
3 5

Within these non-disjoint subsets, means are computed and the result is:

Time Value
1 2
2 3
3 4

A Strategy for Handling Non-Disjoint Subsets

How can we build a general purpose tool to handle these sorts of computations? One way is to rethink how plyr works and then extend it with some trivial variations on its core principles. We can envision plyr as a system that uses a splitting operation that partitions our data into subsets in which each subset satisfies a group of equality constraints: you split the data into groups in which Variable 1 = Value 1 AND Variable 2 = Value 2, etc. Because you consider the conjunction of several equality constraints, the resulting subsets are disjoint.

Seen in this fashion, there is a simple relaxation of the equality constraints that allows us to solve the three problems described a moment ago: instead of looking at the conjunction of equality constraints, we use a conjunction of inequality constraints. For the time being, I’ll describe just three instantiations of this broader strategy.

Using Upper Bounds

Here, we divide data into groups in which Variable 1 <= Value 1 AND Variable 2 <= Value 2, etc. We will also allow equality constraints, so that the operations of plyr are a strict subset of the computations in this new model. For example, we might use the constraint Variable = Value 1 AND Variable 2 <= Value 2. If the upper bound is the Time variable, these contraints will allow us to compute the forward-moving mean we described earlier.

Using Lower Bounds

Instead of using upper bounds, we can use lower bounds to divide data into groups in which Variable >= Value 1 AND Variable 2 >= Value 2, etc. This allows us to implement the backward-moving mean described earlier.

Using Norm Balls

Finally, we can consider a combination of upper and lower bounds. For simplicity, we'll assume that these bounds have a fixed tightness around the "center" of each subset of our split data. To articulate this tightness formally, we look at a specific hypothetical equality constraint like Variable 1 = Value 1 and then loosen it so that norm(Variable 1 - Value 1) <= r. When r = 0, this system gives the original equality constraint. But when r > 0, we produce a "ball" of data around the constraint whose tightness is r. This lets us estimate the local means from our third example.

Implementation

To demo these ideas in a usable fashion, I've created a draft package for R called cumplyr. Here is an extended example of its usage in solving simple variants of the problems described in this post:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45  library('cumplyr')   data <- data.frame(Time = 1:5, Value = seq(1, 9, by = 2))   iddply(data, equality.variables = c('Time'), lower.bound.variables = c(), upper.bound.variables = c(), norm.ball.variables = list(), func = function (df) {with(df, mean(Value))})   iddply(data, equality.variables = c(), lower.bound.variables = c('Time'), upper.bound.variables = c(), norm.ball.variables = list(), func = function (df) {with(df, mean(Value))})   iddply(data, equality.variables = c(), lower.bound.variables = c(), upper.bound.variables = c('Time'), norm.ball.variables = list(), func = function (df) {with(df, mean(Value))})   iddply(data, equality.variables = c(), lower.bound.variables = c(), upper.bound.variables = c(), norm.ball.variables = list('Time' = 1), func = function (df) {with(df, mean(Value))})   iddply(data, equality.variables = c(), lower.bound.variables = c(), upper.bound.variables = c(), norm.ball.variables = list('Time' = 2), func = function (df) {with(df, mean(Value))})   iddply(data, equality.variables = c(), lower.bound.variables = c(), upper.bound.variables = c(), norm.ball.variables = list('Time' = 5), func = function (df) {with(df, mean(Value))})

You can download this package from GitHub and play with it to see whether it helps you. Please submit feedback using GitHub if you have any comments, complaints or patches.

Comparing plyr with cumplyr

In the long run, I'm hoping to make the functions in cumplyr robust enough to submit a patch to plyr. I see these tools as one logical extension of plyr to encompass more of the framework described in Hadley's paper on the Split-Apply-Combine strategy.

For the time being, I would advise any users of cumplyr to make sure that you do not use cumplyr for anything that plyr could already do. cumplyr is very much demo software and I am certain that both its API and implementation will change. In contrast, plyr is fast and stable software that can be trusted to perform its job.

But, if you have a problem that cumplyr will solve and plyr will not, I hope you'll try cumplyr out and submit patches when it breaks.

Happy hacking!

12 responses to “cumplyr: Extending the plyr Package to Handle Cross-Dependencies”

1. Very cool.

I think you can use vectors instead of lists in the norm.ball: c(‘Time’=1, ‘Cats’=7) should work fine.

What are the advantages of using four parameters instead of some sort of DSL?

iddply(data, ~ eq(Time, 0) + gt(Space, 1) + lt(Cats, 5) + norm(Dogs, 3), func…)

-Harlan

2. Oh, right.

iddply(data, ~ eq(Time) + gt(Space) + lt(Cats) + norm(Dogs, 3), func…)

This said, I’m not sure what the above would mean! Intersection? What does your existing code do if you give it:

iddply(data,
equality.variables = c(),
lower.bound.variables = c(‘Time’),
upper.bound.variables = c(‘Space’),
norm.ball.variables = list(),
func = function (df) {with(df, mean(Value))})

3. Nice. I’ve wanted this on several occasions. Would love to have a similar UDAF for Hive.

4. Thanks, Dean! When I settle upon the best algorithm for performing this task in R, I’d be happy to advise someone who wanted to create this sort of UDAF for Hive.

5. This is really interesting. I love plyr but have run across this issue quite a few times and usually just mangle it using ts() objects and for loops. As I’m sure you are aware, there is some built in functionality for this type of thing if you are working with a time series (ts) object. However, I don’t like ts() objects and prefer to keep things in nice neat data.frames with proper col and row names that are preserved after I split-apply-combine. I think you are really tapping into a need here.

6. Thanks, Frank! I did not realize that ts() objects provided this sort of functionality. Thankfully, I also agree with you that it would be better in the long-term to provide a mechanism for handling this problems without using ts() objects, so I hope that I make progress getting code for this approach that runs acceptably fast and produces accurate results. Glad that you think there’s a real need for this.

7. Very nice. Perhaps data.table could be extended in a similar way. It already has a feature where the groups are determined by ‘i’ rather than by ‘by’. These groups may be overlapping. As you may know, data.table doesn’t split-apply-combine because that can be slow for larger datasets. So a side effect is that it already handles overlapping groups, efficiently. Upper and lower bound queries may be suited to data.table’s sorted keys too; e.g. it has ‘mult’ and ‘roll’ arguments. It reminds me of feature request #203 (from 2008) but we didn’t have a really good application for it, until now perhaps. That one was to allow each i column to be a set of (lower,upper) ranges. Let me know if you’d like to collaborate on something. (I should state I’m a co-author of data.table).

8. Hi Matthew,

I’d love to talk with you about this more. Earlier this morning I was trying to use data.table to handle the subsetting operations because they’re very slow on even small datasets, but I’m not yet familiar enough with data.table to get it to work. It would be great to have input from you about those details, as well as someone to discuss theoretical questions with about how a system like cumplyr should work. For example, if an item in the theoretical Cartesian product between the columns satisfies the inequality constraints, but does not occur in the empirical Cartesian product (defined by equalities), should that row exist in the output or not? I hadn’t even considered that question until this morning, but it will come up again and again in practice — and both approaches seem to have defensible use cases.

It would be nice to agree upon a proper set of functionality and a core algorithm for providing that functionality efficiently. With that in place, it should be easy to add this idea as a feature to data.table. I’ll e-mail you to follow up.

9. Thanks Harlan!I’ll try out vectors and see if that’s ceenalr.I didn’t know how to use R’s formula notation well enough to use it, but I think a DSL would be ceenalr in the long run. Right now I’m stumbling upon basic bugs (like my code failing when the variables aren’t numeric), but I’ll put the use of formulas in my TODO.