The Social Dynamics of the R Core Team

Recently a few members of R Core have indicated that part of what slows down the development of R as a language is that it has become increasingly difficult over the years to achieve consensus among the core developers of the language. Inspired by these claims, I decided to look into this issue quantitatively by measuring the quantity of commits to R’s SVN repository that were made by each of the R Core developers. I wanted to know whether a small group of developers were overwhelmingly responsible for changes to R or whether all of the members of R Core had contributed equally. To follow along with what I did, you can grab the data and analysis scripts from GitHub.

First, I downloaded the R Core team’s SVN logs from http://developer.r-project.org/. I then used a simple regex to parse the SVN logs to count commits coming from each core committer.

After that, I tabulated the number of commits from each developer, pooling across the years 2003-2012 for which I had logs. You can see the results below, sorted by total commits in decreasing order:

Committer Total Number of Commits
ripley 22730
maechler 3605
hornik 3602
murdoch 1978
pd 1781
apache 658
jmc 599
luke 576
urbaneks 414
iacus 382
murrell 324
leisch 274
tlumley 153
rgentlem 141
root 87
duncan 81
bates 76
falcon 45
deepayan 40
plummer 28
ligges 24
martyn 20
ihaka 14

After that, I tried to visualize evolving trends over the years. First, I visualized the number of commits per developer per year:


Commits

And then I visualized the evenness of contributions from different developers by measuring the entropy of the distribution of commits on a yearly basis:


Entropy

There seems to be some weak evidence that the community is either finding consensus more difficult and tending towards a single leader who makes final decisions or that some developers are progressively dropping out because of the difficulty of achieving consensus. There is unambiguous evidence that a single developer makes the overwhelming majority of commits to R’s SVN repo.

I leave it to others to understand what all of this means for R and for programming language communities in general.

13 responses to “The Social Dynamics of the R Core Team”

  1. Dirk Eddelbuettel

    Yawn.

    You are missing (in chronological order)
    * Ben Bolder on r-devel in 2007 http://article.gmane.org/gmane.comp.lang.r.devel/13482
    * Simon Jackman on his blog in 2007 http://jackman.stanford.edu/blog/?p=271 as well ackman.stanford.edu/blog/?p=278
    * Me on my blog in 2007 http://dirk.eddelbuettel.com/blog/2007/08/11#ripley_commit_analysis
    * John Fox in an invited keynote at useR as well in the lead-off article in the R Journal 1/2 in December 2008 http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Fox.pdf
    and many more.

  2. Juan Carlos Borrás

    - You can refactor R ad-nauseam and keep increasing the commit count while not adding any additional functionality.
    – While R certainly provides very expressive language constructs (beating other languages for scientific/statistic computing) I would rather say that the power of it comes from the functionality supplied by the thousands of individual packages at CRAN (OK, which build upon the foundations laid by the core system, but still). The name of a certain kiwi comes to my mind…
    – The “benevolent dictator” behavioral pattern seen in other very successful community-driven projects would have (IMH &subjective opinion) prevented many inconsistencies in parameter naming and ordering along the API provided by some of the R core packages, let alone the slight proliferation of different methods for doing the same thing (i.e. indexing). I wonder if Burns’ Inferno would not be such if the number of people committing code would have been smaller, but we are talking of a 30 years old software project with very little deprecations I can think of during the last 3 years….

    You fellows are likely to be far more familiar with the specifics of the R core team and codes but from a purely user’s perspective those were my two cents.

  3. asdf
  4. rmw

    It’s an interesting analysis, but I’m not sure it captures the whole picture (not that I have a better way to do so): speaking only as an external observer, I tend to think of R-Core’s work as being relatively segmented across various projects: off the top of my head, (using their svn names)

    maechler RNG’s, Matrix, maintaining the svn server and mailing lists
    hornik CRAN
    murdoch Help system, parser, sweave, and maintenance work on base graphics [rgl outside of base R]
    pd Classical tests & TclTk Bindings; also release manager?
    jmc Reference Classes
    luke Deep and complicated things like the gc, object model, bytecode compiler
    urbaneks Mac OS X guru & RServer, iplots xtreme, rJava
    iacus Unknown [to me]
    murrell Grid graphics
    leisch Sweave
    tlumley Survey package
    rgentlem Bioconductor
    duncan “Next generation things” & language interop
    bates Rcpp & Mixed Models packages
    falcon Unknown [to me]
    deepayan Lattice
    plummer Unknown [to me] — outside of R, JAGS
    ligges CRAN & Windows Support [Also, most recent addition to R-Core]
    martyn [Isn't this the same as "plummer"?]
    ihaka Has sworn off R publicly but also co-wrote the damn thing ;-)

    BDR is something of an exception to this division scheme and I’ll come back to that.

    Consequently, for some developers, we simply see a smaller number of svn commits because their main work isn’t currently part of “Base R”. [Notably, the "recommended" packages like survey, splines, and Matrix don't show up in the svn logs] I’m not sure there’s a good way to account for that.

    I think there’s also a matter of svn style: luke’s commits tend to be fully developed fundamental changes (the last one being to allow long vectors and the one before that adding an entire bytecode compiler in one fell swoop) while BDR will make a commit out of each “small” change. BDR’s commits tend to be necessary “grub” work — going though each C function and making it long vector ready and Ctrl-C-able seems to be his current work; before that there was an under-the-surface change to the string model which was imperceptible (maybe 3-4x speed up on all string-ops) which never made it to the NEWS file) — that no one else does (is willing to do?) to the same degree. I see BDR’s “area” as doing the work at the head that’s necessary to keep everything in tip-top performing condition (e.g., whenever a nifty trick shows up on R-help, it seems to become an element in a help page within 24 hours) while luke feels pressure to work in small deliberate steps because a mistake in “his” code truly breaks _everything_

    When you add this to BDR’s omni-presence on R-help, CRAN maintainer work, and care of packages like MASS and RCurl, he still seems to be doing a “disproportionate” share of work but I’m not sure I see it as evidence of the same “social dynamics” that I’d guess prompted this post. It might be there, but this isn’t proof of that: just of work styles.

    Anyways — not sure what my point is: just injecting some skepticism.

    Cheers!

    (And awesome work with DSB on porting MM’s RNG work to Julia; I was skeptical about reifying distributions like that, but it seems to work pretty well. And I can’t wait till y’all get named/default args* and can clean a good deal of the repetition up!)
    * Possibly two different steps? I don’t think named args break the multiple dispatch paradigm (maybe they help it?) — default args seem trickier to wedge in….

    @ asdf: Why those links? Certainly there are much more agressive Ripley-ings you could have drug up. (Some of them deservedly directed at me ;-) ) The first one was actually quite civil on my read: asking install details, suggesting to change to a more-up-to-date/working version of the software in question (with link), and suggesting a more helpful mailing list for follow up. The second wasn’t even by BDR.

  5. rmw

    My apologies: Brian Ripley doesnt maintain RCurl; Duncan Temple Lang does.

  6. Ajay

    excellent analysis for starters. if you tie this up with the email poster created in JSM this time, you get a lot of who made who in R.
    But I would like to add, packages are important too, so an analysis of package authors would be awesome, and if you tie it with the dependencies on that particular packages, you can get a weighted average importance of package author

    once again- awesome analysis, terrible graph for comparing one outlier with the rest.

  7. Ajay

    and lets not even mention who all maintain the gloriously beautiful http://www.r-project.com website, the most beautiful website of the undisputed champion of graphical statistical computing.
    :)

  8. Ajay

    i meant the the gloriously beautiful http://www.r-project.org website (now thats a great idea for my blog) the.com domain is squatted by some one in japanese

  9. Rory Winston

    I did a similar analysis using svn commit logs way back in 2008:

    http://www.theresearchkitchen.com/archives/219

    I used the Gini coefficient as a proxy for the relative (in)equality of distribution in commit numbers.

    However, one issue to keep in mind is that the number of commits is a fragile metric.

    — Rory

  10. John Myles White

    Hi Rory, it seems like a ton of people have done this same thing in the past. And I completely agree that using SVN logs is a very weak metric.

  11. Michael

    Great analysis!
    Could you show us the entropy formula you used ?
    Thanks