Introduction
As many people already know, I’ve recently uploaded a new R package called ProjectTemplate to GitHub and CRAN. The ProjectTemplate package provides a function, create.project(), that automatically builds a directory for a new R project with a clean sub-directory structure and automatic data and library loading tools. My hope is that standardized data loading, automatic importing of best practice packages, integrated unit testing and useful nudges towards keeping a cleanly organized codebase will improve the quality of R coding.
My inspiration for this approach comes from the rails command from Ruby on Rails, which initializes a new Rails project with the proper skeletal structure automatically. Also taken from Rails is ProjectTemplate’s approach of preferring convention over configuration: the automatic data and library loading as well as the automatic testing work out of the box because assumptions are made about the directory structure and naming conventions that will be used in your code. You can customize your codebase however you’d like, but you will have to edit the automation scripts to use your conventions instead of the defaults before you’ll get their benefits again.
In what follows, I try to highlight the state of the package as of today.
Installing
ProjectTemplate is available on CRAN and can be installed using a simple call to install.packages():
1 | install.packages('ProjectTemplate') |
If you would like access to changes that are not available in the current version on CRAN, please download the contents of the GitHub repository and then run,
1 2 | R CMD BUILD . R CMD INSTALL ProjectTemplate_*.tar.gz |
Example Code
To create a new project called my-project, open R and type:
1 2 | library('ProjectTemplate') create.project('my-project') |
To enter that project’s home directory and start working, type:
1 2 | setwd('my-project') load.project() |
Once you have code worth testing, you can also type,
1 | run.tests() |
to automatically run all of the unit tests in your tests directory.
If you’re interested in these last two functions, you should know that load.project() is essentially a mnemonic for calling source('lib/boot.R'), which automatically loads all of your libraries and data sets. Similarly, run.tests() is essentially a mnemonic for calling source('lib/run_test.R'), which automatically runs all of the ‘testthat’ style unit tests contained in your tests directory.
Overview
As far as ProjectTemplate is concerned, a good project should look like the following:
- project/
- data/
- diagnostics/
- doc/
- graphs/
- lib/
- boot.R
- load_data.R
- load_libraries.R
- preprocess_data.R
- run_tests.R
- utilities.R
- profiling/
- reports/
- tests/
- README
- TODO
To do work on such a project, enter the main directory, open R and type source('lib/boot.R'). This will then automatically perform the following actions:
source('lib/load_libraries.R'), which automatically loads the CRAN packages currently deemed best practices. At present, this list includes:- reshape
- plyr
- stringr
- ggplot2
- testthat
source('lib/load_data.R'), which automatically imports any CSV or TSV data files inside of thedata/directory.source('lib/preprocess_data.R'), which allows you to make any run-time modifications to your data sets automatically. This is blank by default.
Default Project Layout
Within your project directory, ProjectTemplate creates the following directories and files whose purpose is explained below:
data: Store your raw data files here. If they are CSV or TSV files, they will automatically be loaded when you callload.project()orsource('lib/boot.R'), for whichload.project()is essentially a mnemonic.diagnostics/: Store any scripts you use to diagnose your data sets for corruption or problematic data points. You should also put code that globally censors any data points here.doc/: Store documentation for your analysis here.graphs/: Store any graphs that you produce here.lib/: Store any files that provide useful functionality for your work, but do not constitute a statistical analysis per se here.lib/boot.R: This script handles automatically loading the other files inlib/automatically. Callingload.project()automatically loads this file.lib/load_data.R: This script handles the automatic loading of any CSV and TSV files contained indata/.lib/load_libraries.R: This script handles the automatic loading of the best practice packages, which are reshape, plyr, stringr, ggplot2 and testthat.lib/preprocess_data.R: This script handles the preprocessing of your data, if you need to add columns at run-time or merge normalized data sets.lib/run_tests.R: This script automatically runs any test files contained in thetests/directory using the 'testthat' package. Callingrun.tests()automatically runs this script.lib/utilities.R: This script should contain quick general purpose code that belongs in a package, but hasn't been packaged up yet.profiling/: Store any scripts you use to benchmark and time your code here.reports/: Store any output reports, such as HTML or LaTeX versions of tables here. Sweave documents should also go here.tests/: Store any test cases in this directory. Your test files should use 'testthat' style tests.README: Write notes to help orient newcomers to your project.TODO: Write a list of future improvements and bug fixes you have planned.
Request for Comments
I would love to hear feedback about things that ProjectTemplate is missing or should do differently. Please leave any and all comments you have.
Hi John,
Thank you very much for the post. Very interesting indeed.
I have not tested your package, so some of my questions below might have obvious answers (sorry if so), but I was wondering how ProjectTemplate fits with the standard R package structure.
For example, is there any explicit relation between your doc and reports directory and the man and inst/doc directories in a package, or your lib/load*R files and the dependencies in the DESCRIPTION file? Is it possible to automatically create a compliant R package from a ProjectTemplate project?
Also, in general, how does a ProjectTemplate project compares to a more classical package structure for development?
Thank you in advance.
Laurent
Hi Laurent,
Thanks for your comment. I think the existing documentation may not answer your question so well.
ProjectTemplate is not meant to be a template for building R packages. The package.skeleton() function already does that pretty well.
As such, there’s no explicit relationship between any of directories or files automatically created, though there might be some loose similarities.
Instead, ProjectTemplate is meant to be used for a new statistical analysis. For an example of its usage (with empty directories unfortunately being truncated by git), see my most recent public analysis project of the CRAN interrelation graph on GitHub:
http://github.com/johnmyleswhite/cran_analysis/tree/
I hope that helps.
I think it’s a great idea.
Over the years I’ve evolved my own directory structure for R analysis projects (discussed here: http://bit.ly/cYNe3T). However, it’s always been a bit ad hoc.
It would be great if someone, like yourself, could really think through the issues involved in a good directory structure.
In addition to giving nudges to best practice (e.g., testing, documentation, etc.), a degree of standardisation in itself would be useful when reading other people’s code.
I suppose the challenge will be to develop a directory structure that is a reasonable match to people’s workflows.
Thanks for the input, Jeremy! I’d never seen that piece on SO before: it’s really helpful for thinking through new directions to go.
There’s now a Google Group for ProjectTemplate, towards which I’d like to funnel future discussions: it’s at,
http://groups.google.com/group/projecttemplate
Hi John,
Thank you for your post. I’m a regular user of R and, to a lesser degree, a few other programs/languages, such Jags. However, I was never trained as a programmer so that I don’t really know about good programming practices. I’m assuming that you new package is a tool to automate good practices, improving reproducibility and the sharing code across co-authors or project members. If so, I think that would be *extremely* useful for people like me to have access some kind of (brief) introduction about the benefits of this type of file organization and also how to get most of it. Again, thanks for your work, Antonio.
Hi Antonio,
Thanks for the input. I’ll write up an intro to the virtues of defaults directory structures in the near future and post it here as well as the ProjectTemplate mailing list.
I don’t know why but I can’t install your package :(
> install.packages(‘ProjectTemplate’)
— Please select a CRAN mirror for use in this session —
Warning message:
In getDependencies(pkgs, dependencies, available, lib) :
package ‘ProjectTemplate’ is not available
>
Hi Antonio,
Can you try a few different mirrors? I know that other people have been able to install ProjectTemplate through CRAN, so it’s possible that you just have a slower mirror that hasn’t picked it up yet. If not, you can grab the pre-built package from GitHub that I’ve started to leave there. And, as always, you can download it directly from the CRAN website and run R CMD INSTALL ProjectTemplate*.
Thanks. I was able to install but not from the any of the mirrors I tried. I also installed the dependencies manually. Best, Antonio.
But it is working nicely!!!
Thanks for this. Other than the Josh’s LCFD (load.R, clean.R, func.R, do.R), there’s little guidance available today on how to organize your R code. I love this organization (familiarity with Ruby on Rails) and plan to use it. cheers!
Hi,
I think this is all awesome. I’m reaching the point where I know how I should be doing things, but I needed something like this to kick me in the right direction. One concern that I have is how to fit more “data mining” type scenarios into this framework. Specifically, what’s the right way to use this kind of setup or what changes need to be made when the raw data either cannot or cannot conveniently (>50% RAM) be loaded into memory before processing? Should the data be “loaded” with something like the bigdata package and then processes later? Right now I just process each file as I load it and extract the statistics I’m interested in. In this case I’m fine taking the performance hit of putting everything in one bigtablular or whatever and then processing it, but is that really the right thing to do? It seems like it’s worth thinking about a way to adapt this to allow these steps to be run in a sort of a streaming way. In other words, as you read in data, run the preprocessing on the new data, rather than load all data, stop, preprocess all data.
I realized I had another question. I often use the same data in multiple different analyses. The way you described things, it seems like I’d either be expected to copy the data or create symbolic links to the data. Is there an alternative? Like a project-defined constant containing the path to the data dir? If not, that’d be nice.
Hi Jamie,
Both of those questions still need to be answered definitively for a future draft of ProjectTemplate. If you’d like to e-mail me with suggestions, that would be great. For now, I’m not quite sure how to process data sets that are too large to be stored in memory. To be honest, I’m not really sure R is the best tool for those sorts of projects. That said, we can — and should — add more configuration options to allow people to work around some of the data loading and preprocessing steps. Try the current draft of ProjectTemplate and see how the existing options work out for you.
As for your second point, adding the option you suggested seems like a good idea. I’ll work on it soon.
I have a little nitpicky comment: I would call the “graphs” directory “figures” instead. A lot of images that get produced aren’t really graphs, per se.