A Draft of ProjectTemplate v0.2-1

I’ve just uploaded a new binary of ProjectTemplate to GitHub. This is a draft version of the next release, v0.2-1, which includes some fairly substantial changes and is backwards incompatible in several ways with previous versions of ProjectTemplate.

Foremost of the changes is that most of the logic for load.project() is now built into the load.project() function directly, rather than spread out into autogenerated scripts that you can edit by hand. While this makes ProjectTemplate harder for non-experts to modify, the change will make it much easier to make revisions to ProjectTemplate in the future without having to worry about existing projects falling behind because of vestigial code that’s not being automatically updated when you install a new version of ProjectTemplate.

Because more system logic is now hardcoded into functions, each project’s configuration is handled through a YAML file in config/global.yaml. Incidentally, this introduces the new directory, config/, where configuration files will go from now on.

The data loading system is also more complex than it was before. First, there’s a new hierarchy of data sources: now the system will look for data in a cache/ directory before moving on to the data/ directory. This makes it possible for you to permanently store changes to your data set in cache/ that will allow you to skip loading the raw data set. This is helpful when the original data set is enormous and you only need a radically reduced form of it for your future analyses that you’ll store in cache/.

In addition, preprocessing is now handled through a series of ordered scripts in a munge/ directory rather than just a single preprocessing script in the lib/ directory. There’s also a log/ directory, used by the new integrated log4r support, which is off by default, but can be easily set up after installing log4r from CRAN.

Finally, there’s a src/ directory where we’re going to encourage users to place their primary analyses, so that the main directory always has the same files and directories across all projects.

In addition to all of these changes, many of which were inspired by conversations with Mike Dewar, I’ve incorporated some very helpful patches in this release. Specifically, Diego Valle-Jones fixed a bug in clean.variable.name() that lead to trouble when filenames in the data/ directory began with numbers and Patrick D. Schalk contributed code that adds support for SQLite to ProjectTemplate along with general improvements to the database access codebase.

Thanks for all of the support since the last release. Please let me know if there any changes that need to be made before I turn v0.2-1 loose on CRAN.

9 responses to “A Draft of ProjectTemplate v0.2-1”

  1. Chu-Sheng Yang

    # Add one more extension option in dispatch table
    extensions.dispatch.table <- list(…,"\\.xlsx$" = XLSXReader)

    # .xlsx: Excel 2007 XLSX files read by xlsx package. Another option is RExcelXML package for big files (not implemented)
    XLSXReader <- function(data.file, filename, workbook.name)
    {
    require(xlsx)
    wb <- loadWorkbook(filename)
    sheets <- getSheets(wb)

    for(sheet.name in names(sheets)) {
    variable.name <- paste(workbook.name, clean.variable.name(sheet.name), sep=".")
    assign(variable.name,
    read.xlsx(filename, sheetName = sheet.name,
    header = TRUE),
    envir = .GlobalEnv)
    }
    }

  2. Chu-Sheng Yang

    I added one more reader for myself since storing data in xlsx files is also common. It will create “xlsxFileName.sheetName” objects in R session. Please feel free to make any changes you would like.

  3. Chu-Sheng Yang

    BTW, here are some of my personal preferences you might be interested:

    (1) Prefer load package before utilities, because I have some use-defined utilities scripts which require certain packages to be loaded
    (2) utilities folder: my utilities scripts are too much to put into single file. I used to dividing them by functionalities
    (3) data.files.filter: most of time, I don’t wanna load all data but I would like to keep them in the same folder. I plan to add a data.files.filter in global.yaml like

    data.files.filter: “grep(format(prevBizday(),’%y%m%d’), dir(‘data’), value=T)”
    then, use parse(text=data.files.filter) within load.project()

    At this moment, I modified your source code to create customized template, and I sincerely appreciate your great work to save us much time and to maintain a more disciplined project.

  4. Jan Sunde

    Any Windows 64-bit support yet ?
    I’d love to use it, but last time I checked, it was broken (yaml was the culprit I think).
    Haven’t found any mention of it by anybody else , though, and havent tried compiling my own yaml package
    so I’m wondering whether it’s just my setup ? (R 2.12.0 64-bit on Windows 7 64-bit, 4 GB RAM).

    Unfortunately not using my work computer at the moment, so haven’t had the chance to test it yet.

  5. Chu-Sheng Yang

    Yeah, doesn’t work for me either on 64-bit R2.12.0. Mine is Windows XP. I remembered yaml is the cause. I simply return to 32-bit R at this moment.

  6. Jan Sunde

    John,
    I hope the yaml issue somehow gets resolved.
    R on Win 64 is probably gonna take off once all the major packages are ported
    (Rattle is another one that unfortunately is broken)
    Not quite sure why yaml is broken though – CRAN doesn’t give any indication – hopefully it’s a quick fix!
    For now, I guess I just have to keep a double installation of R to keep Rattle and check out ProjectTemplate :)
    (or buy a Mac…)

  7. Michael Schneider

    I really like the changes you’ve made here and recently installed the latest version. However, I just called ‘load.project’ in an older project directory only to be reminded that ‘config/global.yaml’ is not present. It’s easy enough to copy the new directories into my old project, but you might consider adding a function to update old projects to make it quick and transparent.

    Also, I have a number of scripts in src/ that I usually want loaded at the same time as I run ‘load.project()’. That is, they’re functions that I’m actively developing, but the next step after calling load.project() is usually to load these so I can continue development. I previously had these source commands in lib/load_libraries.R (as a bit of a kluge). Did you have a plan for auto-loading stuff in src/ or is that contrary to your intended usage?