ProjectTemplate Version 0.1-3 Released

I’ve just released the newest version of ProjectTemplate. The primary change is a completely redesigned mechanism for automatically loading data. ProjectTemplate can now read compressed CSV files, access CSV data files over HTTP, read Stata, SPSS and RData binary files and even load MySQL database tables automatically. For my own projects, this is a big step forward. To access the more esoteric data sources like remote datasets and MySQL databases, the end user only needs to provide a YAML file that specifies a few details about the data source that you’ll be accessing. Hopefully the approach I’ve taken works for a large range of problems.

If you’re interested in data available over HTTP, a sample configuration file, called a.url is shown below:

url: "http://www.johnmyleswhite.com/ProjectTemplate/sample_data.csv"
separator: ","

And for those interested in accessing data from MySQL, a sample configuration file, called b.sql is shown below:

type: mysql
user: sample_user
password: sample_password
host: localhost
dbname: sample_database
table: sample_table

My inspiration for these changes came from two people that I’d like to thank: Diego Valle-Jones and David Edgar Liebke. A month ago, Diego submitted a patch for load_data.R that added RData and compressed CSV file type support. At that time, I started thinking about how to make a more extensible data loader, but wasn’t able to return to the topic until this week.

Last night, while I was reading David’s very helpful tutorial on Incanter, I realized that ProjectTemplate could automate many more types of data loading. I hope that I’ve made load_data.R capable of least some of the magic that Incanter’s get-dataset does.

The full list of file types that is now supported is shown below:

  • .csv: CSV files that use a comma separator.
  • .csv.bz2: CSV files that use a comma separator and are compressed using bzip2.
  • .csv.zip: CSV files that use a comma separator and are compressed using zip.
  • .csv.gz: CSV files that use a comma separator and are compressed using gzip.
  • .tsv: CSV files that use a tab separator.
  • .tsv.bz2: CSV files that use a tab separator and are compressed using bzip2.
  • .tsv.zip: CSV files that use a tab separator and are compressed using zip.
  • .tsv.gz: CSV files that use a tab separator and are compressed using gzip.
  • .wsv: CSV files that use an arbitrary whitespace separator.
  • .wsv.bz2: CSV files that use an arbitrary whitespace separator and are compressed using bzip2.
  • .wsv.zip: CSV files that use an arbitrary whitespace separator and are compressed using zip.
  • .wsv.gz: CSV files that use an arbitrary whitespace separator and are compressed using gzip.
  • .RData: .RData binary files produced by save().
  • .rda: .RData binary files produced by save().
  • .url: A YAML file that contains an HTTP URL and a separator specification for a remote dataset.
  • .sql: A YAML file that contains database connection information for a MySQL database.
  • .sav: Binary file format generated by SPSS.
  • .dta: Binary file format generated by Stata.

The other major change to ProjectTemplate in this release is that many fewer packages are now being loaded or even installed by default. I am not sure whether this is the ideal practice moving forward, but it was explicitly requested by a user. I’ve decided to see how the change is received by other users before making a final design decision. If you have strong views for or against this change, please speak up here or on the Google Groups mailing list.