3 Getting data in via ProjectTemplate

First, here is a draft of the R header chunk that I will want to include in every post that uses R (silently):

knitr::opts_chunk$set(comment = '', cache = TRUE)
knitr::opts_knit$set(root.dir = "..")

I’m sure I’ll come up with some more.

Now, on to how to load the Arabidopsis data using ProjectTemplate. I started with the Ler population data, which is in multiple xlsx files, with one sheet having data, one having metadata, and potentially some blank sheets as well. If I follow the standard protocol for ProjectTemplate and drop them into the data directory, then when I execute load.project() I get a data object for every sheet of every file (because there is no way to pass options to read.xlsx(). In addition, now that they are loaded as data objects, it is hard to access them in a loop to build a the combined datasets. Finally, the workspace is cluttered with all of these objects; if I delete them as part of the data munging, then they will get reloaded the next time I use load.project(), which is time consuming!

So instead I took the data files out of the project directory, and used the option to have a .R file in the data directory. Using this, I can re-use my existing code to read the files in from the Dropbox directory and manipulate them as I want, deleting the intermediate objects and just keeping the final objects. Furthermore, if I then cache the objects, then subsequent loads are fast!

So here’s the file:

cat(readLines('data/popLer.R'), sep = '\n')
### Creates the data objects popLer and popLer_cm, representing the Ler populations
###     experiments

# Raw data consists of one file per generation
# Final column is named inconsistently, so needs to be corrected before merge

data_dir <- "~/Documents/Dropbox/Arabidopsis/Data/Exp1"
data_fname <- list.files(data_dir, "seedposition")

popLer_cm <- NULL
for (i in 1:length(data_fname)) {
  tempdata <- xlsx::read.xlsx(file.path(data_dir, data_fname[i]), sheetName = "Data", 
                              header = TRUE)
  names(tempdata)[9] <- "seedlings"
  popLer_cm <- rbind(popLer_cm, tempdata)
  rm(tempdata)
}

# Clean up column names and get a useful order
popLer_cm <- popLer_cm[-1]
names(popLer_cm) <- c("ID", "Treatment", "Rep", "Gap", "Generation", "Pot", "Distance",
                      "Seedlings")
ord <- with(popLer_cm, order(Treatment, Gap, Rep, Generation, Pot, Distance))
popLer_cm <- popLer_cm[ord,]

# Make a version that just has pot totals
require(plyr)
popLer <- ddply(popLer_cm, .(ID, Gap, Rep, Treatment, Generation, Pot), summarize,
                Seedlings = sum(Seedlings))
ord <- with(popLer, order(Treatment, Gap, Rep, Generation, Pot))
popLer <- popLer[ord,]

Now let’s look at what we get:

ProjectTemplate::load.project()
Loading project configuration
Autoloading helper functions
 Running helper script: helpers.R
Autoloading cache
 Loading cached data set: popLer.cm
 Loading cached data set: popLer
Autoloading data
Munging data
 Running preprocessing script: 01-A.R
ls()
[1] "config"          "helper.function" "popLer"          "popLer_cm"      
[5] "project.info"   
head(popLer)
    ID Gap Rep Treatment Generation Pot Seedlings
359 21  0p   1         A          1   0       132
360 21  0p   1         A          1   1        21
361 21  0p   1         A          2   0        74
362 21  0p   1         A          2   1        72
363 21  0p   1         A          2   2        26
364 21  0p   1         A          2   3         2
head(popLer_cm)
   ID Treatment Rep Gap Generation Pot Distance Seedlings
40 21         A   1  0p          1   0        4       132
42 21         A   1  0p          1   1        9         5
43 21         A   1  0p          1   1       10         6
44 21         A   1  0p          1   1       11         3
45 21         A   1  0p          1   1       13         4
41 21         A   1  0p          1   1       14         3

Looking good!

This has the added advantage that any changes in the data that Jenn makes can be easily loaded by deleting the cache; and that I can share the project on a public github directory (and Jenn can use it as long as we have our Dropbox in the same place).