3 Getting data in via ProjectTemplate
First, here is a draft of the R header chunk that I will want to include in every post that uses R (silently):
knitr::opts_chunk$set(comment = '', cache = TRUE)
knitr::opts_knit$set(root.dir = "..")
I’m sure I’ll come up with some more.
Now, on to how to load the Arabidopsis data using ProjectTemplate. I started with the Ler population data, which is in multiple xlsx files, with one sheet having data, one having metadata, and potentially some blank sheets as well. If I follow the standard protocol for ProjectTemplate and drop them into the data
directory, then when I execute load.project()
I get a data object for every sheet of every file (because there is no way to pass options to read.xlsx()
. In addition, now that they are loaded as data objects, it is hard to access them in a loop to build a the combined datasets. Finally, the workspace is cluttered with all of these objects; if I delete them as part of the data munging, then they will get reloaded the next time I use load.project()
, which is time consuming!
So instead I took the data files out of the project directory, and used the option to have a .R
file in the data
directory. Using this, I can re-use my existing code to read the files in from the Dropbox directory and manipulate them as I want, deleting the intermediate objects and just keeping the final objects. Furthermore, if I then cache the objects, then subsequent loads are fast!
So here’s the file:
cat(readLines('data/popLer.R'), sep = '\n')
### Creates the data objects popLer and popLer_cm, representing the Ler populations
### experiments
# Raw data consists of one file per generation
# Final column is named inconsistently, so needs to be corrected before merge
data_dir <- "~/Documents/Dropbox/Arabidopsis/Data/Exp1"
data_fname <- list.files(data_dir, "seedposition")
popLer_cm <- NULL
for (i in 1:length(data_fname)) {
tempdata <- xlsx::read.xlsx(file.path(data_dir, data_fname[i]), sheetName = "Data",
header = TRUE)
names(tempdata)[9] <- "seedlings"
popLer_cm <- rbind(popLer_cm, tempdata)
rm(tempdata)
}
# Clean up column names and get a useful order
popLer_cm <- popLer_cm[-1]
names(popLer_cm) <- c("ID", "Treatment", "Rep", "Gap", "Generation", "Pot", "Distance",
"Seedlings")
ord <- with(popLer_cm, order(Treatment, Gap, Rep, Generation, Pot, Distance))
popLer_cm <- popLer_cm[ord,]
# Make a version that just has pot totals
require(plyr)
popLer <- ddply(popLer_cm, .(ID, Gap, Rep, Treatment, Generation, Pot), summarize,
Seedlings = sum(Seedlings))
ord <- with(popLer, order(Treatment, Gap, Rep, Generation, Pot))
popLer <- popLer[ord,]
Now let’s look at what we get:
ProjectTemplate::load.project()
Loading project configuration
Autoloading helper functions
Running helper script: helpers.R
Autoloading cache
Loading cached data set: popLer.cm
Loading cached data set: popLer
Autoloading data
Munging data
Running preprocessing script: 01-A.R
ls()
[1] "config" "helper.function" "popLer" "popLer_cm"
[5] "project.info"
head(popLer)
ID Gap Rep Treatment Generation Pot Seedlings
359 21 0p 1 A 1 0 132
360 21 0p 1 A 1 1 21
361 21 0p 1 A 2 0 74
362 21 0p 1 A 2 1 72
363 21 0p 1 A 2 2 26
364 21 0p 1 A 2 3 2
head(popLer_cm)
ID Treatment Rep Gap Generation Pot Distance Seedlings
40 21 A 1 0p 1 0 4 132
42 21 A 1 0p 1 1 9 5
43 21 A 1 0p 1 1 10 6
44 21 A 1 0p 1 1 11 3
45 21 A 1 0p 1 1 13 4
41 21 A 1 0p 1 1 14 3
Looking good!
This has the added advantage that any changes in the data that Jenn makes can be easily loaded by deleting the cache; and that I can share the project on a public github directory (and Jenn can use it as long as we have our Dropbox in the same place).