Last week Jenn Williams got in touch with me again. She wrote:
I thought again about the project you did looking at variability across the Arabidopsis replicates when I was at a small working group at the Fields Institute this summer (more spread stuff, this time with mathematicians). We had some interesting discussions about how to partition variability in these kinds of experiments, and then I remembered that of course you’d already taken one approach to doing so with the Arabidopsis data. There is a little momentum (although more interest and enthusiasm than anything), and I wondered if you would be interested in getting back into this. I would like to dive back in - on my current hectic teaching term, that means just a little bit now, and then picking up in more seriousness in December and the start of the new year.
I went back to look at where I had left things (way back in Dec. 2019!), and then we had a Zoom call. It turns out that she’s been almost as unplugged from new research as I have over the past few years (because of kids and Covid), and is excited to get back into this. She’ll be busy teaching until December, than has more time in January (the opposite of me!). We agreed that I would work on getting the code running to re-do the Ler results (from spring 2019) and get new RIL results. I will send intermediate outputs and other updates to Jenn as I go, to have accountability for myself and to keep her engaged.
One note I made during our conversation was to experiment with allowing all pots within a replicate at a given time to have the same kernels; in looking through my code I’ve found that I have already implemented an (undocumented) switch for this called kernel_stoch_pots
.
This week I read carefully through lib/model.R
, which contains iterate_genotype()
(the function that carries a single genotype forward one time step across all replicates) and its supporting functions (except for a distribution or two). There are some random print statements that suggest I may have still be debugging things; but there are also a bunch of unit tests in lib/tests.R
that I need to explore (these aren’t coded using testthat, instead relying on visual inspection of results; perhaps because of the challenge of creating meaningful absolute tests when random numbers are involved).
Next steps
Looking back at the the last post from 2019, I see that the single-genotype model was still throwing some errors (item 1). And the code to run the model (e.g., src/make_Ler_sims.R
) is not documented (item 5). But first, here are my notes on working through the code.
model.R
notes
Conceptual issues
- Do we actually have all the parameter values for RIL? E.g, spatial and temporal stochasticity in seed production? I don’t see any mention of the latter in the parameter estimation document (and I certainly wasn’t thinking about this when I generated RIL density dependence parameters in summer 2015). Actually, I’ve just looked at the parameter generation code, and I simply assumed that every RIL had the same random effects and variance inflation as Ler. A heroic assumption, but experience with Ler suggests that these details don’t matter much.
- Sometimes the state variable in the data is seeds (e.g., for the dedicated dispersal experiments) and sometimes it is seedlings (e.g., for some/all of the population data). I have treated these interchangeably. Does this matter? I vaguely recall long-ago discussions of germination rates (very high; the Science paper just says “> 95%”) and seedling death before they are large enough to be counted (potentially an issue at high densities?) [Queried Jenn about this]
- All
n_rep
runways are subjected to completely correlated temporal environmental stochasticity in seed production, meaning thatn_rep
represents the number of replicate runways in the greenhouse. Monte Carlo replication needs to happen at a higher level (inmake_Ler_sims
I called thatnruns
) - In the greenhouse, all four landscapes were run simultaneously (and also so for evolving and non-evolving populations). In the simulations, each landscape is run separately, losing the correlation in temporal environmental stochasticity. I think this is ok as long as we are not directly comparing the landscapes. However, this could be an issue for evolving vs. non-evolving, where we might well need to look at pairwise differences (if there’s a lot of variability between MC replicates). Thus there may be call for setting up the model to simulate the entire greenhouse.
- This is an even bigger issue for the RIL experiments, for if I follow the current model of iterating each genotype independently, they will get independent temporal stochasticity.
- Perhaps the easiest approach is to generate the temporal deviation outside of
iterate_genotype
and pass it in, keeping it fixed for all RILs/landscapes in an indvidual generation and MC rep
Documentation
- Add documentation of the inputs and outputs of the helper functions, where they are not obvious
- Add documentation for
max_pots
andkernel_stoch_pots
elements ofcontrols
- Add documentation for kernel parameters
- Simplify description of parameter dimensions in Ler vs. RIL
- Describe
N_tot
, exactly whatAdults
andSeeds
represent, and the return value fromiterate_genotype()
. The names are potentially misleading, as seeds become next generation adults, and perhaps neither were directly measured. - Fix typos in comments, as well as corrections noted on printout
- Documentation of
Gompertz_seeds()
doesn’t quite get to the equation as used - Need better commenting on forward/backward dispersal in
seed_sampling
- Need some documentation for
disp_table()
Code concerns, inconsistencies, and redundancies
There are some commented-outaperm
statements inseed_sampling
. Can these be deleted, or was this part of my incomplete debugging work?- In
iterate_genotype
, dispersal further thanmax_pots
is truncated. However, this only affects the output fromdet_kernel
; inseed_sampling
there’s a line that effectively puts all the super-dispersers into the last pot. I think I should delete that line for consistency.
Code organization
n_reps
andnew_pots
are features of the experimental design, so should really be inparams
rather thancontrols
. The same would be formax_pots
if there is any empirical basis for putting an upper bound on dispersal distance.- Actually, it seems like we really should separate “biological parameters,” “experiment parameters,” and “model settings.”
- Helper functions for probability distributions are scattered across
model.R
,helpers.R
anddists.R
. They should all be in the latter (and better documented!) - Add a
dispersal_buffer()
helper incombine_dispersed_seeds()
Efficiency and clarity
- When using no kernel stochasticity in
iterate_genotype()
, the parameters are replicated across a large array. This leads to inefficiencies indet_kernel()
andseed_sampling()
(unless the generalized gamma functions take parameter vectors, see below). The benefit (which is not trivial) is that the exact same code is being run, whether or not there is kernel sampling. - The way that the calculations in
det_kernel
are repeated until the furthest tail gets below 0.5 seems quite inefficient - Calls to distribution functions are awkward–I seem to have been assuming that they can’t take vectors (or matrices in this case) of parameters. However, some experiments with
rbinom
indicate that, as of October 2020 (the version currently on my iMac), it could take parameter matrices (the result comes back as a vector, but reshapes properly using the dimensions of the input matrices). I should check this for gengamma, as this will make the code cleaner and probably faster.- If gengamma doesn’t have this capacity, then I should write helper functions so that the code is cleaner
- The last section of
gapify()
seems redundant to the last section ofcombine_dispersed_seeds()
.- I guess the exception is for an endpoint that would be in a gap. And the cost here is probably small
- Before taking any of this on, I should do some profiling to see what if any of this represents a limiting factor
General housecleaning
Update to current versions of R in both my iMac and MacBook. This will certainly require migrating the ProjectTemplate material (it is already throwing warnings); the real question is whether all the data munging will still work at allAt some point blogdown moved to a new structure where each post has its own folder and the Rmd is named “index.Rmd”, This is annoying, especially if I’m looking at source from previous posts! And also it creates unneccsary folder bloat, given the simplicity of my work. I recall seeing a way to revert this (I think I ran into it previously with the coral project)–I need to do this.
Debugging
When I try to runmake_Ler_sims
I get an Rcpp error thrown from deep withinrgengamma
. I don’t think it’s anything to do with my code; I’m sending in reasonable values. I hope that updating everything will fix this.There are some comprehensive (though undocumented) test routines that I had written intests.R
, that exercise the various functions inmodel.R
. When running the first one,test_kernel_stoch()
, I noticed that the sample means and covariances for the parameters generated from the multivariate normal exactly match the input values. I had used theempirical = TRUE
flag, as from the help formvrnorm
that seemed like the thing to do. But that forces this rescaling, which is not what I want. The function is from MASS, but Venables & Ripley say nothing about it; I can only imagine it has value in preserving structure in some bootstrapping applications. At any rate, I need to take that out.
A bit of work
I updated R (to v. 4.2.1) and the libraries; everything seems to work (though I didn’t try rebuilding the data). As hoped for, it resolved the Rcpp error.
There are some warnings when building the blog. Will need to deal with them.
I’ve found that the -Inf error comes when there’s a runway that is completely extinct–it comes up whenever I use the max(seq_along(y)[y>0])
because max
returns -Inf for an empty vector. I can fix this by max’ing the result with 1 (I think).