File to load Ler dispersal data
Here is the new file data/disperseLer.R
:
### Creates the data object disperseLer, representing the Ler dispersal experiment
# Get data from Jenn's file
data_dir <- "~/Dropbox/Arabidopsis/analysis"
disperseLer <- read.csv(file.path(data_dir, '2013_08_08_Exp1_Spray.csv'),
header = TRUE)
# Drop the "clipped" treatment
disperseLer <- droplevels(subset(disperseLer, new_trt != "clipped", drop = TRUE))
# Drop the columns with the (irrelevant) info about where the mom pots came from
disperseLer <- disperseLer[, -c(1:4, 6)]
# Clean up column names
names(disperseLer) <- c("ID", "Pot", "Distance", "Seedlings", "Siliques", "Density",
"Treatment")
# Make some factor variables
disperseLer$ID <- as.factor(disperseLer$ID)
# Clean up the workspace
rm(data_dir)
# Auto-cache the data
ProjectTemplate::cache("disperseLer")
Summarizing Ler dispersal data
Here is a file to summarize the data by replicate, in lib/helpers.R
:
calc_dispersal_stats <- function(dispersal_data) {
dispersal_stats <- group_by(dispersal_data, ID, Density, Siliques) %>%
summarise(
Total_seeds = sum(Seedlings),
Home_seeds = Seedlings[1],
Dispersing_seeds = Total_seeds - Home_seeds,
Dispersal_fraction = Dispersing_seeds / Total_seeds,
# Dispersal stats including all seeds
Mean_dispersal_all = sum(Seedlings * (Distance - 4)) / Total_seeds,
RMS_dispersal_all = sqrt(sum(Seedlings * (Distance - 4)^2) / Total_seeds),
# Dispersal stats including just dispersing seeds
Mean_dispersal = sum(Seedlings * (Distance - 4)) / Dispersing_seeds,
RMS_dispersal = sqrt(sum(Seedlings * (Distance - 4)^2) / Dispersing_seeds)
)
return (dispersal_stats)
}
Let’s look at how the total seed number, the fraction dispersing, and the dispersal distances scale with density:
Ler_dispersal_stats <- calc_dispersal_stats(disperseLer)
qplot(Density, Total_seeds, data = Ler_dispersal_stats)
qplot(Density, Dispersal_fraction, data = Ler_dispersal_stats)
qplot(Density, Mean_dispersal_all, data = Ler_dispersal_stats)
qplot(Density, Mean_dispersal, data = Ler_dispersal_stats)
qplot(Density, RMS_dispersal_all, data = Ler_dispersal_stats)
qplot(Density, RMS_dispersal, data = Ler_dispersal_stats)
Except for total seed number, there don’t seem to be any patterns here. For the single plants, let’s look at whether silique number (which might be correlated with plant size) can predict anything:
qplot(Siliques, Total_seeds, data = subset(Ler_dispersal_stats, !is.na(Siliques)))
qplot(Siliques, Dispersal_fraction, data = subset(Ler_dispersal_stats, !is.na(Siliques)))
qplot(Siliques, Mean_dispersal_all, data = subset(Ler_dispersal_stats, !is.na(Siliques)))
qplot(Siliques, Mean_dispersal, data = subset(Ler_dispersal_stats, !is.na(Siliques)))
qplot(Siliques, RMS_dispersal_all, data = subset(Ler_dispersal_stats, !is.na(Siliques)))
qplot(Siliques, RMS_dispersal, data = subset(Ler_dispersal_stats, !is.na(Siliques)))
Again, no patterns. This makes modeling simpler!
One more thing to look at: are the mean dispersal distances correlated with either the fraction dispersing or the number of dispersing seeds?
qplot(Dispersing_seeds, Mean_dispersal, data = Ler_dispersal_stats)
qplot(Dispersing_seeds, RMS_dispersal, data = Ler_dispersal_stats)
qplot(Dispersal_fraction, Mean_dispersal, data = Ler_dispersal_stats)
qplot(Dispersal_fraction, RMS_dispersal, data = Ler_dispersal_stats)
Here we see a correlation with both. This could just be a sampling effect; once we fit kernel parameters we’ll have to see if they have correlations as well. If so, that suggests that the same variables (such as canopy height and structure, and intensity of simulated rainfall) that affect whether seeds leave the home pot also affect how far they go.
Here’s a multiple regression just to tease out the two variables:
summary(lm(Mean_dispersal ~ Dispersal_fraction + Dispersing_seeds, data = Ler_dispersal_stats))
Call:
lm(formula = Mean_dispersal ~ Dispersal_fraction + Dispersing_seeds,
data = Ler_dispersal_stats)
Residuals:
Min 1Q Median 3Q Max
-1.7144 -0.5299 -0.2285 0.3843 4.4228
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.0255317 0.4449107 13.543 1.76e-15 ***
Dispersal_fraction 2.5229534 1.1204944 2.252 0.0307 *
Dispersing_seeds 0.0016311 0.0008397 1.942 0.0602 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.148 on 35 degrees of freedom
Multiple R-squared: 0.3358, Adjusted R-squared: 0.2979
F-statistic: 8.849 on 2 and 35 DF, p-value: 0.000776
summary(lm(RMS_dispersal ~ Dispersal_fraction + Dispersing_seeds, data = Ler_dispersal_stats))
Call:
lm(formula = RMS_dispersal ~ Dispersal_fraction + Dispersing_seeds,
data = Ler_dispersal_stats)
Residuals:
Min 1Q Median 3Q Max
-1.8021 -0.6943 -0.3313 0.5148 4.4709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4639884 0.5020757 12.875 7.77e-15 ***
Dispersal_fraction 2.6898597 1.2644627 2.127 0.0405 *
Dispersing_seeds 0.0020128 0.0009476 2.124 0.0408 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.296 on 35 degrees of freedom
Multiple R-squared: 0.3415, Adjusted R-squared: 0.3039
F-statistic: 9.076 on 2 and 35 DF, p-value: 0.0006679
It appears that both are playing a role. But the dispersal fraction is a stronger predictor, which suggests that there is real mechanism here, and not just sampling getting further into the kernel.
Kernel fitting issues
Our data are actually interval-censored, and we need to ensure that we are taking that into account. I believe that the likelihood functions I wrote previously take that into account, but they lack the rigor of fitdistrplus. Unfortunately, the data have another issue: the seeds in pot zero are a mixture of “non-dispersers” (i.e., seeds that fall straight down and aren’t part of the kernel) and “short-distance dispersers” (seeds moving according to a dispersal kernel, but traveling less than 3 cm). To account for this either requires writing a zero-inflated distribution (which will require custom functions for everything) or telling the objective function to truncate the distribution. Again I believe that my prior functions do the latter, but I’m not truly confident of that. Unfortunately, there doesn’t seem to be an obvious way to account this in fitdistrplus using out-of-the-box distributions.
Actually, I take that back—the issue of truncation is addressed in the FAQs. It does require writing new distribution functions, but they are pretty straightforward. Here is their example for a (doubly) truncated exponential distribution:
dtexp <- function(x, rate, low, upp)
{
PU <- pexp(upp, rate=rate)
PL <- pexp(low, rate=rate)
dexp(x, rate) / (PU-PL) * (x >= low) * (x <= upp)
}
ptexp <- function(q, rate, low, upp)
{
PU <- pexp(upp, rate=rate)
PL <- pexp(low, rate=rate)
(pexp(q, rate)-PL) / (PU-PL) * (q >= low) * (q <= upp) + 1 * (q > upp)
}
Looking at the dispersal patterns
Let’s start by just plotting the raw data. The first plot is just the empirical kernels; the second puts the vertical axis on a log scale (so exponential distributions should look linear and normal distributions should look quadratic) and the third puts both axes on a log scale (so lognormal distributions should look quadratic).
ggplot(aes(x = Distance, y = Seedlings), data = subset(disperseLer, Distance > 4)) +
geom_line() + facet_wrap(~ID, scales = "free_y")
ggplot(aes(x = Distance, y = Seedlings), data = subset(disperseLer, Distance > 4)) +
geom_line() + facet_wrap(~ID, scales = "free_y") + scale_y_log10()
ggplot(aes(x = Distance, y = Seedlings), data = subset(disperseLer, Distance > 4)) +
geom_line() + facet_wrap(~ID, scales = "free_y") + scale_y_log10() + scale_x_log10()
In many cases, it seems like a (truncated) normal distribution might be best! Rep 87_1 is going to be particularly problematic, as it is strongly sigmoid.
Trial fitting
I’ll start by selecting a single rep with a large number of seeds and a well-behaved distribution. Let’s use 87_0.
Pull out the data and convert into the format required by fitdistrplus:
data_87_0 <- subset(disperseLer, ID == "87_0" & Distance > 4)
data_vec <- rep(data_87_0$Distance, data_87_0$Seedlings)
cens_data_tble <- data.frame(left = data_vec - 4.5, right = data_vec - 3.5)
Now construct the distribution functions for a truncated normal. I code in the “standard” truncation level.
dtnorm <- function(x, mean, sd, low = 3.5)
{
PU <- 1
PL <- pnorm(low, mean = mean, sd = sd)
dnorm(x, mean, sd) / (PU-PL) * (x >= low)
}
ptnorm <- function(q, mean, sd, low = 3.5)
{
PU <- 1
PL <- pnorm(low, mean = mean, sd = sd)
(pnorm(q, mean, sd)-PL) / (PU-PL) * (q >= low)
}
So now let’s give it a try:
library(fitdistrplus)
fit_tnorm <- fitdistcens(cens_data_tble, "tnorm", start = list(mean = 6, sd = 10))
summary(fit_tnorm)
Fitting of the distribution ' tnorm ' By maximum likelihood on censored data
Parameters
estimate Std. Error
mean 7.681504 0.5930568
sd 5.588817 0.3661223
Fixed parameters:
data frame with 0 columns and 0 rows
Loglikelihood: -1242.553 AIC: 2489.106 BIC: 2497.343
Correlation matrix:
mean sd
mean 1.000000 -0.803976
sd -0.803976 1.000000
cdfcompcens(fit_tnorm, Turnbull.confint = TRUE)
This looks like a pretty darn good fit!
Interestingly, using the non-censored method gives almost identical parameter estimates and CDF:
fit3<-fitdist(data_vec - 4, "tnorm", start = list(mean = 6, sd = 10))
summary(fit3)
Fitting of the distribution ' tnorm ' by maximum likelihood
Parameters :
estimate Std. Error
mean 7.696363 0.5870556
sd 5.583054 0.3626565
Loglikelihood: -1242.376 AIC: 2488.752 BIC: 2496.988
Correlation matrix:
mean sd
mean 1.000000 -0.801385
sd -0.801385 1.000000
cdfcomp(fit3)
Thus, we can use some of the other diagnostic plots for non-censored data:
denscomp(fit3, breaks = 3:23+0.5, plotstyle = "ggplot", xlim = c(3, 24))
# qqcomp(fit3) # This one requires defining a quantile function!
ppcomp(fit3)
Unfortunately, we need additional work to get qqcomp working; and denscomp only gives a sensible looking histogram with tweaking (and even there, the bins are not centered appropriately, so the histogram is shifted).
Here’s a rough draft of a better figure:
xx = seq(3.3, max(fit3$data) + 0.7, 0.1)
plot(xx, (dtnorm(xx, fit3$estimate[1], fit3$estimate[2])), type = "l", col="red")
points(table(fit3$data)/length(fit3$data))
We can make analagous functions for the log normal distribution:
dtlnorm <- function(x, meanlog, sdlog, low = 3.5)
{
PU <- 1
PL <- plnorm(low, meanlog = meanlog, sdlog = sdlog)
dlnorm(x, meanlog, sdlog) / (PU-PL) * (x >= low)
}
ptlnorm <- function(q, meanlog, sdlog, low = 3.5)
{
PU <- 1
PL <- plnorm(low, meanlog = meanlog, sdlog = sdlog)
(plnorm(q, meanlog, sdlog)-PL) / (PU-PL) * (q >= low)
}
Now fit to the censored data and compare with our normal fit
fit_tlnorm <- fitdistcens(cens_data_tble, "tlnorm", start = list(meanlog = log(6), sdlog = log(10)))
summary(fit_tlnorm)
Fitting of the distribution ' tlnorm ' By maximum likelihood on censored data
Parameters
estimate Std. Error
meanlog 2.1737212 0.02373376
sdlog 0.4584419 0.01879380
Fixed parameters:
data frame with 0 columns and 0 rows
Loglikelihood: -1254.488 AIC: 2512.976 BIC: 2521.212
Correlation matrix:
meanlog sdlog
meanlog 1.0000000 -0.2594226
sdlog -0.2594226 1.0000000
cdfcompcens(list(fit_tnorm, fit_tlnorm), Turnbull.confint = TRUE,
legendtext = c("truncated normal", "truncated lognormal"))
The CDF of the log-normal is not as good, and the AIC is about 24 units worse.
Fit all the replicates
Start by writing a function that takes a single replicate’s data and does the analysis
fit_dispersal_models <- function(dispersal_data) {
data_loc <- subset(dispersal_data, Distance > 4)
data_vec <- rep(data_loc$Distance, data_loc$Seedlings)
cens_data_tble <- data.frame(left = data_vec - 4.5, right = data_vec - 3.5)
fit_tnorm <- fitdistcens(cens_data_tble, "tnorm", start = list(mean = 6, sd = 10))
fit_tlnorm <- fitdistcens(cens_data_tble, "tlnorm", start = list(meanlog = log(6), sdlog = log(10)))
cdfcompcens(list(fit_tnorm, fit_tlnorm), Turnbull.confint = TRUE, main = dispersal_data$ID[1],
legendtext = c("truncated normal", "truncated lognormal"))
stnorm <- summary(fit_tnorm)
stlnorm <- summary(fit_tlnorm)
data.frame(ID = dispersal_data$ID[1], AICnorm = stnorm$aic, AIClnorm = stlnorm$aic,
mu = stnorm$est[1], sd = stnorm$est[2],
mulog = stlnorm$est[1], sdlog = stlnorm$est[2])
}
Test it out:
fit_dispersal_models(subset(disperseLer, ID == "87_0"))
ID AICnorm AIClnorm mu sd mulog sdlog
mean 87_0 2489.106 2512.976 7.681504 5.588817 2.173721 0.4584419
ddply(subset(disperseLer, ID == "87_0" | ID == "128_1"), "ID",
fit_dispersal_models)
ID AICnorm AIClnorm mu sd mulog sdlog
1 128_1 4388.528 4368.020 -41.254702 13.903769 1.723583 0.5179114
2 87_0 2489.106 2512.976 7.681504 5.588817 2.173721 0.4584419
Full run
all_Ler_fits <- ddply(disperseLer, "ID", fit_dispersal_models)
Warning in sqrt(diag(varcovar)): NaNs produced
Warning in sqrt(1/diag(V)): NaNs produced
Warning in cov2cor(varcovar): diag(.) had 0 or NA entries; non-finite
result is doubtful
all_Ler_fits
ID AICnorm AIClnorm mu sd mulog sdlog
1 100_0 210.19617 209.46451 4.0982669 2.598637 1.662753 0.3158626
2 101_0 602.77572 592.94699 5.1014741 3.277311 1.839458 0.3314258
3 104_0 3917.28412 3922.72793 -66.0205798 19.643467 1.781924 0.6310585
4 105_0 1749.42577 1745.88089 4.3716320 3.528681 1.788268 0.3682102
5 106_0 646.69449 643.17978 -207.7151988 30.384323 1.307769 0.7884596
6 107_0 785.44720 789.47637 -69.4381348 16.731328 1.359045 0.6932002
7 107_2 1063.19962 1049.11621 -70.4622184 16.881924 1.740646 0.4708113
8 108_0 3238.39521 3253.82893 5.1491554 5.733904 2.030345 0.4653378
9 109_0 986.23168 993.20962 0.3779014 6.555358 1.830760 0.5150772
10 118_0 204.70929 204.18931 5.0644220 2.761443 1.771622 0.3257241
11 119_0 159.50334 161.20702 -6.3007402 10.223254 1.877411 0.6364096
12 125_0 1535.73278 1535.21286 4.3296332 3.730306 1.802977 0.3811286
13 128_1 4388.52816 4368.01998 -41.2547017 13.903769 1.723583 0.5179114
14 131_0 238.69069 240.39449 -4.1084462 5.925933 1.515535 0.5133134
15 133_0 157.21511 155.12266 5.1511826 3.886229 1.897703 0.3621337
16 134_0 309.97239 302.59321 3.1582072 4.429201 1.839981 0.3572949
17 135_0 2047.52526 2051.75059 6.8726239 4.285294 2.044296 0.3991739
18 73_0 1532.73721 1536.47422 3.8581071 5.121355 1.901724 0.4439891
19 75_0 1268.28878 1260.59024 4.1641279 5.606950 1.984015 0.4344484
20 75_1 713.93094 712.15268 5.1754515 3.910981 1.888703 0.3829246
21 77_0 1155.49401 1152.37149 1.9264317 4.898033 1.780264 0.4203578
22 78_2 3036.48468 3056.08625 6.5758420 5.186990 2.078296 0.4447256
23 79_0 16.43216 16.74522 4.7295954 1.069369 1.567024 0.1989932
24 85_1 732.95641 721.38097 2.8431735 4.174413 1.781615 0.3625076
25 85_2 4355.97098 4360.51040 3.7176899 5.757882 1.957874 0.4585569
26 87_0 2489.10649 2512.97557 7.6815043 5.588817 2.173721 0.4584419
27 87_1 4843.91681 4931.83322 8.2138623 4.811877 2.161850 0.4406535
28 87_2 2518.89435 2497.62958 3.4870029 4.699188 1.861861 0.4004055
29 88_0 505.51054 498.61356 3.1493213 3.455050 1.709980 0.3395724
30 88_1 632.33299 627.93489 0.9579404 4.372635 1.677200 0.3862587
31 89_1 2374.42168 2376.20474 -34.9798971 15.374743 1.835881 0.6015254
32 90_0 1414.00768 1402.33081 4.8038650 4.040552 1.881511 0.3774301
33 90_1 127.58152 128.43750 -14.7687132 7.623299 1.276556 0.5866747
34 91_2 3391.73490 3426.96778 2.8737605 7.010140 1.991896 0.5304049
35 93_0 1029.93150 1033.78692 2.0282346 5.749507 1.853023 0.4719776
36 95_0 639.50220 636.32519 -4.6097825 6.728016 1.717181 0.4532293
37 98_0 1956.80573 1995.93158 12.3030722 4.651470 2.468955 0.3797439
38 99_0 1252.70855 1262.56877 2.1108985 6.230271 1.889331 0.5023501
Some AIC stats
sum(all_Ler_fits$AICnorm)
[1] 58230.28
sum(all_Ler_fits$AIClnorm)
[1] 58366.17
cbind(all_Ler_fits$ID, all_Ler_fits$AICnorm - all_Ler_fits$AIClnorm)
[,1] [,2]
[1,] 1 0.7316622
[2,] 2 9.8287266
[3,] 3 -5.4438047
[4,] 4 3.5448752
[5,] 5 3.5147126
[6,] 6 -4.0291707
[7,] 7 14.0834126
[8,] 8 -15.4337230
[9,] 9 -6.9779437
[10,] 10 0.5199870
[11,] 11 -1.7036794
[12,] 12 0.5199187
[13,] 13 20.5081874
[14,] 14 -1.7037993
[15,] 15 2.0924486
[16,] 16 7.3791782
[17,] 17 -4.2253246
[18,] 18 -3.7370105
[19,] 19 7.6985462
[20,] 20 1.7782603
[21,] 21 3.1225217
[22,] 22 -19.6015753
[23,] 23 -0.3130643
[24,] 24 11.5754364
[25,] 25 -4.5394237
[26,] 26 -23.8690776
[27,] 27 -87.9164058
[28,] 28 21.2647714
[29,] 29 6.8969805
[30,] 30 4.3980948
[31,] 31 -1.7830564
[32,] 32 11.6768719
[33,] 33 -0.8559744
[34,] 34 -35.2328843
[35,] 35 -3.8554141
[36,] 36 3.1770114
[37,] 37 -39.1258517
[38,] 38 -9.8602207
So there’s variation among reps on which model fits better. In sum, the normal distribution is better. However, the parameter estimates are much more variable, in ways that really don’t make sense biologically, unless there is directionally biased dispersal that varies from rep to rep.