Consider again Lena’s wind farm study (Exericise 14.3). She would like to predict what fish occur where. What type of model should she use, to get the best predictions?
The focus here is prediction so she should be using some model designed to have good predictive performance. One thing that might help is putting random effects on terms, that take different values for different species, to borrow strength across species.
Petrus set up 28 pitfall traps on sand dunes, in different environmental conditions (in the open, under shrubs, etc) and classified all hunting spiders that fell into the trap to species. He measured six environmental variables characterising each site. He would like to know: Which environmental variables best predict hunting spider communities?
What analysis approach should he use?
This is a variable selection problem, he should be using multivariate model selection techniques. One approach is to construct models designed to be pretty good for prediction, and choose the one with best predictive performance. For example, he could fit mixed models, with random effects for different species, and use cross-validation to see which set of predictors gives the model has best predictive performance.
library(ecostats)
data(windFarms)
set.seed(5) # use this seed to get the same results as below:
nStations = length(levels(as.factor(windFarms$X$Station)))
isTestStn = sample(nStations,nStations/2)
isTest = windFarms$X$Station %in% levels(windFarms$X$Station)[isTestStn]
library(mvabund)
windMV = mvabund(windFarms$abund)
windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
prWind_Test = predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,], type="response")
prWind_Int_Test = predict.manyglm(windFt_Int_Train, newdata=windFarms$X[isTest,], type="response")
predLogL = dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)
predLogL_Int = dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)
c(sum(predLogL),sum(predLogL_Int))
#> [1] -931.5643 -928.3885
What do these results tell us about which model is preferred?
These results suggest the main effect model is better.
Repeat the analyses of Code Box 15.1 but after removing rare species (observed less than 10 times), using the following code:
windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
prWind_TestRare = predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,], type="response")
prWind_Int_TestRare = predict.manyglm(windFt_Int_TrainRare, newdata=windFarms$X[isTest,], type="response")
predLogLRare = dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)
predLogL_IntRare = dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)
c(sum(predLogLRare),sum(predLogL_IntRare))
#> [1] -781.0415 -788.9616
Did you get a similar answer?
Nope – now the model with an interaction term looks better!
Note that so far we have only considered one test sample, and there is randomness in the choice of training/test split. Repeat the analyses of Code Box 15.1 as well as those you have done here, with and without rare species, multiple times (which is a form of cross-validation).
set.seed(1) # use this seed to get the same results as below:
nStations = length(levels(as.factor(windFarms$X$Station)))
isTestStn = sample(nStations,nStations/2)
isTest = windFarms$X$Station %in% levels(windFarms$X$Station)[isTestStn]
windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
prWind_Test = predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,], type="response")
prWind_Int_Test = predict.manyglm(windFt_Int_Train, newdata=windFarms$X[isTest,], type="response")
predLogL = dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)
predLogL_Int = dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)
c(sum(predLogL),sum(predLogL_Int))
#> [1] -915.8434 -970.8314
windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
prWind_TestRare = predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,], type="response")
prWind_Int_TestRare = predict.manyglm(windFt_Int_TrainRare, newdata=windFarms$X[isTest,], type="response")
predLogLRare = dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)
predLogL_IntRare = dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)
c(sum(predLogLRare),sum(predLogL_IntRare))
#> [1] -734.4901 -780.1963
set.seed(2) # use this seed to get the same results as below:
nStations = length(levels(as.factor(windFarms$X$Station)))
isTestStn = sample(nStations,nStations/2)
isTest = windFarms$X$Station %in% levels(windFarms$X$Station)[isTestStn]
library(mvabund)
windMV = mvabund(windFarms$abund)
windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
prWind_Test = predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,], type="response")
prWind_Int_Test = predict.manyglm(windFt_Int_Train, newdata=windFarms$X[isTest,], type="response")
predLogL = dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)
predLogL_Int = dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)
c(sum(predLogL),sum(predLogL_Int))
#> [1] -886.4941 -992.3476
windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
prWind_TestRare = predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,], type="response")
prWind_Int_TestRare = predict.manyglm(windFt_Int_TrainRare, newdata=windFarms$X[isTest,], type="response")
predLogLRare = dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)
predLogL_IntRare = dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)
c(sum(predLogLRare),sum(predLogL_IntRare))
#> [1] -756.8341 -865.5058
set.seed(3) # use this seed to get the same results as below:
nStations = length(levels(as.factor(windFarms$X$Station)))
isTestStn = sample(nStations,nStations/2)
isTest = windFarms$X$Station %in% levels(windFarms$X$Station)[isTestStn]
library(mvabund)
windMV = mvabund(windFarms$abund)
windFt_Train=manyglm(windMV[isTest==FALSE,]~Year+Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
windFt_Int_Train=manyglm(windMV[isTest==FALSE,]~Year*Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
prWind_Test = predict.manyglm(windFt_Train,newdata=windFarms$X[isTest,], type="response")
prWind_Int_Test = predict.manyglm(windFt_Int_Train, newdata=windFarms$X[isTest,], type="response")
predLogL = dpois(windMV[isTest,],lambda=prWind_Test,log=TRUE)
predLogL_Int = dpois(windMV[isTest,],lambda=prWind_Int_Test,log=TRUE)
c(sum(predLogL),sum(predLogL_Int))
#> [1] -830.5662 -832.3659
windFt_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year+Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
windFt_Int_TrainRare=manyglm(windMVnotRare[isTest==FALSE,]~Year*Zone,
data=windFarms$X[isTest==FALSE,],family="poisson")
prWind_TestRare = predict.manyglm(windFt_TrainRare,newdata=windFarms$X[isTest,], type="response")
prWind_Int_TestRare = predict.manyglm(windFt_Int_TrainRare, newdata=windFarms$X[isTest,], type="response")
predLogLRare = dpois(windMVnotRare[isTest,],lambda=prWind_TestRare,log=TRUE)
predLogL_IntRare = dpois(windMVnotRare[isTest,],lambda=prWind_Int_TestRare,log=TRUE)
c(sum(predLogLRare),sum(predLogL_IntRare))
#> [1] -734.7738 -738.2911
Which set of results tends to be more reliable (less variable) – the ones with or without the rare species? Why do you think this happened?
Without the rare species, the results are more reliable, with predictive likelihood usually in the mid -700’s, and the interaction model consistently performing better (although sometimes narrowly so). With rare species, it bounces around a lot more, with the smaller model (main effects only) sometimes having better predictive performance. This is likely because rare species with all absences in one treatment combination can blow up the predictive likelihood (if they are predicted to have a mean of zero, but have some non-zero cases in test data).
windComp = manyglm(windMV~Zone*Year,data=windFarms$X,composition=TRUE)
library(glmmTMB)
wind_glmm = glmmTMB(windMV~Year*Zone+diag(Year*Zone|cols),
family=poisson(), data=windComp$data)
summary(wind_glmm)
#> Family: poisson ( log )
#> Formula: windMV ~ Year * Zone + diag(Year * Zone | cols)
#> Data: windComp$data
#>
#> AIC BIC logLik deviance df.resid
#> 3259.9 3331.4 -1617.9 3235.9 2852
#>
#> Random effects:
#>
#> Conditional model:
#> Groups Name Variance Std.Dev. Corr
#> cols (Intercept) 5.3421 2.3113
#> Year2010 1.6302 1.2768 0.00
#> ZoneN 0.6807 0.8251 0.00 0.00
#> ZoneS 3.9012 1.9751 0.00 0.00 0.00
#> Year2010:ZoneN 0.8519 0.9230 0.00 0.00 0.00 0.00
#> Year2010:ZoneS 0.3777 0.6146 0.00 0.00 0.00 0.00 0.00
#> Number of obs: 2864, groups: cols, 16
#>
#> Conditional model:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -3.18913 0.65414 -4.875 1.09e-06 ***
#> Year2010 -0.26989 0.45008 -0.600 0.549
#> ZoneN -0.15541 0.35209 -0.441 0.659
#> ZoneS -0.16619 0.61670 -0.269 0.788
#> Year2010:ZoneN 0.08998 0.41305 0.218 0.828
#> Year2010:ZoneS 0.56211 0.37247 1.509 0.131
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Calculate the predictive likelihood for the mixed model fitted to the wind farm data (Code Box 15.2), following the method in Code Box 15.1.
isTest = windComp$data$Station %in% levels(windComp$data$Station)[isTestStn]
wind_glmm = glmmTMB(windMV~Year*Zone+diag(Year*Zone|cols),
family=poisson(), data=windComp$data[isTest==FALSE,])
prGLMM_Test = predict(wind_glmm, newdata=windComp$data[isTest==TRUE,], type="response")
predGLMM = sum(dpois(windComp$data$windMV[isTest==TRUE],lambda=prGLMM_Test,log=TRUE))
c(predGLMM,sum(predLogL))
#> [1] -744.1239 -830.5662
Did this have better predictive performance? Why do you think this happened?
This did waaaay better than manyglm
. Probably because it
was borrowing strength across species to better predict rare ones, not
overfitting quite as badly.
library(glmnet)
X = model.matrix(windMV~Year*Zone*cols,data=windComp$data)
y = windComp$data$windMV
windLasso = glmnet(X,y, family="poisson")
isTest = windComp$data$Station %in%
levels(windComp$data$Station)[isTestStn]
windLassoTrain = glmnet(X[isTest==FALSE,], y[isTest==FALSE], family="poisson")
prLassoTest = predict(windLassoTrain,X[isTest,],type="response")
predLLlasso=colSums(dpois(windComp$data$windMV[isTest],prLassoTest,
log=TRUE))
plot(windLassoTrain$lambda,predLLlasso,type="l",log="x")
matplot(windLasso$lambda,t(windLasso$beta),type="n", log="x")
matlines(windLasso$lambda,t(windLasso$beta))
library(grplasso)
windLambdaTrain = lambdamax(windMV~Year*Zone*cols, data=windComp$data,
subset=isTest==FALSE, model = PoissReg()) * 0.7^(0:19)
windGrplasso = grplasso(windMV~Year*Zone*cols, data=windComp$data,
lambda=windLambdaTrain, subset=isTest==FALSE, model = PoissReg())
#> Lambda: 556.505 nr.var: 16
#> Lambda: 389.5535 nr.var: 31
#> Lambda: 272.6875 nr.var: 61
#> Lambda: 190.8812 nr.var: 61
#> Lambda: 133.6169 nr.var: 61
#> Lambda: 93.5318 nr.var: 61
#> Lambda: 65.47226 nr.var: 91
#> Lambda: 45.83058 nr.var: 91
#> Lambda: 32.08141 nr.var: 91
#> Lambda: 22.45699 nr.var: 91
#> Lambda: 15.71989 nr.var: 91
#> Lambda: 11.00392 nr.var: 91
#> Lambda: 7.702746 nr.var: 91
#> Lambda: 5.391922 nr.var: 91
#> Lambda: 3.774346 nr.var: 91
#> Lambda: 2.642042 nr.var: 91
#> Lambda: 1.849429 nr.var: 91
#> Lambda: 1.294601 nr.var: 91
#> Lambda: 0.9062204 nr.var: 91
#> Lambda: 0.6343543 nr.var: 91
prGrpTest = predict(windGrplasso,newdata=windComp$data[isTest,], type="response")
predLLgrplasso = colSums(dpois(windComp$data$windMV[isTest], prGrpTest,log=TRUE))
plot(windGrplasso$lambda,predLLgrplasso,log="x",type="l")
matplot(windGrplasso$lambda,t(windGrplasso$coefficients),type="n",log="x")
matlines(windGrplasso$lambda,t(windGrplasso$coefficients))
Compare the predictive log-likelihoods from Code Box 15.1 and Figures 15.1c-d. These models were all applied using the same test dataset. Which model seems to fit the data better? Why do you think this is?
c(sum(predLogL),sum(predLogL_Int), max(predLLlasso), max(predLLgrplasso))
#> [1] -830.5662 -832.3659 -744.2159 -739.6487
The LASSO models fit way better, as expected. These models shrink parameters together, borrowing strength across taxa. The two LASSO fits did about as well as each other.
library(VGAM)
wind_RR2=rrvglm(as.matrix(windFarms$abund)~Year*Zone, family=poissonff, data=windFarms$X, Rank=2)
wind_manyglm = manyglm(windMV~Year*Zone, data=windFarms$X, family=poisson())
c( BIC(wind_RR2), sum(BIC(wind_manyglm)))
#> [1] 3317.985 3471.901
zoneyear = interaction(windFarms$X$Zone,windFarms$X$Year)
matplot(as.numeric(zoneyear),latvar(wind_RR2),pch=c(1,19))
Which model fits better, according to BIC?
The reduced rank approach fits better.
library(mvabund)
data(spider)
spid.trait = traitglm(spider$abund,spider$x,method="cv.glm1path")
#> No traits matrix entered, so will fit SDMs with different env response for each spp
library(lattice)
a = max( abs(spid.trait$fourth.corner) )
colort = colorRampPalette(c("blue","white","red"))
plot.4th = levelplot(t(as.matrix(spid.trait$fourth.corner)),
xlab="Environmental Variables", ylab="Species",
col.regions=colort(100), at=seq(-a, a, length=100),
scales = list( x= list(rot = 45)) )
print(plot.4th)
Which environmental variables seem to have the strongest effect on spider abundances?
soil.dry
has some strongly negative coefficients,
fallen.leaves
and herb.layer
also have some
strong (but positive) interactions.
spidXY = data.frame(scale(spider$x),spider$abund) # scale standardises data!
library(reshape2)
spiderLong = melt(id=1:6,spidXY,variable.name="cols")
Xformula = paste(colnames(spider$x),collapse="+")
fullFormula = formula(paste0("value~cols+",Xformula,"+(",Xformula,"|cols)"))
library(glmmTMB)
spid_glmm = glmmTMB(fullFormula,family=nbinom2(),data=spiderLong)
#> Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; non-positive-definite
#> Hessian matrix. See vignette('troubleshooting')
#> Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence problem; false convergence (8).
#> See vignette('troubleshooting'), help('diagnose')
summary(spid_glmm)
#> Family: nbinom2 ( log )
#> Formula: value ~ cols + soil.dry + bare.sand + fallen.leaves + moss +
#> herb.layer + reflection + (soil.dry + bare.sand + fallen.leaves +
#> moss + herb.layer + reflection | cols)
#> Data: spiderLong
#>
#> AIC BIC logLik deviance df.resid
#> NA NA NA NA 289
#>
#> Random effects:
#>
#> Conditional model:
#> Groups Name Variance Std.Dev. Corr
#> cols (Intercept) 0.6278 0.7923
#> soil.dry 1.8414 1.3570 -0.87
#> bare.sand 0.1583 0.3979 -0.13 -0.06
#> fallen.leaves 0.4065 0.6376 0.66 -0.66 -0.71
#> moss 0.2394 0.4893 -0.68 0.38 0.53 -0.60
#> herb.layer 0.6218 0.7885 -0.91 0.80 -0.18 -0.37 0.44
#> reflection 0.6865 0.8286 0.18 -0.55 0.32 0.21 0.37 -0.36
#> Number of obs: 336, groups: cols, 12
#>
#> Dispersion parameter for nbinom2 family (): 1.51
#>
#> Conditional model:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.14046 0.38463 0.365 0.714972
#> colsAlopcune 0.57040 0.35671 1.599 0.109811
#> colsAlopfabr -1.63338 0.58814 -2.777 0.005483 **
#> colsArctlute -1.95026 0.73851 -2.641 0.008270 **
#> colsArctperi -4.59785 1.36464 -3.369 0.000754 ***
#> colsAuloalbi 0.08365 0.45974 0.182 0.855622
#> colsPardlugu -1.35449 0.69980 -1.936 0.052924 .
#> colsPardmont 1.72206 0.42390 4.062 4.86e-05 ***
#> colsPardnigr 0.96816 0.47566 2.035 0.041808 *
#> colsPardpull 1.18339 0.67418 1.755 0.079205 .
#> colsTrocterr 2.28614 0.37644 6.073 1.26e-09 ***
#> colsZoraspin 0.37109 0.43828 0.847 0.397160
#> soil.dry 0.63734 0.42406 1.503 0.132855
#> bare.sand 0.06513 0.15932 0.409 0.682691
#> fallen.leaves -0.35355 0.26504 -1.334 0.182218
#> moss 0.13821 0.18615 0.743 0.457782
#> herb.layer 1.11253 0.27857 3.994 6.50e-05 ***
#> reflection -0.01724 0.29874 -0.058 0.953983
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are some convergence warnings here, which is fair enough because there are only 12 species but heaps of correlated terms in this model. Results seem to be reasonably stable on multiple runs, tweaking some settings, so we’ll go with this fit.
Which predictors seem to be most important to spider community composition?
The largest variance component, by some margin, is for
soil.dry
(about 1.8), followed by herb.layer
and reflection
.