This vignette is meant to introduce users to the
MIAmaxent
package by providing a worked example of a
distribution modeling exercise. It shows how to use all of the top-level
functions included in MIAmaxent
in the order of a typical
analysis. This vignette does NOT describe the theoretical underpinnings
of the package. To learn more about the theory behind
MIAmaxent
, the user is referred to Halvorsen (2013),
Halvorsen et al. (2015), and Vollering et al. (2019).
Help pages for the package and for individual functions in the package can be accessed in R using the standard help command:
?"MIAmaxent"
(after attaching the package usinglibrary(MIAmaxent)
).
Abbreviations: EV = explanatory variable; DV = derived variable; RV = response variable; FOP = frequency of observed presence; AUC = area under the curve of the receiver operating characteristic plot
The data used in this vignette have been used to model the distribution of semi-natural grasslands in Østfold County, in southeastern Norway. The data set consists of 1059 locations where presence of semi-natural grasslands has been recorded, 13 environmental variables covering the extent of the study area, and 122 locations where the presence or absence of semi-natural grasslands has been recorded, independently of the presence-only records. The extent of the study area is about 4000 square kilometers, and the grain of the raster data is 500 meters (0.25 km2).
The data used in this vignette are included in the package as an example data set, so the results shown here can be directly replicated by executing the code as provided.
Before beginning the modeling exercise, it may be useful to see what
some of the data look like in their geographical representation. We can
use the terra
package to plot the 1059 recorded presences
on top of one of the environmental variable rasters:
library(terra)
EV1 <- rast(list.files(system.file("extdata", "EV_continuous",
package="MIAmaxent"), full.names=TRUE)[1])
PO <- read.csv(system.file("extdata", "occurrence_PO.csv", package="MIAmaxent"))
plot(EV1, legend=FALSE)
points(PO$POINT_X, PO$POINT_Y, pch = 20, cex = 0.5, col = 'red')
The starting point for modeling using MIAmaxent
is a
simple data object that contains occurrence data for the modeled target,
and some number of explanatory variables (EVs). This data object must be
formatted as a data frame, with the binary response variable (RV)
representing occurrence in the first column, and corresponding EV values
in subsequent columns. When the occurrence data consist of presence and
absence records, these should be coded as “1” and “0” respectively. When
the occurrence data consist of presence records only, presence locations
are contrasted against locations with unknown occurrence, and the RV
should be coded as “1” or “NA”. EVs may be continuous (numeric class) or
categorical (factor class), as denoted in the data object.
The
readData()
function transforms data in CSV and ASCII or GeoTIFF raster file formats into a single data frame which serves as the starting point for modeling.
Users of the highly popular maxent.jar program for maximum entropy
modeling are accustomed to data in a different format. Specifically,
occurrence data is often in CSV file format, with presences records
followed by coordinates, and explanatory data in ASCII raster file
format. The readData()
function makes it easy to read these
data into the data object that is used in MIAmaxent
. This
function extracts values of the EVs at locations specified in the CSV
file and properly formats these into the starting point for modeling. If
the CSV file contains presence records only, then
readData()
also selects a random set of uninformed
background locations for the data object. Alternatively, the user can
specify a custom set of background locations by giving these in the CSV
file.
We begin by creating our data object from file. Note that continuous and categorical environmental variables must be placed in separate directories:
library(MIAmaxent)
grasslandPO <- readData(
occurrence=system.file("extdata", "occurrence_PO.csv", package="MIAmaxent"),
contEV=system.file("extdata", "EV_continuous", package="MIAmaxent"),
catEV=system.file("extdata", "EV_categorical", package="MIAmaxent"),
maxbkg=20000)
In this case, the number of uninformed background locations to be
randomly selected (maxbkg=20000
) was larger than the total
number of raster cells in the study area, so all cells are included in
the data object.
Most functions in
MIAmaxent
return console output. Therefore, it’s handy to assign function output to an object, so that you can manipulate that object further. If you forget, you can use?.Last.value()
.
If we look at the resulting data object we see the response variable (with 1059 presence and 16420 background locations) along with 6 continuous and 5 categorical EVs:
str(grasslandPO)
#> 'data.frame': 17479 obs. of 14 variables:
#> $ RV : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ pca1 : num 0.000504 0.000506 0.000554 0.00051 0.000486 ...
#> $ prbygall : num 0.00772 0.00134 0.003 0.01049 0.00092 ...
#> $ prtilany : num 0 0 0.3898 0.0884 0.1316 ...
#> $ teraspif : int 180 95 20 111 116 61 24 34 35 35 ...
#> $ terdem : int 0 0 0 12 0 0 0 11 26 39 ...
#> $ terslpdg : num 0 0.827 1.433 1.175 0.813 ...
#> $ tersolrade: num 977 966 1018 956 960 ...
#> $ tertpi09 : num -3.54 -7.51 -6.11 -2.38 -15.16 ...
#> $ geoberg : Factor w/ 16 levels "0","2","3","4",..: 10 10 9 10 10 10 10 10 10 10 ...
#> $ geolmja1 : Factor w/ 15 levels "11","12","15",..: 8 7 15 7 8 7 8 7 7 7 ...
#> $ lcucor1 : Factor w/ 21 levels "111","112","121",..: 21 21 21 8 21 8 21 8 12 2 ...
#> $ lcutilt4 : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 1 ...
#> $ terslpps15: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 6 1 1 1 1 6 6 ...
sum(grasslandPO$RV == 1, na.rm = TRUE)
#> [1] 1059
sum(is.na(grasslandPO$RV))
#> [1] 16420
IMPORTANT: A number of important issues for distribution
modeling – such as accounting for sampling bias and setting study area
extent – are not dealt with in MIAmaxent
, and should be
dealt with beforehand. Good modeling practice requires that these issues
be considered carefully! See Guisan et al. (2017) for a full treatment
of distribution modeling in R.
By its simplest definition, a distribution model examines the relationship between the modeled target and its environment. In this way, distribution modeling follows the long tradition of gradient analysis in vegetation ecology (Halvorsen, 2012). Therefore, before building an actual model, we should have some idea about what influence the environmental variables have on the occurrence of the target.
We can use the plotFOP
function to create a so-called
Frequency of Observed Presence (FOP) plot. An FOP plot shows how
commonly the target occurs across the range of the EV, and makes it
possible to recognize patterns in frequency of occurrence. In theory,
the relationship between a continuous EV and modeled target is expected
to be unimodal, if the observed range of the EV is sufficiently large.
In practice, the pattern seen in the FOP plot depends not only on the
range of the EV — which is affected by the extent of the study area —
but also the scaling of the EV.
Here we examine FOP plots for 2 of the continuous EVs:
The points in these FOP plots show the observed proportion of points in a given interval of the EV which contain presences. The red line is a local regression smoother which aims to summarize the pattern in the empirical FOP values. The grey distribution in the background is an approximation of the data density across the range of the EV.
Notice the difference in the scales of the FOP axes. EVs showing a larger interval on the FOP axis typically carry more explanatory power.
We can change the number of the number of intervals used to calculate FOP, or the neighborhood of the smoother, and we can access the plotted data directly:
terslpdgFOP
#> $EVoptimum
#> [1] 6.382247
#>
#> $FOPdata
#> int n intEV intRV loess
#> 1 (-0.00951,0.476] 2506 0.2601985 0.06304868 0.06692123
#> 2 (0.476,0.951] 3730 0.7188061 0.06729223 0.06253300
#> 3 (0.951,1.43] 3447 1.1799195 0.05860168 0.05922944
#> 4 (1.43,1.9] 2560 1.6513057 0.05507812 0.05698450
#> 5 (1.9,2.38] 1834 2.1238806 0.05507088 0.05588399
#> 6 (2.38,2.85] 1239 2.5967183 0.05004036 0.05601716
#> 7 (2.85,3.33] 830 3.0659710 0.06626506 0.05736699
#> 8 (3.33,3.8] 480 3.5487974 0.06666667 0.05992171
#> 9 (3.8,4.28] 318 4.0297763 0.05974843 0.06412131
#> 10 (4.28,4.76] 202 4.4976457 0.05445545 0.06853793
#> 11 (4.76,5.23] 131 4.9892501 0.08396947 0.07204624
#> 12 (5.23,5.71] 72 5.4752495 0.06944444 0.07506689
#> 13 (5.71,6.18] 44 5.9547193 0.11363636 0.07805324
#> 14 (6.18,6.66] 44 6.3822473 0.04545455 0.07853477
#> 15 (6.66,7.13] 21 6.8873543 0.19047619 0.07539665
#> 16 (7.13,7.61] 12 7.3856734 0.00000000 0.06830319
#> 17 (7.61,8.08] 5 7.7893479 0.00000000 0.05917056
#> 18 (8.08,8.56] 2 8.1287899 0.00000000 0.05042162
#> 19 (8.56,9.04] 1 9.0068598 0.00000000 0.01422974
#> 20 (9.04,9.52] 1 9.5107498 0.00000000 -0.01200278
Based on this FOP plot, the occurrence of semi-natural grasslands seems to be unimodally related to ‘terslopdg’ (terrain slope) with a maximum at around 6.
Now we examine FOP plots for one of the categorical EVs:
We see that geoberg type 4 has the highest rate of observed presence, followed by type 2, and then types 3 and 28. If we look more closely however, we notice also that geoberg type 4 is sampled very rarely (see grey bars), with only 6 locations falling into that category:
geobergFOP
#> $EVoptimum
#> [1] 4
#> Levels: 0 2 3 4 5 21 22 26 27 28 29 35 40 62 82 85
#>
#> $FOPdata
#> level n levelRV
#> 1 0 5 0.00000000
#> 2 2 20 0.45000000
#> 3 3 74 0.28378378
#> 4 4 6 0.50000000
#> 5 5 456 0.10526316
#> 6 21 3448 0.05974478
#> 7 22 46 0.06521739
#> 8 26 14 0.00000000
#> 9 27 21 0.09523810
#> 10 28 58 0.27586207
#> 11 29 6 0.00000000
#> 12 35 369 0.11382114
#> 13 40 1 0.00000000
#> 14 62 8183 0.05230356
#> 15 82 4445 0.05939258
#> 16 85 327 0.05198777
If geoberg type 4 had shown a high FOP value and a large number of observations, the uncertainty associated with its FOP value would be lower and its likelihood of being selected in the model would be increased.
It’s useful to examine FOP plots for all candidate explanatory variables (EVs) before building a model.
Looking at FOP plots should help the modeler decide which EVs are likely to have greatest explanatory power, and gives an idea of the strength and shape of the relationships between the EVs and RV.
To fit the many different kinds of relationships between explanatory and response variables, we need to transform the EVs. This means that we create new “derived” variables (DVs) from the original EVs. Another way of thinking about this is to put it in terms of rescaling; we adjust the scale of the EV in many different ways in order to check which scaling is most ecologically relevant to the occurrence of the modeled target.
The deriveVars()
function produces DVs from EVs by 7
different transformation types: linear, monotonous, deviation, forward
hinge, reverse hinge, threshold, and binary (Halvorsen et al., 2015).
The first 6 of these are relevant for continuous variables and the
binary transformation is relevant only for categorical variables.
Different types of transformations can be turned on or off to balance
model complexity with model fit.
For the spline-type transformations (forward hinge, reverse hinge,
threshold) an endless number of different transformations are possible,
so by default the function produces 20 of each, and then chooses those
which explain the most variation in the RV. This means that 20 models
are built and evaluated for each combination of EV and spline
transformation, so running deriveVars()
with these
transformation types turned on can take a bit of time — depending on the
size of the data set.
Here we produce all types of DVs from our EVs:
grasslandDVs <- deriveVars(grasslandPO,
transformtype = c("L","M","D","HF","HR","T","B"))
#> Pre-selecting forward hinge transformations of pca1
#> Pre-selecting reverse hinge transformations of pca1
#> Pre-selecting threshold transformations of pca1
#> Pre-selecting forward hinge transformations of prbygall
#> Pre-selecting reverse hinge transformations of prbygall
#> Pre-selecting threshold transformations of prbygall
#> Pre-selecting forward hinge transformations of prtilany
#> Pre-selecting reverse hinge transformations of prtilany
#> Pre-selecting threshold transformations of prtilany
#> Pre-selecting forward hinge transformations of teraspif
#> Pre-selecting reverse hinge transformations of teraspif
#> Pre-selecting threshold transformations of teraspif
#> Pre-selecting forward hinge transformations of terdem
#> Pre-selecting reverse hinge transformations of terdem
#> Pre-selecting threshold transformations of terdem
#> Pre-selecting forward hinge transformations of terslpdg
#> Pre-selecting reverse hinge transformations of terslpdg
#> Pre-selecting threshold transformations of terslpdg
#> Pre-selecting forward hinge transformations of tersolrade
#> Pre-selecting reverse hinge transformations of tersolrade
#> Pre-selecting threshold transformations of tersolrade
#> Pre-selecting forward hinge transformations of tertpi09
#> Pre-selecting reverse hinge transformations of tertpi09
#> Pre-selecting threshold transformations of tertpi09
Turn
write
on and (optionally) specify a directory to save the transformation functions produced byderiveVars
to file.
The output of deriveVars()
is a list consisting of 2
parts:
Both list elements also contain the RV vector.
In our grasslands analysis, the contents of the list items look like this:
summary(grasslandDVs$dvdata)
#> Length Class Mode
#> RV 17479 -none- numeric
#> pca1 10 data.frame list
#> prbygall 5 data.frame list
#> prtilany 7 data.frame list
#> teraspif 8 data.frame list
#> terdem 8 data.frame list
#> terslpdg 5 data.frame list
#> tersolrade 5 data.frame list
#> tertpi09 8 data.frame list
#> geoberg 16 data.frame list
#> geolmja1 15 data.frame list
#> lcucor1 21 data.frame list
#> lcutilt4 2 data.frame list
#> terslpps15 6 data.frame list
head(summary(grasslandDVs$transformations))
#> Length Class Mode
#> RV 17479 -none- numeric
#> pca1_L_transf 1 -none- function
#> pca1_M_transf 1 -none- function
#> pca1_D05_transf 1 -none- function
#> pca1_D1_transf 1 -none- function
#> pca1_D2_transf 1 -none- function
length(grasslandDVs$transformations)
#> [1] 117
Note that the names of DVs indicate the type of transformation was used to create them. For example, “terslpdg_D2” is a deviation-type transformation of terslpdg, where the slope of the deviation is controlled by a parameter value of 2. Meanwhile, “terslpdg_HR4” is a reverse hinge transformation, with the knot in the 4th position.
Underscores (’_‘) are used to denote DVs, and colons (’:’) are used to denote interaction terms, so EV names must not contain these characters. EV names should also be unique.
To illustrate, look at how a given DV relates to the original, untransformed EV from which it was derived. Here we examine “terslpdg_D2” and “terslpdg_M”:
plot(grasslandPO$terslpdg, grasslandDVs$dvdata$terslpdg$terslpdg_D2, pch=20,
ylab="terslpdg_D2")
plot(grasslandPO$terslpdg, grasslandDVs$dvdata$terslpdg$terslpdg_M, pch=20,
ylab="terslpdg_M")
“terslpdg_D2” is the squared deviation (hence D2) from the estimated optimum in terslpdg (around 6). “terslpdg_M” is a monotone (hence M) transformation of terslpdg — specifically a zero-skew transformation.
With DVs ready, we are ready to begin the process of choosing which
variables to include in the model. This is arguably the most critical
step in the whole modeling process. Following the principle of
parsimony, the aim in selecting variables is to explain as much
variation in the RV as efficiently as possible. The greater the number
of EVs or DVs included in the model, the more variation in the RV we can
explain, but at the cost of model complexity. In the
MIAmaxent
package, the benefit of additional variation
explained is weighed against the cost in model complexity using an
inference test (Chi-squared or F). Variables are added to the model one
by one in a process termed forward selection, and each new model is
compared to its predecessor. Another term for this process is “nested
model comparison.”
Rather than selecting from the full pool of DVs one by one,
MIAmaxent
performs variable selection in two parts:
selectDVforEV()
function.selectEV()
function.Variable selection occurs hierarchically: first DVs for each EV, then EVs for the full model.
The selectDVforEV()
function performs forward selection
of individual DVs for each EV. In other words, the function takes each
EV one by one, and narrows the group of DVs produced from that EV to a
set which explains variation in the RVs most efficiently.
The alpha
argument to selectDVforEV()
is
used in the inference test during forward selection, setting the
threshold for how much variation a DV must explain to be retained. A
lower alpha
results in a more conservative test, i.e. DVs
must explain more variation to be selected.
Here we use selectDVforEV()
on the grassland data set.
Note the “$dvdata” following grasslandsDV, which identifies the list of
DVs we made using deriveVars()
(see
?deriveVars()
Value).
The output is a list consisting of 2 parts:
Comparing the list of DVs before and after selection, we can see that
selectDVforEV()
reduced the number of DVs considerably:
summary(grasslandDVs$dvdata)
#> Length Class Mode
#> RV 17479 -none- numeric
#> pca1 10 data.frame list
#> prbygall 5 data.frame list
#> prtilany 7 data.frame list
#> teraspif 8 data.frame list
#> terdem 8 data.frame list
#> terslpdg 5 data.frame list
#> tersolrade 5 data.frame list
#> tertpi09 8 data.frame list
#> geoberg 16 data.frame list
#> geolmja1 15 data.frame list
#> lcucor1 21 data.frame list
#> lcutilt4 2 data.frame list
#> terslpps15 6 data.frame list
sum(sapply(grasslandDVs$dvdata[-1], length))
#> [1] 116
summary(grasslandDVselect$dvdata)
#> Length Class Mode
#> RV 17479 -none- numeric
#> pca1 1 data.frame list
#> prbygall 1 data.frame list
#> prtilany 1 data.frame list
#> teraspif 1 data.frame list
#> terdem 1 data.frame list
#> tertpi09 1 data.frame list
#> geoberg 5 data.frame list
#> geolmja1 3 data.frame list
#> lcucor1 2 data.frame list
#> terslpps15 1 data.frame list
sum(sapply(grasslandDVselect$dvdata[-1], length))
#> [1] 17
Note also that the number of EVs was reduced from 13 to 10. EVs for which none of the DVs explained a significant amount of variation are dropped.
Here is an example of one of the (13) trails of forward DV selection. Shown is the trail for the “terdem” EV:
grasslandDVselect$selection$terdem
#> round variables m Dsq Chisq df P
#> 1 1 terdem_D05 1 0.006 106.684 1 5.22e-25
#> 2 1 terdem_M 1 0.005 97.665 1 4.95e-23
#> 3 1 terdem_HR18 1 0.005 97.541 1 5.28e-23
#> 4 1 terdem_L 1 0.005 97.460 1 5.49e-23
#> 5 1 terdem_D1 1 0.005 97.373 1 5.74e-23
#> 6 1 terdem_HR4 1 0.005 93.987 1 3.18e-22
#> 7 1 terdem_D2 1 0.005 83.598 1 6.06e-20
#> 8 1 terdem_T8 1 0.003 55.579 1 8.98e-14
#> 9 2 terdem_D05 + terdem_HR4 2 0.006 0.779 1 3.77e-01
#> 10 2 terdem_D05 + terdem_D2 2 0.006 0.572 1 4.49e-01
#> 11 2 terdem_D05 + terdem_T8 2 0.006 0.344 1 5.58e-01
#> 12 2 terdem_D05 + terdem_M 2 0.006 0.154 1 6.95e-01
#> 13 2 terdem_D05 + terdem_HR18 2 0.006 0.094 1 7.60e-01
#> 14 2 terdem_D05 + terdem_D1 2 0.006 0.092 1 7.61e-01
#> 15 2 terdem_D05 + terdem_L 2 0.006 0.091 1 7.63e-01
The columns in this data frame represent: the round of variable selection (round), the names of the DVs included in the model (variables), the number of DVs in the model (m), the fraction of deviance explained (D2, sensu Guisan & Zimmerman, 2000), the Chi-squared statistic for the nested model comparison (Chisq), the degrees of freedom for the nested model comparison (df), and the p-value for the Chi-squared statistic under the specified degrees of freedom (P).
This table shows, for example, that in the first round of selection, one model was built for each of the 8 DVs, and that all of these explained enough variation to be retained for the second round of selection. Of all the DVs produced from “terdem”, “terdem_D05” explained the most variation in the RV. However, none of the remaining DVs explained enough of the remaining variation to be selected in addition to “terdem_D05” (P > alpha, in round 2).
Part 2 of variable selection using MIAmaxent
is
performed by the selectEV()
function. This function is
similar to the selectDVforEV()
function, but instead of
selecting a parsimonious set of DVs to represent each EV,
selectEV()
selects a parsimonious set of EVs to comprise
the model. This proceeds in the same manner as
selectDVforEV()
, with nested model comparison using
inference tests. Where selectDVforEV()
adds a single DV at
a time, selectEV()
adds a single set of DVs
(representing one EV) at a time.
The selectEV()
function also differs from
selectDVforEV()
in another important way; it provides the
option of testing interaction terms between selected EVs
(interaction = TRUE
). Interaction terms are useful when one
EV changes the effect of another EV on the modeled target. In
MIAmaxent
, only first-order interactions are tested, and
only between EVs both included in the model.
Here we use selectEV()
on the grassland data set. Note
the “$dvdata” following grasslandsDVselect, which identifies the list of
selected DVs we made using selectDVforEV()
(see
?selectDVforEV()
Value).
grasslandEVselect <- selectEV(grasslandDVselect$dvdata, alpha = 0.001,
interaction = TRUE)
#> Forward selection of 10 EVs
#> Round 1 complete.
#> Round 2 complete.
#> Round 3 complete.
#> Round 4 complete.
#> Round 5 complete.
#> Round 6 complete.
#> Forward selection of interaction terms between 6 EVs
#> Round 7 complete.
The output is a list consisting of 3 parts:
Compare the list of EVs before and after:
summary(grasslandDVselect$dvdata)
#> Length Class Mode
#> RV 17479 -none- numeric
#> pca1 1 data.frame list
#> prbygall 1 data.frame list
#> prtilany 1 data.frame list
#> teraspif 1 data.frame list
#> terdem 1 data.frame list
#> tertpi09 1 data.frame list
#> geoberg 5 data.frame list
#> geolmja1 3 data.frame list
#> lcucor1 2 data.frame list
#> terslpps15 1 data.frame list
length(grasslandDVselect$dvdata[-1])
#> [1] 10
summary(grasslandEVselect$dvdata)
#> Length Class Mode
#> RV 17479 -none- numeric
#> prbygall 1 data.frame list
#> geoberg 5 data.frame list
#> lcucor1 2 data.frame list
#> tertpi09 1 data.frame list
#> geolmja1 3 data.frame list
#> teraspif 1 data.frame list
length(grasslandEVselect$dvdata[-1])
#> [1] 6
We can see that selectEV()
reduced the number of EVs to
6. To check whether any interaction terms have been included in the
model, we can look at its formula:
grasslandEVselect$selectedmodel$formula
#> RV ~ prbygall_M + geoberg_BX3 + geoberg_BX28 + geoberg_BX2 +
#> geoberg_BX35 + geoberg_BX5 + lcucor1_BX312 + lcucor1_BX322 +
#> tertpi09_HR12 + geolmja1_BX20 + geolmja1_BX42 + geolmja1_BX60 +
#> teraspif_T6
#> <environment: 0x000001b957468f90>
No interaction terms were significant in the model. These would be denoted by two DV names separated by a colon (e.g. ‘pr.bygall_M:tertpi09_HR12’). This can be confirmed by looking at the trail of forward selection of EVs and interaction terms. Here we show only the best model of each round:
grasslandEVselect$selection[!duplicated(grasslandEVselect$selection$round), ]
#> round variables m Dsq
#> 1 1 prbygall 1 0.012
#> 11 2 prbygall + geoberg 6 0.019
#> 20 3 prbygall + geoberg + lcucor1 8 0.023
#> 27 4 prbygall + geoberg + lcucor1 + tertpi09 9 0.026
#> 33 5 prbygall + geoberg + lcucor1 + tertpi09 + geolmja1 12 0.028
#> 36 6 prbygall + geoberg + lcucor1 + tertpi09 + geolmja1 + teraspif 13 0.028
#> 37 7 prbygall + geoberg + lcucor1 + tertpi09 + geolmja1 + teraspif + lcucor1:tertpi09 15 0.029
#> Chisq df P
#> 1 219.973 1 9.17e-50
#> 11 112.063 5 1.50e-22
#> 20 69.863 2 6.75e-16
#> 27 62.059 1 3.33e-15
#> 33 25.978 3 9.64e-06
#> 36 14.735 1 1.24e-04
#> 37 10.593 2 5.01e-03
As expected, the model with the interaction term is not significant under the alpha value of 0.001. Instead, the selected model is the model with 6 single-order terms.
The full selection trail can be saved as a CSV-format file by setting
write = TRUE
in selectEV()
, and optionally
specifying a dir
. A careful examination of the trail of
forward selection can be helpful in choosing the final model for a given
application.
Note that above we started the forward selection procedure from a
null model including none of the EVs. However, if we had an a
priori reason to include one or more of the EVs in the model
regardless of explanatory power, we could do so using the
formula
argument in selectEV()
. Then forward
selection proceeds with the specified model as a starting point.
We may choose to plot some of the data in the forward selection trail to help us decide which model to use. For example, we may plot fraction of deviance explained (D2) against round number, to see how much better the model fit is for each added term:
In this case, we may decide that a simpler model with only 5 EVs is
better than the selected model, since the amount of deviance explained
levels off after round 5. The EVs included in this model are: prbygall +
geoberg + lcucor1 + tertpi09 + geolmja1. To proceed with this model,
instead of the selectedmodel
in
grasslandEVselect
, we can use the
chooseModel()
function:
grasslandmodel <- chooseModel(grasslandDVselect$dvdata,
formula("~ prbygall + geoberg + lcucor1 +
tertpi09 + geolmja1"))
This is the model which we explore further below.
After building a model by selecting which EVs to include, it is useful to explore what the modeled relationships actually look like. A straightforward way to do this is to look at model predictions over a range of explanatory data. This gives the modeler a sense of the relationships that the model has captured, and can help the modeler understand strengths and weaknesses in the model.
We can use the plotResp()
function to create a response
curve. A response curve plots model output across the range of one
particular EV. When other variables are excluded from the model
entirely, this is called a “single-effect response curve”.
Here we examine a single-effect response curve for the most important EV included in the model chosen above:
To assess how well the relationship between the EV and RV is captured by the model, it can be useful to examine FOP plots and response plots side-by-side:
prbygallFOP <- plotFOP(grasslandPO, "prbygall")
plotResp(grasslandmodel, grasslandDVs$transformations, "prbygall")
The values on the y-axes of the plots are not directly comparable, but we expect the shape of the response plot curve to mirror, more or less, the shape of the FOP plot curve.
Here is the same comparison for one of the categorical variables included in the model:
The plotResp2()
function is very similar to the
plotResp()
function in that it takes the same arguments and
is used to produce response plots. However, plotResp2()
plots “marginal-effect” responses rather than “single-effect” responses.
A marginal-effect response plot shows model predictions across the range
of one EV while holding all other EVs in the model constant at their
mean value. The resulting curves are often similar in shape, if not
in scale.
Here is the marginal-effect response plot for the same continuous EV shown above:
As a measure of variable contribution, we can calculate the Relative Variation Accounted for (RVA) by each variable (Halvorsen et al., 2015). Here with our chosen model:
calculateRVA(grasslandEVselect, formula("~ prbygall + geoberg + lcucor1 +
tertpi09 + geolmja1"))
#> variable RVA
#> 1 prbygall 0.429
#> 2 geoberg 0.250
#> 3 lcucor1 0.143
#> 4 tertpi09 0.107
#> 5 geolmja1 0.071
For many modeling applications — although not all — the motivation for building a model is to obtain model predictions. Model predictions, or model output, can be used in many different ways: to make predictions about parts of the study area for which there exist no occurrence data, to predict relative probability of occurrence outside the study area, or to predict relative probability of occurrence in the past or future. In other words, any form of spatial or temporal interpolation or extrapolation is possible (although not always recommended!). The only requirement is that the values of the EVs are known for the time or space for which model output is desired.
The projectModel()
function can be used to obtain model
predictions for any kind of modeling application. As input it takes the
model to be projected (model
), the transformation functions
used by the model (transformations
), and explanatory data
to project across (data
). For “maxent”-type models, the
projectModel
returns model predictions in probability ratio
output (PRO) format for each location represented in data
.
PRO format gives relative probability of presence, and PRO = 1
is a reference value that represents the probability of presence in an
“average” location in the training data.
A value of PRO = 1 can be interpreted as the relative probability of presence of a location randomly drawn from the training data. Put another way, values above 1 represent higher-than-average probability of presence, and vice versa.
Here, we obtain model output across the extent of the study area as
represented by the training data. When we enter the data
argument as a SpatRaster
object,
projectModel()
automatically shows model predictions in
geographical space. Note that the names of the raster layers must match
names of EVs in the model.
EVfiles <- c(list.files(system.file("extdata", "EV_continuous", package="MIAmaxent"),
full.names=TRUE),
list.files(system.file("extdata", "EV_categorical", package="MIAmaxent"),
full.names=TRUE))
EVstack <- rast(EVfiles)
grasslandPreds <- projectModel(model = grasslandmodel,
transformations = grasslandDVs$transformations,
data = EVstack)
It is often easier to visualize probability-ratio values on a log scale, so we plot the raster object again as log2(PRO + 1):
Alternatively, if the data
are supplied in a simple data
frame, model predictions are appended to the data
in column
1, and returned as list item output
. In this case, the
predictions can be mapped to geographical space manually, by including
spatial coordinates in the data
input to the
projectModel()
, and then using the rasterize()
function in the terra
package.
Additionally, the projectModel()
function automatically
checks how the range of the input explanatory data compares to the range
of the data used to train the model. This is important because if the
range of the input data lies outside the range of the training data
(i.e. the model is extrapolated), the predictions are less reliable. The
range of continuous variables is reported on the training data scale,
from 0 to 1, and the range of categorical variables is reported as
“inside” if all categories are represented in the training data:
grasslandPreds
#> $output
#> class : SpatRaster
#> dimensions : 201, 158, 1 (nrow, ncol, nlyr)
#> resolution : 500, 500 (x, y)
#> extent : 249000, 328000, 6531500, 6632000 (xmin, xmax, ymin, ymax)
#> coord. ref. :
#> source(s) : memory
#> varname : prbygall
#> name : PRO
#> min value : 0.3538549
#> max value : 24.7064347
#>
#> $ranges
#> $ranges$prbygall
#> [1] 0 1
#>
#> $ranges$geoberg
#> [1] "inside"
#>
#> $ranges$lcucor1
#> [1] "inside"
#>
#> $ranges$tertpi09
#> [1] 0 1
#>
#> $ranges$geolmja1
#> [1] "inside"
Since we projected the model across the training data, it makes sense that all the ranges are reported as [0,1] or “inside.”
There are many ways of evaluating the quality of a model, including assessing which EVs are selected and gauging the realism of response curves. Arguably the best way to evaluate a model, however, is to test how often its predictions are correct using occurrence data which are independent from the data used to train the model. This is strongly preferable to using training data because it indicates whether the model reflects patterns specific to the training data or generalized patterns for the modeled target (i.e. whether the model is overfitted). Independent, presence-absence test data are not always available, for example when projecting a model into the future, but when they are, they represent a high standard in model validation.
The evaluation metric which is used most commonly for distribution
models and is implemented in MIAmaxent
is Area Under the
Curve (AUC) of the receiver operating characteristic plot. This is a
metric which measures the performance of the model as a binary
classifier over the full range of discrimination thresholds. When it is
calculated using independent test data we refer to it as “testAUC”.
The testAUC()
function takes a data frame of presence
and absence locations, along with the corresponding values of EVs at
those locations, and calculates testAUC. The evaluation data can easily
be read into R using the readData()
function with
PA = TRUE
if desired, as shown below.
In our example, 122 test locations in Østfold County, Norway, were visited to record presence or absence of semi-natural grasslands:
grasslandPA <- readData(
occurrence = system.file("extdata", "occurrence_PA.csv", package="MIAmaxent"),
contEV = system.file("extdata", "EV_continuous", package="MIAmaxent"),
catEV = system.file("extdata", "EV_categorical", package="MIAmaxent"),
PA = TRUE, XY = TRUE)
head(grasslandPA)
#> RV x y pca1 prbygall prtilany teraspif terdem terslpdg tersolrade tertpi09 geoberg
#> 1 1 262750 6557250 0.000646 0.002871206 0.45159969 93 24 1.114550 967.471 8.67901 21
#> 2 1 263250 6558750 0.000635 0.029484030 0.01523342 109 19 1.236990 956.743 1.50617 21
#> 3 1 263750 6555250 0.000694 0.000970874 0.92038828 41 0 0.365185 981.625 -10.39510 21
#> 4 1 265250 6555750 0.000719 0.003206413 0.27815631 7 13 0.326633 988.487 -4.71605 21
#> 5 1 265250 6559750 0.000652 0.002914238 0.27019149 151 20 1.382270 941.362 6.06173 21
#> 6 1 266750 6553750 0.000848 0.012106540 0.21468930 49 10 0.769991 985.827 2.41975 21
#> geolmja1 lcucor1 lcutilt4 terslpps15
#> 1 100 311 0 6
#> 2 100 322 0 6
#> 3 130 523 1 1
#> 4 130 322 1 5
#> 5 130 311 1 6
#> 6 130 322 1 6
tail(grasslandPA)
#> RV x y pca1 prbygall prtilany teraspif terdem terslpdg tersolrade tertpi09 geoberg
#> 117 0 313250 6575250 0.000788 0 1.0101010 131 132 0.045296 1000.920 -3.39507 29
#> 118 0 313750 6579250 0.000650 0 1.0076010 39 149 1.358030 1022.570 -1.76543 62
#> 119 0 314250 6572750 0.000545 0 0.3865006 78 139 0.146076 1001.790 -7.23457 82
#> 120 0 314250 6576750 0.000694 0 1.0060360 81 141 1.689610 1028.010 3.16049 62
#> 121 0 314750 6571250 0.000640 0 0.0000000 102 174 1.501690 983.568 12.33330 82
#> 122 0 314750 6574750 0.000745 0 0.9434344 166 130 0.449772 988.030 -6.40741 62
#> geolmja1 lcucor1 lcutilt4 terslpps15
#> 117 100 312 1 1
#> 118 130 312 1 6
#> 119 12 312 1 1
#> 120 12 312 1 6
#> 121 130 312 0 6
#> 122 12 312 1 1
Plotted on the raster of (log-scaled) model predictions, these occurrences look like this:
plot(log2(grasslandPreds$output + 1))
presences <- grasslandPA[grasslandPA$RV==1, ]
absences <- grasslandPA[grasslandPA$RV==0, ]
points(presences$x, presences$y, pch = 20, cex = 0.5, col = 'red')
points(absences$x, absences$y, pch = 20, cex = 0.5, col = 'grey')
legend('top', c('presence', 'absence'), col = c('red', 'grey'), pch = c(20, 20))
We can use these data to calculate testAUC for our distribution model:
#> [1] 0.7817506
The ROC plot shows how the rate of true positives as well as false positives increases as the binary classification threshold is lowered. The area under this curve corresponds to the value returned by testAUC.
Note that the testAUC()
function can also be used to
calculate AUC from presence-only training data, but that the
interpretation of these values differs importantly.
In the modeling example above, we used presence-only data to fit
maximum entropy models. It is important to note that
MIAmaxent
provides the exact same functionality for models
fitted by standard logistic regression. Specifically, in all functions
which perform model fitting (i.e. pre-selection of spline DVs in
deriveVars()
, along with selectDVforEV()
,
selectEV()
, and chooseModel()
), the fitting
algorithm
can be specified as "maxent"
or"LR"
, where "maxent"
is the default.
Conventionally, the maximum entropy estimation (
algorithm = "maxent"
) is used with presence-only occurrence data, while logistic regression (algorithm = "LR"
), is used with presence-absence data.
The maximum entropy algorithm represents a form of parametric density
estimation using an exponential family model, and is equivalent to
inhomogeneous Poisson process models (Fithian & Hastie, 2013). It is
implemented in MIAmaxent
using infinitely-weighted logistic
regression, with presences automatically added to the background. The
maximum entropy algorithm has been used primarily with presence-only
occurrence data in the popular maxent.jar software (Phillips et al.,
2006), but it may also be used with presence-absence data (Halvorsen,
2013). In contrast, standard logistic regression (binomial GLM) is
customarily used with presence-absence occurrence data. For more
information about the differences between these approaches, see Guisan
et al. (2017) and Fithian & Hastie (2013).
Below, we follow the same procedure as above, but use the presence-absence data to fit models by logistic regression.
grasslandPA <- readData(
occurrence = system.file("extdata", "occurrence_PA.csv", package="MIAmaxent"),
contEV = system.file("extdata", "EV_continuous", package="MIAmaxent"),
catEV = system.file("extdata", "EV_categorical", package="MIAmaxent"),
PA = TRUE, XY = FALSE)
str(grasslandPA)
#> 'data.frame': 122 obs. of 14 variables:
#> $ RV : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ pca1 : num 0.000646 0.000635 0.000694 0.000719 0.000652 ...
#> $ prbygall : num 0.002871 0.029484 0.000971 0.003206 0.002914 ...
#> $ prtilany : num 0.4516 0.0152 0.9204 0.2782 0.2702 ...
#> $ teraspif : int 93 109 41 7 151 49 62 90 48 167 ...
#> $ terdem : int 24 19 0 13 20 10 2 16 0 0 ...
#> $ terslpdg : num 1.115 1.237 0.365 0.327 1.382 ...
#> $ tersolrade: num 967 957 982 988 941 ...
#> $ tertpi09 : num 8.68 1.51 -10.4 -4.72 6.06 ...
#> $ geoberg : Factor w/ 5 levels "21","29","35",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ geolmja1 : Factor w/ 9 levels "12","15","41",..: 8 8 9 9 9 9 8 9 4 9 ...
#> $ lcucor1 : Factor w/ 11 levels "112","142","211",..: 5 7 11 7 5 7 2 2 7 11 ...
#> $ lcutilt4 : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 2 1 ...
#> $ terslpps15: Factor w/ 5 levels "1","2","3","5",..: 5 5 1 4 5 5 1 5 1 1 ...
plotFOP(grasslandPA, "teraspif")
plotFOP(grasslandPA, "terslpdg")
plotFOP(grasslandPA, 10)
Note that since the RV in grasslandPA
contains presence
and absence, the plots above show empirical frequencies of
presence, rather than observed frequencies of presence. This is a subtle
but important distinction.
PA.grasslandDVs <- deriveVars(grasslandPA, algorithm = "LR")
#> Pre-selecting forward hinge transformations of pca1
#> Pre-selecting reverse hinge transformations of pca1
#> Pre-selecting threshold transformations of pca1
#> Pre-selecting forward hinge transformations of prbygall
#> Pre-selecting reverse hinge transformations of prbygall
#> Pre-selecting threshold transformations of prbygall
#> Pre-selecting forward hinge transformations of prtilany
#> Pre-selecting reverse hinge transformations of prtilany
#> Pre-selecting threshold transformations of prtilany
#> Pre-selecting forward hinge transformations of teraspif
#> Pre-selecting reverse hinge transformations of teraspif
#> Pre-selecting threshold transformations of teraspif
#> Pre-selecting forward hinge transformations of terdem
#> Pre-selecting reverse hinge transformations of terdem
#> Pre-selecting threshold transformations of terdem
#> Pre-selecting forward hinge transformations of terslpdg
#> Pre-selecting reverse hinge transformations of terslpdg
#> Pre-selecting threshold transformations of terslpdg
#> Pre-selecting forward hinge transformations of tersolrade
#> Pre-selecting reverse hinge transformations of tersolrade
#> Pre-selecting threshold transformations of tersolrade
#> Pre-selecting forward hinge transformations of tertpi09
#> Pre-selecting reverse hinge transformations of tertpi09
#> Pre-selecting threshold transformations of tertpi09
The DVs produced above are not the same as those produced previously with presence-only data, even though the variable names might be the same. That is because the exact forms of the transformations applied are data-dependent.
PA.grasslandDVselect <- selectDVforEV(PA.grasslandDVs$dvdata, alpha = 0.001,
algorithm = "LR", quiet = TRUE)
PA.grasslandEVselect <- selectEV(PA.grasslandDVselect$dvdata, alpha = 0.001, algorithm = "LR")
#> Forward selection of 8 EVs
#> Round 1 complete.
#> Round 2 complete.
PA.grasslandEVselect$selection
#> round variables m Dsq Chisq df P
#> 1 1 terdem 1 0.205 33.170 1 8.45e-09
#> 2 1 prbygall 1 0.187 30.177 1 3.94e-08
#> 3 1 pca1 1 0.120 19.379 1 1.07e-05
#> 4 1 prtilany 1 0.118 19.015 1 1.30e-05
#> 5 1 geoberg 1 0.107 17.312 1 3.17e-05
#> 6 1 tersolrade 1 0.095 15.338 1 8.99e-05
#> 7 1 terslpdg 1 0.085 13.809 1 2.02e-04
#> 8 1 lcucor1 1 0.080 12.869 1 3.34e-04
#> 9 2 terdem + prbygall 2 0.286 13.041 1 3.05e-04
#> 10 2 terdem + terslpdg 2 0.249 7.032 1 8.01e-03
#> 11 2 terdem + prtilany 2 0.238 5.310 1 2.12e-02
#> 12 2 terdem + tersolrade 2 0.233 4.564 1 3.26e-02
#> 13 2 terdem + pca1 2 0.218 2.151 1 1.42e-01
#> 14 2 terdem + lcucor1 2 0.205 0.003 1 9.59e-01
#> 15 2 terdem + geoberg 2 0.205 0.001 1 9.76e-01
PA.grasslandEVselect$selectedmodel
#>
#> Call: stats::glm(formula = formula, family = "binomial", data = data)
#>
#> Coefficients:
#> (Intercept) terdem_D1 prbygall_D2
#> 3.266 -3.780 -51.989
#>
#> Degrees of Freedom: 121 Total (i.e. Null); 119 Residual
#> Null Deviance: 161.7
#> Residual Deviance: 115.5 AIC: 121.5
plotResp(PA.grasslandEVselect$selectedmodel, PA.grasslandDVs$transformations, "terdem")
plotFOP(grasslandPA, "terdem")
plotResp(PA.grasslandEVselect$selectedmodel, PA.grasslandDVs$transformations, "prbygall")
plotFOP(grasslandPA, "prbygall")
PA.grasslandPreds <- projectModel(model = PA.grasslandEVselect$selectedmodel,
transformations = PA.grasslandDVs$transformations,
data = EVstack)
Notice that predictions from the logistic regression model — unlike the maxent model — are on the interval [0,1]. These values represent predicted probability of presence, rather than predicted relative probability of presence.
Now, compare the model trained on presence-absence data by logistic regression with the model trained on presence-only data by maximum entropy estimation. First we calculate their AUCs on the presence-absence data set:
# logistic regression model
testAUC(PA.grasslandEVselect$selectedmodel, PA.grasslandDVs$transformations,
grasslandPA, plot = FALSE)
#> [1] 0.8435355
# maxent model
testAUC(grasslandmodel, grasslandDVs$transformations, grasslandPA, plot = FALSE)
#> [1] 0.7817506
Next we calculate their AUCs on the presence-only data set:
# logistic regression model
testAUC(PA.grasslandEVselect$selectedmodel, PA.grasslandDVs$transformations,
grasslandPO, plot = FALSE)
#> Warning: The test data consist of 1/NA only, so NA is treated as absence.
#> Be aware of implications for the interpretation of the AUC value.
#> [1] 0.6109549
# maxent model
testAUC(grasslandmodel, grasslandDVs$transformations, grasslandPO, plot = FALSE)
#> Warning: The test data consist of 1/NA only, so NA is treated as absence.
#> Be aware of implications for the interpretation of the AUC value.
#> [1] 0.6898254
Notice that:
In general, the question of how to make best use of presence-only and presence-absence data, when both are available, is an area of active research. Similarly, the use of presence-absence data in maximum entropy models is mostly unexplored (Halvorsen, 2013).
Thank you to Sabrina Mazzoni and Rune Halvorsen for providing the data used in this vignette.